Skip to content

[improve][broker] Improve no-filter dispatch performance by skipping filter calls#26056

Open
void-ptr974 wants to merge 2 commits into
apache:masterfrom
void-ptr974:optimize-dispatch-filter-fast-path
Open

[improve][broker] Improve no-filter dispatch performance by skipping filter calls#26056
void-ptr974 wants to merge 2 commits into
apache:masterfrom
void-ptr974:optimize-dispatch-filter-fast-path

Conversation

@void-ptr974

@void-ptr974 void-ptr974 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Motivation

AbstractBaseDispatcher#filterEntriesForConsumer is on the dispatch hot path. When no entry filters are configured, runFiltersForEntry only returns ACCEPT, so calling it once per entry is avoidable work.

Modifications

  • Cache the dispatcher filter state in a local variable for the batch.
  • Skip runFiltersForEntry when no entry filters are configured.

Scope

This intentionally leaves the filtered/redelivery list allocation and ack-set handling unchanged. Those are separate optimizations and are kept out of this PR to make the review focused on the no-filter fast path only.

Performance

This improves no-filter dispatch performance by removing one no-op filter call and the REJECT/RESCHEDULE checks per entry when no filters are configured. Allocation remains effectively unchanged; this is a CPU-path optimization.

JMH data gathered locally using a local-only DispatcherDispatchPathBenchmark.(noFilterRunFiltersPerEntry|noFilterFastPath) benchmark, with -p entriesCount=32,256,1000 -p batchSize=1,10 -wi 2 -i 5 -w 1s -r 1s -f 1 -bm avgt -tu ns -prof gc:

batchSize entriesCount old path fast path improvement
1 32 44.8 ns/op 28.0 ns/op 37.6%
1 256 329.1 ns/op 186.1 ns/op 43.4%
1 1000 3151.4 ns/op 1603.3 ns/op 49.1%
10 32 52.4 ns/op 31.2 ns/op 40.4%
10 256 361.7 ns/op 182.9 ns/op 49.4%
10 1000 2658.4 ns/op 1593.4 ns/op 40.1%

Verification

  • Added AbstractBaseDispatcherTest coverage for a 512-entry no-filter dispatch batch with varying payload sizes and batch sizes.
  • ./gradlew :pulsar-broker:compileJava :pulsar-broker:checkstyleMain --max-workers=1
  • ./gradlew :pulsar-broker:test --tests org.apache.pulsar.broker.service.AbstractBaseDispatcherTest --max-workers=1 -PtestRetryCount=0
  • ./gradlew :pulsar-broker:checkstyleTest --max-workers=1

@void-ptr974 void-ptr974 force-pushed the optimize-dispatch-filter-fast-path branch 3 times, most recently from ab3f3eb to 08414a3 Compare June 18, 2026 11:56
@void-ptr974 void-ptr974 changed the title [improve][broker] Reduce base dispatcher fast-path overhead [improve][broker] Skip entry filters when none are configured Jun 18, 2026
@void-ptr974 void-ptr974 force-pushed the optimize-dispatch-filter-fast-path branch from 08414a3 to c487ddb Compare June 18, 2026 12:09
@void-ptr974 void-ptr974 force-pushed the optimize-dispatch-filter-fast-path branch from c487ddb to ba8c764 Compare June 18, 2026 12:20
@void-ptr974 void-ptr974 changed the title [improve][broker] Skip entry filters when none are configured [improve][broker] Improve no-filter dispatch performance by skipping filter calls Jun 18, 2026
@void-ptr974 void-ptr974 marked this pull request as ready for review June 18, 2026 12:39
…ilter-fast-path

# Conflicts:
#	pulsar-broker/src/test/java/org/apache/pulsar/broker/service/AbstractBaseDispatcherTest.java
filteredBytesCount += metadataAndPayload.readableBytes();
entry.release();
continue;
if (hasFilter) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a if (hasFilter) { line a few lines above. join the blocks

@lhotari

lhotari commented Jun 24, 2026

Copy link
Copy Markdown
Member

This improves no-filter dispatch performance by removing one no-op filter call and the REJECT/RESCHEDULE checks per entry when no filters are configured. Allocation remains effectively unchanged; this is a CPU-path optimization.

The results seem odd since runFiltersForEntry is a method call which will check for hasFilter and return EntryFilter.FilterResult.ACCEPT so no allocations are involved.

I think that results might be flawed. Please check a recent blog post https://quarkus.io/blog/when-the-jit-cant-keep-up/ by Francesco Nigro. The url line has "when the JIT can't keep up". It might be a bottleneck in your case.
The JIT optimizations might not kick in when the warmup time is too short and when CPU is starved so that the JIT compiler postpones optimizations. Claude recommends running with a longer warmup time.

Claude review for JMH parameters

Config under review:

-p entriesCount=32,256,1000 -p batchSize=1,10 -wi 2 -i 5 -w 1s -r 1s -f 1 -bm avgt -tu ns -prof gc

-f 1 is the weak point for reviewer-facing evidence

A single fork means no run-to-run variance estimate across JVM launches. JIT
compilation, GC ergonomics, and code layout vary between JVM instances; with one
fork you can't tell whether a delta between the two methods is real or an
artifact of that single launch. For a committed performance claim, -f 3 (or 5)
and reporting the spread is much more defensible. This is exactly the kind of
single-fork result the reviewer's JIT objection can poke at.

-wi 2 -w 1s = 2 seconds of warmup is short

For a claim about compiled-code behavior, 2 seconds of warmup is short. With
sub-microsecond per-op work at entriesCount=32, you'll get millions of ops, so
the iteration count is fine — but 2 seconds may not be enough wall-clock for C2
to finish and the loop to stabilize, especially if anything deopts. To answer the
JIT objection cleanly, bump to -wi 5 -w 2s or add -prof perfnorm to confirm
settling.

-prof gc is the right call for the allocation half

If gc.alloc.rate.norm comes back identical (ideally 0 B/op, or the same
non-zero value) for both noFilterRunFiltersPerEntry and noFilterFastPath
across all six param combos, you've directly confirmed the commit's "allocation
remains effectively unchanged" — and shown the ACCEPT singleton path never
allocated to begin with.

Suggested config for the PR

This config is reasonable for a quick local look, but for the version attached to
the PR, prefer:

-f 3 -wi 5 -w 2s -prof gc -prof perfnorm

That covers both the allocation claim (B/op) and the "is the delta real CPU work"
claim (branches/op and cycles/op), with cross-fork variance to back it.

-prof perfnorm isn't available on MacOS.

Some other details from Claude:

Measurement Noise: JMH on a 2020 13" MacBook Pro (4-core 2.0 GHz i5)

On this machine, noise on the timing numbers (avgt ns/op) is likely
substantial and comes from several stacking sources. The allocation numbers
(gc.alloc.rate.norm, B/op) are essentially noise-free — only the time-based
results are affected.

Sources of noise

Thermal throttling (dominant)

2.0 GHz base, Turbo to ~3.8 GHz. A cool chip runs the early part of a run near
Turbo, then clocks down as it heats under sustained load. Across -i 5 one-second
iterations you can be measuring at meaningfully different clock speeds. This
produces both variance (iterations scatter) and bias (a downward trend — later
iterations slower as heat builds, or the reverse if a prior run left it hot). It's
the worst kind of noise because it's correlated with time, so averaging doesn't
cleanly cancel it.

Single fork hides all of it

With -f 1 you get one JVM's worth of one thermal trajectory. JMH's reported
error bars are within-fork only, so they look deceptively tight while completely
missing the run-to-run swing that throttling, code layout, and GC ergonomics
introduce. The error interval JMH prints is not the error you actually have.

Background OS activity

Spotlight, Time Machine, browser, WindowServer all share the same 4 cores, and
macOS gives no clean CPU-pinning equivalent to Linux taskset/chrt. A stray
background spike during a 1-second iteration is a large fraction of that iteration.
Short iterations are least robust to this.

Core saturation if threads >= cores

Only 4 physical cores. If the benchmark runs multi-threaded or JMH defaults threads
high, the measurement threads compete with each other and with JIT/GC threads for
those 4 cores — adding scheduling jitter on top of everything.

What to do about it

Fixable in config

  • -f 3 or more — the single most important change. Surfaces the real variance
    instead of hiding it, and averages across independent thermal trajectories.
  • -w 2s -wi 5 — longer warmup lets the chip approach thermal steady-state
    before measurement starts, so you're not measuring the cool-to-hot transient.
  • -i 8 -r 2s — longer/more measurement iterations let background blips average
    out and make any residual thermal trend visible rather than aliased.
  • -t 1 — keep it single-threaded so measurement threads don't fight each other or
    the compiler for the 4 cores.

Not fixable in config

  • Let the machine idle/cool between runs, plugged in (not on battery — power
    management changes clocks), lid open, ideally with heavy background apps quit.
    Don't compare a cold run against one done right after another run.
  • Run the two arms (noFilterRunFiltersPerEntry vs noFilterFastPath) interleaved
    or back-to-back under the same thermal conditions, not one in the morning and one
    in the afternoon. JMH already interleaves forks, which is another reason -f 3+
    helps — both methods get sampled across the same set of thermal states.

Honest framing for the reviewer

Even with all of the above, a throttling laptop gives a directional result, not a
precise one:

  • If noFilterFastPath is ~15–20% faster across all param combos with -f 3,
    that's a real signal that survives the noise.
  • If the delta is 2–3%, this machine genuinely cannot tell you whether it's real —
    that's within the thermal noise floor. You'd need a controlled Linux box (fixed
    clocks, cpupower governor pinned to performance, perfnorm for per-op counters)
    to make the claim.

Bottom line: trust the -prof gc B/op result from the Mac unconditionally;
treat the ns/op result as a sanity check whose resolution is maybe ±10% on this
hardware; put any tight CPU claim on Linux.

It's possible to mitigate issues, but micro benchmarking on Linux x86-64 on a modern CPU with CPU powersaving etc. turned off would have less measurement noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants