[improve][broker] Improve no-filter dispatch performance by skipping filter calls by void-ptr974 · Pull Request #26056 · apache/pulsar

void-ptr974 · 2026-06-18T06:51:56Z

Motivation

AbstractBaseDispatcher#filterEntriesForConsumer is on the dispatch hot path. When no entry filters are configured, runFiltersForEntry only returns ACCEPT, so calling it once per entry is avoidable work.

Modifications

Cache the dispatcher filter state in a local variable for the batch.
Skip runFiltersForEntry when no entry filters are configured.

Scope

This intentionally leaves the filtered/redelivery list allocation and ack-set handling unchanged. Those are separate optimizations and are kept out of this PR to make the review focused on the no-filter fast path only.

Performance

This improves no-filter dispatch performance by removing one no-op filter call and the REJECT/RESCHEDULE checks per entry when no filters are configured. Allocation remains effectively unchanged; this is a CPU-path optimization.

JMH data gathered locally using a local-only DispatcherDispatchPathBenchmark.(noFilterRunFiltersPerEntry|noFilterFastPath) benchmark, with -p entriesCount=32,256,1000 -p batchSize=1,10 -wi 2 -i 5 -w 1s -r 1s -f 1 -bm avgt -tu ns -prof gc:

batchSize	entriesCount	old path	fast path	improvement
1	32	44.8 ns/op	28.0 ns/op	37.6%
1	256	329.1 ns/op	186.1 ns/op	43.4%
1	1000	3151.4 ns/op	1603.3 ns/op	49.1%
10	32	52.4 ns/op	31.2 ns/op	40.4%
10	256	361.7 ns/op	182.9 ns/op	49.4%
10	1000	2658.4 ns/op	1593.4 ns/op	40.1%

Verification

Added AbstractBaseDispatcherTest coverage for a 512-entry no-filter dispatch batch with varying payload sizes and batch sizes.
./gradlew :pulsar-broker:compileJava :pulsar-broker:checkstyleMain --max-workers=1
./gradlew :pulsar-broker:test --tests org.apache.pulsar.broker.service.AbstractBaseDispatcherTest --max-workers=1 -PtestRetryCount=0
./gradlew :pulsar-broker:checkstyleTest --max-workers=1

…ilter-fast-path # Conflicts: # pulsar-broker/src/test/java/org/apache/pulsar/broker/service/AbstractBaseDispatcherTest.java

lhotari · 2026-06-24T19:32:18Z

-                filteredBytesCount += metadataAndPayload.readableBytes();
-                entry.release();
-                continue;
+            if (hasFilter) {


there's a if (hasFilter) { line a few lines above. join the blocks

lhotari · 2026-06-24T20:40:23Z

This improves no-filter dispatch performance by removing one no-op filter call and the REJECT/RESCHEDULE checks per entry when no filters are configured. Allocation remains effectively unchanged; this is a CPU-path optimization.

The results seem odd since runFiltersForEntry is a method call which will check for hasFilter and return EntryFilter.FilterResult.ACCEPT so no allocations are involved.

I think that results might be flawed. Please check a recent blog post https://quarkus.io/blog/when-the-jit-cant-keep-up/ by Francesco Nigro. The url line has "when the JIT can't keep up". It might be a bottleneck in your case.
The JIT optimizations might not kick in when the warmup time is too short and when CPU is starved so that the JIT compiler postpones optimizations. Claude recommends running with a longer warmup time.

Claude review for JMH parameters

Config under review:
-p entriesCount=32,256,1000 -p batchSize=1,10 -wi 2 -i 5 -w 1s -r 1s -f 1 -bm avgt -tu ns -prof gc
-f 1 is the weak point for reviewer-facing evidence

A single fork means no run-to-run variance estimate across JVM launches. JIT
compilation, GC ergonomics, and code layout vary between JVM instances; with one
fork you can't tell whether a delta between the two methods is real or an
artifact of that single launch. For a committed performance claim, -f 3 (or 5)
and reporting the spread is much more defensible. This is exactly the kind of
single-fork result the reviewer's JIT objection can poke at.

-wi 2 -w 1s = 2 seconds of warmup is short

For a claim about compiled-code behavior, 2 seconds of warmup is short. With
sub-microsecond per-op work at entriesCount=32, you'll get millions of ops, so
the iteration count is fine — but 2 seconds may not be enough wall-clock for C2
to finish and the loop to stabilize, especially if anything deopts. To answer the
JIT objection cleanly, bump to -wi 5 -w 2s or add -prof perfnorm to confirm
settling.

-prof gc is the right call for the allocation half

If gc.alloc.rate.norm comes back identical (ideally 0 B/op, or the same
non-zero value) for both noFilterRunFiltersPerEntry and noFilterFastPath
across all six param combos, you've directly confirmed the commit's "allocation
remains effectively unchanged" — and shown the ACCEPT singleton path never
allocated to begin with.

Suggested config for the PR

This config is reasonable for a quick local look, but for the version attached to
the PR, prefer:
-f 3 -wi 5 -w 2s -prof gc -prof perfnorm
That covers both the allocation claim (B/op) and the "is the delta real CPU work"
claim (branches/op and cycles/op), with cross-fork variance to back it.

-prof perfnorm isn't available on MacOS.

Some other details from Claude:

Measurement Noise: JMH on a 2020 13" MacBook Pro (4-core 2.0 GHz i5)

On this machine, noise on the timing numbers (avgt ns/op) is likely
substantial and comes from several stacking sources. The allocation numbers
(gc.alloc.rate.norm, B/op) are essentially noise-free — only the time-based
results are affected.

Sources of noise

Thermal throttling (dominant)

2.0 GHz base, Turbo to ~3.8 GHz. A cool chip runs the early part of a run near
Turbo, then clocks down as it heats under sustained load. Across -i 5 one-second
iterations you can be measuring at meaningfully different clock speeds. This
produces both variance (iterations scatter) and bias (a downward trend — later
iterations slower as heat builds, or the reverse if a prior run left it hot). It's
the worst kind of noise because it's correlated with time, so averaging doesn't
cleanly cancel it.

Single fork hides all of it

With -f 1 you get one JVM's worth of one thermal trajectory. JMH's reported
error bars are within-fork only, so they look deceptively tight while completely
missing the run-to-run swing that throttling, code layout, and GC ergonomics
introduce. The error interval JMH prints is not the error you actually have.

Background OS activity

Spotlight, Time Machine, browser, WindowServer all share the same 4 cores, and
macOS gives no clean CPU-pinning equivalent to Linux taskset/chrt. A stray
background spike during a 1-second iteration is a large fraction of that iteration.
Short iterations are least robust to this.

Core saturation if threads >= cores

Only 4 physical cores. If the benchmark runs multi-threaded or JMH defaults threads
high, the measurement threads compete with each other and with JIT/GC threads for
those 4 cores — adding scheduling jitter on top of everything.

What to do about it

Fixable in config

-f 3 or more — the single most important change. Surfaces the real variance
instead of hiding it, and averages across independent thermal trajectories.

-w 2s -wi 5 — longer warmup lets the chip approach thermal steady-state
before measurement starts, so you're not measuring the cool-to-hot transient.

-i 8 -r 2s — longer/more measurement iterations let background blips average
out and make any residual thermal trend visible rather than aliased.

-t 1 — keep it single-threaded so measurement threads don't fight each other or
the compiler for the 4 cores.

Not fixable in config

Let the machine idle/cool between runs, plugged in (not on battery — power
management changes clocks), lid open, ideally with heavy background apps quit.
Don't compare a cold run against one done right after another run.

Run the two arms (noFilterRunFiltersPerEntry vs noFilterFastPath) interleaved
or back-to-back under the same thermal conditions, not one in the morning and one
in the afternoon. JMH already interleaves forks, which is another reason -f 3+
helps — both methods get sampled across the same set of thermal states.

Honest framing for the reviewer

Even with all of the above, a throttling laptop gives a directional result, not a
precise one:

If noFilterFastPath is ~15–20% faster across all param combos with -f 3,
that's a real signal that survives the noise.

If the delta is 2–3%, this machine genuinely cannot tell you whether it's real —
that's within the thermal noise floor. You'd need a controlled Linux box (fixed
clocks, cpupower governor pinned to performance, perfnorm for per-op counters)
to make the claim.

Bottom line: trust the -prof gc B/op result from the Mac unconditionally;
treat the ns/op result as a sanity check whose resolution is maybe ±10% on this
hardware; put any tight CPU claim on Linux.

It's possible to mitigate issues, but micro benchmarking on Linux x86-64 on a modern CPU with CPU powersaving etc. turned off would have less measurement noise.

void-ptr974 force-pushed the optimize-dispatch-filter-fast-path branch 3 times, most recently from ab3f3eb to 08414a3 Compare June 18, 2026 11:56

void-ptr974 changed the title ~~[improve][broker] Reduce base dispatcher fast-path overhead~~ [improve][broker] Skip entry filters when none are configured Jun 18, 2026

void-ptr974 force-pushed the optimize-dispatch-filter-fast-path branch from 08414a3 to c487ddb Compare June 18, 2026 12:09

[improve][broker] Reduce base dispatcher fast-path overhead

ba8c764

void-ptr974 force-pushed the optimize-dispatch-filter-fast-path branch from c487ddb to ba8c764 Compare June 18, 2026 12:20

void-ptr974 changed the title ~~[improve][broker] Skip entry filters when none are configured~~ [improve][broker] Improve no-filter dispatch performance by skipping filter calls Jun 18, 2026

void-ptr974 marked this pull request as ready for review June 18, 2026 12:39

merlimat approved these changes Jun 18, 2026

View reviewed changes

Merge remote-tracking branch 'origin/master' into optimize-dispatch-f…

4641053

…ilter-fast-path # Conflicts: # pulsar-broker/src/test/java/org/apache/pulsar/broker/service/AbstractBaseDispatcherTest.java

dao-jun approved these changes Jun 21, 2026

View reviewed changes

dao-jun assigned void-ptr974 Jun 21, 2026

lhotari reviewed Jun 24, 2026

View reviewed changes

void-ptr974 mentioned this pull request Jun 25, 2026

Flaky-test: GeoReplicationTest.testTopicReplication #15913

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[improve][broker] Improve no-filter dispatch performance by skipping filter calls#26056

[improve][broker] Improve no-filter dispatch performance by skipping filter calls#26056
void-ptr974 wants to merge 2 commits into
apache:masterfrom
void-ptr974:optimize-dispatch-filter-fast-path

void-ptr974 commented Jun 18, 2026 •

edited

Loading

Uh oh!

lhotari Jun 24, 2026

Uh oh!

lhotari commented Jun 24, 2026

`-f 1` is the weak point for reviewer-facing evidence

`-wi 2 -w 1s` = 2 seconds of warmup is short

`-prof gc` is the right call for the allocation half

Suggested config for the PR

Measurement Noise: JMH on a 2020 13" MacBook Pro (4-core 2.0 GHz i5)

Sources of noise

Thermal throttling (dominant)

Single fork hides all of it

Background OS activity

Core saturation if threads >= cores

What to do about it

Fixable in config

Not fixable in config

Honest framing for the reviewer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

void-ptr974 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Scope

Performance

Verification

Uh oh!

lhotari Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

lhotari commented Jun 24, 2026

-f 1 is the weak point for reviewer-facing evidence

-wi 2 -w 1s = 2 seconds of warmup is short

-prof gc is the right call for the allocation half

Suggested config for the PR

Measurement Noise: JMH on a 2020 13" MacBook Pro (4-core 2.0 GHz i5)

Sources of noise

Thermal throttling (dominant)

Single fork hides all of it

Background OS activity

Core saturation if threads >= cores

What to do about it

Fixable in config

Not fixable in config

Honest framing for the reviewer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

void-ptr974 commented Jun 18, 2026 •

edited

Loading

`-f 1` is the weak point for reviewer-facing evidence

`-wi 2 -w 1s` = 2 seconds of warmup is short

`-prof gc` is the right call for the allocation half