Skip to content

[fix][client] Serialize chunked-message bookkeeping to fix use-after-…#26084

Open
SongOf wants to merge 1 commit into
apache:masterfrom
SongOf:fix/client-chunked-message-bookkeeping-race
Open

[fix][client] Serialize chunked-message bookkeeping to fix use-after-…#26084
SongOf wants to merge 1 commit into
apache:masterfrom
SongOf:fix/client-chunked-message-bookkeeping-race

Conversation

@SongOf

@SongOf SongOf commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Motivation

ConsumerImpl's chunked-message reassembly state — the per-uuid ChunkedMessageCtx,
its chunkedMsgBuffer, pendingChunkedMessageCount and pendingChunkedMessageUuidQueue
— is mutated from two different threads with no synchronization:

  • the receive/assembly path (processMessageChunk, and the last-chunk finalize in
    messageReceived) runs on the Netty IO event-loop thread
    (ClientCnx.handleMessage calls consumer.messageReceived(...) directly);
  • the incomplete-chunk expiry path (removeExpireIncompleteChunkedMessages) runs on
    the client's internalPinnedExecutor — a separate single-thread pool
    (client.getInternalExecutorService()), not the eventLoopGroup.

When a late chunk for a uuid arrives while the expiry task is removing that same ctx,
the expiry thread can release() / recycle() a ChunkedMessageCtx and its buffer while
the receive thread is still writing into it. This races into:

  • use-after-free / double-free of chunkedMsgBuffer (IllegalReferenceCountException,
    or worse — writing into memory the allocator already handed to someone else);
  • double ChunkedMessageCtx.recycle(), which corrupts the Netty Recycler pool and can
    hand the same instance to two different chunked messages;
  • pendingChunkedMessageCount drift (non-atomic int mutated from both threads).

Incomplete-chunk expiry is enabled by default (expireTimeOfIncompleteChunkedMessageMillis
= 1 minute), so any consumer of chunked messages is exposed.

Separately, a redelivered first chunk (chunkId == 0 for a uuid that already has an
in-progress ctx) replaced the old ctx without decrementing pendingChunkedMessageCount
and re-enqueued the uuid, so the counter over-counted and pendingChunkedMessageUuidQueue
ended up with duplicate / mis-ordered entries (the queue is meant to be ordered oldest-first
by receivedTime, with one entry per in-progress uuid, since removeExpire/removeOldest
clean from the head).

Modifications

  • Add a dedicated lock chunkedMessageLock. processMessageChunk,
    removeOldestPendingChunkedMessage and removeExpireIncompleteChunkedMessages now acquire
    it (thin wrapper over an extracted body), and messageReceived holds it across the
    last-chunk assembly + finalize so the assemble→finalize window is closed. All chunk
    bookkeeping is serialized, and pendingChunkedMessageCount is mutated only under the lock
    (so it needs no volatile/Atomic). Heavy per-message work (newMessage, decryption of
    non-chunk payloads, executeNotifyCallback) stays outside the lock, and non-chunked
    messages never touch it.
    • The lock only ever guards chunk bookkeeping; the calls made under it
      (ConcurrentHashMap ops, doAcknowledge which is async/non-blocking) never acquire it
      in reverse, so there is no new lock-ordering / deadlock and no blocking call held across
      the lock.
  • Fix the redelivered-first-chunk path: decrement pendingChunkedMessageCount for the
    replaced ctx, remove(uuid) its stale entry from pendingChunkedMessageUuidQueue and
    re-add(uuid) at the tail — keeping the queue ordered oldest-first with exactly one entry
    per in-progress uuid (net 0 count change on replace, +1 on a genuinely new uuid).

This is an internal client-side locking change only; no public API, wire protocol, schema,
config defaults or metrics are changed.

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

  • ConsumerImplTest.testChunkedMessageCountRaceBetweenReceiveAndExpiry — drives
    processMessageChunk on a "receiver" thread concurrently with the real
    removeExpireIncompleteChunkedMessages on an "expirer" thread, for the "a late chunk
    arrives for a uuid that is concurrently being expired" scenario. Verified to fail before
    the fix
    (IllegalReferenceCountException / corrupted pendingChunkedMessageCount) and
    pass after.
  • ConsumerImplTest.testDuplicateFirstChunkOvercountsPendingChunkedMessageCount
    deterministic; delivers a redelivered first chunk and asserts
    pendingChunkedMessageCount == chunkedMessagesMap.size() and that the uuid appears exactly
    once in pendingChunkedMessageUuidQueue. Verified to fail before and pass after.

Full pulsar-client unit suite (712 tests) passes with no regressions.

Does this pull request potentially affect one of the following parts:

  • Dependencies (add or upgrade a dependency)
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc-required
  • doc-not-needed
    (internal bug fix; no user-facing behavior or config change)
  • doc
  • doc-complete

* that writes into the same {@code chunkedMsgBuffer} — exactly when the expiry task may release/recycle that ctx.
* Without serialization this races (use-after-free / double-recycle). With the fix, both paths take
* chunkedMessageLock, so it passes and pendingChunkedMessageCount stays equal to chunkedMessagesMap.size().
* See .claude/pulsar-client-review-findings.md (#1).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this reference

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

}

/**
* Deterministic reproducer for finding #2: a duplicated first chunk (chunkId == 0, redelivered) for a uuid that

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

referring to findings doesn't make sense in code comments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, thanks for your comments.

@SongOf SongOf force-pushed the fix/client-chunked-message-bookkeeping-race branch from 8b391e4 to 69fb6f6 Compare June 25, 2026 02:59
@SongOf SongOf force-pushed the fix/client-chunked-message-bookkeeping-race branch from 69fb6f6 to 910db42 Compare June 25, 2026 03:17
@SongOf SongOf requested a review from lhotari June 25, 2026 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants