BB-788: Log stuck task identities on rebalance timeout#2763
Conversation
|
LGTM — clean observability addition. Review by Claude Code |
ef318c7 to
272fc1c
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
... and 2 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.0 #2763 +/- ##
===================================================
- Coverage 74.30% 74.15% -0.16%
===================================================
Files 201 201
Lines 13485 13491 +6
===================================================
- Hits 10020 10004 -16
- Misses 3455 3477 +22
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
LGTM — clean, well-scoped change. The new |
94e8afe to
0126c5a
Compare
|
LGTM — clean, well-scoped observability improvement. The new |
|
LGTM — clean observability improvement. The new |
0126c5a to
92bf1da
Compare
|
LGTM |
92bf1da to
690a13b
Compare
|
LGTM — no issues found. The data shape accessed in |
cb398c5 to
11aa0bb
Compare
|
rebased |
690a13b to
1a70dd7
Compare
|
|
||
| this._log.error('rdkafka.rebalance timeout: consumer stuck, disconnecting'); | ||
| const queueLen = this._processingQueue?.length(); | ||
| const running = this._processingQueue?.running(); |
There was a problem hiding this comment.
Nit: inside the setTimeout callback at the drain timeout, this._processingQueue is guaranteed to be non-null — the early exit at line 741 (if (!this._processingQueue || this._processingQueue.idle())) already returns before the timeout is ever set. The optional chaining on this._processingQueue?.length() and this._processingQueue?.running() is harmless but unnecessary, unlike in getInFlightTasks() where it's correctly needed because the method is public.
— Claude Code
|
LGTM — clean, well-scoped change. The new |
Hello anurag4dsb,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Incorrect fix versionThe
Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:
Please check the |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command: Alternatively, the |
|
/create_integration_branches |
Integration data createdI have created the integration data for the additional destination branches.
The following branches will NOT be impacted:
You can set option The following options are set: create_integration_branches |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
The following options are set: create_integration_branches |
|
/approve |
Build failedThe build for commit did not succeed in branch w/9.5/improvement/BB-788-stuck-consumer-logs The following options are set: approve, create_integration_branches |
|
/bypass_build_status |
|
I have successfully merged the changeset of this pull request
The following branches have NOT changed:
This pull request did not target the following hotfix branch(es) so they
Please check the status of the associated issue BB-788. Goodbye anurag4dsb. The following options are set: bypass_build_status, approve, create_integration_branches |
Note: 1 test failing on this branch is a known and not related to this PR
When the rebalance drain timeout fires, the stuck-consumer error line now says what the
consumer was stuck on: queue depth, running count, and the kafka coordinates + message
key of up to 10 in-flight tasks. Keys are logged whole (their size is already bounded
by arsenal's objectKeyByteLimit). The cap of 10 equals the default consumer
concurrency, so the line can list a full default in-flight set but never grows past
~11KB worst case; the list is oldest-first, so the wedged tasks always lead it, and
anything beyond ten would be the newest, least suspicious entries. Runs once, in the
drain-timeout path only — zero per-message work. Stacked on #2761.
A real line, captured from the functional suite (single line in the logs, wrapped here):
{"name":"BackbeatConsumer","time":1781182870950, "topic":"backbeat-consumer-spec-shutdown","groupId":"bucket-processor-0.274398999", "queueLen":0,"running":1, "stuckTasks":[{"topic":"backbeat-consumer-spec-shutdown","partition":0,"offset":142,"key":"key"}], "level":"error","message":"rdkafka.rebalance timeout: consumer stuck, disconnecting", "hostname":"...","pid":76103}On a CRR processor the same line carries
topic: backbeat-replication,groupId: backbeat-replication-group-<site>(configured groupId + site), andkey: <bucket>/<object-key>— the key the populator sets when publishing the entry(
ReplicationQueuePopulator.js). The key names the wedged object, partition/offset letyou fetch the exact message back from kafka, and grepping the processor log for the key
shows how far the task got.
Ten is also sufficient because nothing beyond it is lost:
runningminus the listedtasks says how many more were in flight, and the remainder is recoverable two ways --
dump the partition from the lowest listed offset with kafka tools (every message key
names its object), or just wait: the next partition owner re-runs every uncommitted
entry and logs each one as it goes.