Skip to content

BB-788: Log stuck task identities on rebalance timeout#2763

Merged
bert-e merged 1 commit into
development/9.0from
improvement/BB-788-stuck-consumer-logs
Jun 12, 2026
Merged

BB-788: Log stuck task identities on rebalance timeout#2763
bert-e merged 1 commit into
development/9.0from
improvement/BB-788-stuck-consumer-logs

Conversation

@anurag4DSB

@anurag4DSB anurag4DSB commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Note: 1 test failing on this branch is a known and not related to this PR

When the rebalance drain timeout fires, the stuck-consumer error line now says what the
consumer was stuck on: queue depth, running count, and the kafka coordinates + message
key of up to 10 in-flight tasks. Keys are logged whole (their size is already bounded
by arsenal's objectKeyByteLimit). The cap of 10 equals the default consumer
concurrency, so the line can list a full default in-flight set but never grows past
~11KB worst case; the list is oldest-first, so the wedged tasks always lead it, and
anything beyond ten would be the newest, least suspicious entries. Runs once, in the
drain-timeout path only — zero per-message work. Stacked on #2761.

A real line, captured from the functional suite (single line in the logs, wrapped here):

{"name":"BackbeatConsumer","time":1781182870950,
 "topic":"backbeat-consumer-spec-shutdown","groupId":"bucket-processor-0.274398999",
 "queueLen":0,"running":1,
 "stuckTasks":[{"topic":"backbeat-consumer-spec-shutdown","partition":0,"offset":142,"key":"key"}],
 "level":"error","message":"rdkafka.rebalance timeout: consumer stuck, disconnecting",
 "hostname":"...","pid":76103}

On a CRR processor the same line carries topic: backbeat-replication,
groupId: backbeat-replication-group-<site> (configured groupId + site), and
key: <bucket>/<object-key> — the key the populator sets when publishing the entry
(ReplicationQueuePopulator.js). The key names the wedged object, partition/offset let
you fetch the exact message back from kafka, and grepping the processor log for the key
shows how far the task got.

Ten is also sufficient because nothing beyond it is lost: running minus the listed
tasks says how many more were in flight, and the remainder is recoverable two ways --
dump the partition from the lowest listed offset with kafka tools (every message key
names its object), or just wait: the next partition owner re-runs every uncommitted
entry and logs each one as it goes.

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — clean observability addition. Review by Claude Code

@anurag4DSB anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch from ef318c7 to 272fc1c Compare June 11, 2026 13:36
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.15%. Comparing base (11aa0bb) to head (1a70dd7).
⚠️ Report is 1 commits behind head on development/9.0.

Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
lib/BackbeatConsumer.js 93.88% <100.00%> (+1.45%) ⬆️

... and 2 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.20% <ø> (ø)
Core Library 80.63% <100.00%> (-0.38%) ⬇️
Ingestion 70.30% <ø> (ø)
Lifecycle 78.63% <ø> (ø)
Oplog Populator 85.06% <ø> (ø)
Replication 58.61% <ø> (ø)
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.0    #2763      +/-   ##
===================================================
- Coverage            74.30%   74.15%   -0.16%     
===================================================
  Files                  201      201              
  Lines                13485    13491       +6     
===================================================
- Hits                 10020    10004      -16     
- Misses                3455     3477      +22     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.48% <0.00%> (-0.01%) ⬇️
api:routes 9.29% <0.00%> (-0.01%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 9.96% <14.28%> (-0.87%) ⬇️
ingestion 12.57% <14.28%> (+<0.01%) ⬆️
lib 7.57% <100.00%> (+0.07%) ⬆️
lifecycle 18.57% <14.28%> (-0.01%) ⬇️
notification 1.03% <0.00%> (-0.01%) ⬇️
replication 18.57% <14.28%> (-0.01%) ⬇️
unit 50.06% <42.85%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — clean, well-scoped change. The new getInFlightTasks() method is defensive (optional chaining, fallback to empty array, capped at 5, key truncation to 200 chars), the enriched log line in the drain-timeout path follows existing patterns for queueLen/running (consistent with lines 684–691), and tests cover the key scenarios (empty queue, key truncation, keyless entries, cap). No issues found.

Review by Claude Code

@anurag4DSB anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch 2 times, most recently from 94e8afe to 0126c5a Compare June 11, 2026 13:50
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — clean, well-scoped observability improvement. The new getInFlightTasks() method correctly uses optional chaining on the processing queue, caps output at 5 entries to bound log line size, and the drain-timeout error now carries actionable context (topic, groupId, queue depth, running count, stuck task coordinates). Unit tests cover the empty-queue, normal, and cap cases. No issues found.

Review by Claude Code

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — clean observability improvement. The new getInFlightTasks() method is simple and correctly handles edge cases (no queue, missing key, Buffer keys). The 5-task cap keeps log lines bounded, the drain-timeout path runs only once so there is zero per-message overhead, and the tests cover the important cases (empty queue, key stringification, cap enforcement). No async-safety, Kafka, or logging concerns.

Review by Claude Code

@anurag4DSB anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch from 0126c5a to 92bf1da Compare June 11, 2026 14:10
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM

Review by Claude Code

@anurag4DSB anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch from 92bf1da to 690a13b Compare June 11, 2026 14:18
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — no issues found. The data shape accessed in getInFlightTasks() correctly matches what async.queue's workersList() returns and the Kafka message entries pushed at line 437. Cap at 10 entries, optional chaining on _processingQueue, key stringification pattern, and zero per-message overhead are all clean. Tests cover the three key cases (no queue, correct field mapping with Buffer keys and missing keys, and the cap). Well-scoped change with clear motivation and solid test coverage.

Review by Claude Code

@anurag4DSB anurag4DSB marked this pull request as ready for review June 11, 2026 14:23
@anurag4DSB anurag4DSB requested a review from a team June 11, 2026 15:01
@anurag4DSB anurag4DSB force-pushed the improvement/BB-787-exit-stuck-crr-consumer branch from cb398c5 to 11aa0bb Compare June 12, 2026 08:59
@anurag4DSB

Copy link
Copy Markdown
Contributor Author

rebased

@anurag4DSB anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch from 690a13b to 1a70dd7 Compare June 12, 2026 09:00
Comment thread lib/BackbeatConsumer.js

this._log.error('rdkafka.rebalance timeout: consumer stuck, disconnecting');
const queueLen = this._processingQueue?.length();
const running = this._processingQueue?.running();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: inside the setTimeout callback at the drain timeout, this._processingQueue is guaranteed to be non-null — the early exit at line 741 (if (!this._processingQueue || this._processingQueue.idle())) already returns before the timeout is ever set. The optional chaining on this._processingQueue?.length() and this._processingQueue?.running() is harmless but unnecessary, unlike in getInFlightTasks() where it's correctly needed because the method is public.

— Claude Code

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

LGTM — clean, well-scoped change. The new getInFlightTasks() method correctly caps output, handles missing keys, and the log enrichment in the drain timeout path gives operators exactly what they need (topic, groupId, queue depth, running count, stuck task coordinates). Tests cover all three cases (no queue, normal mapping with Buffer keys, cap enforcement). One minor nit left inline about unnecessary optional chaining in the timeout callback where _processingQueue is guaranteed non-null.

Review by Claude Code

Base automatically changed from improvement/BB-787-exit-stuck-crr-consumer to development/9.0 June 12, 2026 09:13
@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Hello anurag4dsb,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Incorrect fix version

The Fix Version/s in issue BB-788 contains:

  • None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.0.28

  • 9.1.12

  • 9.2.7

  • 9.3.5

  • 9.4.1

  • 9.5.0

Please check the Fix Version/s of BB-788, or the target
branch of this pull request.

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Incorrect fix version

The Fix Version/s in issue BB-788 contains:

  • 9.0.28

  • 9.2.7

  • 9.3.5

  • 9.4.1

  • 9.5.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.0.28

  • 9.1.12

  • 9.2.7

  • 9.3.5

  • 9.4.1

  • 9.5.0

Please check the Fix Version/s of BB-788, or the target
branch of this pull request.

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

/create_integration_branches

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Integration data created

I have created the integration data for the additional destination branches.

The following branches will NOT be impacted:

  • development/7.10
  • development/7.4
  • development/7.70
  • development/8.6

You can set option create_pull_requests if you need me to create
integration pull requests in addition to integration branches, with:

@bert-e create_pull_requests

The following options are set: create_integration_branches

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

The following options are set: create_integration_branches

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

/approve

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Build failed

The build for commit did not succeed in branch w/9.5/improvement/BB-788-stuck-consumer-logs

The following options are set: approve, create_integration_branches

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

/bypass_build_status

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

I have successfully merged the changeset of this pull request
into targetted development branches:

  • ✔️ development/9.0

  • ✔️ development/9.1

  • ✔️ development/9.2

  • ✔️ development/9.3

  • ✔️ development/9.4

  • ✔️ development/9.5

The following branches have NOT changed:

  • development/7.10
  • development/7.4
  • development/7.70
  • development/8.6

This pull request did not target the following hotfix branch(es) so they
were left untouched:

  • hotfix/7.9.0
  • hotfix/7.10.2
  • hotfix/7.4.10
  • hotfix/7.10.4
  • hotfix/7.70.1
  • hotfix/7.4.8
  • hotfix/9.0.7
  • hotfix/7.4.0
  • hotfix/7.4.4
  • hotfix/7.2.0
  • hotfix/7.4.5
  • hotfix/7.4.9
  • hotfix/7.10.12
  • hotfix/7.10.1
  • hotfix/7.6.0
  • hotfix/7.8.0
  • hotfix/7.4.6
  • hotfix/7.4.3
  • hotfix/7.7.0
  • hotfix/9.0.4
  • hotfix/7.4.1
  • hotfix/7.4.2
  • hotfix/7.70.15
  • hotfix/8.2.12
  • hotfix/7.10.0
  • hotfix/7.4.7
  • hotfix/7.10.8
  • hotfix/7.70.12
  • hotfix/7.10.17
  • hotfix/7.10.3

Please check the status of the associated issue BB-788.

Goodbye anurag4dsb.

The following options are set: bypass_build_status, approve, create_integration_branches

@bert-e bert-e merged commit 1a70dd7 into development/9.0 Jun 12, 2026
22 checks passed
@bert-e bert-e deleted the improvement/BB-788-stuck-consumer-logs branch June 12, 2026 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants