BB-788: Log stuck task identities on rebalance timeout by anurag4DSB · Pull Request #2762 · scality/backbeat

anurag4DSB · 2026-06-11T06:17:26Z

Intent: why does this change exist?

When BB-787's self-exit fires (#2761), the logs say nothing about which
objects were stuck or whether the exit gate was even armed. Customer success debugging a
stuck CRR site from log files alone needs both: the identity of the wedged object, and the
difference between "this consumer will restart itself" and "this consumer is down until
someone acts".

System impact: what's affected, including downstream?

Replication (CRR) only — no other consumer changes behavior. Three structural
guarantees, each grep-verifiable on this diff:

The rebalance.timeout event has exactly one listener in the repo
(QueueProcessor.js); for lifecycle, GC, notification, the populator, the status
processor and all of Zenko, the emit is a no-op by Node EventEmitter semantics.
QueueProcessor.js is loaded only by the replication queue/replay processor
entrypoint (extensions/replication/queueProcessor/task.js) — the logger naming and the
new CRR log lines cannot execute in any other process.
Zero per-message work is added — the hot path (millions of objects/day) is untouched;
every addition runs on the stuck path, at most once per drain timeout.

The one shared surface, stated openly: the pre-existing 'consumer stuck, disconnecting'
error line in lib/BackbeatConsumer.js gains payload fields (topic, groupId, queue state,
stuck-task identities). Other consumers DO get those richer fields if they ever get
stuck — same line, same level, same moment, identical control flow; strictly additive JSON
on a rare incident-only error. That is an observability improvement for them, not a
behavior change.

Preserved behavior: what explicitly stays the same?

All hot-path code is untouched — zero per-entry work added (deliberately: millions of
objects flow through these consumers). The revoke/un-assign rebalance lines are unchanged;
drain timing remains derivable from their existing timestamps. The rebalance.timeout event
gains a payload, which its only listener uses for logging.

Intended change: what's different after this PR?

The stuck error line now names the in-flight tasks — topic, partition, offset and the
kafka key, which for replication entries is the object name, the one identity no existing
log line carries (slow task logs the offset but not the key). Capped at 5 tasks, keys
truncated to 200 chars. The queue processor logs an error when the timeout fires with the
gate disarmed ("self-restart disabled; consumer stays disconnected until restarted externally"), and the fatal exit
line carries queue state and site.

Verification: how do we know this worked, or how would we know if it didn't?

Unit: getInFlightTasks shape/truncation/cap tests; the stuck-exit spec asserts the fatal
payload, the gate-off error, and exit timing across all gate states. Functional: the
existing stuck-rebalance test now asserts the event payload carries the stuck-task array
with real offsets from a genuinely wedged consumer against the CI kafka image.

claude · 2026-06-11T06:18:16Z

LGTM — clean, observability-only change. The stuck-task snapshot is correctly capped and truncated, the event payload is backward-compatible (no existing listener relied on the absence of arguments), and the new error log for the gate-off path fills a real diagnostic gap. Tests cover shape, truncation, cap, and all gate states including the no-payload fallback.

Review by Claude Code

codecov · 2026-06-11T06:18:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.13%. Comparing base (8faf5cb) to head (a56e055).

Additional details and impacted files

Files with missing lines	Coverage Δ
...sions/replication/queueProcessor/QueueProcessor.js	`72.96% <ø> (ø)`
lib/BackbeatConsumer.js	`93.33% <100.00%> (+0.09%)`	⬆️

... and 4 files with indirect coverage changes

Components	Coverage Δ
Bucket Notification	`80.20% <ø> (ø)`
Core Library	`80.59% <100.00%> (+0.10%)`	⬆️
Ingestion	`70.30% <ø> (ø)`
Lifecycle	`78.63% <ø> (ø)`
Oplog Populator	`85.06% <ø> (ø)`
Replication	`58.61% <ø> (+0.03%)`	⬆️
Bucket Scanner	`85.76% <ø> (ø)`

@@                              Coverage Diff                               @@
##           improvement/BB-787-exit-stuck-crr-consumer    #2762      +/-   ##
==============================================================================
+ Coverage                                       74.08%   74.13%   +0.05%     
==============================================================================
  Files                                             201      201              
  Lines                                           13485    13490       +5     
==============================================================================
+ Hits                                             9990    10001      +11     
+ Misses                                           3485     3479       -6     
  Partials                                           10       10

Flag	Coverage Δ
api:retry	`9.48% <0.00%> (-0.01%)`	⬇️
api:routes	`9.29% <0.00%> (-0.01%)`	⬇️
bucket-scanner	`85.76% <ø> (ø)`
ft_test:queuepopulator	`9.95% <0.00%> (+1.01%)`	⬆️
ingestion	`12.57% <0.00%> (-0.01%)`	⬇️
lib	`7.54% <100.00%> (+0.01%)`	⬆️
lifecycle	`18.57% <0.00%> (-0.01%)`	⬇️
notification	`1.03% <0.00%> (-0.01%)`	⬇️
replication	`18.56% <0.00%> (-0.01%)`	⬇️
unit	`50.06% <33.33%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

claude · 2026-06-11T06:30:46Z

LGTM — observability-only changes, no behavior change on the hot path. The stuck-task snapshot is well-bounded (5 tasks, 200-char key cap), the unarmed case now logs instead of silently returning, and both unit and functional tests cover the new event payload shape. No issues found.

Review by Claude Code

claude · 2026-06-11T07:02:00Z

extensions/replication/queueProcessor/QueueProcessor.js:462 — non-ASCII em dash in log message; rest of codebase uses ASCII. Minor consistency nit, not a bug.

Overall: clean, well-scoped observability improvement. The changes are logs-only with no hot-path impact, the cap/truncation in getInFlightTasks prevents unbounded log growth, the default parameter (queueState = {}) gracefully handles missing payloads, and the tests cover shape, truncation, cap, armed/disarmed gate, and the missing-payload edge case. LGTM with the minor nit above.

Review by Claude Code

claude · 2026-06-11T07:07:22Z

LGTM — clean, well-scoped observability improvement. The stuck-task snapshot in BackbeatConsumer is properly capped and truncated, the rebalance.timeout event payload is backward-compatible (default {}), and the QueueProcessor gate-disarmed path now surfaces actionable context. Test coverage is thorough: unit tests for shape/truncation/cap, gate-state assertions, missing-payload backward-compat, and the functional test validates real in-flight entries from a genuinely wedged consumer.

Review by Claude Code

claude · 2026-06-11T07:08:29Z

LGTM — observability-only additions on the rare stuck-consumer path. getInFlightTasks caps and truncates safely, the event payload is backwards-compatible (extra argument ignored by any listener that doesn't destructure it), the gate-disarmed error fills a real diagnostic gap, and the tests cover shape, truncation, cap, and all gate states including the missing-payload defensive path. No issues found.

Review by Claude Code

claude · 2026-06-11T09:34:31Z

LGTM — clean observability improvement, well scoped to the stuck-consumer path with no hot-path impact. getInFlightTasks is properly capped and truncated, the rebalance.timeout event payload is backwards-compatible (extra arg ignored by any listener that doesn't bind it), and test coverage is thorough (unit tests for shape/truncation/cap, functional test asserts real offsets from a wedged consumer, stuck-exit spec covers armed, disarmed, and missing-payload cases).

Review by Claude Code

When the exit added by BB-787 fires, the logs said nothing about which objects were stuck or whether the gate was armed. Customer success needs both to debug from log files alone. Carry the in-flight task identities on the existing 'consumer stuck, disconnecting' error line and in the rebalance.timeout event payload: topic, partition, offset and the kafka key, which for replication entries names the object - the one identity no existing log line carries ('slow task' logs offset but not key). Cap at 5 tasks, key truncated to 200 chars. Log an error when the timeout fires with the gate disarmed, so a disconnected-forever consumer is distinguishable from one about to self-heal, and carry queue state on the fatal exit line. No per-entry work is added: everything runs on the stuck path only, at most once per drain timeout. The revoke/un-assign log lines are untouched; drain timing is derivable from their timestamps.

claude · 2026-06-11T12:40:55Z

LGTM — clean, well-scoped observability improvement. The getInFlightTasks() method correctly uses async.queue's workersList() API, handles the null-queue and null-key edge cases, and caps/truncates output for safe logging. The enriched error log on the stuck-consumer path adds the right diagnostic fields (topic, groupId, queue state, stuck task identities) without touching any hot-path code. Tests cover shape, truncation, and cap. No issues found.

Review by Claude Code

anurag4DSB · 2026-06-11T13:01:51Z

Folded into #2761 per review simplification — the logs change now rides there as its own commit (ef318c7, BB-788: Log stuck task identities on rebalance timeout). Closing.

anurag4DSB · 2026-06-11T13:11:26Z

Reopened — keeping the logs separate after all, so this PR has room to grow (the consumer logger naming should eventually cover the other backbeat services too, not just CRR).

anurag4DSB · 2026-06-11T13:12:19Z

Correction to the comment above: GitHub would not reopen this after the head branch moved, so the logs change continues as #2763 (same branch, same commit).

anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch from b4e6135 to 62d2540 Compare June 11, 2026 06:28

anurag4DSB force-pushed the improvement/BB-787-exit-stuck-crr-consumer branch from 44f9dca to 8da164b Compare June 11, 2026 06:54

anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch from 62d2540 to 144b19c Compare June 11, 2026 07:00

claude Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread extensions/replication/queueProcessor/QueueProcessor.js Outdated

anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch 2 times, most recently from cb430fe to e48f4f4 Compare June 11, 2026 07:06

anurag4DSB marked this pull request as ready for review June 11, 2026 07:40

anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch from e48f4f4 to 7cdd405 Compare June 11, 2026 09:31

anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch from 7cdd405 to a56e055 Compare June 11, 2026 12:38

anurag4DSB closed this Jun 11, 2026

anurag4DSB changed the title ~~BB-788: Name stuck tasks and gate state in logs~~ BB-788: Log stuck task identities on rebalance timeout Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BB-788: Log stuck task identities on rebalance timeout#2762

BB-788: Log stuck task identities on rebalance timeout#2762
anurag4DSB wants to merge 1 commit into
improvement/BB-787-exit-stuck-crr-consumerfrom
improvement/BB-788-stuck-consumer-logs

anurag4DSB commented Jun 11, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

anurag4DSB commented Jun 11, 2026

Uh oh!

anurag4DSB commented Jun 11, 2026

Uh oh!

anurag4DSB commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anurag4DSB commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intent: why does this change exist?

System impact: what's affected, including downstream?

Preserved behavior: what explicitly stays the same?

Intended change: what's different after this PR?

Verification: how do we know this worked, or how would we know if it didn't?

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026

Uh oh!

anurag4DSB commented Jun 11, 2026

Uh oh!

anurag4DSB commented Jun 11, 2026

Uh oh!

anurag4DSB commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anurag4DSB commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading