Skip to content

BB-788: Log stuck task identities on rebalance timeout#2762

Closed
anurag4DSB wants to merge 1 commit into
improvement/BB-787-exit-stuck-crr-consumerfrom
improvement/BB-788-stuck-consumer-logs
Closed

BB-788: Log stuck task identities on rebalance timeout#2762
anurag4DSB wants to merge 1 commit into
improvement/BB-787-exit-stuck-crr-consumerfrom
improvement/BB-788-stuck-consumer-logs

Conversation

@anurag4DSB

@anurag4DSB anurag4DSB commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Intent: why does this change exist?

When BB-787's self-exit fires (#2761), the logs say nothing about which
objects were stuck or whether the exit gate was even armed. Customer success debugging a
stuck CRR site from log files alone needs both: the identity of the wedged object, and the
difference between "this consumer will restart itself" and "this consumer is down until
someone acts".

System impact: what's affected, including downstream?

Replication (CRR) only — no other consumer changes behavior. Three structural
guarantees, each grep-verifiable on this diff:

  • The rebalance.timeout event has exactly one listener in the repo
    (QueueProcessor.js); for lifecycle, GC, notification, the populator, the status
    processor and all of Zenko, the emit is a no-op by Node EventEmitter semantics.
  • QueueProcessor.js is loaded only by the replication queue/replay processor
    entrypoint (extensions/replication/queueProcessor/task.js) — the logger naming and the
    new CRR log lines cannot execute in any other process.
  • Zero per-message work is added — the hot path (millions of objects/day) is untouched;
    every addition runs on the stuck path, at most once per drain timeout.

The one shared surface, stated openly: the pre-existing 'consumer stuck, disconnecting'
error line in lib/BackbeatConsumer.js gains payload fields (topic, groupId, queue state,
stuck-task identities). Other consumers DO get those richer fields if they ever get
stuck — same line, same level, same moment, identical control flow; strictly additive JSON
on a rare incident-only error. That is an observability improvement for them, not a
behavior change.

Preserved behavior: what explicitly stays the same?

All hot-path code is untouched — zero per-entry work added (deliberately: millions of
objects flow through these consumers). The revoke/un-assign rebalance lines are unchanged;
drain timing remains derivable from their existing timestamps. The rebalance.timeout event
gains a payload, which its only listener uses for logging.

Intended change: what's different after this PR?

The stuck error line now names the in-flight tasks — topic, partition, offset and the
kafka key, which for replication entries is the object name, the one identity no existing
log line carries (slow task logs the offset but not the key). Capped at 5 tasks, keys
truncated to 200 chars. The queue processor logs an error when the timeout fires with the
gate disarmed ("self-restart disabled; consumer stays disconnected until restarted externally"), and the fatal exit
line carries queue state and site.

Verification: how do we know this worked, or how would we know if it didn't?

Unit: getInFlightTasks shape/truncation/cap tests; the stuck-exit spec asserts the fatal
payload, the gate-off error, and exit timing across all gate states. Functional: the
existing stuck-rebalance test now asserts the event payload carries the stuck-task array
with real offsets from a genuinely wedged consumer against the CI kafka image.

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — clean, observability-only change. The stuck-task snapshot is correctly capped and truncated, the event payload is backward-compatible (no existing listener relied on the absence of arguments), and the new error log for the gate-off path fills a real diagnostic gap. Tests cover shape, truncation, cap, and all gate states including the no-payload fallback.

Review by Claude Code

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.13%. Comparing base (8faf5cb) to head (a56e055).

Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
...sions/replication/queueProcessor/QueueProcessor.js 72.96% <ø> (ø)
lib/BackbeatConsumer.js 93.33% <100.00%> (+0.09%) ⬆️

... and 4 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.20% <ø> (ø)
Core Library 80.59% <100.00%> (+0.10%) ⬆️
Ingestion 70.30% <ø> (ø)
Lifecycle 78.63% <ø> (ø)
Oplog Populator 85.06% <ø> (ø)
Replication 58.61% <ø> (+0.03%) ⬆️
Bucket Scanner 85.76% <ø> (ø)
@@                              Coverage Diff                               @@
##           improvement/BB-787-exit-stuck-crr-consumer    #2762      +/-   ##
==============================================================================
+ Coverage                                       74.08%   74.13%   +0.05%     
==============================================================================
  Files                                             201      201              
  Lines                                           13485    13490       +5     
==============================================================================
+ Hits                                             9990    10001      +11     
+ Misses                                           3485     3479       -6     
  Partials                                           10       10              
Flag Coverage Δ
api:retry 9.48% <0.00%> (-0.01%) ⬇️
api:routes 9.29% <0.00%> (-0.01%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 9.95% <0.00%> (+1.01%) ⬆️
ingestion 12.57% <0.00%> (-0.01%) ⬇️
lib 7.54% <100.00%> (+0.01%) ⬆️
lifecycle 18.57% <0.00%> (-0.01%) ⬇️
notification 1.03% <0.00%> (-0.01%) ⬇️
replication 18.56% <0.00%> (-0.01%) ⬇️
unit 50.06% <33.33%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@anurag4DSB anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch from b4e6135 to 62d2540 Compare June 11, 2026 06:28
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — observability-only changes, no behavior change on the hot path. The stuck-task snapshot is well-bounded (5 tasks, 200-char key cap), the unarmed case now logs instead of silently returning, and both unit and functional tests cover the new event payload shape. No issues found.

Review by Claude Code

@anurag4DSB anurag4DSB force-pushed the improvement/BB-787-exit-stuck-crr-consumer branch from 44f9dca to 8da164b Compare June 11, 2026 06:54
@anurag4DSB anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch from 62d2540 to 144b19c Compare June 11, 2026 07:00
Comment thread extensions/replication/queueProcessor/QueueProcessor.js Outdated
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown
  • extensions/replication/queueProcessor/QueueProcessor.js:462 — non-ASCII em dash in log message; rest of codebase uses ASCII. Minor consistency nit, not a bug.

Overall: clean, well-scoped observability improvement. The changes are logs-only with no hot-path impact, the cap/truncation in getInFlightTasks prevents unbounded log growth, the default parameter (queueState = {}) gracefully handles missing payloads, and the tests cover shape, truncation, cap, armed/disarmed gate, and the missing-payload edge case. LGTM with the minor nit above.

Review by Claude Code

@anurag4DSB anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch 2 times, most recently from cb430fe to e48f4f4 Compare June 11, 2026 07:06
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — clean, well-scoped observability improvement. The stuck-task snapshot in BackbeatConsumer is properly capped and truncated, the rebalance.timeout event payload is backward-compatible (default {}), and the QueueProcessor gate-disarmed path now surfaces actionable context. Test coverage is thorough: unit tests for shape/truncation/cap, gate-state assertions, missing-payload backward-compat, and the functional test validates real in-flight entries from a genuinely wedged consumer.

Review by Claude Code

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — observability-only additions on the rare stuck-consumer path. getInFlightTasks caps and truncates safely, the event payload is backwards-compatible (extra argument ignored by any listener that doesn't destructure it), the gate-disarmed error fills a real diagnostic gap, and the tests cover shape, truncation, cap, and all gate states including the missing-payload defensive path. No issues found.

Review by Claude Code

@anurag4DSB anurag4DSB marked this pull request as ready for review June 11, 2026 07:40
@anurag4DSB anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch from e48f4f4 to 7cdd405 Compare June 11, 2026 09:31
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — clean observability improvement, well scoped to the stuck-consumer path with no hot-path impact. getInFlightTasks is properly capped and truncated, the rebalance.timeout event payload is backwards-compatible (extra arg ignored by any listener that doesn't bind it), and test coverage is thorough (unit tests for shape/truncation/cap, functional test asserts real offsets from a wedged consumer, stuck-exit spec covers armed, disarmed, and missing-payload cases).

Review by Claude Code

When the exit added by BB-787 fires, the logs said nothing about
which objects were stuck or whether the gate was armed. Customer
success needs both to debug from log files alone.

Carry the in-flight task identities on the existing 'consumer stuck,
disconnecting' error line and in the rebalance.timeout event payload:
topic, partition, offset and the kafka key, which for replication
entries names the object - the one identity no existing log line
carries ('slow task' logs offset but not key). Cap at 5 tasks, key
truncated to 200 chars. Log an error when the timeout fires with the
gate disarmed, so a disconnected-forever consumer is distinguishable
from one about to self-heal, and carry queue state on the fatal exit
line.

No per-entry work is added: everything runs on the stuck path only,
at most once per drain timeout. The revoke/un-assign log lines are
untouched; drain timing is derivable from their timestamps.
@anurag4DSB anurag4DSB force-pushed the improvement/BB-788-stuck-consumer-logs branch from 7cdd405 to a56e055 Compare June 11, 2026 12:38
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — clean, well-scoped observability improvement. The getInFlightTasks() method correctly uses async.queue's workersList() API, handles the null-queue and null-key edge cases, and caps/truncates output for safe logging. The enriched error log on the stuck-consumer path adds the right diagnostic fields (topic, groupId, queue state, stuck task identities) without touching any hot-path code. Tests cover shape, truncation, and cap. No issues found.

Review by Claude Code

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

Folded into #2761 per review simplification — the logs change now rides there as its own commit (ef318c7, BB-788: Log stuck task identities on rebalance timeout). Closing.

@anurag4DSB anurag4DSB closed this Jun 11, 2026
@anurag4DSB anurag4DSB changed the title BB-788: Name stuck tasks and gate state in logs BB-788: Log stuck task identities on rebalance timeout Jun 11, 2026
@anurag4DSB

Copy link
Copy Markdown
Contributor Author

Reopened — keeping the logs separate after all, so this PR has room to grow (the consumer logger naming should eventually cover the other backbeat services too, not just CRR).

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

Correction to the comment above: GitHub would not reopen this after the head branch moved, so the logs change continues as #2763 (same branch, same commit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant