BB-788: Log stuck task identities on rebalance timeout#2762
Conversation
|
LGTM — clean, observability-only change. The stuck-task snapshot is correctly capped and truncated, the event payload is backward-compatible (no existing listener relied on the absence of arguments), and the new error log for the gate-off path fills a real diagnostic gap. Tests cover shape, truncation, cap, and all gate states including the no-payload fallback. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
... and 4 files with indirect coverage changes
@@ Coverage Diff @@
## improvement/BB-787-exit-stuck-crr-consumer #2762 +/- ##
==============================================================================
+ Coverage 74.08% 74.13% +0.05%
==============================================================================
Files 201 201
Lines 13485 13490 +5
==============================================================================
+ Hits 9990 10001 +11
+ Misses 3485 3479 -6
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
b4e6135 to
62d2540
Compare
|
LGTM — observability-only changes, no behavior change on the hot path. The stuck-task snapshot is well-bounded (5 tasks, 200-char key cap), the unarmed case now logs instead of silently returning, and both unit and functional tests cover the new event payload shape. No issues found. |
44f9dca to
8da164b
Compare
62d2540 to
144b19c
Compare
Overall: clean, well-scoped observability improvement. The changes are logs-only with no hot-path impact, the cap/truncation in Review by Claude Code |
cb430fe to
e48f4f4
Compare
|
LGTM — clean, well-scoped observability improvement. The stuck-task snapshot in BackbeatConsumer is properly capped and truncated, the rebalance.timeout event payload is backward-compatible (default |
|
LGTM — observability-only additions on the rare stuck-consumer path. getInFlightTasks caps and truncates safely, the event payload is backwards-compatible (extra argument ignored by any listener that doesn't destructure it), the gate-disarmed error fills a real diagnostic gap, and the tests cover shape, truncation, cap, and all gate states including the missing-payload defensive path. No issues found. |
e48f4f4 to
7cdd405
Compare
|
LGTM — clean observability improvement, well scoped to the stuck-consumer path with no hot-path impact. getInFlightTasks is properly capped and truncated, the rebalance.timeout event payload is backwards-compatible (extra arg ignored by any listener that doesn't bind it), and test coverage is thorough (unit tests for shape/truncation/cap, functional test asserts real offsets from a wedged consumer, stuck-exit spec covers armed, disarmed, and missing-payload cases). |
When the exit added by BB-787 fires, the logs said nothing about which objects were stuck or whether the gate was armed. Customer success needs both to debug from log files alone. Carry the in-flight task identities on the existing 'consumer stuck, disconnecting' error line and in the rebalance.timeout event payload: topic, partition, offset and the kafka key, which for replication entries names the object - the one identity no existing log line carries ('slow task' logs offset but not key). Cap at 5 tasks, key truncated to 200 chars. Log an error when the timeout fires with the gate disarmed, so a disconnected-forever consumer is distinguishable from one about to self-heal, and carry queue state on the fatal exit line. No per-entry work is added: everything runs on the stuck path only, at most once per drain timeout. The revoke/un-assign log lines are untouched; drain timing is derivable from their timestamps.
7cdd405 to
a56e055
Compare
|
LGTM — clean, well-scoped observability improvement. The getInFlightTasks() method correctly uses async.queue's workersList() API, handles the null-queue and null-key edge cases, and caps/truncates output for safe logging. The enriched error log on the stuck-consumer path adds the right diagnostic fields (topic, groupId, queue state, stuck task identities) without touching any hot-path code. Tests cover shape, truncation, and cap. No issues found. |
|
Reopened — keeping the logs separate after all, so this PR has room to grow (the consumer logger naming should eventually cover the other backbeat services too, not just CRR). |
|
Correction to the comment above: GitHub would not reopen this after the head branch moved, so the logs change continues as #2763 (same branch, same commit). |
Intent: why does this change exist?
When BB-787's self-exit fires (#2761), the logs say nothing about which
objects were stuck or whether the exit gate was even armed. Customer success debugging a
stuck CRR site from log files alone needs both: the identity of the wedged object, and the
difference between "this consumer will restart itself" and "this consumer is down until
someone acts".
System impact: what's affected, including downstream?
Replication (CRR) only — no other consumer changes behavior. Three structural
guarantees, each grep-verifiable on this diff:
rebalance.timeoutevent has exactly one listener in the repo(
QueueProcessor.js); for lifecycle, GC, notification, the populator, the statusprocessor and all of Zenko, the emit is a no-op by Node EventEmitter semantics.
QueueProcessor.jsis loaded only by the replication queue/replay processorentrypoint (
extensions/replication/queueProcessor/task.js) — the logger naming and thenew CRR log lines cannot execute in any other process.
every addition runs on the stuck path, at most once per drain timeout.
The one shared surface, stated openly: the pre-existing
'consumer stuck, disconnecting'error line in
lib/BackbeatConsumer.jsgains payload fields (topic, groupId, queue state,stuck-task identities). Other consumers DO get those richer fields if they ever get
stuck — same line, same level, same moment, identical control flow; strictly additive JSON
on a rare incident-only error. That is an observability improvement for them, not a
behavior change.
Preserved behavior: what explicitly stays the same?
All hot-path code is untouched — zero per-entry work added (deliberately: millions of
objects flow through these consumers). The revoke/un-assign rebalance lines are unchanged;
drain timing remains derivable from their existing timestamps. The rebalance.timeout event
gains a payload, which its only listener uses for logging.
Intended change: what's different after this PR?
The stuck error line now names the in-flight tasks — topic, partition, offset and the
kafka key, which for replication entries is the object name, the one identity no existing
log line carries (
slow tasklogs the offset but not the key). Capped at 5 tasks, keystruncated to 200 chars. The queue processor logs an error when the timeout fires with the
gate disarmed ("self-restart disabled; consumer stays disconnected until restarted externally"), and the fatal exit
line carries queue state and site.
Verification: how do we know this worked, or how would we know if it didn't?
Unit: getInFlightTasks shape/truncation/cap tests; the stuck-exit spec asserts the fatal
payload, the gate-off error, and exit timing across all gate states. Functional: the
existing stuck-rebalance test now asserts the event payload carries the stuck-task array
with real offsets from a genuinely wedged consumer against the CI kafka image.