BB-787: Exit consumer stuck past rebalance drain timeout#2761
Conversation
Hello anurag4dsb,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 6 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.0 #2761 +/- ##
===================================================
+ Coverage 74.13% 74.30% +0.16%
===================================================
Files 201 201
Lines 13482 13485 +3
===================================================
+ Hits 9995 10020 +25
+ Misses 3477 3455 -22
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
LGTM — clean, minimal change. Event pattern is well-designed (generic signal from BackbeatConsumer, handler logic in QueueProcessor), env var gate makes this a no-op when unset, hard exit is justified over SIGTERM since the graceful-stop path would hang on the stuck tasks, and test coverage is thorough (unit tests for all three gate states with fake timers, functional test verifies the event fires in the real stuck-consumer scenario). No issues found. |
|
ping |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command: Alternatively, the |
|
/create_integration_branches |
Integration data createdI have created the integration data for the additional destination branches.
The following branches will NOT be impacted:
You can set option The following options are set: create_integration_branches |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
The following options are set: create_integration_branches |
44f9dca to
8da164b
Compare
History mismatchMerge commit #44f9dca4b63bb5ead8c77e82d485d80cf629f374 on the integration branch It is likely due to a rebase of the branch Please use the The following options are set: create_integration_branches |
|
LGTM — clean, well-scoped change. The new rebalance.timeout event in BackbeatConsumer is additive and non-breaking. The QueueProcessor handler is properly gated behind the env var with strict equality check, the hard exit (vs SIGTERM) is well-justified given the stuck-task deadlock, and the 1s log-flush grace matches the pattern. Unit tests cover all three gate states with proper env backup/restore, and the functional test fix (clearInterval + try/catch) is a solid improvement. No issues found. Review by Claude Code |
|
|
Do not review, making changes. |
|
| { topic: this._topic, groupId: this._groupId }); | ||
| // grace period so the fatal line flushes to stdout | ||
| // before the exit | ||
| setTimeout(() => process.exit(1), 1000); |
There was a problem hiding this comment.
The setTimeout reference returned here is never stored, so it cannot be cleared if the consumer is closed or re-subscribed before the 1 s grace elapses. In the production code path this is moot (the process is about to die), but in tests or if this code is ever reused in a graceful-shutdown variant, the dangling timer could fire unexpectedly. Consider assigning it to an instance field and clearing it in close().
— Claude Code
There was a problem hiding this comment.
so the handle is intentionally not stored, clearing it (e.g. from a close() racing the crash inside the 1s grace) would resurrect exactly the zombie state
|
ef318c7 to
07e2dad
Compare
|
LGTM — clean, well-scoped change. The env-var gate, strict |
|
/reset |
Reset completeI have successfully deleted this pull request's integration branches. The following options are set: create_integration_branches |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
The following options are set: create_integration_branches |
Build failedThe build for commit did not succeed in branch improvement/BB-787-exit-stuck-crr-consumer The following options are set: approve, create_integration_branches |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
The following options are set: create_integration_branches |
|
/reset |
Reset completeI have successfully deleted this pull request's integration branches. The following options are set: create_integration_branches |
ConflictA conflict has been raised during the creation of I have not created the integration branch. Here are the steps to resolve this conflict: git fetch
git checkout -B w/9.1/improvement/BB-787-exit-stuck-crr-consumer origin/development/9.1
git merge origin/improvement/BB-787-exit-stuck-crr-consumer
# <intense conflict resolution>
git commit
git push -u origin w/9.1/improvement/BB-787-exit-stuck-crr-consumerThe following options are set: create_integration_branches |
|
added a new commit for just bumping package.json |
|
LGTM — env-gated hard exit after rebalance drain timeout is well-scoped, follows the existing CRASH_ON_BATCH_TIMEOUT pattern, and both gate-on / gate-off paths are covered by functional tests with a properly stubbed process.exit. No issues found. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
The following options are set: create_integration_branches |
When a task wedges past the rebalance drain timeout, the consumer disconnects itself so that, in the design's words, the healthcheck fails and the process gets restarted. On S3C nothing performs that restart: supervisord only restarts a process that exits, so the worker stays down while its stuck in-flight tasks can still complete late and write over the partition's new owner (ghost objects, orphaned parts). Emit a rebalance.timeout event at the disconnect point (no listener anywhere except the replication queue processor, so no behavior change for other consumers or Zenko) and, when REPLICATION_QUEUE_PROCESSOR_CRASH_ON_REBALANCE_TIMEOUT is exactly "true", log a fatal line and hard-exit so the supervisor restarts the process and it rejoins the group. Hard exit rather than SIGTERM: the graceful stop path waits on the very tasks that are stuck. Federation sets the env var per site-processor program.
cb398c5 to
11aa0bb
Compare
History mismatchMerge commit #8da164b2970190406592462c139024210dcd8d57 on the integration branch It is likely due to a rebase of the branch Please use the The following options are set: create_integration_branches |
|
/reset |
|
squashed and pushed, no changes |
Reset completeI have successfully deleted this pull request's integration branches. The following options are set: create_integration_branches |
ConflictA conflict has been raised during the creation of I have not created the integration branch. Here are the steps to resolve this conflict: git fetch
git checkout -B w/9.1/improvement/BB-787-exit-stuck-crr-consumer origin/development/9.1
git merge origin/improvement/BB-787-exit-stuck-crr-consumer
# <intense conflict resolution>
git commit
git push -u origin w/9.1/improvement/BB-787-exit-stuck-crr-consumerThe following options are set: create_integration_branches |
|
LGTM — clean, well-scoped change. The env-gated hard exit follows the established |
|
ping |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
The following options are set: create_integration_branches |
|
Quick note, I am gonna add a bypass build status on this as discussed with @nicolas2bert for other 9.0 targetting PRs and same for #2763 |
|
/approve |
Build failedThe build for commit did not succeed in branch improvement/BB-787-exit-stuck-crr-consumer The following options are set: approve, create_integration_branches |
|
/bypass_build_status |
|
I have successfully merged the changeset of this pull request
The following branches have NOT changed:
This pull request did not target the following hotfix branch(es) so they
Please check the status of the associated issue BB-787. Goodbye anurag4dsb. The following options are set: bypass_build_status, approve, create_integration_branches |
Note: 1 test failing on this branch is a known and not related to this PR
On S3C nothing restarts a backbeat process whose consumer wedged past the rebalance
drain timeout: the lib disconnects it so the healthcheck fails, but supervisord only
restarts on exit — so the worker stays down, and its stuck tasks can still complete
late and write over the partition's new owner (ghost objects). This adds an env-gated
hard exit right after that disconnect, in the shared consumer so every backbeat
service is covered: when
CRASH_ON_REBALANCE_TIMEOUT === 'true', log a fatal andprocess.exit(1)after a 1s log-flush grace. Same pattern as the existingCRASH_ON_BATCH_TIMEOUTin LogReader.The 1s grace leaves a window where a wedged task could wake and fire one last write.
That is benign: the next partition owner has not even finished the rebalance yet, so
its redo of the same uncommitted entry always lands later and supersedes the write.
The damaging ordering (a stale write landing after the redo) needs the stuck worker
alive minutes later, which is exactly what the exit removes.
Inert unless the env var is set: never set on Zenko/K8s (the liveness probe already
handles this state), set container-wide on S3C by scality/Federation#6929. Hard exit
rather than SIGTERM because the graceful stop path waits on the very tasks that are
stuck; strict
=== 'true'so the var can be explicitly set tofalseto disarm.Both endings of the drain window are pinned by functional tests against a real kafka
broker: gate unset → disconnect, process stays alive; gate armed → exit(1) fires after
the fatal. The tests stub
process.exitbefore arming the gate, so the test processcan never actually die.