Skip to content

BB-787: Exit consumer stuck past rebalance drain timeout#2761

Merged
bert-e merged 2 commits into
development/9.0from
improvement/BB-787-exit-stuck-crr-consumer
Jun 12, 2026
Merged

BB-787: Exit consumer stuck past rebalance drain timeout#2761
bert-e merged 2 commits into
development/9.0from
improvement/BB-787-exit-stuck-crr-consumer

Conversation

@anurag4DSB

@anurag4DSB anurag4DSB commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Note: 1 test failing on this branch is a known and not related to this PR

On S3C nothing restarts a backbeat process whose consumer wedged past the rebalance
drain timeout: the lib disconnects it so the healthcheck fails, but supervisord only
restarts on exit — so the worker stays down, and its stuck tasks can still complete
late and write over the partition's new owner (ghost objects). This adds an env-gated
hard exit right after that disconnect, in the shared consumer so every backbeat
service is covered: when CRASH_ON_REBALANCE_TIMEOUT === 'true', log a fatal and
process.exit(1) after a 1s log-flush grace. Same pattern as the existing
CRASH_ON_BATCH_TIMEOUT in LogReader.

The 1s grace leaves a window where a wedged task could wake and fire one last write.
That is benign: the next partition owner has not even finished the rebalance yet, so
its redo of the same uncommitted entry always lands later and supersedes the write.
The damaging ordering (a stale write landing after the redo) needs the stuck worker
alive minutes later, which is exactly what the exit removes.

Inert unless the env var is set: never set on Zenko/K8s (the liveness probe already
handles this state), set container-wide on S3C by scality/Federation#6929. Hard exit
rather than SIGTERM because the graceful stop path waits on the very tasks that are
stuck; strict === 'true' so the var can be explicitly set to false to disarm.

Both endings of the drain window are pinned by functional tests against a real kafka
broker: gate unset → disconnect, process stays alive; gate armed → exit(1) fires after
the fatal. The tests stub process.exit before arming the gate, so the test process
can never actually die.

@bert-e

bert-e commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hello anurag4dsb,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e

bert-e commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Incorrect fix version

The Fix Version/s in issue BB-787 contains:

  • None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.0.28

  • 9.1.12

  • 9.2.7

  • 9.3.5

  • 9.4.1

  • 9.5.0

Please check the Fix Version/s of BB-787, or the target
branch of this pull request.

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.30%. Comparing base (39caff4) to head (11aa0bb).
⚠️ Report is 2 commits behind head on development/9.0.

Files with missing lines Patch % Lines
lib/BackbeatConsumer.js 0.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
lib/BackbeatConsumer.js 92.43% <0.00%> (-2.94%) ⬇️

... and 6 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.20% <ø> (ø)
Core Library 81.01% <0.00%> (+0.40%) ⬆️
Ingestion 70.30% <ø> (ø)
Lifecycle 78.63% <ø> (ø)
Oplog Populator 85.06% <ø> (ø)
Replication 58.61% <ø> (ø)
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.0    #2761      +/-   ##
===================================================
+ Coverage            74.13%   74.30%   +0.16%     
===================================================
  Files                  201      201              
  Lines                13482    13485       +3     
===================================================
+ Hits                  9995    10020      +25     
+ Misses                3477     3455      -22     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.48% <0.00%> (-0.01%) ⬇️
api:routes 9.29% <0.00%> (-0.01%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 10.82% <0.00%> (+1.88%) ⬆️
ingestion 12.57% <0.00%> (-0.01%) ⬇️
lifecycle 18.58% <0.00%> (+<0.01%) ⬆️
notification 1.03% <0.00%> (-0.01%) ⬇️
replication 18.57% <0.00%> (-0.09%) ⬇️
unit 50.06% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — clean, minimal change. Event pattern is well-designed (generic signal from BackbeatConsumer, handler logic in QueueProcessor), env var gate makes this a no-op when unset, hard exit is justified over SIGTERM since the graceful-stop path would hang on the stuck tasks, and test coverage is thorough (unit tests for all three gate states with fake timers, functional test verifies the event fires in the real stuck-consumer scenario). No issues found.

Review by Claude Code

@anurag4DSB anurag4DSB marked this pull request as ready for review June 11, 2026 05:26
@anurag4DSB

Copy link
Copy Markdown
Contributor Author

ping

@bert-e

bert-e commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

/create_integration_branches

@bert-e

bert-e commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Integration data created

I have created the integration data for the additional destination branches.

The following branches will NOT be impacted:

  • development/7.10
  • development/7.4
  • development/7.70
  • development/8.6

You can set option create_pull_requests if you need me to create
integration pull requests in addition to integration branches, with:

@bert-e create_pull_requests

The following options are set: create_integration_branches

@bert-e

bert-e commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

The following options are set: create_integration_branches

Comment thread lib/BackbeatConsumer.js Outdated
Comment thread extensions/replication/queueProcessor/QueueProcessor.js Outdated
Comment thread lib/BackbeatConsumer.js Outdated
@anurag4DSB anurag4DSB force-pushed the improvement/BB-787-exit-stuck-crr-consumer branch from 44f9dca to 8da164b Compare June 11, 2026 06:54
@bert-e

bert-e commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

History mismatch

Merge commit #44f9dca4b63bb5ead8c77e82d485d80cf629f374 on the integration branch
w/9.1/improvement/BB-787-exit-stuck-crr-consumer is merging a branch which is neither the current
branch improvement/BB-787-exit-stuck-crr-consumer nor the development branch
development/9.1.

It is likely due to a rebase of the branch improvement/BB-787-exit-stuck-crr-consumer and the
merge is not possible until all related w/* branches are deleted or updated.

Please use the reset command to have me reinitialize these branches.

The following options are set: create_integration_branches

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — clean, well-scoped change. The new rebalance.timeout event in BackbeatConsumer is additive and non-breaking. The QueueProcessor handler is properly gated behind the env var with strict equality check, the hard exit (vs SIGTERM) is well-justified given the stuck-task deadlock, and the 1s log-flush grace matches the pattern. Unit tests cover all three gate states with proper env backup/restore, and the functional test fix (clearInterval + try/catch) is a solid improvement. No issues found.

Review by Claude Code

@anurag4DSB anurag4DSB requested review from a team and francoisferrand June 11, 2026 09:32
Comment thread extensions/replication/queueProcessor/QueueProcessor.js Outdated
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown
  • QueueProcessor.js:456 — Both _consumer and _dataMoverConsumer go through _createConsumer, so _onRebalanceTimeout can fire twice (two fatal logs, two pending process.exit calls). Consider guarding with a flag to fire once.

    Overall: clean, well-scoped change. The env var gate, hard exit rationale, event plumbing, and test coverage are all solid.

    Review by Claude Code

@anurag4DSB anurag4DSB marked this pull request as draft June 11, 2026 09:47
@anurag4DSB

Copy link
Copy Markdown
Contributor Author

Do not review, making changes.

@anurag4DSB anurag4DSB changed the title BB-787: Exit CRR consumer stuck past drain timeout BB-787: Exit consumer stuck past rebalance drain timeout Jun 11, 2026
Comment thread tests/functional/lib/BackbeatConsumer.js
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown
  • tests/functional/lib/BackbeatConsumer.js:358 — The original "should fail healthcheck on rebalance timeout" test was replaced rather than preserved. The base case (timeout → disconnect → healthcheck failure, without env var) is no longer tested. The PR description's claim of three gate-state tests (armed, unset, "false") is only partially fulfilled: only the armed state is covered, and no separate unit test file was added.

    - Note: The PR description references REPLICATION_QUEUE_PROCESSOR_CRASH_ON_REBALANCE_TIMEOUT and a rebalance.timeout event, but the code uses CRASH_ON_REBALANCE_TIMEOUT with no event emission. Consider updating the description to match the implementation.

    Review by Claude Code

Comment thread lib/BackbeatConsumer.js
{ topic: this._topic, groupId: this._groupId });
// grace period so the fatal line flushes to stdout
// before the exit
setTimeout(() => process.exit(1), 1000);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setTimeout reference returned here is never stored, so it cannot be cleared if the consumer is closed or re-subscribed before the 1 s grace elapses. In the production code path this is moot (the process is about to die), but in tests or if this code is ever reused in a graceful-shutdown variant, the dangling timer could fire unexpectedly. Consider assigning it to an instance field and clearing it in close().

— Claude Code

@anurag4DSB anurag4DSB Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the handle is intentionally not stored, clearing it (e.g. from a close() racing the crash inside the 1s grace) would resurrect exactly the zombie state

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown
  • lib/BackbeatConsumer.js:775 — The setTimeout(() => process.exit(1), 1000) timer handle is never stored or cleared. Low risk since the process is exiting, but a dangling timer if the code path is ever reused outside a fatal context.

    Overall: clean, well-gated change. The env-var guard (=== 'true') is strict, the observability additions (queue depth, in-flight tasks, named logger) are proportionate, and both sides of the gate are functionally tested. LGTM with the minor nit above.

    Review by Claude Code

@anurag4DSB anurag4DSB force-pushed the improvement/BB-787-exit-stuck-crr-consumer branch from ef318c7 to 07e2dad Compare June 11, 2026 13:11
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

LGTM — clean, well-scoped change. The env-var gate, strict === 'true' check, hard exit (vs graceful stop that would wait on stuck tasks), and 1s log-flush grace period are all sound. Both gate-on and gate-off paths are covered by functional tests with a properly stubbed process.exit.

Review by Claude Code

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

/reset

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Reset complete

I have successfully deleted this pull request's integration branches.

The following options are set: create_integration_branches

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

The following options are set: create_integration_branches

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Build failed

The build for commit did not succeed in branch improvement/BB-787-exit-stuck-crr-consumer

The following options are set: approve, create_integration_branches

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

The following options are set: create_integration_branches

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

/reset

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Reset complete

I have successfully deleted this pull request's integration branches.

The following options are set: create_integration_branches

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Conflict

A conflict has been raised during the creation of
integration branch w/9.1/improvement/BB-787-exit-stuck-crr-consumer with contents from improvement/BB-787-exit-stuck-crr-consumer
and development/9.1.

I have not created the integration branch.

Here are the steps to resolve this conflict:

 git fetch
 git checkout -B w/9.1/improvement/BB-787-exit-stuck-crr-consumer origin/development/9.1
 git merge origin/improvement/BB-787-exit-stuck-crr-consumer
 # <intense conflict resolution>
 git commit
 git push -u origin w/9.1/improvement/BB-787-exit-stuck-crr-consumer

The following options are set: create_integration_branches

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

added a new commit for just bumping package.json

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

LGTM — env-gated hard exit after rebalance drain timeout is well-scoped, follows the existing CRASH_ON_BATCH_TIMEOUT pattern, and both gate-on / gate-off paths are covered by functional tests with a properly stubbed process.exit. No issues found.

Review by Claude Code

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

The following options are set: create_integration_branches

When a task wedges past the rebalance drain timeout, the consumer
disconnects itself so that, in the design's words, the healthcheck
fails and the process gets restarted. On S3C nothing performs that
restart: supervisord only restarts a process that exits, so the
worker stays down while its stuck in-flight tasks can still complete
late and write over the partition's new owner (ghost objects,
orphaned parts).

Emit a rebalance.timeout event at the disconnect point (no listener
anywhere except the replication queue processor, so no behavior
change for other consumers or Zenko) and, when
REPLICATION_QUEUE_PROCESSOR_CRASH_ON_REBALANCE_TIMEOUT is exactly
"true", log a fatal line and hard-exit so the supervisor restarts
the process and it rejoins the group. Hard exit rather than SIGTERM:
the graceful stop path waits on the very tasks that are stuck.
Federation sets the env var per site-processor program.
@anurag4DSB anurag4DSB force-pushed the improvement/BB-787-exit-stuck-crr-consumer branch from cb398c5 to 11aa0bb Compare June 12, 2026 08:59
@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

History mismatch

Merge commit #8da164b2970190406592462c139024210dcd8d57 on the integration branch
w/9.1/improvement/BB-787-exit-stuck-crr-consumer is merging a branch which is neither the current
branch improvement/BB-787-exit-stuck-crr-consumer nor the development branch
development/9.1.

It is likely due to a rebase of the branch improvement/BB-787-exit-stuck-crr-consumer and the
merge is not possible until all related w/* branches are deleted or updated.

Please use the reset command to have me reinitialize these branches.

The following options are set: create_integration_branches

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

/reset

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

squashed and pushed, no changes

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Reset complete

I have successfully deleted this pull request's integration branches.

The following options are set: create_integration_branches

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Conflict

A conflict has been raised during the creation of
integration branch w/9.1/improvement/BB-787-exit-stuck-crr-consumer with contents from improvement/BB-787-exit-stuck-crr-consumer
and development/9.1.

I have not created the integration branch.

Here are the steps to resolve this conflict:

 git fetch
 git checkout -B w/9.1/improvement/BB-787-exit-stuck-crr-consumer origin/development/9.1
 git merge origin/improvement/BB-787-exit-stuck-crr-consumer
 # <intense conflict resolution>
 git commit
 git push -u origin w/9.1/improvement/BB-787-exit-stuck-crr-consumer

The following options are set: create_integration_branches

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

LGTM — clean, well-scoped change. The env-gated hard exit follows the established CRASH_ON_BATCH_TIMEOUT pattern with a deliberate strictness improvement (=== 'true' vs truthy check). Code is inert unless explicitly armed, tests cover both gate-unset and gate-armed paths with properly stubbed process.exit, and cleanup in afterEach is correct.

Review by Claude Code

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

ping

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

The following options are set: create_integration_branches

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

Quick note, I am gonna add a bypass build status on this as discussed with @nicolas2bert for other 9.0 targetting PRs and same for #2763

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

/approve

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Build failed

The build for commit did not succeed in branch improvement/BB-787-exit-stuck-crr-consumer

The following options are set: approve, create_integration_branches

@anurag4DSB

Copy link
Copy Markdown
Contributor Author

/bypass_build_status

@bert-e

bert-e commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

I have successfully merged the changeset of this pull request
into targetted development branches:

  • ✔️ development/9.0

  • ✔️ development/9.1

  • ✔️ development/9.2

  • ✔️ development/9.3

  • ✔️ development/9.4

  • ✔️ development/9.5

The following branches have NOT changed:

  • development/7.10
  • development/7.4
  • development/7.70
  • development/8.6

This pull request did not target the following hotfix branch(es) so they
were left untouched:

  • hotfix/7.9.0
  • hotfix/7.10.2
  • hotfix/7.4.10
  • hotfix/7.10.4
  • hotfix/7.70.1
  • hotfix/7.4.8
  • hotfix/9.0.7
  • hotfix/7.4.0
  • hotfix/7.4.4
  • hotfix/7.2.0
  • hotfix/7.4.5
  • hotfix/7.4.9
  • hotfix/7.10.12
  • hotfix/7.10.1
  • hotfix/7.6.0
  • hotfix/7.8.0
  • hotfix/7.4.6
  • hotfix/7.4.3
  • hotfix/7.7.0
  • hotfix/9.0.4
  • hotfix/7.4.1
  • hotfix/7.4.2
  • hotfix/7.70.15
  • hotfix/8.2.12
  • hotfix/7.10.0
  • hotfix/7.4.7
  • hotfix/7.10.8
  • hotfix/7.70.12
  • hotfix/7.10.17
  • hotfix/7.10.3

Please check the status of the associated issue BB-787.

Goodbye anurag4dsb.

The following options are set: bypass_build_status, approve, create_integration_branches

@bert-e bert-e merged commit 11aa0bb into development/9.0 Jun 12, 2026
20 of 22 checks passed
@bert-e bert-e deleted the improvement/BB-787-exit-stuck-crr-consumer branch June 12, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants