feat(lifecycle): emit putBucketIndexesFailed metric on index creation errors#2750
feat(lifecycle): emit putBucketIndexesFailed metric on index creation errors#2750delthas wants to merge 1 commit into
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Incorrect Jira projectThe Jira issue ZENKO-5286 specified in the source |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
... and 3 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.4 #2750 +/- ##
===================================================
- Coverage 74.65% 74.60% -0.06%
===================================================
Files 199 199
Lines 13654 13655 +1
===================================================
- Hits 10194 10187 -7
- Misses 3450 3458 +8
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
LGTM — clean, well-scoped change. The new |
… error Today the conductor's putBucketIndexes-error path just logs a warning and falls back to v1, leaving no Prometheus signal for ZENKO-5285's alert to fire on. Add a dedicated metric tag in the error callback so ops can alert on sustained index-creation failures regardless of the underlying cause (insufficient disk, server-side abort via indexBuildMinAvailableDiskSpaceMB in MongoDB 7.1+, duplicate-name conflict, transient infra, etc.). The existing 'putBucketIndexes' tag continues to count attempts; the new 'putBucketIndexesFailed' tag counts failures. The ratio is the failure rate. Issue: BB-783
3f906c8 to
c568664
Compare
|
LGTM — one-line metric addition in the existing error path, bounded label cardinality, proper test with spy cleanup via afterEach. No issues found. |
|
Replaced by #2751 after the source branch was renamed ( |
Summary
Adds a dedicated metric tag (
putBucketIndexesFailed) when the conductor'sputBucketIndexescall returns an error. Provides the Prometheus signal that ZENKO-5285's "lifecycle index creation failing repeatedly" alert will hook into.Why
Today the error callback at
LifecycleConductor.js:339-345just logs a warning and falls back to v1. There's no metric for the failure itself — onlyputBucketIndexesis emitted (before the call) which counts attempts. No way to alert on a sustained failure rate.Implementation
One-line change in the conductor's error path, plus a unit test that stubs
client.putBucketIndexesto fail and assertsLifecycleMetrics.onLegacyTaskis called with the new tag.The new tag is cause-agnostic — operators can dig into the warn log to find the underlying reason (MongoDB 7.1's
IndexBuildKilledByOutOfDiskSpace, duplicate-name conflict, transient infra, etc.). A future ticket can map specific Arsenal errors and emit more granular tags if needed.Existing tag semantics (unchanged)
putBucketIndexes— counts attemptsputBucketIndexesFailed(NEW) — counts failuresRelated
indexBuildMinAvailableDiskSpaceMB=10000in Artesca (PR #5228)Issue: BB-783