feat(mongodb): alert when compaction is needed#2433
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
Add --collector.dbstatsfreestorage to mongodb_exporter's extraArgs so
the dbstats response's freeStorageSize / indexFreeStorageSize /
totalFreeStorageSize fields are surfaced as top-level Prometheus series
(mongodb_dbstats_freeStorageSize{database, rs_nm, ...} etc.).
These sub-collectors are not bundled into the catch-all options the
exporter exposes (--collect-all and similar shortcuts); they have to be
opted into explicitly. Since the chart no longer uses --collect-all
anyway (dropped in 8414833 for ZENKO-5281), each individual collector
we want has to be named in extraArgs — which is already how dbstats,
diagnosticdata, replicasetstatus, and topmetrics are wired up. This
just adds dbstatsfreestorage to that list.
Verified on a live Artesca cluster (exporter 0.40.0): without this flag
the freeStorageSize fields only appear as part of the per-host
mongodb_dbstats_raw_<host>_freeStorageSize series — clunky for alerting
queries. With the flag they appear cleanly as top-level series with
{database, rs_nm, ...} labels.
This unblocks the MongoDbCompactionNeeded alert added in the following
commit, which needs totalFreeStorageSize at top level to express the
compaction-pressure heuristic.
Issue: ZENKO-5293
24d5505 to
807646c
Compare
Add MongoDbCompactionNeeded Prometheus rule that fires when a MongoDB
database has accumulated significant reclaimable storage AND the
underlying filesystem is under pressure. Per-(pod, database) granularity.
Heuristic (all three must hold; tunable via x-inputs):
fsUsedSize / fsTotalSize > 0.7 (filesystem at >70%)
AND totalFreeStorageSize / totalSize > 0.3 (>30% of on-disk
footprint reclaimable)
AND totalFreeStorageSize > 10 GiB (absolute floor, avoids
noise on small DBs)
for: 1h (fragmentation builds slowly)
The disk-usage floor exists because raw fragmentation ratio is noisy:
on a freshly-loaded cluster, a 28%-fragmented DB on a 13%-full disk is
not actionable. We only want to alert when reclaiming would relieve
real pressure.
Companion fixture covers a needs-compaction DB and a healthy DB sharing
the same pod (so the filesystem-pressure leg fires on both, but only
the high-free-storage one passes all three conditions).
Issue: ZENKO-5293
807646c to
146b0b5
Compare
|
@francoisferrand Heuristics are up for discussion -- I chose a mix of:
|
francoisferrand
left a comment
There was a problem hiding this comment.
- not sure about thresholds/computation: are these relevant default?
- not sure we should merge in 2.15, or target 2.16 and get time to "preview" this and avoid shipping alerts which would lead to support calls...
| - name: compactionDiskUsageThreshold | ||
| type: config | ||
| value: 0.7 | ||
| - name: compactionFreeStorageRatioThreshold | ||
| type: config | ||
| value: 0.3 | ||
| - name: compactionFreeStorageAbsoluteThreshold | ||
| type: config | ||
| value: 10 * 1024 * 1024 * 1024 # 10 GiB, in bytes |
There was a problem hiding this comment.
are we confident about these levels ?
- good defaults everywhere
- alert when really needed
- not too noisy
these alerts need to be "low touch", and fire only when an action is required...
There was a problem hiding this comment.
(note: since we add config, shoud add matching options in ZKOP afterwards, to allow setting them...)
| mongodb_dbstats_fsUsedSize{namespace="${namespace}",pod=~"${service}.*"} | ||
| / mongodb_dbstats_fsTotalSize{namespace="${namespace}",pod=~"${service}.*"} | ||
| ) > ${compactionDiskUsageThreshold} | ||
| and | ||
| ( | ||
| mongodb_dbstats_totalFreeStorageSize{namespace="${namespace}",pod=~"${service}.*"} | ||
| / mongodb_dbstats_totalSize{namespace="${namespace}",pod=~"${service}.*"} | ||
| ) > ${compactionFreeStorageRatioThreshold} |
There was a problem hiding this comment.
not sure about this rule : percentage of disk (FS) and percentage of reclaimable (I gather totalFreeStorageSize is what we can reclaim?) could fire in many case, but often when not relevant ?
e.g. we are interested when the reclaimable amount is significant vs the available mongodb (free/total) size I guess?
- i.e. your last condition, or possibly a variable where the threshold is a % of fs(free/total)size?
- first condition seems fine, to avoid alerts when there is lot of space (though that is debatable, not sure it is really useful or we want to delay in that case)
- I don't see at all what the benefit of % of reclaimable space is ?
ConflictThere is a conflict between your branch Please resolve the conflict on the feature branch ( git fetch && \
git checkout origin/improvement/ZENKO-5293/mongodb-compaction-needed-alert && \
git merge origin/development/2.15Resolve merge conflicts and commit git push origin HEAD:improvement/ZENKO-5293/mongodb-compaction-needed-alert |
Summary
Follow-up to ZENKO-5285 (PR #2431), which bundled two alerts in its description and only shipped the first (createIndexes-failed). This adds the second: a fragmentation/compaction-needed signal.
Two commits, deliberately split:
mongodb: enable dbstatsfreestorage collector in exporter— one-linevalues.yamlchange adding--collector.dbstatsfreestoragetometrics.extraArgs. The chart's exporter no longer uses--collect-all(dropped in 8414833 for ZENKO-5281); each sub-collector is opted into individually inextraArgs, which already listsdbstats,diagnosticdata,replicasetstatus, andtopmetrics. We needdbstatsfreestorageon that list so thefreeStorageSize/indexFreeStorageSize/totalFreeStorageSizefields are emitted as clean top-levelmongodb_dbstats_*series instead of being buried in the per-hostmongodb_dbstats_raw_<host>_*expansion.mongodb: alert when compaction is needed— the alert proper.The alert
Per-(pod, database) granularity. Severity warning.
Heuristic — open for discussion
All three conditions and the
for:are exposed as x-inputs (compactionDiskUsageThreshold,compactionFreeStorageRatioThreshold,compactionFreeStorageAbsoluteThreshold). The defaults are my best guesses and very much up for review:for: 1h: fragmentation builds slowly. Anything shorter would alert on transient writes.Per-DB granularity means the alert tells you
pod={{ "{{ $labels.pod }}" }}anddatabase={{ "{{ $labels.database }}" }}but does not identify the specific collection. To find that, an operator runscollStatson the alerting DB. The exporter has a--collector.collstatsflag we could enable to get per-collection visibility, but on Artesca clusters with thousands of buckets per DB that's a real cardinality cost — deferred for now.Safety against missing / zero values
(pod, database)tuple → PromQL produces empty for that tuple → no alert. ✓totalFreeStorageSize = 0(no fragmentation) → ratio is 0, fails both ratio and absolute checks. ✓totalSize = 0(empty DB) →0/0 = NaN, NaN comparisons are always false. And totalFreeStorageSize can't exceed totalSize, so the absolute check fails too. ✓+Infon the ratio leg.Why this was split off from ZENKO-5285
The original PR shipped the simpler index-failure alert. This one needed more thought on the right signal, the per-(pod, database) aggregation, and the exporter flag dependency. Reviewer @DarkIsDude explicitly asked for a follow-up.
Related
Issue: ZENKO-5293