Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 43 additions & 1 deletion monitoring/mongodb/alerts.test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -670,4 +670,46 @@ tests:
MongoDB replica set `data-db-mongodb-sharded-shard-1` is not in the expected state.
It does not have the expected number of SECONDARY members. Please ensure that all
instances are running properly.
summary: MongoDB replica set out of sync
summary: MongoDB replica set out of sync

- name: MongoDbCompactionNeeded
interval: 5m
input_series:
# Compaction-needed DB: 80% disk usage, 40% free ratio, 12 GB free (>10 GiB)
- series: mongodb_dbstats_fsUsedSize{namespace="zenko", pod="data-db-mongodb-sharded-shard0-data-0", database="needs-compaction"}
values: 8e9x13
- series: mongodb_dbstats_fsTotalSize{namespace="zenko", pod="data-db-mongodb-sharded-shard0-data-0", database="needs-compaction"}
values: 1e10x13
- series: mongodb_dbstats_totalFreeStorageSize{namespace="zenko", pod="data-db-mongodb-sharded-shard0-data-0", database="needs-compaction"}
values: 12e9x13
- series: mongodb_dbstats_totalSize{namespace="zenko", pod="data-db-mongodb-sharded-shard0-data-0", database="needs-compaction"}
values: 30e9x13
# Healthy DB on same pod: same 80% disk usage but only 1 GB free (below both ratio and absolute thresholds)
- series: mongodb_dbstats_fsUsedSize{namespace="zenko", pod="data-db-mongodb-sharded-shard0-data-0", database="healthy-db"}
values: 8e9x13
- series: mongodb_dbstats_fsTotalSize{namespace="zenko", pod="data-db-mongodb-sharded-shard0-data-0", database="healthy-db"}
values: 1e10x13
- series: mongodb_dbstats_totalFreeStorageSize{namespace="zenko", pod="data-db-mongodb-sharded-shard0-data-0", database="healthy-db"}
values: 1e9x13
- series: mongodb_dbstats_totalSize{namespace="zenko", pod="data-db-mongodb-sharded-shard0-data-0", database="healthy-db"}
values: 30e9x13
alert_rule_test:
# for: 1h not yet satisfied
- alertname: MongoDbCompactionNeeded
eval_time: 10m
exp_alerts: []
- alertname: MongoDbCompactionNeeded
eval_time: 55m
exp_alerts: []
# for: 1h satisfied; only the needs-compaction DB fires
- alertname: MongoDbCompactionNeeded
eval_time: 65m
exp_alerts:
- exp_labels:
severity: warning
namespace: zenko
pod: data-db-mongodb-sharded-shard0-data-0
database: needs-compaction
exp_annotations:
description: "MongoDB pod `data-db-mongodb-sharded-shard0-data-0` database `needs-compaction` has accumulated significant reclaimable storage while the underlying filesystem is filling up. Consider running compaction to recover disk space."
summary: MongoDB compaction needed
30 changes: 30 additions & 0 deletions monitoring/mongodb/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,15 @@ x-inputs:
- name: replicationLagOplogSizeThreshold
type: config
value: 0.5
- name: compactionDiskUsageThreshold
type: config
value: 0.7
- name: compactionFreeStorageRatioThreshold
type: config
value: 0.3
- name: compactionFreeStorageAbsoluteThreshold
type: config
value: 10 * 1024 * 1024 * 1024 # 10 GiB, in bytes
Comment on lines +30 to +38

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we confident about these levels ?

  • good defaults everywhere
  • alert when really needed
  • not too noisy

these alerts need to be "low touch", and fire only when an action is required...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note: since we add config, shoud add matching options in ZKOP afterwards, to allow setting them...)


groups:
- name: MongoDb
Expand Down Expand Up @@ -270,6 +279,27 @@ groups:
description: "MongoDB pod `{{ $labels.pod }}` has been in the 'STARTUP2' state for more than 1 hour. Please ensure that the instance is running properly."
summary: MongoDB node in STARTUP2 state for too long

- alert: MongoDbCompactionNeeded
expr: |
(
mongodb_dbstats_fsUsedSize{namespace="${namespace}",pod=~"${service}.*"}
/ mongodb_dbstats_fsTotalSize{namespace="${namespace}",pod=~"${service}.*"}
) > ${compactionDiskUsageThreshold}
and
(
mongodb_dbstats_totalFreeStorageSize{namespace="${namespace}",pod=~"${service}.*"}
/ mongodb_dbstats_totalSize{namespace="${namespace}",pod=~"${service}.*"}
) > ${compactionFreeStorageRatioThreshold}
Comment on lines +285 to +292

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this rule : percentage of disk (FS) and percentage of reclaimable (I gather totalFreeStorageSize is what we can reclaim?) could fire in many case, but often when not relevant ?

e.g. we are interested when the reclaimable amount is significant vs the available mongodb (free/total) size I guess?

  • i.e. your last condition, or possibly a variable where the threshold is a % of fs(free/total)size?
  • first condition seems fine, to avoid alerts when there is lot of space (though that is debatable, not sure it is really useful or we want to delay in that case)
  • I don't see at all what the benefit of % of reclaimable space is ?

and
mongodb_dbstats_totalFreeStorageSize{namespace="${namespace}",pod=~"${service}.*"}
> ${compactionFreeStorageAbsoluteThreshold}
for: 1h
labels:
severity: warning
annotations:
description: "MongoDB pod `{{ $labels.pod }}` database `{{ $labels.database }}` has accumulated significant reclaimable storage while the underlying filesystem is filling up. Consider running compaction to recover disk space."
summary: MongoDB compaction needed

- alert: MongoDbRSNotSynced
expr: |
count by (rs_nm, statefulset) (
Expand Down
2 changes: 1 addition & 1 deletion solution-base/mongodb/charts/mongodb-sharded/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1815,7 +1815,7 @@ metrics:
## @param metrics.extraArgs String with extra arguments to the metrics exporter
## ref: https://github.com/percona/mongodb_exporter/blob/main/main.go
##
extraArgs: "--collector.diagnosticdata --collector.replicasetstatus --collector.dbstats --collector.topmetrics --compatible-mode"
extraArgs: "--collector.diagnosticdata --collector.replicasetstatus --collector.dbstats --collector.dbstatsfreestorage --collector.topmetrics --compatible-mode"
## @param metrics.resourcesPreset Set container resources according to one common preset (allowed values: none, nano, micro, small, medium, large, xlarge, 2xlarge). This is ignored if metrics.resources is set (metrics.resources is recommended for production).
## More information: https://github.com/bitnami/charts/blob/main/bitnami/common/templates/_resources.tpl#L15
##
Expand Down
Loading