Skip to content

NOISSUE - Add CPU/memory Prometheus metrics and OTLP task lifecycle spans#233

Open
JeffMboya wants to merge 5 commits into
absmach:mainfrom
JeffMboya:feat/proplet-prometheus-metrics
Open

NOISSUE - Add CPU/memory Prometheus metrics and OTLP task lifecycle spans#233
JeffMboya wants to merge 5 commits into
absmach:mainfrom
JeffMboya:feat/proplet-prometheus-metrics

Conversation

@JeffMboya

@JeffMboya JeffMboya commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What does this do?

Surfaces proplet CPU and container memory data as Prometheus metrics and instruments the task execution lifecycle with OTLP trace spans.

Previously, CPU and container memory data from MetricsCollector was only published over MQTT to the manager. This PR makes that data scrapeable at :9092/metrics alongside the existing task counters by adding four new Gauges:

  • proplet_cpu_user_seconds — cumulative CPU time in user mode (sourced from /proc/self/stat)
  • proplet_cpu_system_seconds — cumulative CPU time in kernel mode
  • proplet_memory_container_usage_bytes — container memory usage in bytes (from cgroup)
  • proplet_memory_container_limit_bytes — container memory limit in bytes (0 = unlimited)

It also adds three OTLP trace spans so proplet execution appears in Jaeger alongside the manager:

  • task.start — covers MQTT decode through task dispatch, tagged with task_id and task_name
  • task.execute — wraps the full spawned task lifetime end-to-end
  • wasm.execute — child span around runtime.start_app so wasmtime execution time is visible separately

The Grafana dashboard gains two new panels in the Proplet System row: CPU Time (cumulative) and Container Memory.

Also fixes a double-count bug in the FML coordinator error path where tasks_failed was incremented even when the wasm task had already succeeded.

Which issue(s) does this PR fix/relate to?

No issue

List any changes that modify/break current functionality

No breaking changes. The four new metrics are additive. The task.stop span gains a task_id field for Jaeger correlation with task.start.

Have you included tests for your changes?

Yes — cargo test passes (128 tests). Three tests in telemetry.rs cover the new metrics:

  • test_metrics_new_all_zero — asserts all 12 metrics initialise to zero
  • test_metrics_registry_gathers_all_metrics — asserts all 12 metric names are present in the Prometheus registry
  • test_cpu_and_container_metrics_set — exercises the new gauge setters with concrete values

Did you document any new/modified functionality?

The Grafana dashboard JSON (docker/addons/grafana/grafana/dashboards/propeller.json) is updated with the two new panels.

Notes

The CPU gauge names intentionally omit the _total suffix. Prometheus convention reserves _total for Counter types. These metrics are Gauge values set via set(), not monotonically incremented via inc().

JeffMboya added 5 commits June 8, 2026 15:17
Add tracing spans to the three key operations so proplet appears
in Jaeger alongside manager:
- task.start: covers MQTT decode through task dispatch, tagged with
  task_id and task_name
- task.execute: wraps the spawned task lifetime end-to-end
- wasm.execute: child span around runtime.start_app so wasmtime
  execution time is visible separately

Uses #[tracing::instrument] for async-fn boundaries (Send-safe) and
.instrument() for the tokio::spawn future.
Add four new Prometheus gauges sourced from MetricsCollector data that
was previously only published over MQTT:

- proplet_cpu_user_seconds_total — cumulative user-mode CPU seconds
- proplet_cpu_system_seconds_total — cumulative kernel-mode CPU seconds
- proplet_memory_container_usage_bytes — cgroup memory usage
- proplet_memory_container_limit_bytes — cgroup memory limit (0 = unlimited)

These are scraped alongside the existing proplet metrics at :9092/metrics.
The Grafana dashboard gains two new panels: "CPU Time (cumulative)" and
"Container Memory" in the Proplet System row.
…d to stop span

- Rename proplet_cpu_user_seconds_total and proplet_cpu_system_seconds_total
  to drop the _total suffix: _total is reserved for Counter types in Prometheus
  convention; these are Gauges (values come from set() not inc())
- Update Grafana panel expressions and metric HELP text to match
- Add fields(task_id) to task.stop span and record it after decode so stop
  events can be correlated with start events in Jaeger
- Remove redundant task_id field from inner wasm.execute span (inherited
  from parent task.execute span)
- Extend test_metrics_new_all_zero to assert all gauge metrics start at 0.0
test_metrics_registry_gathers_all_metrics was missing assertions for
proplet_tasks_failed_total and proplet_wasm_fetch_bytes_total.
…ror path

When MANAGER_COORDINATOR_URL is missing for an FML task that already
completed (wasm returned Ok), tasks_completed and tasks_failed were
both incremented for the same execution. Now tasks_failed is only
incremented if the wasm task itself failed.
@JeffMboya JeffMboya changed the title feat(proplet): expose CPU and container memory metrics to Prometheus NOISSUE - feat(proplet): add CPU/memory Prometheus metrics and OTLP task lifecycle spans Jun 8, 2026
@JeffMboya JeffMboya changed the title NOISSUE - feat(proplet): add CPU/memory Prometheus metrics and OTLP task lifecycle spans NOISSUE - Add CPU/memory Prometheus metrics and OTLP task lifecycle spans Jun 8, 2026
@JeffMboya JeffMboya force-pushed the feat/proplet-prometheus-metrics branch 2 times, most recently from 5fcfd0d to 85f10ea Compare June 8, 2026 17:02
@JeffMboya

Copy link
Copy Markdown
Contributor Author

Grafana — Proplet metrics live

Propeller Grafana dashboard showing proplet Prometheus metrics

Panels in the Proplet System row:

  • Tasks Started / Completed / Failed — task counters firing in real time (2 started, 1 completed, 1 failed)
  • CPU Usageproplet_cpu_usage_ratio
  • Memory RSSproplet_memory_rss_bytes
  • CPU Time (cumulative)proplet_cpu_user_seconds + proplet_cpu_system_seconds (new)
  • Container Memoryproplet_memory_container_usage_bytes + proplet_memory_container_limit_bytes (new)

@JeffMboya

Copy link
Copy Markdown
Contributor Author

Metric coverage — all previously collected metrics are now Prometheus-scrapeable

System metrics (from MetricsCollector, previously MQTT-only)

Metric Type Prometheus name
CPU usage ratio Gauge proplet_cpu_usage_ratio
CPU user-mode seconds Gauge proplet_cpu_user_seconds
CPU kernel-mode seconds Gauge proplet_cpu_system_seconds
Memory RSS Gauge proplet_memory_rss_bytes
Container memory usage Gauge proplet_memory_container_usage_bytes
Container memory limit Gauge proplet_memory_container_limit_bytes

Task metrics (from service.rs)

Metric Type Prometheus name
Tasks started Counter proplet_tasks_started_total
Tasks completed Counter proplet_tasks_completed_total
Tasks failed Counter proplet_tasks_failed_total
Tasks currently running Gauge proplet_tasks_running
WASM bytes fetched Counter proplet_wasm_fetch_bytes_total
MQTT reconnects Counter proplet_mqtt_reconnects_total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant