NOISSUE - Add CPU/memory Prometheus metrics and OTLP task lifecycle spans by JeffMboya · Pull Request #233 · absmach/propeller

JeffMboya · 2026-06-08T12:37:05Z

What does this do?

Surfaces proplet CPU and container memory data as Prometheus metrics and instruments the task execution lifecycle with OTLP trace spans.

Previously, CPU and container memory data from MetricsCollector was only published over MQTT to the manager. This PR makes that data scrapeable at :9092/metrics alongside the existing task counters by adding four new Gauges:

proplet_cpu_user_seconds — cumulative CPU time in user mode (sourced from /proc/self/stat)
proplet_cpu_system_seconds — cumulative CPU time in kernel mode
proplet_memory_container_usage_bytes — container memory usage in bytes (from cgroup)
proplet_memory_container_limit_bytes — container memory limit in bytes (0 = unlimited)

It also adds three OTLP trace spans so proplet execution appears in Jaeger alongside the manager:

task.start — covers MQTT decode through task dispatch, tagged with task_id and task_name
task.execute — wraps the full spawned task lifetime end-to-end
wasm.execute — child span around runtime.start_app so wasmtime execution time is visible separately

The Grafana dashboard gains two new panels in the Proplet System row: CPU Time (cumulative) and Container Memory.

Also fixes a double-count bug in the FML coordinator error path where tasks_failed was incremented even when the wasm task had already succeeded.

Which issue(s) does this PR fix/relate to?

No issue

List any changes that modify/break current functionality

No breaking changes. The four new metrics are additive. The task.stop span gains a task_id field for Jaeger correlation with task.start.

Have you included tests for your changes?

Yes — cargo test passes (128 tests). Three tests in telemetry.rs cover the new metrics:

test_metrics_new_all_zero — asserts all 12 metrics initialise to zero
test_metrics_registry_gathers_all_metrics — asserts all 12 metric names are present in the Prometheus registry
test_cpu_and_container_metrics_set — exercises the new gauge setters with concrete values

Did you document any new/modified functionality?

The Grafana dashboard JSON (docker/addons/grafana/grafana/dashboards/propeller.json) is updated with the two new panels.

Notes

The CPU gauge names intentionally omit the _total suffix. Prometheus convention reserves _total for Counter types. These metrics are Gauge values set via set(), not monotonically incremented via inc().

Add tracing spans to the three key operations so proplet appears in Jaeger alongside manager: - task.start: covers MQTT decode through task dispatch, tagged with task_id and task_name - task.execute: wraps the spawned task lifetime end-to-end - wasm.execute: child span around runtime.start_app so wasmtime execution time is visible separately Uses #[tracing::instrument] for async-fn boundaries (Send-safe) and .instrument() for the tokio::spawn future.

Add four new Prometheus gauges sourced from MetricsCollector data that was previously only published over MQTT: - proplet_cpu_user_seconds_total — cumulative user-mode CPU seconds - proplet_cpu_system_seconds_total — cumulative kernel-mode CPU seconds - proplet_memory_container_usage_bytes — cgroup memory usage - proplet_memory_container_limit_bytes — cgroup memory limit (0 = unlimited) These are scraped alongside the existing proplet metrics at :9092/metrics. The Grafana dashboard gains two new panels: "CPU Time (cumulative)" and "Container Memory" in the Proplet System row.

…d to stop span - Rename proplet_cpu_user_seconds_total and proplet_cpu_system_seconds_total to drop the _total suffix: _total is reserved for Counter types in Prometheus convention; these are Gauges (values come from set() not inc()) - Update Grafana panel expressions and metric HELP text to match - Add fields(task_id) to task.stop span and record it after decode so stop events can be correlated with start events in Jaeger - Remove redundant task_id field from inner wasm.execute span (inherited from parent task.execute span) - Extend test_metrics_new_all_zero to assert all gauge metrics start at 0.0

test_metrics_registry_gathers_all_metrics was missing assertions for proplet_tasks_failed_total and proplet_wasm_fetch_bytes_total.

…ror path When MANAGER_COORDINATOR_URL is missing for an FML task that already completed (wasm returned Ok), tasks_completed and tasks_failed were both incremented for the same execution. Now tasks_failed is only incremented if the wasm task itself failed.

JeffMboya · 2026-06-08T17:05:35Z

Grafana — Proplet metrics live

Panels in the Proplet System row:

Tasks Started / Completed / Failed — task counters firing in real time (2 started, 1 completed, 1 failed)
CPU Usage — proplet_cpu_usage_ratio
Memory RSS — proplet_memory_rss_bytes
CPU Time (cumulative) — proplet_cpu_user_seconds + proplet_cpu_system_seconds (new)
Container Memory — proplet_memory_container_usage_bytes + proplet_memory_container_limit_bytes (new)

JeffMboya · 2026-06-08T17:08:04Z

Metric coverage — all previously collected metrics are now Prometheus-scrapeable

System metrics (from `MetricsCollector`, previously MQTT-only)

Metric	Type	Prometheus name
CPU usage ratio	Gauge	`proplet_cpu_usage_ratio`
CPU user-mode seconds	Gauge	`proplet_cpu_user_seconds`
CPU kernel-mode seconds	Gauge	`proplet_cpu_system_seconds`
Memory RSS	Gauge	`proplet_memory_rss_bytes`
Container memory usage	Gauge	`proplet_memory_container_usage_bytes`
Container memory limit	Gauge	`proplet_memory_container_limit_bytes`

Task metrics (from `service.rs`)

Metric	Type	Prometheus name
Tasks started	Counter	`proplet_tasks_started_total`
Tasks completed	Counter	`proplet_tasks_completed_total`
Tasks failed	Counter	`proplet_tasks_failed_total`
Tasks currently running	Gauge	`proplet_tasks_running`
WASM bytes fetched	Counter	`proplet_wasm_fetch_bytes_total`
MQTT reconnects	Counter	`proplet_mqtt_reconnects_total`

JeffMboya added 5 commits June 8, 2026 15:17

test(proplet): cover all 12 metric names in registry gather test

e9f83b1

test_metrics_registry_gathers_all_metrics was missing assertions for proplet_tasks_failed_total and proplet_wasm_fetch_bytes_total.

JeffMboya changed the title ~~feat(proplet): expose CPU and container memory metrics to Prometheus~~ NOISSUE - feat(proplet): add CPU/memory Prometheus metrics and OTLP task lifecycle spans Jun 8, 2026

JeffMboya changed the title ~~NOISSUE - feat(proplet): add CPU/memory Prometheus metrics and OTLP task lifecycle spans~~ NOISSUE - Add CPU/memory Prometheus metrics and OTLP task lifecycle spans Jun 8, 2026

JeffMboya force-pushed the feat/proplet-prometheus-metrics branch 2 times, most recently from 5fcfd0d to 85f10ea Compare June 8, 2026 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NOISSUE - Add CPU/memory Prometheus metrics and OTLP task lifecycle spans#233

NOISSUE - Add CPU/memory Prometheus metrics and OTLP task lifecycle spans#233
JeffMboya wants to merge 5 commits into
absmach:mainfrom
JeffMboya:feat/proplet-prometheus-metrics

JeffMboya commented Jun 8, 2026 •

edited

Loading

Uh oh!

JeffMboya commented Jun 8, 2026

Uh oh!

JeffMboya commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JeffMboya commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this do?

Which issue(s) does this PR fix/relate to?

List any changes that modify/break current functionality

Have you included tests for your changes?

Did you document any new/modified functionality?

Notes

Uh oh!

JeffMboya commented Jun 8, 2026

Uh oh!

JeffMboya commented Jun 8, 2026

System metrics (from MetricsCollector, previously MQTT-only)

Task metrics (from service.rs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JeffMboya commented Jun 8, 2026 •

edited

Loading

System metrics (from `MetricsCollector`, previously MQTT-only)

Task metrics (from `service.rs`)