NOISSUE - Add CPU/memory Prometheus metrics and OTLP task lifecycle spans#233
Open
JeffMboya wants to merge 5 commits into
Open
NOISSUE - Add CPU/memory Prometheus metrics and OTLP task lifecycle spans#233JeffMboya wants to merge 5 commits into
JeffMboya wants to merge 5 commits into
Conversation
Add tracing spans to the three key operations so proplet appears in Jaeger alongside manager: - task.start: covers MQTT decode through task dispatch, tagged with task_id and task_name - task.execute: wraps the spawned task lifetime end-to-end - wasm.execute: child span around runtime.start_app so wasmtime execution time is visible separately Uses #[tracing::instrument] for async-fn boundaries (Send-safe) and .instrument() for the tokio::spawn future.
Add four new Prometheus gauges sourced from MetricsCollector data that was previously only published over MQTT: - proplet_cpu_user_seconds_total — cumulative user-mode CPU seconds - proplet_cpu_system_seconds_total — cumulative kernel-mode CPU seconds - proplet_memory_container_usage_bytes — cgroup memory usage - proplet_memory_container_limit_bytes — cgroup memory limit (0 = unlimited) These are scraped alongside the existing proplet metrics at :9092/metrics. The Grafana dashboard gains two new panels: "CPU Time (cumulative)" and "Container Memory" in the Proplet System row.
…d to stop span - Rename proplet_cpu_user_seconds_total and proplet_cpu_system_seconds_total to drop the _total suffix: _total is reserved for Counter types in Prometheus convention; these are Gauges (values come from set() not inc()) - Update Grafana panel expressions and metric HELP text to match - Add fields(task_id) to task.stop span and record it after decode so stop events can be correlated with start events in Jaeger - Remove redundant task_id field from inner wasm.execute span (inherited from parent task.execute span) - Extend test_metrics_new_all_zero to assert all gauge metrics start at 0.0
test_metrics_registry_gathers_all_metrics was missing assertions for proplet_tasks_failed_total and proplet_wasm_fetch_bytes_total.
…ror path When MANAGER_COORDINATOR_URL is missing for an FML task that already completed (wasm returned Ok), tasks_completed and tasks_failed were both incremented for the same execution. Now tasks_failed is only incremented if the wasm task itself failed.
5fcfd0d to
85f10ea
Compare
Contributor
Author
|
Grafana — Proplet metrics live Panels in the Proplet System row:
|
Contributor
Author
|
Metric coverage — all previously collected metrics are now Prometheus-scrapeable System metrics (from
|
| Metric | Type | Prometheus name |
|---|---|---|
| CPU usage ratio | Gauge | proplet_cpu_usage_ratio |
| CPU user-mode seconds | Gauge | proplet_cpu_user_seconds |
| CPU kernel-mode seconds | Gauge | proplet_cpu_system_seconds |
| Memory RSS | Gauge | proplet_memory_rss_bytes |
| Container memory usage | Gauge | proplet_memory_container_usage_bytes |
| Container memory limit | Gauge | proplet_memory_container_limit_bytes |
Task metrics (from service.rs)
| Metric | Type | Prometheus name |
|---|---|---|
| Tasks started | Counter | proplet_tasks_started_total |
| Tasks completed | Counter | proplet_tasks_completed_total |
| Tasks failed | Counter | proplet_tasks_failed_total |
| Tasks currently running | Gauge | proplet_tasks_running |
| WASM bytes fetched | Counter | proplet_wasm_fetch_bytes_total |
| MQTT reconnects | Counter | proplet_mqtt_reconnects_total |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

What does this do?
Surfaces proplet CPU and container memory data as Prometheus metrics and instruments the task execution lifecycle with OTLP trace spans.
Previously, CPU and container memory data from
MetricsCollectorwas only published over MQTT to the manager. This PR makes that data scrapeable at:9092/metricsalongside the existing task counters by adding four new Gauges:proplet_cpu_user_seconds— cumulative CPU time in user mode (sourced from/proc/self/stat)proplet_cpu_system_seconds— cumulative CPU time in kernel modeproplet_memory_container_usage_bytes— container memory usage in bytes (from cgroup)proplet_memory_container_limit_bytes— container memory limit in bytes (0 = unlimited)It also adds three OTLP trace spans so proplet execution appears in Jaeger alongside the manager:
task.start— covers MQTT decode through task dispatch, tagged withtask_idandtask_nametask.execute— wraps the full spawned task lifetime end-to-endwasm.execute— child span aroundruntime.start_appso wasmtime execution time is visible separatelyThe Grafana dashboard gains two new panels in the Proplet System row: CPU Time (cumulative) and Container Memory.
Also fixes a double-count bug in the FML coordinator error path where
tasks_failedwas incremented even when the wasm task had already succeeded.Which issue(s) does this PR fix/relate to?
No issue
List any changes that modify/break current functionality
No breaking changes. The four new metrics are additive. The
task.stopspan gains atask_idfield for Jaeger correlation withtask.start.Have you included tests for your changes?
Yes —
cargo testpasses (128 tests). Three tests intelemetry.rscover the new metrics:test_metrics_new_all_zero— asserts all 12 metrics initialise to zerotest_metrics_registry_gathers_all_metrics— asserts all 12 metric names are present in the Prometheus registrytest_cpu_and_container_metrics_set— exercises the new gauge setters with concrete valuesDid you document any new/modified functionality?
The Grafana dashboard JSON (
docker/addons/grafana/grafana/dashboards/propeller.json) is updated with the two new panels.Notes
The CPU gauge names intentionally omit the
_totalsuffix. Prometheus convention reserves_totalforCountertypes. These metrics areGaugevalues set viaset(), not monotonically incremented viainc().