refactor: move uptime tracking from system_status_server(HTTP) to DRT level by keivenchang · Pull Request #2587 · ai-dynamo/dynamo

keivenchang · 2025-08-21T02:01:22Z

Overview:

Refactors system metrics and uptime tracking architecture to improve separation of concerns (e.g. the uptime should NOT be dependent on enabling the system_status_server/HTTP).

Details:

Move uptime tracking from SystemStatusState to DRT's SystemHealth
Improve test isolation and reduce component coupling
Consolidate metrics initialization in DistributedRuntime
Simplify SystemStatusState by removing uptime responsibilities

Where should the reviewer start?

lib/runtime/src/lib.rs - New SystemHealth uptime implementation
lib/runtime/src/system_status_server.rs - Simplified SystemStatusState
lib/runtime/src/distributed.rs - Metrics initialization

Related Issues:

Relates to DIS-482

Summary by CodeRabbit

New Features
- Added automatic uptime tracking with an exported uptime metric initialized at startup.
- Uptime is continuously updated and reported in seconds.
Bug Fixes
- Uptime metrics now work even when system status endpoints are disabled.
- Duplicate metric registration is handled gracefully without failures.
Refactor
- Centralized uptime logic in the system health component for consistent reporting across health and metrics endpoints.
- Simplified status server to rely on centralized health data.

coderabbitai · 2025-08-21T02:10:10Z

Walkthrough

Centralizes uptime tracking in SystemHealth and initializes the uptime gauge during DistributedRuntime startup. Removes local uptime handling from SystemStatusState; handlers now read uptime/health from SystemHealth and update the shared gauge. Public API adds uptime methods on SystemHealth and drops uptime-related methods from SystemStatusState.

Changes

Cohort / File(s)	Summary
SystemHealth uptime support `lib/runtime/src/lib.rs`	Added start_time and optional prometheus Gauge to SystemHealth; new methods to initialize gauge, compute uptime, and update gauge.
Early uptime gauge init `lib/runtime/src/distributed.rs`	Initializes SystemHealth’s uptime gauge during DistributedRuntime::new; removes fallback SystemStatusState-based uptime when HTTP endpoints are disabled.
SystemStatus server refactor `lib/runtime/src/system_status_server.rs`	Removed local uptime fields/methods from SystemStatusState; handlers now obtain uptime and health via DistributedRuntime.system_health; metrics handler updates shared uptime gauge. Tests and imports adjusted accordingly.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant App as Application
  participant DRT as DistributedRuntime
  participant SH as SystemHealth
  participant Prom as MetricsRegistry

  rect rgb(235, 245, 255)
  note over App,DRT: Startup
  App->>DRT: new(...)
  DRT->>SH: lock() and initialize_uptime_gauge(Prom)
  SH-->>DRT: Gauge registered or no-op (duplicate)
  end

  rect rgb(240, 255, 240)
  note over App,DRT: HTTP request handling
  App->>DRT: /health or /metrics
  DRT->>SH: uptime(), get_health_status()
  SH-->>DRT: Duration, HealthStatus
  DRT->>SH: update_uptime_gauge()
  DRT-->>App: Response
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat: health check changes based on endpoint served #1996 — Prior edits to SystemHealth/DistributedRuntime that introduced or prepared uptime fields and gauge handling used here.
fix: fix endpoint run to return error DIS-325 #2156 — Related adjustments to uptime metric registration and metrics code paths in the runtime.
fix: replace metrics callback with background scraping to prevent tim… #2480 — Changes in system_status_server.rs affecting uptime/metrics behavior, aligned with this refactor.

Poem

A rabbit ticks the seconds so fleet,
Gauges hum with a metered beat.
Health now speaks with one clear voice,
Uptime shared—a single choice.
Burrow boots, the servers sing,
Metrics bloom in early spring. 🐇⏱️

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/runtime/src/system_status_server.rs (1)

174-189: Serialize uptime as a number; Duration in json! is ambiguous/fragile.

std::time::Duration isn’t guaranteed to serialize as desired in JSON across serde versions/configs. Emit a numeric field (seconds) to keep the API stable and language-agnostic.

-    let system_health = state.drt().system_health.lock().unwrap();
-    let (healthy, endpoints) = system_health.get_health_status();
-    let uptime = Some(system_health.uptime());
+    let (healthy, endpoints, uptime_seconds) = {
+        let sh = state.drt().system_health.lock().expect("SystemHealth lock poisoned");
+        let (healthy, endpoints) = sh.get_health_status();
+        (healthy, endpoints, sh.uptime().as_secs_f64())
+    };
@@
-    let response = json!({
-        "status": healthy_string,
-        "uptime": uptime,
-        "endpoints": endpoints
-    });
+    let response = json!({
+        "status": healthy_string,
+        "uptime_seconds": uptime_seconds,
+        "endpoints": endpoints
+    });

Follow-up: If clients rely on the old uptime key, consider adding it as a deprecated alias during a transition window.

🧹 Nitpick comments (7)

lib/runtime/src/distributed.rs (2)
120-126: Early uptime gauge initialization is good; make lock/registration non-fatal and avoid unwrap.

Initializing the uptime gauge during DRT startup is the right place and keeps ownership clear.

However, a poisoned mutex would panic due to unwrap(), and a metrics registration error will currently abort DRT construction via ?. Consider treating both as non-fatal so runtime startup isn’t blocked by observability issues.

Apply this diff to minimize panic risk and keep startup resilient:
-        distributed_runtime
-            .system_health
-            .lock()
-            .unwrap()
-            .initialize_uptime_gauge(&distributed_runtime)?;
+        match distributed_runtime.system_health.lock() {
+            Ok(mut sh) => {
+                if let Err(e) = sh.initialize_uptime_gauge(&distributed_runtime) {
+                    tracing::warn!(
+                        "Failed to initialize uptime_seconds gauge; continuing without it: {e}"
+                    );
+                }
+            }
+            Err(poison) => {
+                tracing::warn!(
+                    "SystemHealth lock poisoned during uptime gauge initialization: {poison}"
+                );
+            }
+        }
163-165: Uptime still tracked without HTTP, but no updater is called — add a metrics callback.

When the HTTP server is disabled, nothing triggers update_uptime_gauge(). Register a runtime metrics callback (like the NATS one) to update uptime on scrape for this hierarchy so any prometheus_metrics_fmt() caller gets a fresh value.

Example insertion after the uptime gauge initialization (recompute hierarchies since the NATS path consumes its Vec):
+        // Ensure uptime gauge updates on scrapes (even without HTTP server)
+        let uptime_hierarchies = {
+            let mut hs = distributed_runtime.parent_hierarchy();
+            hs.push(distributed_runtime.hierarchy());
+            hs
+        };
+        let system_health_for_cb = distributed_runtime.system_health.clone();
+        let uptime_cb = Arc::new(move || {
+            if let Ok(sh) = system_health_for_cb.lock() {
+                sh.update_uptime_gauge();
+            }
+            Ok(())
+        });
+        distributed_runtime.register_metrics_callback(uptime_hierarchies, uptime_cb);
lib/runtime/src/lib.rs (1)
154-176: Avoid brittle string-based duplicate detection when registering the uptime gauge

The current check
Err(e) if e.to_string().contains("Duplicate metrics") => { … }
relies on the wording of an underlying anyhow::Error message and may break silently if that message changes.

Consider one of the following more robust approaches:

• Extend the MetricsRegistry API to expose a structured “already registered” error variant (e.g. Error::AlreadyRegistered) so callers can match on the error type instead of its text.
• Introduce a get_or_create_gauge(name, desc, labels) -> Gauge helper on MetricsRegistry that returns the existing gauge if present, or creates and returns a new one otherwise. Document its idempotent semantics.
• If adding a helper isn’t feasible right away, provide a way on MetricsRegistry to query for the existence of a metric by name (e.g. has_metric("uptime_seconds")) and guard create_gauge calls accordingly.

By adopting a structured or idempotent registration API, we eliminate the need for fragile string matching and ensure that future changes to underlying error messages won’t disable uptime reporting.
lib/runtime/src/system_status_server.rs (4)
93-108: Avoid taking two locks for paths; read both under a single guard.

Two consecutive lock().unwrap() calls are unnecessary and slightly increase contention. Read both health_path and live_path while holding one guard.
-    let health_path = server_state
-        .drt()
-        .system_health
-        .lock()
-        .unwrap()
-        .health_path
-        .clone();
-    let live_path = server_state
-        .drt()
-        .system_health
-        .lock()
-        .unwrap()
-        .live_path
-        .clone();
+    let (health_path, live_path) = {
+        let sh = server_state.drt().system_health.lock().expect("SystemHealth lock poisoned");
+        (sh.health_path.clone(), sh.live_path.clone())
+    };
200-205: Update-before-scrape is correct; minimize panic risk.

Good place to refresh the gauge. As a small hardening, avoid unwrap() to prevent a poisoned lock from taking down the endpoint.
-    state
-        .drt()
-        .system_health
-        .lock()
-        .unwrap()
-        .update_uptime_gauge();
+    if let Ok(sh) = state.drt().system_health.lock() {
+        sh.update_uptime_gauge();
+    } else {
+        tracing::warn!("SystemHealth lock poisoned; skipping uptime gauge update");
+    }
320-336: Nice sanity checks for uptime monotonicity.

The test validates that SystemHealth::uptime() exists and increases. Consider asserting the JSON shape in /health includes uptime_seconds once you switch serialization, to prevent regressions.

I can add a small HTTP test asserting the numeric uptime field and monotonic increase across two requests. Want me to draft it?

381-405: Gauge update test exercises duration, not the metric — consider asserting the metric value.

This test proves uptime increases but doesn’t validate the gauge was actually updated. If feasible, scrape the registry (via prometheus_metrics_fmt()) twice and assert dynamo_component_uptime_seconds increases.

I can provide a test snippet to parse the metrics output and compare the gauge value across two scrapes.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4eb2563 and f2ec27d.

📒 Files selected for processing (3)

lib/runtime/src/distributed.rs (2 hunks)
lib/runtime/src/lib.rs (4 hunks)
lib/runtime/src/system_status_server.rs (6 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (2)

lib/runtime/src/lib.rs (2)

lib/llm/src/block_manager/metrics.rs (1)

gauge (71-75)

lib/runtime/src/metrics.rs (8)

std (295-295)

std (296-296)

std (310-310)

std (310-310)

std (325-325)

std (325-325)

std (340-340)

std (340-340)

lib/runtime/src/system_status_server.rs (3)

lib/runtime/src/distributed.rs (1)

new (53-168)

lib/runtime/src/lib.rs (3)

new (99-118)

new (211-216)

uptime (179-181)

lib/runtime/src/metrics.rs (10)

new (1473-1475)

create_test_drt_async (583-588)

std (295-295)

std (296-296)

std (310-310)

std (310-310)

std (325-325)

std (325-325)

std (340-340)

std (340-340)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (.)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (10)

lib/runtime/src/lib.rs (2)

94-117: Centralizing uptime in SystemHealth looks solid.

The addition of start_time and an optional prometheus::Gauge is clean, keeps concerns together, and avoids coupling to the HTTP server. Constructor initialization with Instant::now() is appropriate.

183-188: Gauge updates in seconds are correct and cheap.

as_secs_f64() is an appropriate unit for Prometheus gauges and the method is a no-op when the gauge is absent. Looks good.

lib/runtime/src/system_status_server.rs (8)

67-71: State simplification is good.

Removing local uptime tracking from SystemStatusState and delegating to SystemHealth reduces duplication and coupling.

75-77: Constructor is minimal and clear.

Returning Self { root_drt: drt } keeps SystemStatusState narrowly scoped. 👍

444-517: Endpoints tests look good and reflect path customization.

Coverage for default/custom paths and 404s is solid. Once uptime_seconds is emitted, you might add a quick JSON parse to ensure the field exists (no need to check exact value).

576-688: Health transition test is robust.

The “200 of 200” loop is a good stress test of readiness transitions. No issues.

21-25: Minor: use of Arc and handle wrapping in SystemStatusServerInfo is tidy.

Storing the spawned handle behind Arc allows sharing without ownership confusion. LGTM.

Also applies to: 234-241

16-29: Trace layer usage is consistent with existing patterns.

No concerns on the axum/trace setup.

1-15: Headers/license remain intact.

No action needed.

1-741: No stale references to removed SystemStatusState APIs detected

The searches confirm that none of the removed SystemStatusState methods (initialize_start_time, uptime, update_uptime_gauge) are still being referenced. All remaining calls to uptime() and update_uptime_gauge() are invoked on system_health, not on SystemStatusState.

lib/runtime/src/distributed.rs

rmccorm4

one comment, LGTM otherwise - please fix the merge conflict too

keivenchang · 2025-08-23T00:03:28Z

Graham did a Rust version update, so I rebased to main, and push -f (instead of merge), just to look cleaner.

…ead safety - Move uptime tracking from SystemStatusState to SystemHealth - Replace Option<prometheus::Gauge> with OnceLock<prometheus::Gauge> for better thread safety - Add tests for uptime functionality with system enabled/disabled - Fix clippy warning by removing unnecessary ref in pattern matching

lib/runtime/src/distributed.rs

… level (#2587) Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

… level (#2587) Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>

… level (#2587) Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: Jason Zhou <jasonzho@jasonzho-mlt.client.nvidia.com>

… level (#2587) Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>

… level (#2587) Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: nnshah1 <neelays@nvidia.com>

keivenchang requested a review from a team as a code owner August 21, 2025 02:01

pull-request-size bot added the size/L label Aug 21, 2025

keivenchang self-assigned this Aug 21, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 21, 2025 02:01 Inactive

github-actions bot added the refactor label Aug 21, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 21, 2025 02:02 Inactive

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

rmccorm4 reviewed Aug 21, 2025

View reviewed changes

lib/runtime/src/distributed.rs Show resolved Hide resolved

rmccorm4 approved these changes Aug 21, 2025

View reviewed changes

refactor: consolidate system metrics and uptime tracking architecture

0741ad2

keivenchang force-pushed the keivenchang/DIS-482__refactor-the-dynamo-uptime-from-system-status branch from f2ec27d to 0741ad2 Compare August 23, 2025 00:02

copy-pr-bot bot temporarily deployed to GITLAB August 23, 2025 00:02 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 23, 2025 00:03 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 23, 2025 00:37 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 23, 2025 00:38 Inactive

rmccorm4 reviewed Aug 23, 2025

View reviewed changes

lib/runtime/src/distributed.rs Show resolved Hide resolved

keivenchang merged commit 68fb3d9 into main Aug 25, 2025
13 of 14 checks passed

keivenchang deleted the keivenchang/DIS-482__refactor-the-dynamo-uptime-from-system-status branch August 25, 2025 22:56

nv-anants pushed a commit that referenced this pull request Aug 28, 2025

refactor: move uptime tracking from system_status_server(HTTP) to DRT…

880838b

… level (#2587) Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>

coderabbitai bot mentioned this pull request Sep 7, 2025

fix: fixed failing health probes to enable state transition between s… #2920

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: move uptime tracking from system_status_server(HTTP) to DRT level#2587

refactor: move uptime tracking from system_status_server(HTTP) to DRT level#2587
keivenchang merged 2 commits intomainfrom
keivenchang/DIS-482__refactor-the-dynamo-uptime-from-system-status

keivenchang commented Aug 21, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Aug 21, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

rmccorm4 left a comment

Uh oh!

keivenchang commented Aug 23, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

keivenchang commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 21, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rmccorm4 left a comment

Choose a reason for hiding this comment

Uh oh!

keivenchang commented Aug 23, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

keivenchang commented Aug 21, 2025 •

edited

Loading