Skip to content

refactor: move uptime tracking from system_status_server(HTTP) to DRT level#2587

Merged
keivenchang merged 2 commits intomainfrom
keivenchang/DIS-482__refactor-the-dynamo-uptime-from-system-status
Aug 25, 2025
Merged

refactor: move uptime tracking from system_status_server(HTTP) to DRT level#2587
keivenchang merged 2 commits intomainfrom
keivenchang/DIS-482__refactor-the-dynamo-uptime-from-system-status

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Aug 21, 2025

Overview:

Refactors system metrics and uptime tracking architecture to improve separation of concerns (e.g. the uptime should NOT be dependent on enabling the system_status_server/HTTP).

Details:

  • Move uptime tracking from SystemStatusState to DRT's SystemHealth
  • Improve test isolation and reduce component coupling
  • Consolidate metrics initialization in DistributedRuntime
  • Simplify SystemStatusState by removing uptime responsibilities

Where should the reviewer start?

  • lib/runtime/src/lib.rs - New SystemHealth uptime implementation
  • lib/runtime/src/system_status_server.rs - Simplified SystemStatusState
  • lib/runtime/src/distributed.rs - Metrics initialization

Related Issues:

Relates to DIS-482

Summary by CodeRabbit

  • New Features

    • Added automatic uptime tracking with an exported uptime metric initialized at startup.
    • Uptime is continuously updated and reported in seconds.
  • Bug Fixes

    • Uptime metrics now work even when system status endpoints are disabled.
    • Duplicate metric registration is handled gracefully without failures.
  • Refactor

    • Centralized uptime logic in the system health component for consistent reporting across health and metrics endpoints.
    • Simplified status server to rely on centralized health data.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 21, 2025

Walkthrough

Centralizes uptime tracking in SystemHealth and initializes the uptime gauge during DistributedRuntime startup. Removes local uptime handling from SystemStatusState; handlers now read uptime/health from SystemHealth and update the shared gauge. Public API adds uptime methods on SystemHealth and drops uptime-related methods from SystemStatusState.

Changes

Cohort / File(s) Summary
SystemHealth uptime support
lib/runtime/src/lib.rs
Added start_time and optional prometheus Gauge to SystemHealth; new methods to initialize gauge, compute uptime, and update gauge.
Early uptime gauge init
lib/runtime/src/distributed.rs
Initializes SystemHealth’s uptime gauge during DistributedRuntime::new; removes fallback SystemStatusState-based uptime when HTTP endpoints are disabled.
SystemStatus server refactor
lib/runtime/src/system_status_server.rs
Removed local uptime fields/methods from SystemStatusState; handlers now obtain uptime and health via DistributedRuntime.system_health; metrics handler updates shared uptime gauge. Tests and imports adjusted accordingly.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant App as Application
  participant DRT as DistributedRuntime
  participant SH as SystemHealth
  participant Prom as MetricsRegistry

  rect rgb(235, 245, 255)
  note over App,DRT: Startup
  App->>DRT: new(...)
  DRT->>SH: lock() and initialize_uptime_gauge(Prom)
  SH-->>DRT: Gauge registered or no-op (duplicate)
  end

  rect rgb(240, 255, 240)
  note over App,DRT: HTTP request handling
  App->>DRT: /health or /metrics
  DRT->>SH: uptime(), get_health_status()
  SH-->>DRT: Duration, HealthStatus
  DRT->>SH: update_uptime_gauge()
  DRT-->>App: Response
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

A rabbit ticks the seconds so fleet,
Gauges hum with a metered beat.
Health now speaks with one clear voice,
Uptime shared—a single choice.
Burrow boots, the servers sing,
Metrics bloom in early spring. 🐇⏱️

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/runtime/src/system_status_server.rs (1)

174-189: Serialize uptime as a number; Duration in json! is ambiguous/fragile.

std::time::Duration isn’t guaranteed to serialize as desired in JSON across serde versions/configs. Emit a numeric field (seconds) to keep the API stable and language-agnostic.

-    let system_health = state.drt().system_health.lock().unwrap();
-    let (healthy, endpoints) = system_health.get_health_status();
-    let uptime = Some(system_health.uptime());
+    let (healthy, endpoints, uptime_seconds) = {
+        let sh = state.drt().system_health.lock().expect("SystemHealth lock poisoned");
+        let (healthy, endpoints) = sh.get_health_status();
+        (healthy, endpoints, sh.uptime().as_secs_f64())
+    };
@@
-    let response = json!({
-        "status": healthy_string,
-        "uptime": uptime,
-        "endpoints": endpoints
-    });
+    let response = json!({
+        "status": healthy_string,
+        "uptime_seconds": uptime_seconds,
+        "endpoints": endpoints
+    });

Follow-up: If clients rely on the old uptime key, consider adding it as a deprecated alias during a transition window.

🧹 Nitpick comments (7)
lib/runtime/src/distributed.rs (2)

120-126: Early uptime gauge initialization is good; make lock/registration non-fatal and avoid unwrap.

  • Initializing the uptime gauge during DRT startup is the right place and keeps ownership clear.
  • However, a poisoned mutex would panic due to unwrap(), and a metrics registration error will currently abort DRT construction via ?. Consider treating both as non-fatal so runtime startup isn’t blocked by observability issues.

Apply this diff to minimize panic risk and keep startup resilient:

-        distributed_runtime
-            .system_health
-            .lock()
-            .unwrap()
-            .initialize_uptime_gauge(&distributed_runtime)?;
+        match distributed_runtime.system_health.lock() {
+            Ok(mut sh) => {
+                if let Err(e) = sh.initialize_uptime_gauge(&distributed_runtime) {
+                    tracing::warn!(
+                        "Failed to initialize uptime_seconds gauge; continuing without it: {e}"
+                    );
+                }
+            }
+            Err(poison) => {
+                tracing::warn!(
+                    "SystemHealth lock poisoned during uptime gauge initialization: {poison}"
+                );
+            }
+        }

163-165: Uptime still tracked without HTTP, but no updater is called — add a metrics callback.

When the HTTP server is disabled, nothing triggers update_uptime_gauge(). Register a runtime metrics callback (like the NATS one) to update uptime on scrape for this hierarchy so any prometheus_metrics_fmt() caller gets a fresh value.

Example insertion after the uptime gauge initialization (recompute hierarchies since the NATS path consumes its Vec):

+        // Ensure uptime gauge updates on scrapes (even without HTTP server)
+        let uptime_hierarchies = {
+            let mut hs = distributed_runtime.parent_hierarchy();
+            hs.push(distributed_runtime.hierarchy());
+            hs
+        };
+        let system_health_for_cb = distributed_runtime.system_health.clone();
+        let uptime_cb = Arc::new(move || {
+            if let Ok(sh) = system_health_for_cb.lock() {
+                sh.update_uptime_gauge();
+            }
+            Ok(())
+        });
+        distributed_runtime.register_metrics_callback(uptime_hierarchies, uptime_cb);
lib/runtime/src/lib.rs (1)

154-176: Avoid brittle string-based duplicate detection when registering the uptime gauge

The current check

Err(e) if e.to_string().contains("Duplicate metrics") => {}

relies on the wording of an underlying anyhow::Error message and may break silently if that message changes.

Consider one of the following more robust approaches:

• Extend the MetricsRegistry API to expose a structured “already registered” error variant (e.g. Error::AlreadyRegistered) so callers can match on the error type instead of its text.
• Introduce a get_or_create_gauge(name, desc, labels) -> Gauge helper on MetricsRegistry that returns the existing gauge if present, or creates and returns a new one otherwise. Document its idempotent semantics.
• If adding a helper isn’t feasible right away, provide a way on MetricsRegistry to query for the existence of a metric by name (e.g. has_metric("uptime_seconds")) and guard create_gauge calls accordingly.

By adopting a structured or idempotent registration API, we eliminate the need for fragile string matching and ensure that future changes to underlying error messages won’t disable uptime reporting.

lib/runtime/src/system_status_server.rs (4)

93-108: Avoid taking two locks for paths; read both under a single guard.

Two consecutive lock().unwrap() calls are unnecessary and slightly increase contention. Read both health_path and live_path while holding one guard.

-    let health_path = server_state
-        .drt()
-        .system_health
-        .lock()
-        .unwrap()
-        .health_path
-        .clone();
-    let live_path = server_state
-        .drt()
-        .system_health
-        .lock()
-        .unwrap()
-        .live_path
-        .clone();
+    let (health_path, live_path) = {
+        let sh = server_state.drt().system_health.lock().expect("SystemHealth lock poisoned");
+        (sh.health_path.clone(), sh.live_path.clone())
+    };

200-205: Update-before-scrape is correct; minimize panic risk.

Good place to refresh the gauge. As a small hardening, avoid unwrap() to prevent a poisoned lock from taking down the endpoint.

-    state
-        .drt()
-        .system_health
-        .lock()
-        .unwrap()
-        .update_uptime_gauge();
+    if let Ok(sh) = state.drt().system_health.lock() {
+        sh.update_uptime_gauge();
+    } else {
+        tracing::warn!("SystemHealth lock poisoned; skipping uptime gauge update");
+    }

320-336: Nice sanity checks for uptime monotonicity.

The test validates that SystemHealth::uptime() exists and increases. Consider asserting the JSON shape in /health includes uptime_seconds once you switch serialization, to prevent regressions.

I can add a small HTTP test asserting the numeric uptime field and monotonic increase across two requests. Want me to draft it?


381-405: Gauge update test exercises duration, not the metric — consider asserting the metric value.

This test proves uptime increases but doesn’t validate the gauge was actually updated. If feasible, scrape the registry (via prometheus_metrics_fmt()) twice and assert dynamo_component_uptime_seconds increases.

I can provide a test snippet to parse the metrics output and compare the gauge value across two scrapes.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4eb2563 and f2ec27d.

📒 Files selected for processing (3)
  • lib/runtime/src/distributed.rs (2 hunks)
  • lib/runtime/src/lib.rs (4 hunks)
  • lib/runtime/src/system_status_server.rs (6 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
lib/runtime/src/lib.rs (2)
lib/llm/src/block_manager/metrics.rs (1)
  • gauge (71-75)
lib/runtime/src/metrics.rs (8)
  • std (295-295)
  • std (296-296)
  • std (310-310)
  • std (310-310)
  • std (325-325)
  • std (325-325)
  • std (340-340)
  • std (340-340)
lib/runtime/src/system_status_server.rs (3)
lib/runtime/src/distributed.rs (1)
  • new (53-168)
lib/runtime/src/lib.rs (3)
  • new (99-118)
  • new (211-216)
  • uptime (179-181)
lib/runtime/src/metrics.rs (10)
  • new (1473-1475)
  • create_test_drt_async (583-588)
  • std (295-295)
  • std (296-296)
  • std (310-310)
  • std (310-310)
  • std (325-325)
  • std (325-325)
  • std (340-340)
  • std (340-340)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (10)
lib/runtime/src/lib.rs (2)

94-117: Centralizing uptime in SystemHealth looks solid.

The addition of start_time and an optional prometheus::Gauge is clean, keeps concerns together, and avoids coupling to the HTTP server. Constructor initialization with Instant::now() is appropriate.


183-188: Gauge updates in seconds are correct and cheap.

as_secs_f64() is an appropriate unit for Prometheus gauges and the method is a no-op when the gauge is absent. Looks good.

lib/runtime/src/system_status_server.rs (8)

67-71: State simplification is good.

Removing local uptime tracking from SystemStatusState and delegating to SystemHealth reduces duplication and coupling.


75-77: Constructor is minimal and clear.

Returning Self { root_drt: drt } keeps SystemStatusState narrowly scoped. 👍


444-517: Endpoints tests look good and reflect path customization.

Coverage for default/custom paths and 404s is solid. Once uptime_seconds is emitted, you might add a quick JSON parse to ensure the field exists (no need to check exact value).


576-688: Health transition test is robust.

The “200 of 200” loop is a good stress test of readiness transitions. No issues.


21-25: Minor: use of Arc and handle wrapping in SystemStatusServerInfo is tidy.

Storing the spawned handle behind Arc allows sharing without ownership confusion. LGTM.

Also applies to: 234-241


16-29: Trace layer usage is consistent with existing patterns.

No concerns on the axum/trace setup.


1-15: Headers/license remain intact.

No action needed.


1-741: No stale references to removed SystemStatusState APIs detected

The searches confirm that none of the removed SystemStatusState methods (initialize_start_time, uptime, update_uptime_gauge) are still being referenced. All remaining calls to uptime() and update_uptime_gauge() are invoked on system_health, not on SystemStatusState.

Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment, LGTM otherwise - please fix the merge conflict too

@keivenchang keivenchang force-pushed the keivenchang/DIS-482__refactor-the-dynamo-uptime-from-system-status branch from f2ec27d to 0741ad2 Compare August 23, 2025 00:02
@keivenchang
Copy link
Contributor Author

Graham did a Rust version update, so I rebased to main, and push -f (instead of merge), just to look cleaner.

…ead safety

- Move uptime tracking from SystemStatusState to SystemHealth
- Replace Option<prometheus::Gauge> with OnceLock<prometheus::Gauge> for better thread safety
- Add tests for uptime functionality with system enabled/disabled
- Fix clippy warning by removing unnecessary ref in pattern matching
@keivenchang keivenchang merged commit 68fb3d9 into main Aug 25, 2025
13 of 14 checks passed
@keivenchang keivenchang deleted the keivenchang/DIS-482__refactor-the-dynamo-uptime-from-system-status branch August 25, 2025 22:56
hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025
… level (#2587)

Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
nv-anants pushed a commit that referenced this pull request Aug 28, 2025
… level (#2587)

Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
jasonqinzhou pushed a commit that referenced this pull request Aug 30, 2025
… level (#2587)

Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Jason Zhou <jasonzho@jasonzho-mlt.client.nvidia.com>
KrishnanPrash pushed a commit that referenced this pull request Sep 2, 2025
… level (#2587)

Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
nnshah1 pushed a commit that referenced this pull request Sep 8, 2025
… level (#2587)

Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: nnshah1 <neelays@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants