refactor: move uptime tracking from system_status_server(HTTP) to DRT level#2587
Conversation
WalkthroughCentralizes uptime tracking in SystemHealth and initializes the uptime gauge during DistributedRuntime startup. Removes local uptime handling from SystemStatusState; handlers now read uptime/health from SystemHealth and update the shared gauge. Public API adds uptime methods on SystemHealth and drops uptime-related methods from SystemStatusState. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant App as Application
participant DRT as DistributedRuntime
participant SH as SystemHealth
participant Prom as MetricsRegistry
rect rgb(235, 245, 255)
note over App,DRT: Startup
App->>DRT: new(...)
DRT->>SH: lock() and initialize_uptime_gauge(Prom)
SH-->>DRT: Gauge registered or no-op (duplicate)
end
rect rgb(240, 255, 240)
note over App,DRT: HTTP request handling
App->>DRT: /health or /metrics
DRT->>SH: uptime(), get_health_status()
SH-->>DRT: Duration, HealthStatus
DRT->>SH: update_uptime_gauge()
DRT-->>App: Response
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/runtime/src/system_status_server.rs (1)
174-189: Serialize uptime as a number; Duration in json! is ambiguous/fragile.
std::time::Durationisn’t guaranteed to serialize as desired in JSON across serde versions/configs. Emit a numeric field (seconds) to keep the API stable and language-agnostic.- let system_health = state.drt().system_health.lock().unwrap(); - let (healthy, endpoints) = system_health.get_health_status(); - let uptime = Some(system_health.uptime()); + let (healthy, endpoints, uptime_seconds) = { + let sh = state.drt().system_health.lock().expect("SystemHealth lock poisoned"); + let (healthy, endpoints) = sh.get_health_status(); + (healthy, endpoints, sh.uptime().as_secs_f64()) + }; @@ - let response = json!({ - "status": healthy_string, - "uptime": uptime, - "endpoints": endpoints - }); + let response = json!({ + "status": healthy_string, + "uptime_seconds": uptime_seconds, + "endpoints": endpoints + });Follow-up: If clients rely on the old
uptimekey, consider adding it as a deprecated alias during a transition window.
🧹 Nitpick comments (7)
lib/runtime/src/distributed.rs (2)
120-126: Early uptime gauge initialization is good; make lock/registration non-fatal and avoid unwrap.
- Initializing the uptime gauge during DRT startup is the right place and keeps ownership clear.
- However, a poisoned mutex would panic due to
unwrap(), and a metrics registration error will currently abort DRT construction via?. Consider treating both as non-fatal so runtime startup isn’t blocked by observability issues.Apply this diff to minimize panic risk and keep startup resilient:
- distributed_runtime - .system_health - .lock() - .unwrap() - .initialize_uptime_gauge(&distributed_runtime)?; + match distributed_runtime.system_health.lock() { + Ok(mut sh) => { + if let Err(e) = sh.initialize_uptime_gauge(&distributed_runtime) { + tracing::warn!( + "Failed to initialize uptime_seconds gauge; continuing without it: {e}" + ); + } + } + Err(poison) => { + tracing::warn!( + "SystemHealth lock poisoned during uptime gauge initialization: {poison}" + ); + } + }
163-165: Uptime still tracked without HTTP, but no updater is called — add a metrics callback.When the HTTP server is disabled, nothing triggers
update_uptime_gauge(). Register a runtime metrics callback (like the NATS one) to update uptime on scrape for this hierarchy so anyprometheus_metrics_fmt()caller gets a fresh value.Example insertion after the uptime gauge initialization (recompute hierarchies since the NATS path consumes its Vec):
+ // Ensure uptime gauge updates on scrapes (even without HTTP server) + let uptime_hierarchies = { + let mut hs = distributed_runtime.parent_hierarchy(); + hs.push(distributed_runtime.hierarchy()); + hs + }; + let system_health_for_cb = distributed_runtime.system_health.clone(); + let uptime_cb = Arc::new(move || { + if let Ok(sh) = system_health_for_cb.lock() { + sh.update_uptime_gauge(); + } + Ok(()) + }); + distributed_runtime.register_metrics_callback(uptime_hierarchies, uptime_cb);lib/runtime/src/lib.rs (1)
154-176: Avoid brittle string-based duplicate detection when registering the uptime gaugeThe current check
Err(e) if e.to_string().contains("Duplicate metrics") => { … }relies on the wording of an underlying
anyhow::Errormessage and may break silently if that message changes.Consider one of the following more robust approaches:
• Extend the
MetricsRegistryAPI to expose a structured “already registered” error variant (e.g.Error::AlreadyRegistered) so callers can match on the error type instead of its text.
• Introduce aget_or_create_gauge(name, desc, labels) -> Gaugehelper onMetricsRegistrythat returns the existing gauge if present, or creates and returns a new one otherwise. Document its idempotent semantics.
• If adding a helper isn’t feasible right away, provide a way onMetricsRegistryto query for the existence of a metric by name (e.g.has_metric("uptime_seconds")) and guardcreate_gaugecalls accordingly.By adopting a structured or idempotent registration API, we eliminate the need for fragile string matching and ensure that future changes to underlying error messages won’t disable uptime reporting.
lib/runtime/src/system_status_server.rs (4)
93-108: Avoid taking two locks for paths; read both under a single guard.Two consecutive
lock().unwrap()calls are unnecessary and slightly increase contention. Read bothhealth_pathandlive_pathwhile holding one guard.- let health_path = server_state - .drt() - .system_health - .lock() - .unwrap() - .health_path - .clone(); - let live_path = server_state - .drt() - .system_health - .lock() - .unwrap() - .live_path - .clone(); + let (health_path, live_path) = { + let sh = server_state.drt().system_health.lock().expect("SystemHealth lock poisoned"); + (sh.health_path.clone(), sh.live_path.clone()) + };
200-205: Update-before-scrape is correct; minimize panic risk.Good place to refresh the gauge. As a small hardening, avoid
unwrap()to prevent a poisoned lock from taking down the endpoint.- state - .drt() - .system_health - .lock() - .unwrap() - .update_uptime_gauge(); + if let Ok(sh) = state.drt().system_health.lock() { + sh.update_uptime_gauge(); + } else { + tracing::warn!("SystemHealth lock poisoned; skipping uptime gauge update"); + }
320-336: Nice sanity checks for uptime monotonicity.The test validates that
SystemHealth::uptime()exists and increases. Consider asserting the JSON shape in/healthincludesuptime_secondsonce you switch serialization, to prevent regressions.I can add a small HTTP test asserting the numeric uptime field and monotonic increase across two requests. Want me to draft it?
381-405: Gauge update test exercises duration, not the metric — consider asserting the metric value.This test proves uptime increases but doesn’t validate the gauge was actually updated. If feasible, scrape the registry (via
prometheus_metrics_fmt()) twice and assertdynamo_component_uptime_secondsincreases.I can provide a test snippet to parse the metrics output and compare the gauge value across two scrapes.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
lib/runtime/src/distributed.rs(2 hunks)lib/runtime/src/lib.rs(4 hunks)lib/runtime/src/system_status_server.rs(6 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
lib/runtime/src/lib.rs (2)
lib/llm/src/block_manager/metrics.rs (1)
gauge(71-75)lib/runtime/src/metrics.rs (8)
std(295-295)std(296-296)std(310-310)std(310-310)std(325-325)std(325-325)std(340-340)std(340-340)
lib/runtime/src/system_status_server.rs (3)
lib/runtime/src/distributed.rs (1)
new(53-168)lib/runtime/src/lib.rs (3)
new(99-118)new(211-216)uptime(179-181)lib/runtime/src/metrics.rs (10)
new(1473-1475)create_test_drt_async(583-588)std(295-295)std(296-296)std(310-310)std(310-310)std(325-325)std(325-325)std(340-340)std(340-340)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (10)
lib/runtime/src/lib.rs (2)
94-117: Centralizing uptime in SystemHealth looks solid.The addition of
start_timeand an optionalprometheus::Gaugeis clean, keeps concerns together, and avoids coupling to the HTTP server. Constructor initialization withInstant::now()is appropriate.
183-188: Gauge updates in seconds are correct and cheap.
as_secs_f64()is an appropriate unit for Prometheus gauges and the method is a no-op when the gauge is absent. Looks good.lib/runtime/src/system_status_server.rs (8)
67-71: State simplification is good.Removing local uptime tracking from
SystemStatusStateand delegating toSystemHealthreduces duplication and coupling.
75-77: Constructor is minimal and clear.Returning
Self { root_drt: drt }keepsSystemStatusStatenarrowly scoped. 👍
444-517: Endpoints tests look good and reflect path customization.Coverage for default/custom paths and 404s is solid. Once
uptime_secondsis emitted, you might add a quick JSON parse to ensure the field exists (no need to check exact value).
576-688: Health transition test is robust.The “200 of 200” loop is a good stress test of readiness transitions. No issues.
21-25: Minor: use ofArcand handle wrapping inSystemStatusServerInfois tidy.Storing the spawned handle behind
Arcallows sharing without ownership confusion. LGTM.Also applies to: 234-241
16-29: Trace layer usage is consistent with existing patterns.No concerns on the axum/trace setup.
1-15: Headers/license remain intact.No action needed.
1-741: No stale references to removed SystemStatusState APIs detectedThe searches confirm that none of the removed
SystemStatusStatemethods (initialize_start_time,uptime,update_uptime_gauge) are still being referenced. All remaining calls touptime()andupdate_uptime_gauge()are invoked onsystem_health, not onSystemStatusState.
rmccorm4
left a comment
There was a problem hiding this comment.
one comment, LGTM otherwise - please fix the merge conflict too
f2ec27d to
0741ad2
Compare
|
Graham did a Rust version update, so I rebased to main, and push -f (instead of merge), just to look cleaner. |
…ead safety - Move uptime tracking from SystemStatusState to SystemHealth - Replace Option<prometheus::Gauge> with OnceLock<prometheus::Gauge> for better thread safety - Add tests for uptime functionality with system enabled/disabled - Fix clippy warning by removing unnecessary ref in pattern matching
… level (#2587) Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
… level (#2587) Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com>
… level (#2587) Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: Jason Zhou <jasonzho@jasonzho-mlt.client.nvidia.com>
… level (#2587) Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
… level (#2587) Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
Overview:
Refactors system metrics and uptime tracking architecture to improve separation of concerns (e.g. the uptime should NOT be dependent on enabling the system_status_server/HTTP).
Details:
Where should the reviewer start?
lib/runtime/src/lib.rs- New SystemHealth uptime implementationlib/runtime/src/system_status_server.rs- Simplified SystemStatusStatelib/runtime/src/distributed.rs- Metrics initializationRelated Issues:
Relates to DIS-482
Summary by CodeRabbit
New Features
Bug Fixes
Refactor