fix: fixed failing health probes to enable state transition between s… by GavinZhu-GMI · Pull Request #2920 · ai-dynamo/dynamo

GavinZhu-GMI · 2025-09-07T05:18:04Z

Overview:

This fix solves deployment where health probes were failing because services never transitioned to ready state.

Details:

Files Modified:

/code/dynamo/lib/runtime/src/config.rs

Added HealthTransitionPolicy enum with 4 options:
- Manual - No auto-transition (original behavior)
- TimeBasedReady { after_seconds } - Auto-ready after uptime
- EndpointBasedReady - Ready when endpoints are registered
- Custom { auto_ready_after_seconds, require_endpoints_ready } - Flexible logic
Added health_transition_policy field to RuntimeConfig struct
Added environment variable mapping for:
- DYN_SYSTEM_HEALTH_TRANSITION_POLICY
- DYN_SYSTEM_AUTO_READY_AFTER_SECONDS
Default policy: TimeBasedReady { after_seconds: 30 } (much better than Manual!)

/code/dynamo/lib/runtime/src/lib.rs

Updated SystemHealth struct to include health_transition_policy field
Added check_and_update_health_status() method with transition logic for all policies
Added get_health_status_with_transition_check() method for the health handler to use
Updated constructor to accept health_transition_policy parameter

/code/dynamo/lib/runtime/src/system_status_server.rs

Updated health_handler() to use mutable lock and call the transition check method
Fixed the core issue: Health status now automatically transitions from "notready" to "ready"

/code/dynamo/lib/runtime/src/distributed.rs

Updated SystemHealth::new() call to include the health_transition_policy parameter

Where should the reviewer start?

from health_handler in system_status_server.rs I have state transition check.
check that how to config should be in dynamo/lib/runtime/src/config.rs and dynamo/lib/runtime/src/lib.rs

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: [BUG]: Health Status Transition Issue: Services Stuck in "notready" State #2916

Summary by CodeRabbit

New Features
- Configurable health transition policy with manual, time-based, endpoint-based, and custom options.
- Automatic readiness transitions based on uptime and/or endpoint status.
- Health checks now evaluate and apply transitions during requests for more accurate status.
- Uptime reporting: expose current uptime and a gauge for metrics backends.
- New configuration via environment variables: DYN_SYSTEM_HEALTH_TRANSITION_POLICY, HEALTH_TRANSITION_POLICY, AUTO_READY_AFTER_SECONDS.

copy-pr-bot · 2025-09-07T05:18:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-09-07T05:18:14Z

👋 Hi GavinZhu-GMI! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-09-07T05:25:22Z

Walkthrough

Introduces HealthTransitionPolicy to runtime config; plumbs it through DistributedRuntime into SystemHealth; extends SystemHealth with policy-aware auto-transition logic, uptime tracking/gauge APIs, and a transition-checking health read path; updates the HTTP health handler to use the new transition-aware method.

Changes

Cohort / File(s)	Summary
Config: HealthTransitionPolicy introduction `lib/runtime/src/config.rs`	Adds public enum HealthTransitionPolicy (Manual, TimeBasedReady { after_seconds }, EndpointBasedReady, Custom { auto_ready_after_seconds, require_endpoints_ready }); adds RuntimeConfig.health_transition_policy with defaults; maps env/config keys (DYN_SYSTEM_HEALTH_TRANSITION_POLICY, HEALTH_TRANSITION_POLICY, AUTO_READY_AFTER_SECONDS).
SystemHealth core changes `lib/runtime/src/lib.rs`	Extends SystemHealth::new to accept HealthTransitionPolicy; implements policy-driven auto transitions via check_and_update_health_status and get_health_status_with_transition_check; adds uptime gauge support (initialize_uptime_gauge, uptime, update_uptime_gauge); imports and stores policy.
Distributed runtime plumbing `lib/runtime/src/distributed.rs`	Reads health_transition_policy from config and passes it to SystemHealth::new (new param between use_endpoint_health_status and health_endpoint_path).
HTTP health handler `lib/runtime/src/system_status_server.rs`	Changes health request path to acquire mutable SystemHealth and call get_health_status_with_transition_check instead of get_health_status.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant S as SystemStatusServer
  participant H as SystemHealth
  Note over S,H: Health endpoint request

  C->>S: GET /health
  S->>H: get_health_status_with_transition_check()
  activate H
  rect rgba(230,245,255,0.5)
    note right of H: Evaluate HealthTransitionPolicy
    alt Manual
      H-->>H: No auto transition
    else TimeBasedReady
      H-->>H: if uptime >= after_seconds -> set Ready
    else EndpointBasedReady
      H-->>H: if all endpoints Ready -> set Ready
    else Custom
      H-->>H: Optional time check and/or endpoints check
    end
  end
  H-->>S: (healthy, details)
  deactivate H
  S-->>C: 200/503 with status payload

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: health check changes based on endpoint served #1996 — Extends the same SystemHealth/config pathways with HealthTransitionPolicy and related plumbing.
feat: Parameterize health and live HTTP endpoint paths #2230 — Adjusts SystemHealth::new and RuntimeConfig integration similarly to pass new config-driven parameters.
refactor: move uptime tracking from system_status_server(HTTP) to DRT level #2587 — Touches SystemHealth uptime/gauge methods and related initialization paths akin to this change.

Poem

I twitch my ears at policies new,
From seed to Ready, I hop on queue.
Timed or endpoints—follow the trail—
Custom burrows never fail.
Uptime ticks; my whiskers sway—
Health says “Ready!” I leap away. 🐇

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/runtime/src/config.rs (1)

239-244: Post-process env to support AUTO_READY_AFTER_SECONDS safely.

Derive TimeBasedReady from DYN_SYSTEM_AUTO_READY_AFTER_SECONDS after extracting the config to avoid serde issues and keep simple env UX.

Apply this diff:

-    pub fn from_settings() -> Result<RuntimeConfig> {
-        let config: RuntimeConfig = Self::figment().extract()?;
-        config.validate()?;
-        Ok(config)
-    }
+    pub fn from_settings() -> Result<RuntimeConfig> {
+        let mut config: RuntimeConfig = Self::figment().extract()?;
+        // Optional override via simple env for convenience:
+        if let Ok(s) = std::env::var("DYN_SYSTEM_AUTO_READY_AFTER_SECONDS") {
+            if let Ok(secs) = s.parse::<u64>() {
+                config.health_transition_policy =
+                    HealthTransitionPolicy::TimeBasedReady { after_seconds: secs };
+            } else if !s.is_empty() {
+                tracing::warn!("Invalid DYN_SYSTEM_AUTO_READY_AFTER_SECONDS={s}, ignoring");
+            }
+        }
+        config.validate()?;
+        Ok(config)
+    }

Note: tracing is already used elsewhere; if unavailable here, substitute with eprintln! or remove the warn.

🧹 Nitpick comments (4)

lib/runtime/src/config.rs (2)
141-147: Field addition is fine; consider documenting env usage nuances.

Since associated-data variants require nested env keys, add a short doc example (see fix below) to prevent misconfiguration.

167-183: Include policy in Display for easier diagnostics.

Printing the active policy helps when debugging deployments.

Apply this diff:
         write!(
             f,
             "starting_health_status={:?}",
             self.starting_health_status
         )?;
+        write!(
+            f,
+            ", health_transition_policy={:?}",
+            self.health_transition_policy
+        )?;
         write!(f, ", system_health_path={}", self.system_health_path)?;
lib/runtime/src/system_status_server.rs (1)
174-176: Shorten lock scope in handler.

Minimize time holding the std::sync::Mutex in an async context to reduce contention.

Apply this diff:
-    let mut system_health = state.drt().system_health.lock().unwrap();
-    let (healthy, endpoints) = system_health.get_health_status_with_transition_check();
-    let uptime = Some(system_health.uptime());
+    let (healthy, endpoints, uptime) = {
+        let mut sh = state.drt().system_health.lock().unwrap();
+        let (healthy, endpoints) = sh.get_health_status_with_transition_check();
+        let uptime = Some(sh.uptime());
+        (healthy, endpoints, uptime)
+    };
lib/runtime/src/lib.rs (1)

135-196: Transition logic: confirm intent with endpoint gating.

With use_endpoint_health_status non-empty, TimeBasedReady won’t flip overall readiness until endpoints are ready (by design). Verify this matches operational expectations; otherwise consider an option to let time-based policy override endpoint gating.

If an override is desired, add a flag in Custom (e.g., ignore_endpoint_gating: bool) and honor it here.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1477f6e and 64bfff8.

📒 Files selected for processing (4)

lib/runtime/src/config.rs (5 hunks)
lib/runtime/src/distributed.rs (1 hunks)
lib/runtime/src/lib.rs (5 hunks)
lib/runtime/src/system_status_server.rs (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-16T12:41:12.543Z

Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Applied to files:

lib/runtime/src/system_status_server.rs

🧬 Code graph analysis (1)

lib/runtime/src/distributed.rs (1)

lib/runtime/src/lib.rs (2)

new (105-126)

new (284-289)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2920/merge) by GavinZhu-GMI.

lib/runtime/src/config.rs

[error] 1-1: Trailing whitespace detected by pre-commit; file was auto-fixed.

lib/runtime/src/lib.rs

[error] 1-1: Trailing whitespace detected by pre-commit; file was auto-fixed.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Build and Test - dynamo
GitHub Check: pre-merge-rust (.)
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (lib/bindings/python)

🔇 Additional comments (8)

lib/runtime/src/config.rs (3)

62-85: Enum and default look solid.

The policy variants and a sane default (TimeBasedReady { after_seconds: 30 }) are appropriate.

251-261: Default/single_threaded wiring is correct.

290-291: Default includes the new field correctly.

lib/runtime/src/distributed.rs (1)

78-87: Plumbing policy into SystemHealth::new is correct.

lib/runtime/src/lib.rs (4)

68-68: Import of HealthTransitionPolicy is appropriate.

93-103: Struct extension looks good.

New field and uptime gauge are placed appropriately.

198-207: Transition-aware getter is fine.

Clear separation between read-only and transition-check paths.

104-126: Constructor signature change verified: only one call site remains in lib/runtime/src/distributed.rs, and it already passes all updated parameters to SystemHealth::new. No stale calls found.

coderabbitai · 2025-09-07T05:25:25Z

lib/runtime/src/config.rs

+                            "HEALTH_TRANSITION_POLICY" => "health_transition_policy",
+                            "AUTO_READY_AFTER_SECONDS" => "health_transition_policy",
                            "HEALTH_PATH" => "system_health_path",


⚠️ Potential issue

AUTO_READY_AFTER_SECONDS mapping will break deserialization.

Mapping DYN_SYSTEM_AUTO_READY_AFTER_SECONDS directly to health_transition_policy makes Figment try to deserialize a u64 into HealthTransitionPolicy, causing extraction failures. Remove this mapping and derive the policy post-extraction.

Apply this diff to drop the problematic mapping:

let mapped_key = match k.as_str() { "HOST" => "system_host", "PORT" => "system_port", "ENABLED" => "system_enabled", "USE_ENDPOINT_HEALTH_STATUS" => "use_endpoint_health_status", "STARTING_HEALTH_STATUS" => "starting_health_status", "HEALTH_TRANSITION_POLICY" => "health_transition_policy", - "AUTO_READY_AFTER_SECONDS" => "health_transition_policy", "HEALTH_PATH" => "system_health_path", "LIVE_PATH" => "system_live_path", _ => k.as_str(), };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"HEALTH_TRANSITION_POLICY" => "health_transition_policy",

"AUTO_READY_AFTER_SECONDS" => "health_transition_policy",

"HEALTH_PATH" => "system_health_path",

let mapped_key = match k.as_str() {

"HOST" => "system_host",

"PORT" => "system_port",

"ENABLED" => "system_enabled",

"USE_ENDPOINT_HEALTH_STATUS" => "use_endpoint_health_status",

"STARTING_HEALTH_STATUS" => "starting_health_status",

"HEALTH_TRANSITION_POLICY" => "health_transition_policy",

"HEALTH_PATH" => "system_health_path",

"LIVE_PATH" => "system_live_path",

_ => k.as_str(),

};

🤖 Prompt for AI Agents

In lib/runtime/src/config.rs around lines 218 to 220, the env-var mapping that maps "AUTO_READY_AFTER_SECONDS" to "health_transition_policy" causes Figment to attempt deserializing a numeric u64 into the HealthTransitionPolicy enum; remove that mapping line so DYN_SYSTEM_AUTO_READY_AFTER_SECONDS is no longer mapped into health_transition_policy, and instead read the numeric env var into a separate optional field (or into its own config key) and derive/convert it into HealthTransitionPolicy after Figment extraction completes.

…tates Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

GavinZhu-GMI requested a review from a team as a code owner September 7, 2025 05:18

pull-request-size bot added the size/L label Sep 7, 2025

github-actions bot added fix external-contribution Pull request is from an external contributor labels Sep 7, 2025

coderabbitai bot reviewed Sep 7, 2025

View reviewed changes

GavinZhu-GMI force-pushed the fix_probe branch from d502795 to 53e0774 Compare September 8, 2025 02:02

GavinZhu-GMI requested review from a team, alec-flowers, biswapanda, hhzhang16, indrajit96, ishandhanani, krishung5, nnshah1, ptarasiewiczNV, richardhuo-nv, rmccorm4, tanmayv25 and whoisj as code owners September 8, 2025 02:02

pull-request-size bot added size/XL size/L and removed size/L size/XL labels Sep 8, 2025

GavinZhu-GMI force-pushed the fix_probe branch from 1038668 to a59ddec Compare September 8, 2025 03:08

pull-request-size bot added size/XL and removed size/L labels Sep 8, 2025

fix: fixed failing health probes to enable state transition between s…

0bb2c8d

…tates Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

GavinZhu-GMI added 5 commits September 8, 2025 11:12

fix: fixed format related to prefommit

90e7373

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

fix: removed previouse mapping to avoid deserialize issue

9e19c49

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

fix: fixed space .. again

2dff50f

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

chore: replace pynvml with nvidia-ml-py as suggested

8f1ac9a

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

fix: fix dep seq as pre commit

b3f80d6

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

GavinZhu-GMI force-pushed the fix_probe branch from a59ddec to b3f80d6 Compare September 8, 2025 03:12

pull-request-size bot added size/L and removed size/XL labels Sep 8, 2025

GavinZhu-GMI closed this Sep 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fixed failing health probes to enable state transition between s…#2920

fix: fixed failing health probes to enable state transition between s…#2920
GavinZhu-GMI wants to merge 6 commits intoai-dynamo:mainfrom
GavinZhu-GMI:fix_probe

GavinZhu-GMI commented Sep 7, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Sep 7, 2025

Uh oh!

github-actions bot commented Sep 7, 2025

Uh oh!

coderabbitai bot commented Sep 7, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-                            "HEALTH_TRANSITION_POLICY" => "health_transition_policy",
-                            "AUTO_READY_AFTER_SECONDS" => "health_transition_policy",
-                            "HEALTH_PATH" => "system_health_path",
+let mapped_key = match k.as_str() {
+    "HOST" => "system_host",
+    "PORT" => "system_port",
+    "ENABLED" => "system_enabled",
+    "USE_ENDPOINT_HEALTH_STATUS" => "use_endpoint_health_status",
+    "STARTING_HEALTH_STATUS" => "starting_health_status",
+    "HEALTH_TRANSITION_POLICY" => "health_transition_policy",
+    "HEALTH_PATH" => "system_health_path",
+    "LIVE_PATH" => "system_live_path",
+    _ => k.as_str(),
+};

Conversation

GavinZhu-GMI commented Sep 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Sep 7, 2025

Uh oh!

github-actions bot commented Sep 7, 2025

Uh oh!

coderabbitai bot commented Sep 7, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GavinZhu-GMI commented Sep 7, 2025 •

edited by coderabbitai bot

Loading