Skip to content

fix: fixed failing health probes to enable state transition between s…#2920

Closed
GavinZhu-GMI wants to merge 6 commits intoai-dynamo:mainfrom
GavinZhu-GMI:fix_probe
Closed

fix: fixed failing health probes to enable state transition between s…#2920
GavinZhu-GMI wants to merge 6 commits intoai-dynamo:mainfrom
GavinZhu-GMI:fix_probe

Conversation

@GavinZhu-GMI
Copy link
Contributor

@GavinZhu-GMI GavinZhu-GMI commented Sep 7, 2025

Overview:

This fix solves deployment where health probes were failing because services never transitioned to ready state.

Details:

Files Modified:

  1. /code/dynamo/lib/runtime/src/config.rs
  • Added HealthTransitionPolicy enum with 4 options:
    • Manual - No auto-transition (original behavior)
    • TimeBasedReady { after_seconds } - Auto-ready after uptime
    • EndpointBasedReady - Ready when endpoints are registered
    • Custom { auto_ready_after_seconds, require_endpoints_ready } - Flexible logic
  • Added health_transition_policy field to RuntimeConfig struct
  • Added environment variable mapping for:
    • DYN_SYSTEM_HEALTH_TRANSITION_POLICY
    • DYN_SYSTEM_AUTO_READY_AFTER_SECONDS
  • Default policy: TimeBasedReady { after_seconds: 30 } (much better than Manual!)
  1. /code/dynamo/lib/runtime/src/lib.rs
  • Updated SystemHealth struct to include health_transition_policy field
  • Added check_and_update_health_status() method with transition logic for all policies
  • Added get_health_status_with_transition_check() method for the health handler to use
  • Updated constructor to accept health_transition_policy parameter
  1. /code/dynamo/lib/runtime/src/system_status_server.rs
  • Updated health_handler() to use mutable lock and call the transition check method
  • Fixed the core issue: Health status now automatically transitions from "notready" to "ready"
  1. /code/dynamo/lib/runtime/src/distributed.rs
  • Updated SystemHealth::new() call to include the health_transition_policy parameter

Where should the reviewer start?

  1. from health_handler in system_status_server.rs I have state transition check.
  2. check that how to config should be in dynamo/lib/runtime/src/config.rs and dynamo/lib/runtime/src/lib.rs

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • New Features
    • Configurable health transition policy with manual, time-based, endpoint-based, and custom options.
    • Automatic readiness transitions based on uptime and/or endpoint status.
    • Health checks now evaluate and apply transitions during requests for more accurate status.
    • Uptime reporting: expose current uptime and a gauge for metrics backends.
    • New configuration via environment variables: DYN_SYSTEM_HEALTH_TRANSITION_POLICY, HEALTH_TRANSITION_POLICY, AUTO_READY_AFTER_SECONDS.

@GavinZhu-GMI GavinZhu-GMI requested a review from a team as a code owner September 7, 2025 05:18
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 7, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

github-actions bot commented Sep 7, 2025

👋 Hi GavinZhu-GMI! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added fix external-contribution Pull request is from an external contributor labels Sep 7, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 7, 2025

Walkthrough

Introduces HealthTransitionPolicy to runtime config; plumbs it through DistributedRuntime into SystemHealth; extends SystemHealth with policy-aware auto-transition logic, uptime tracking/gauge APIs, and a transition-checking health read path; updates the HTTP health handler to use the new transition-aware method.

Changes

Cohort / File(s) Summary
Config: HealthTransitionPolicy introduction
lib/runtime/src/config.rs
Adds public enum HealthTransitionPolicy (Manual, TimeBasedReady { after_seconds }, EndpointBasedReady, Custom { auto_ready_after_seconds, require_endpoints_ready }); adds RuntimeConfig.health_transition_policy with defaults; maps env/config keys (DYN_SYSTEM_HEALTH_TRANSITION_POLICY, HEALTH_TRANSITION_POLICY, AUTO_READY_AFTER_SECONDS).
SystemHealth core changes
lib/runtime/src/lib.rs
Extends SystemHealth::new to accept HealthTransitionPolicy; implements policy-driven auto transitions via check_and_update_health_status and get_health_status_with_transition_check; adds uptime gauge support (initialize_uptime_gauge, uptime, update_uptime_gauge); imports and stores policy.
Distributed runtime plumbing
lib/runtime/src/distributed.rs
Reads health_transition_policy from config and passes it to SystemHealth::new (new param between use_endpoint_health_status and health_endpoint_path).
HTTP health handler
lib/runtime/src/system_status_server.rs
Changes health request path to acquire mutable SystemHealth and call get_health_status_with_transition_check instead of get_health_status.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant S as SystemStatusServer
  participant H as SystemHealth
  Note over S,H: Health endpoint request

  C->>S: GET /health
  S->>H: get_health_status_with_transition_check()
  activate H
  rect rgba(230,245,255,0.5)
    note right of H: Evaluate HealthTransitionPolicy
    alt Manual
      H-->>H: No auto transition
    else TimeBasedReady
      H-->>H: if uptime >= after_seconds -> set Ready
    else EndpointBasedReady
      H-->>H: if all endpoints Ready -> set Ready
    else Custom
      H-->>H: Optional time check and/or endpoints check
    end
  end
  H-->>S: (healthy, details)
  deactivate H
  S-->>C: 200/503 with status payload
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

I twitch my ears at policies new,
From seed to Ready, I hop on queue.
Timed or endpoints—follow the trail—
Custom burrows never fail.
Uptime ticks; my whiskers sway—
Health says “Ready!” I leap away. 🐇


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/runtime/src/config.rs (1)

239-244: Post-process env to support AUTO_READY_AFTER_SECONDS safely.

Derive TimeBasedReady from DYN_SYSTEM_AUTO_READY_AFTER_SECONDS after extracting the config to avoid serde issues and keep simple env UX.

Apply this diff:

-    pub fn from_settings() -> Result<RuntimeConfig> {
-        let config: RuntimeConfig = Self::figment().extract()?;
-        config.validate()?;
-        Ok(config)
-    }
+    pub fn from_settings() -> Result<RuntimeConfig> {
+        let mut config: RuntimeConfig = Self::figment().extract()?;
+        // Optional override via simple env for convenience:
+        if let Ok(s) = std::env::var("DYN_SYSTEM_AUTO_READY_AFTER_SECONDS") {
+            if let Ok(secs) = s.parse::<u64>() {
+                config.health_transition_policy =
+                    HealthTransitionPolicy::TimeBasedReady { after_seconds: secs };
+            } else if !s.is_empty() {
+                tracing::warn!("Invalid DYN_SYSTEM_AUTO_READY_AFTER_SECONDS={s}, ignoring");
+            }
+        }
+        config.validate()?;
+        Ok(config)
+    }

Note: tracing is already used elsewhere; if unavailable here, substitute with eprintln! or remove the warn.

🧹 Nitpick comments (4)
lib/runtime/src/config.rs (2)

141-147: Field addition is fine; consider documenting env usage nuances.

Since associated-data variants require nested env keys, add a short doc example (see fix below) to prevent misconfiguration.


167-183: Include policy in Display for easier diagnostics.

Printing the active policy helps when debugging deployments.

Apply this diff:

         write!(
             f,
             "starting_health_status={:?}",
             self.starting_health_status
         )?;
+        write!(
+            f,
+            ", health_transition_policy={:?}",
+            self.health_transition_policy
+        )?;
         write!(f, ", system_health_path={}", self.system_health_path)?;
lib/runtime/src/system_status_server.rs (1)

174-176: Shorten lock scope in handler.

Minimize time holding the std::sync::Mutex in an async context to reduce contention.

Apply this diff:

-    let mut system_health = state.drt().system_health.lock().unwrap();
-    let (healthy, endpoints) = system_health.get_health_status_with_transition_check();
-    let uptime = Some(system_health.uptime());
+    let (healthy, endpoints, uptime) = {
+        let mut sh = state.drt().system_health.lock().unwrap();
+        let (healthy, endpoints) = sh.get_health_status_with_transition_check();
+        let uptime = Some(sh.uptime());
+        (healthy, endpoints, uptime)
+    };
lib/runtime/src/lib.rs (1)

135-196: Transition logic: confirm intent with endpoint gating.

With use_endpoint_health_status non-empty, TimeBasedReady won’t flip overall readiness until endpoints are ready (by design). Verify this matches operational expectations; otherwise consider an option to let time-based policy override endpoint gating.

If an override is desired, add a flag in Custom (e.g., ignore_endpoint_gating: bool) and honor it here.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1477f6e and 64bfff8.

📒 Files selected for processing (4)
  • lib/runtime/src/config.rs (5 hunks)
  • lib/runtime/src/distributed.rs (1 hunks)
  • lib/runtime/src/lib.rs (5 hunks)
  • lib/runtime/src/system_status_server.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-16T12:41:12.543Z
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.

Applied to files:

  • lib/runtime/src/system_status_server.rs
🧬 Code graph analysis (1)
lib/runtime/src/distributed.rs (1)
lib/runtime/src/lib.rs (2)
  • new (105-126)
  • new (284-289)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2920/merge) by GavinZhu-GMI.
lib/runtime/src/config.rs

[error] 1-1: Trailing whitespace detected by pre-commit; file was auto-fixed.

lib/runtime/src/lib.rs

[error] 1-1: Trailing whitespace detected by pre-commit; file was auto-fixed.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (8)
lib/runtime/src/config.rs (3)

62-85: Enum and default look solid.

The policy variants and a sane default (TimeBasedReady { after_seconds: 30 }) are appropriate.


251-261: Default/single_threaded wiring is correct.


290-291: Default includes the new field correctly.

lib/runtime/src/distributed.rs (1)

78-87: Plumbing policy into SystemHealth::new is correct.

lib/runtime/src/lib.rs (4)

68-68: Import of HealthTransitionPolicy is appropriate.


93-103: Struct extension looks good.

New field and uptime gauge are placed appropriately.


198-207: Transition-aware getter is fine.

Clear separation between read-only and transition-check paths.


104-126: Constructor signature change verified: only one call site remains in lib/runtime/src/distributed.rs, and it already passes all updated parameters to SystemHealth::new. No stale calls found.

Comment on lines 218 to 219
"HEALTH_TRANSITION_POLICY" => "health_transition_policy",
"AUTO_READY_AFTER_SECONDS" => "health_transition_policy",
"HEALTH_PATH" => "system_health_path",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

AUTO_READY_AFTER_SECONDS mapping will break deserialization.

Mapping DYN_SYSTEM_AUTO_READY_AFTER_SECONDS directly to health_transition_policy makes Figment try to deserialize a u64 into HealthTransitionPolicy, causing extraction failures. Remove this mapping and derive the policy post-extraction.

Apply this diff to drop the problematic mapping:

                 let mapped_key = match k.as_str() {
                     "HOST" => "system_host",
                     "PORT" => "system_port",
                     "ENABLED" => "system_enabled",
                     "USE_ENDPOINT_HEALTH_STATUS" => "use_endpoint_health_status",
                     "STARTING_HEALTH_STATUS" => "starting_health_status",
                     "HEALTH_TRANSITION_POLICY" => "health_transition_policy",
-                    "AUTO_READY_AFTER_SECONDS" => "health_transition_policy",
                     "HEALTH_PATH" => "system_health_path",
                     "LIVE_PATH" => "system_live_path",
                     _ => k.as_str(),
                 };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"HEALTH_TRANSITION_POLICY" => "health_transition_policy",
"AUTO_READY_AFTER_SECONDS" => "health_transition_policy",
"HEALTH_PATH" => "system_health_path",
let mapped_key = match k.as_str() {
"HOST" => "system_host",
"PORT" => "system_port",
"ENABLED" => "system_enabled",
"USE_ENDPOINT_HEALTH_STATUS" => "use_endpoint_health_status",
"STARTING_HEALTH_STATUS" => "starting_health_status",
"HEALTH_TRANSITION_POLICY" => "health_transition_policy",
"HEALTH_PATH" => "system_health_path",
"LIVE_PATH" => "system_live_path",
_ => k.as_str(),
};
🤖 Prompt for AI Agents
In lib/runtime/src/config.rs around lines 218 to 220, the env-var mapping that
maps "AUTO_READY_AFTER_SECONDS" to "health_transition_policy" causes Figment to
attempt deserializing a numeric u64 into the HealthTransitionPolicy enum; remove
that mapping line so DYN_SYSTEM_AUTO_READY_AFTER_SECONDS is no longer mapped
into health_transition_policy, and instead read the numeric env var into a
separate optional field (or into its own config key) and derive/convert it into
HealthTransitionPolicy after Figment extraction completes.

…tates

Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor fix size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant