fix: fixed failing health probes to enable state transition between s…#2920
fix: fixed failing health probes to enable state transition between s…#2920GavinZhu-GMI wants to merge 6 commits intoai-dynamo:mainfrom
Conversation
|
👋 Hi GavinZhu-GMI! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughIntroduces HealthTransitionPolicy to runtime config; plumbs it through DistributedRuntime into SystemHealth; extends SystemHealth with policy-aware auto-transition logic, uptime tracking/gauge APIs, and a transition-checking health read path; updates the HTTP health handler to use the new transition-aware method. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Client
participant S as SystemStatusServer
participant H as SystemHealth
Note over S,H: Health endpoint request
C->>S: GET /health
S->>H: get_health_status_with_transition_check()
activate H
rect rgba(230,245,255,0.5)
note right of H: Evaluate HealthTransitionPolicy
alt Manual
H-->>H: No auto transition
else TimeBasedReady
H-->>H: if uptime >= after_seconds -> set Ready
else EndpointBasedReady
H-->>H: if all endpoints Ready -> set Ready
else Custom
H-->>H: Optional time check and/or endpoints check
end
end
H-->>S: (healthy, details)
deactivate H
S-->>C: 200/503 with status payload
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/runtime/src/config.rs (1)
239-244: Post-process env to support AUTO_READY_AFTER_SECONDS safely.Derive TimeBasedReady from DYN_SYSTEM_AUTO_READY_AFTER_SECONDS after extracting the config to avoid serde issues and keep simple env UX.
Apply this diff:
- pub fn from_settings() -> Result<RuntimeConfig> { - let config: RuntimeConfig = Self::figment().extract()?; - config.validate()?; - Ok(config) - } + pub fn from_settings() -> Result<RuntimeConfig> { + let mut config: RuntimeConfig = Self::figment().extract()?; + // Optional override via simple env for convenience: + if let Ok(s) = std::env::var("DYN_SYSTEM_AUTO_READY_AFTER_SECONDS") { + if let Ok(secs) = s.parse::<u64>() { + config.health_transition_policy = + HealthTransitionPolicy::TimeBasedReady { after_seconds: secs }; + } else if !s.is_empty() { + tracing::warn!("Invalid DYN_SYSTEM_AUTO_READY_AFTER_SECONDS={s}, ignoring"); + } + } + config.validate()?; + Ok(config) + }Note: tracing is already used elsewhere; if unavailable here, substitute with eprintln! or remove the warn.
🧹 Nitpick comments (4)
lib/runtime/src/config.rs (2)
141-147: Field addition is fine; consider documenting env usage nuances.Since associated-data variants require nested env keys, add a short doc example (see fix below) to prevent misconfiguration.
167-183: Include policy in Display for easier diagnostics.Printing the active policy helps when debugging deployments.
Apply this diff:
write!( f, "starting_health_status={:?}", self.starting_health_status )?; + write!( + f, + ", health_transition_policy={:?}", + self.health_transition_policy + )?; write!(f, ", system_health_path={}", self.system_health_path)?;lib/runtime/src/system_status_server.rs (1)
174-176: Shorten lock scope in handler.Minimize time holding the std::sync::Mutex in an async context to reduce contention.
Apply this diff:
- let mut system_health = state.drt().system_health.lock().unwrap(); - let (healthy, endpoints) = system_health.get_health_status_with_transition_check(); - let uptime = Some(system_health.uptime()); + let (healthy, endpoints, uptime) = { + let mut sh = state.drt().system_health.lock().unwrap(); + let (healthy, endpoints) = sh.get_health_status_with_transition_check(); + let uptime = Some(sh.uptime()); + (healthy, endpoints, uptime) + };lib/runtime/src/lib.rs (1)
135-196: Transition logic: confirm intent with endpoint gating.With use_endpoint_health_status non-empty, TimeBasedReady won’t flip overall readiness until endpoints are ready (by design). Verify this matches operational expectations; otherwise consider an option to let time-based policy override endpoint gating.
If an override is desired, add a flag in Custom (e.g., ignore_endpoint_gating: bool) and honor it here.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
lib/runtime/src/config.rs(5 hunks)lib/runtime/src/distributed.rs(1 hunks)lib/runtime/src/lib.rs(5 hunks)lib/runtime/src/system_status_server.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-16T12:41:12.543Z
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Applied to files:
lib/runtime/src/system_status_server.rs
🧬 Code graph analysis (1)
lib/runtime/src/distributed.rs (1)
lib/runtime/src/lib.rs (2)
new(105-126)new(284-289)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2920/merge) by GavinZhu-GMI.
lib/runtime/src/config.rs
[error] 1-1: Trailing whitespace detected by pre-commit; file was auto-fixed.
lib/runtime/src/lib.rs
[error] 1-1: Trailing whitespace detected by pre-commit; file was auto-fixed.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (8)
lib/runtime/src/config.rs (3)
62-85: Enum and default look solid.The policy variants and a sane default (TimeBasedReady { after_seconds: 30 }) are appropriate.
251-261: Default/single_threaded wiring is correct.
290-291: Default includes the new field correctly.lib/runtime/src/distributed.rs (1)
78-87: Plumbing policy into SystemHealth::new is correct.lib/runtime/src/lib.rs (4)
68-68: Import of HealthTransitionPolicy is appropriate.
93-103: Struct extension looks good.New field and uptime gauge are placed appropriately.
198-207: Transition-aware getter is fine.Clear separation between read-only and transition-check paths.
104-126: Constructor signature change verified: only one call site remains inlib/runtime/src/distributed.rs, and it already passes all updated parameters toSystemHealth::new. No stale calls found.
| "HEALTH_TRANSITION_POLICY" => "health_transition_policy", | ||
| "AUTO_READY_AFTER_SECONDS" => "health_transition_policy", | ||
| "HEALTH_PATH" => "system_health_path", |
There was a problem hiding this comment.
AUTO_READY_AFTER_SECONDS mapping will break deserialization.
Mapping DYN_SYSTEM_AUTO_READY_AFTER_SECONDS directly to health_transition_policy makes Figment try to deserialize a u64 into HealthTransitionPolicy, causing extraction failures. Remove this mapping and derive the policy post-extraction.
Apply this diff to drop the problematic mapping:
let mapped_key = match k.as_str() {
"HOST" => "system_host",
"PORT" => "system_port",
"ENABLED" => "system_enabled",
"USE_ENDPOINT_HEALTH_STATUS" => "use_endpoint_health_status",
"STARTING_HEALTH_STATUS" => "starting_health_status",
"HEALTH_TRANSITION_POLICY" => "health_transition_policy",
- "AUTO_READY_AFTER_SECONDS" => "health_transition_policy",
"HEALTH_PATH" => "system_health_path",
"LIVE_PATH" => "system_live_path",
_ => k.as_str(),
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "HEALTH_TRANSITION_POLICY" => "health_transition_policy", | |
| "AUTO_READY_AFTER_SECONDS" => "health_transition_policy", | |
| "HEALTH_PATH" => "system_health_path", | |
| let mapped_key = match k.as_str() { | |
| "HOST" => "system_host", | |
| "PORT" => "system_port", | |
| "ENABLED" => "system_enabled", | |
| "USE_ENDPOINT_HEALTH_STATUS" => "use_endpoint_health_status", | |
| "STARTING_HEALTH_STATUS" => "starting_health_status", | |
| "HEALTH_TRANSITION_POLICY" => "health_transition_policy", | |
| "HEALTH_PATH" => "system_health_path", | |
| "LIVE_PATH" => "system_live_path", | |
| _ => k.as_str(), | |
| }; |
🤖 Prompt for AI Agents
In lib/runtime/src/config.rs around lines 218 to 220, the env-var mapping that
maps "AUTO_READY_AFTER_SECONDS" to "health_transition_policy" causes Figment to
attempt deserializing a numeric u64 into the HealthTransitionPolicy enum; remove
that mapping line so DYN_SYSTEM_AUTO_READY_AFTER_SECONDS is no longer mapped
into health_transition_policy, and instead read the numeric env var into a
separate optional field (or into its own config key) and derive/convert it into
HealthTransitionPolicy after Figment extraction completes.
d502795 to
53e0774
Compare
1038668 to
a59ddec
Compare
…tates Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>
a59ddec to
b3f80d6
Compare
Overview:
This fix solves deployment where health probes were failing because services never transitioned to ready state.
Details:
Files Modified:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit