Skip to content

Conversation

@heisenberglit
Copy link
Contributor

@heisenberglit heisenberglit commented Jul 18, 2025

Overview:

This PR enhances the /health endpoint by including a list of currently registered model instances retrieved from etcd.

Details:

  • Updated the /health route handler to fetch and return model instances using the etcd_client if available.
  • Extended the State struct to store an optional etcd::Client.
  • Integrated the etcd_client into HttpServiceConfigBuilder and passed it through the shared state.
  • Ensures backward compatibility by only including instances if etcd_client is configured.

Where should the reviewer start?

  • src/http/service/health.rs – to review the changes to the /health endpoint logic.
  • src/http/service/service_v2.rs – to review the integration of etcd_client into the State.
  • runtime/src/instances.rs - to review logic to fetch all instances

Related Issues:

  • Closes GitHub issue: #1312

Summary by CodeRabbit

  • New Features

    • Health check responses now include information about service instances, providing greater visibility into running services.
    • Added functionality to retrieve and display all registered service instances from the distributed runtime.
  • Improvements

    • Enhanced service configuration to support optional integration with distributed instance management, improving flexibility and observability.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi heisenberglit! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added feat external-contribution Pull request is from an external contributor labels Jul 18, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 18, 2025

Walkthrough

The changes introduce centralized handling of the etcd client within the HTTP service, allowing it to be injected into service state and configuration. A new module for listing service instances from etcd is added, and the health check endpoint now reports discovered instances. Public APIs are updated to support these enhancements.

Changes

File(s) Summary
lib/llm/src/entrypoint/input/http.rs Refactored to initialize and pass etcd client once at startup; avoids redundant DistributedRuntime init.
lib/llm/src/http/service/health.rs Health handler now includes etcd-based instance info in its response.
lib/llm/src/http/service/service_v2.rs State and HttpServiceConfig structs updated for optional etcd client; new constructors and accessors.
lib/runtime/src/instances.rs New module with async function to list all instances from etcd.
lib/runtime/src/lib.rs Publicly exposes new instances module.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTP_Service
    participant State
    participant etcdClient
    participant etcd

    Client->>HTTP_Service: GET /health
    HTTP_Service->>State: etcd_client()
    alt etcd_client exists
        State->>etcdClient: list_all_instances()
        etcdClient->>etcd: get instances with prefix
        etcd-->>etcdClient: instance data
        etcdClient-->>HTTP_Service: Vec<Instance>
        HTTP_Service-->>Client: JSON (status, endpoints, instances)
    else no etcd_client
        HTTP_Service-->>Client: JSON (status, endpoints, empty instances)
    end
Loading

Poem

A hop, a skip, with paws so light,
Etcd brings all instances in sight.
Health checks now with extra flair,
Service state with clients to spare.
New modules bloom, the code expands—
Distributed dreams in bunny hands! 🐇✨


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6f392c and 8fec10b.

📒 Files selected for processing (5)
  • lib/llm/src/entrypoint/input/http.rs (1 hunks)
  • lib/llm/src/http/service/health.rs (3 hunks)
  • lib/llm/src/http/service/service_v2.rs (5 hunks)
  • lib/runtime/src/instances.rs (1 hunks)
  • lib/runtime/src/lib.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
lib/runtime/src/lib.rs (1)
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
lib/llm/src/http/service/health.rs (1)
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
lib/runtime/src/instances.rs (1)
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
lib/llm/src/entrypoint/input/http.rs (2)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Learnt from: grahamking
PR: ai-dynamo/dynamo#1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
🧬 Code Graph Analysis (3)
lib/runtime/src/lib.rs (1)
lib/runtime/src/component/client.rs (1)
  • instances (121-123)
lib/llm/src/http/service/health.rs (2)
lib/runtime/src/instances.rs (1)
  • list_all_instances (13-34)
lib/llm/src/http/service/service_v2.rs (1)
  • etcd_client (57-59)
lib/runtime/src/instances.rs (1)
lib/bindings/python/src/dynamo/_core.pyi (2)
  • Client (250-273)
  • EtcdClient (66-111)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (15)
lib/runtime/src/lib.rs (1)

53-53: LGTM! Module declaration follows established patterns.

The public module declaration is correctly placed and necessary for exposing the instance management functionality to other parts of the codebase.

lib/llm/src/http/service/health.rs (3)

33-33: LGTM! Import is necessary and correctly formatted.

The import for list_all_instances follows Rust conventions and is required for the health endpoint enhancement.


83-93: LGTM! Excellent error handling and backward compatibility.

The instance fetching logic properly handles:

  • Optional etcd client availability check
  • Async operation with proper error handling
  • Graceful fallback to empty vector on failure
  • Appropriate warning logging for debugging

This maintains backward compatibility while adding the new functionality.


102-102: LGTM! Consistent JSON response structure.

The instances field is correctly added to both healthy and unhealthy response paths, maintaining consistency in the API response format.

Also applies to: 115-115

lib/runtime/src/instances.rs (1)

13-34: LGTM! Robust implementation with excellent error handling.

The function demonstrates good practices:

  • Proper async Result return type for error propagation
  • Resilient error handling that logs parse failures without stopping the operation
  • Detailed logging including problematic keys and values for debugging
  • Appropriate use of etcd client's prefix-based key retrieval

The approach of continuing operation despite individual parse failures is well-suited for distributed systems where partial data corruption shouldn't break the entire health check.

lib/llm/src/entrypoint/input/http.rs (3)

25-26: LGTM! Efficient runtime initialization.

Extracting the DistributedRuntime and etcd_client once at the beginning eliminates redundant initialization and improves efficiency.


34-34: LGTM! Clean integration with HTTP service builder.

The etcd client is properly passed to the HTTP service builder, enabling the health endpoint to access instance information.


39-40: LGTM! Consistent refactoring of etcd client usage.

The match arm now uses the pre-extracted etcd_client, maintaining the same functionality while being more efficient and consistent with the initialization pattern.

lib/llm/src/http/service/service_v2.rs (7)

24-24: LGTM! Proper optional field design.

The etcd_client field uses Option<etcd::Client> which correctly models the optional dependency on etcd functionality.


32-34: LGTM! Backward compatibility maintained.

The existing constructor properly initializes the new field to None, ensuring backward compatibility for existing code.


36-42: LGTM! Clean constructor for etcd-enabled scenarios.

The new constructor allows proper initialization with an etcd client while maintaining the same structure and patterns as the existing constructor.


57-59: LGTM! Appropriate accessor method.

The accessor method returns Option<&etcd::Client>, which is the correct type for optional borrowing and matches the usage pattern in the health handler.


104-105: LGTM! Consistent config field design.

The etcd_client field follows the same pattern as other optional configuration fields with appropriate derive attributes and default value.


177-177: LGTM! Clean integration with State constructor.

The State creation now properly uses the etcd_client from the configuration, enabling the optional etcd functionality.


248-251: LGTM! Proper builder pattern implementation.

The with_etcd_client method follows the established builder pattern, correctly setting the field and returning self for method chaining.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@heisenberglit heisenberglit requested a review from a team as a code owner July 21, 2025 06:20
@grahamking grahamking merged commit b48d4c3 into ai-dynamo:main Aug 5, 2025
10 checks passed
@grahamking
Copy link
Contributor

Thanks @heisenberglit !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor fault tolerance feat size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants