Skip to content

Conversation

@kylehh
Copy link
Contributor

@kylehh kylehh commented Aug 28, 2025

Overview:

Create a distributed serving example for Single-node-sized model

Details:

First example has following features

  • aggregated serving
  • KV routing
  • vLLM backend

Summary by CodeRabbit

  • New Features

    • Added a distributed inference example with a frontend router and multiple vLLM workers, supporting Qwen/Qwen2.5-1.5B-Instruct, KV-cache routing, health probes, GPU/resource configs, local cache mounting, and scaling to 4 replicas.
  • Documentation

    • Introduced a step-by-step guide to install prerequisites, configure the deployment, set a Hugging Face token, apply the router, and test via port-forward and curl. Includes notes on observability and a link to GenAI-Perf for benchmarking.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 28, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 28, 2025

Walkthrough

Adds a new distributed inference example under examples/deployments/Distributed_Inference with a README and a DynamoGraphDeployment manifest. The README documents setup and testing steps. The YAML defines a frontend with KV router mode and multiple vLLM decode workers running a Qwen model, including probes, resources, caching, and environment configuration.

Changes

Cohort / File(s) Summary
Documentation: Distributed Inference Guide
examples/deployments/Distributed_Inference/README.md
New guide detailing installation, namespace setup, HF token secret creation, applying the agg router config, port-forwarding, and curl-based testing against /v1/chat/completions; references benchmarking with GenAI-Perf.
K8s Deployment: vLLM Aggregated Router
examples/deployments/Distributed_Inference/agg_router.yaml
Adds DynamoGraphDeployment with two services: Frontend (vllm-runtime, port 8000, --router-mode kv, health/readiness probes) and VllmDecodeWorker (Qwen2.5-1.5B-Instruct, port 9090, probes, startupProbe, env flags, 4 replicas, CPU/GPU resources, hostPath cache volume).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant FE as Frontend (KV Router)
    participant W1 as vLLM Worker 1
    participant W2 as vLLM Worker 2
    participant Wn as vLLM Worker N

    Note over FE: Liveness: GET /health<br/>Readiness: probes ensure availability

    User->>FE: POST /v1/chat/completions (prompt)
    FE->>FE: Route via KV policy
    par Dispatch to available workers
        FE->>W1: Generate(request shard)
        FE->>W2: Generate(request shard)
        FE->>Wn: Generate(request shard)
    end
    W1-->>FE: Tokens/partial result
    W2-->>FE: Tokens/partial result
    Wn-->>FE: Tokens/partial result
    FE->>FE: Aggregate/stream response
    FE-->>User: Completion (stream or final)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

A rabbit routes the tokens’ flow,
Through KV lanes where workers go.
Frontend hums, the shards align,
Qwen whispers answers, crisp and fine.
Pods awake, health lights green—
Hop, deploy, infer, serene. 🐇✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/deployments/Distributed_Inference/README.md (1)

57-58: Trailing whitespace broke pre-commit; run hooks and trim.

The CI failure indicates trailing whitespace in this file.

Run:

pre-commit run --all-files

and ensure your editor trims trailing spaces on save.

🧹 Nitpick comments (6)
examples/deployments/Distributed_Inference/agg_router.yaml (2)

87-101: hostPath cache breaks multi-node scheduling; prefer PVC or node affinity.

With replicas possibly across nodes, /raid/models must exist on each node, else pods fail and caching is inconsistent.

Options:

  • Use a ReadWriteMany PVC (NFS/FSx/NetApp) and mount at /root/.cache.
  • If sticking to hostPath, add node affinity to constrain workers to nodes with that path and a DaemonSet pre-provisioner. I can draft a PVC-based patch if you share your storage class.

107-107: Minor: cleanup command spacing and ensure pipefail.

Double space before redirection; also consider pipefail so exit codes propagate.

-            - python3 -m dynamo.vllm --model Qwen/Qwen2.5-1.5B-Instruct  2>&1 | tee /tmp/vllm.log
+            - set -o pipefail; python3 -m dynamo.vllm --model Qwen/Qwen2.5-1.5B-Instruct 2>&1 | tee /tmp/vllm.log
examples/deployments/Distributed_Inference/README.md (4)

1-3: Tighten title and section grammar.

-# Distributed Inferences with Dynamo
-## 1. Single-Node-Sized Models hosting on multiple Nodes
-For SNS (Single-Node-Sized) Model, we can use Dynamo aggregated serving to deploy multiple replicas of the model and create a frontend with different routing strategies
+# Distributed Inference with Dynamo
+## 1. Single-Node-Sized models hosted on multiple nodes
+For a Single-Node-Sized (SNS) model, use Dynamo aggregated serving to deploy multiple replicas and a frontend with different routing strategies.

11-14: Grammar and naming fixes.

-Create a K8S namespace for your Dynamo application and install the Dynamo platform. It will install following pods:
-- ETCD
-- NATs
-- Dynamo Operator Controller
+Create a K8s namespace for your Dynamo application and install the Dynamo platform. It installs the following pods:
+- etcd
+- NATS
+- Dynamo Operator Controller

21-28: Typo and list intro punctuation; mention model path consistency.

-This `agg_router.yaml` is adpated from vLLM deployment [example](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/agg_router.yaml). It has following customizations
+This `agg_router.yaml` is adapted from the vLLM deployment [example](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/agg_router.yaml). It has the following customizations:
@@
-- Mounted a local cache folder `/YOUR/LOCAL/CACHE/FOLDER` for model artifacts reuse
+- Mounted a local cache folder for reusing model artifacts (update the hostPath in the YAML; default is `/raid/models`)

Also call out that the YAML uses hostPath; provide a PVC alternative if available.


43-55: Minor: request polish and typo fixes in the sample prompt.

  • Add -sS for cleaner output; fix typos “ests”→“suggests”, “familt”→“family”.
-curl localhost:8000/v1/chat/completions \
+curl -sS localhost:8000/v1/chat/completions \
@@
-        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
+        "content": "In the heart of Eldoria lies the long-forgotten city of Aeloria. An ancient map suggests that Aeloria holds a secret so profound it could reshape reality. Your Task: Character Background — describe your explorer’s motivations, skills, weaknesses, and any personal connection to Aeloria’s legends or family history."
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9c4e995 and 9d58acb.

📒 Files selected for processing (2)
  • examples/deployments/Distributed_Inference/README.md (1 hunks)
  • examples/deployments/Distributed_Inference/agg_router.yaml (1 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/deployments/Distributed_Inference/README.md

[grammar] ~2-~2: There might be a mistake here.
Context: ...e-Sized Models hosting on multiple Nodes For SNS (Single-Node-Sized) Model, we ca...

(QB_NEW_EN)


[grammar] ~11-~11: There might be a mistake here.
Context: ...nd install the Dynamo platform. It will install following pods: - ETCD - NATs - Dynamo ...

(QB_NEW_EN)


[grammar] ~11-~11: There might be a mistake here.
Context: ...latform. It will install following pods: - ETCD - NATs - Dynamo Operator Controller...

(QB_NEW_EN)


[grammar] ~21-~21: There might be a mistake here.
Context: ...} ``` 3. Model hosting with vLLM backend This agg_router.yaml is adpated from v...

(QB_NEW_EN)


[grammar] ~22-~22: There might be a mistake here.
Context: ...ckends/vllm/deploy/agg_router.yaml). It has following customizations - Deployed `Qw...

(QB_NEW_EN)


[grammar] ~36-~36: There might be a mistake here.
Context: ...esting the deployment and run benchmarks After deployment, forward the frontend s...

(QB_NEW_EN)


[grammar] ~41-~41: There might be a mistake here.
Context: ...ntend- 8000:8000 ``` and use following request to test the deployed ...

(QB_NEW_EN)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2773/merge) by kylehh.
examples/deployments/Distributed_Inference/README.md

[error] 1-1: Pre-commit hook 'trailing-whitespace' failed (exit code 1) during 'pre-commit run --show-diff-on-failure --color=always --all-files'. Trailing whitespace detected and the file was updated: examples/deployments/Distributed_Inference/README.md.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
examples/deployments/Distributed_Inference/agg_router.yaml (3)

66-76: Confirm GPU resource key mapping.

This CRD uses gpu: "1". Verify the operator maps this to the correct extended resource (e.g., nvidia.com/gpu). If not, scheduling will fail.

Run a quick check against your cluster CRD docs/operator config and confirm whether gpu or nvidia.com/gpu is required. I can adapt the manifest accordingly.


49-50: HF token env key: verify the name consumed by vLLM.

You create hf-token-secret with HF_TOKEN. Many stacks expect HUGGINGFACE_HUB_TOKEN or HUGGING_FACE_HUB_TOKEN.

If vLLM reads a different key, either rename the secret key or add an explicit env mapping:

       envs:
+        - name: HUGGINGFACE_HUB_TOKEN
+          valueFrom:
+            secretKeyRef:
+              name: hf-token-secret
+              key: HF_TOKEN

Also applies to: 76-84


58-64: Probe thresholds are too forgiving; failures may take 10+ minutes to surface.

Readiness and startup failureThreshold=60 with 10s period delays fail detection by ~10 minutes. Consider tighter bounds; also make liveness less aggressive than 1.

       readinessProbe:
         httpGet:
           path: /health
           port: 9090
-        periodSeconds: 10
-        timeoutSeconds: 30
-        failureThreshold: 60
+        periodSeconds: 10
+        timeoutSeconds: 5
+        failureThreshold: 6
@@
         startupProbe:
           httpGet:
             path: /health
             port: 9090
-          periodSeconds: 10
-          failureThreshold: 60
+          periodSeconds: 5
+          failureThreshold: 24   # ~2 minutes
@@
       livenessProbe:
         httpGet:
           path: /live
           port: 9090
-        periodSeconds: 5
-        timeoutSeconds: 30
-        failureThreshold: 1
+        periodSeconds: 10
+        timeoutSeconds: 5
+        failureThreshold: 3

Also applies to: 92-98

⛔ Skipped due to learnings
Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.
examples/deployments/Distributed_Inference/README.md (1)

31-35: Secret key name may not match vLLM expectations.

Tie this to the YAML note.

If vLLM expects HUGGINGFACE_HUB_TOKEN, rename the key:

-    --from-literal=HF_TOKEN=${HF_TOKEN} \
+    --from-literal=HUGGINGFACE_HUB_TOKEN=${HF_TOKEN} \

Or keep HF_TOKEN and map via env as suggested in the YAML comment.

@nealvaidya nealvaidya changed the title SNS agg distributed serving example docs: SNS agg k8s example Sep 2, 2025
@github-actions github-actions bot added the docs label Sep 2, 2025
kylehh and others added 4 commits September 2, 2025 18:46
Signed-off-by: Neal Vaidya <[email protected]>
Signed-off-by: Neal Vaidya <[email protected]>
Signed-off-by: Neal Vaidya <[email protected]>
@nealvaidya nealvaidya enabled auto-merge (squash) September 2, 2025 19:01
@nealvaidya nealvaidya merged commit 42669ba into main Sep 2, 2025
10 of 11 checks passed
@nealvaidya nealvaidya deleted the khuang-dist branch September 2, 2025 19:30
dillon-cullinan pushed a commit that referenced this pull request Sep 5, 2025
Signed-off-by: Neal Vaidya <[email protected]>
Co-authored-by: Neal Vaidya <[email protected]>
nnshah1 pushed a commit that referenced this pull request Sep 8, 2025
Signed-off-by: Neal Vaidya <[email protected]>
Co-authored-by: Neal Vaidya <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants