docs: SNS agg k8s example #2773

kylehh · 2025-08-28T23:03:02Z

Overview:

Create a distributed serving example for Single-node-sized model

Details:

First example has following features

aggregated serving
KV routing
vLLM backend

Summary by CodeRabbit

New Features
- Added a distributed inference example with a frontend router and multiple vLLM workers, supporting Qwen/Qwen2.5-1.5B-Instruct, KV-cache routing, health probes, GPU/resource configs, local cache mounting, and scaling to 4 replicas.
Documentation
- Introduced a step-by-step guide to install prerequisites, configure the deployment, set a Hugging Face token, apply the router, and test via port-forward and curl. Includes notes on observability and a link to GenAI-Perf for benchmarking.

copy-pr-bot · 2025-08-28T23:03:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-08-28T23:09:21Z

Walkthrough

Adds a new distributed inference example under examples/deployments/Distributed_Inference with a README and a DynamoGraphDeployment manifest. The README documents setup and testing steps. The YAML defines a frontend with KV router mode and multiple vLLM decode workers running a Qwen model, including probes, resources, caching, and environment configuration.

Changes

Cohort / File(s)	Summary
Documentation: Distributed Inference Guide `examples/deployments/Distributed_Inference/README.md`	New guide detailing installation, namespace setup, HF token secret creation, applying the agg router config, port-forwarding, and curl-based testing against /v1/chat/completions; references benchmarking with GenAI-Perf.
K8s Deployment: vLLM Aggregated Router `examples/deployments/Distributed_Inference/agg_router.yaml`	Adds DynamoGraphDeployment with two services: Frontend (vllm-runtime, port 8000, --router-mode kv, health/readiness probes) and VllmDecodeWorker (Qwen2.5-1.5B-Instruct, port 9090, probes, startupProbe, env flags, 4 replicas, CPU/GPU resources, hostPath cache volume).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant FE as Frontend (KV Router)
    participant W1 as vLLM Worker 1
    participant W2 as vLLM Worker 2
    participant Wn as vLLM Worker N

    Note over FE: Liveness: GET /health<br/>Readiness: probes ensure availability

    User->>FE: POST /v1/chat/completions (prompt)
    FE->>FE: Route via KV policy
    par Dispatch to available workers
        FE->>W1: Generate(request shard)
        FE->>W2: Generate(request shard)
        FE->>Wn: Generate(request shard)
    end
    W1-->>FE: Tokens/partial result
    W2-->>FE: Tokens/partial result
    Wn-->>FE: Tokens/partial result
    FE->>FE: Aggregate/stream response
    FE-->>User: Completion (stream or final)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix: vllm router examples #1942 — Aligns vLLM agg router YAMLs and adds the "--router-mode kv" flag on the frontend, matching this example’s router configuration.
chore: Change vllm K8s from dynamo-run to python -m dynamo.frontend #2055 — Switches frontend startup to python -m dynamo.frontend with http/router flags, consistent with this deployment.
feat: Add epp-aware gateway integration #2345 — Introduces similar DynamoGraphDeployment manifests for frontend + vLLM workers using Qwen models and comparable health/resource settings.

Poem

A rabbit routes the tokens’ flow,
Through KV lanes where workers go.
Frontend hums, the shards align,
Qwen whispers answers, crisp and fine.
Pods awake, health lights green—
Hop, deploy, infer, serene. 🐇✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbit or @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/deployments/Distributed_Inference/README.md (1)
57-58: Trailing whitespace broke pre-commit; run hooks and trim.

The CI failure indicates trailing whitespace in this file.

Run:
pre-commit run --all-files
and ensure your editor trims trailing spaces on save.

🧹 Nitpick comments (6)

examples/deployments/Distributed_Inference/agg_router.yaml (2)

87-101: hostPath cache breaks multi-node scheduling; prefer PVC or node affinity.

With replicas possibly across nodes, /raid/models must exist on each node, else pods fail and caching is inconsistent.

Options:

Use a ReadWriteMany PVC (NFS/FSx/NetApp) and mount at /root/.cache.

If sticking to hostPath, add node affinity to constrain workers to nodes with that path and a DaemonSet pre-provisioner. I can draft a PVC-based patch if you share your storage class.

107-107: Minor: cleanup command spacing and ensure pipefail.

Double space before redirection; also consider pipefail so exit codes propagate.
-            - python3 -m dynamo.vllm --model Qwen/Qwen2.5-1.5B-Instruct  2>&1 | tee /tmp/vllm.log
+            - set -o pipefail; python3 -m dynamo.vllm --model Qwen/Qwen2.5-1.5B-Instruct 2>&1 | tee /tmp/vllm.log

examples/deployments/Distributed_Inference/README.md (4)

1-3: Tighten title and section grammar.

-# Distributed Inferences with Dynamo
-## 1. Single-Node-Sized Models hosting on multiple Nodes
-For SNS (Single-Node-Sized) Model, we can use Dynamo aggregated serving to deploy multiple replicas of the model and create a frontend with different routing strategies
+# Distributed Inference with Dynamo
+## 1. Single-Node-Sized models hosted on multiple nodes
+For a Single-Node-Sized (SNS) model, use Dynamo aggregated serving to deploy multiple replicas and a frontend with different routing strategies.

11-14: Grammar and naming fixes.

-Create a K8S namespace for your Dynamo application and install the Dynamo platform. It will install following pods:
-- ETCD
-- NATs
-- Dynamo Operator Controller
+Create a K8s namespace for your Dynamo application and install the Dynamo platform. It installs the following pods:
+- etcd
+- NATS
+- Dynamo Operator Controller

21-28: Typo and list intro punctuation; mention model path consistency.

-This `agg_router.yaml` is adpated from vLLM deployment [example](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/agg_router.yaml). It has following customizations
+This `agg_router.yaml` is adapted from the vLLM deployment [example](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/agg_router.yaml). It has the following customizations:
@@
-- Mounted a local cache folder `/YOUR/LOCAL/CACHE/FOLDER` for model artifacts reuse
+- Mounted a local cache folder for reusing model artifacts (update the hostPath in the YAML; default is `/raid/models`)

Also call out that the YAML uses hostPath; provide a PVC alternative if available.

43-55: Minor: request polish and typo fixes in the sample prompt.

Add -sS for cleaner output; fix typos “ests”→“suggests”, “familt”→“family”.

-curl localhost:8000/v1/chat/completions \
+curl -sS localhost:8000/v1/chat/completions \
@@
-        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
+        "content": "In the heart of Eldoria lies the long-forgotten city of Aeloria. An ancient map suggests that Aeloria holds a secret so profound it could reshape reality. Your Task: Character Background — describe your explorer’s motivations, skills, weaknesses, and any personal connection to Aeloria’s legends or family history."

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9c4e995 and 9d58acb.

📒 Files selected for processing (2)

examples/deployments/Distributed_Inference/README.md (1 hunks)
examples/deployments/Distributed_Inference/agg_router.yaml (1 hunks)

🧰 Additional context used

🪛 LanguageTool

examples/deployments/Distributed_Inference/README.md

[grammar] ~2-~2: There might be a mistake here.
Context: ...e-Sized Models hosting on multiple Nodes For SNS (Single-Node-Sized) Model, we ca...

(QB_NEW_EN)

[grammar] ~11-~11: There might be a mistake here.
Context: ...nd install the Dynamo platform. It will install following pods: - ETCD - NATs - Dynamo ...

(QB_NEW_EN)

[grammar] ~11-~11: There might be a mistake here.
Context: ...latform. It will install following pods: - ETCD - NATs - Dynamo Operator Controller...

(QB_NEW_EN)

[grammar] ~21-~21: There might be a mistake here.
Context: ...} ``` 3. Model hosting with vLLM backend This agg_router.yaml is adpated from v...

(QB_NEW_EN)

[grammar] ~22-~22: There might be a mistake here.
Context: ...ckends/vllm/deploy/agg_router.yaml). It has following customizations - Deployed `Qw...

(QB_NEW_EN)

[grammar] ~36-~36: There might be a mistake here.
Context: ...esting the deployment and run benchmarks After deployment, forward the frontend s...

(QB_NEW_EN)

[grammar] ~41-~41: There might be a mistake here.
Context: ...ntend- 8000:8000 ``` and use following request to test the deployed ...

(QB_NEW_EN)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2773/merge) by kylehh.

examples/deployments/Distributed_Inference/README.md

[error] 1-1: Pre-commit hook 'trailing-whitespace' failed (exit code 1) during 'pre-commit run --show-diff-on-failure --color=always --all-files'. Trailing whitespace detected and the file was updated: examples/deployments/Distributed_Inference/README.md.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (4)

examples/deployments/Distributed_Inference/agg_router.yaml (3)
66-76: Confirm GPU resource key mapping.

This CRD uses gpu: "1". Verify the operator maps this to the correct extended resource (e.g., nvidia.com/gpu). If not, scheduling will fail.

Run a quick check against your cluster CRD docs/operator config and confirm whether gpu or nvidia.com/gpu is required. I can adapt the manifest accordingly.

49-50: HF token env key: verify the name consumed by vLLM.

You create hf-token-secret with HF_TOKEN. Many stacks expect HUGGINGFACE_HUB_TOKEN or HUGGING_FACE_HUB_TOKEN.

If vLLM reads a different key, either rename the secret key or add an explicit env mapping:
       envs:
+        - name: HUGGINGFACE_HUB_TOKEN
+          valueFrom:
+            secretKeyRef:
+              name: hf-token-secret
+              key: HF_TOKEN
Also applies to: 76-84

58-64: Probe thresholds are too forgiving; failures may take 10+ minutes to surface.

Readiness and startup failureThreshold=60 with 10s period delays fail detection by ~10 minutes. Consider tighter bounds; also make liveness less aggressive than 1.
       readinessProbe:
         httpGet:
           path: /health
           port: 9090
-        periodSeconds: 10
-        timeoutSeconds: 30
-        failureThreshold: 60
+        periodSeconds: 10
+        timeoutSeconds: 5
+        failureThreshold: 6
@@
         startupProbe:
           httpGet:
             path: /health
             port: 9090
-          periodSeconds: 10
-          failureThreshold: 60
+          periodSeconds: 5
+          failureThreshold: 24   # ~2 minutes
@@
       livenessProbe:
         httpGet:
           path: /live
           port: 9090
-        periodSeconds: 5
-        timeoutSeconds: 30
-        failureThreshold: 1
+        periodSeconds: 10
+        timeoutSeconds: 5
+        failureThreshold: 3
Also applies to: 92-98
⛔ Skipped due to learnings
Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.
examples/deployments/Distributed_Inference/README.md (1)
31-35: Secret key name may not match vLLM expectations.

Tie this to the YAML note.

If vLLM expects HUGGINGFACE_HUB_TOKEN, rename the key:
-    --from-literal=HF_TOKEN=${HF_TOKEN} \
+    --from-literal=HUGGINGFACE_HUB_TOKEN=${HF_TOKEN} \
Or keep HF_TOKEN and map via env as suggested in the YAML comment.

examples/deployments/Distributed_Inference/agg_router.yaml

examples/deployments/Distributed_Inference/README.md

Signed-off-by: Neal Vaidya <[email protected]>

Signed-off-by: Neal Vaidya <[email protected]> Co-authored-by: Neal Vaidya <[email protected]>

Signed-off-by: Neal Vaidya <[email protected]> Co-authored-by: Neal Vaidya <[email protected]> Signed-off-by: nnshah1 <[email protected]>

kylehh requested review from a team, ishandhanani, nealvaidya, nnshah1 and whoisj as code owners August 28, 2025 23:03

pull-request-size bot added the size/L label Aug 28, 2025

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

examples/deployments/Distributed_Inference/agg_router.yaml Show resolved Hide resolved

examples/deployments/Distributed_Inference/README.md Show resolved Hide resolved

examples/deployments/Distributed_Inference/README.md Outdated Show resolved Hide resolved

nealvaidya changed the title ~~SNS agg distributed serving example~~ docs: SNS agg k8s example Sep 2, 2025

github-actions bot added the docs label Sep 2, 2025

kylehh and others added 4 commits September 2, 2025 18:46

SNS agg distributed serving

a7ff0eb

Signed-off-by: Neal Vaidya <[email protected]>

update

0a58536

Signed-off-by: Neal Vaidya <[email protected]>

move the folder

b169463

Signed-off-by: Neal Vaidya <[email protected]>

chore: fix whitespace issues

3b6a760

Signed-off-by: Neal Vaidya <[email protected]>

nealvaidya force-pushed the khuang-dist branch from 5a3eb5c to 3b6a760 Compare September 2, 2025 18:46

copy-pr-bot bot temporarily deployed to GITLAB September 2, 2025 18:46 Inactive

nealvaidya approved these changes Sep 2, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to GITLAB September 2, 2025 18:48 Inactive

chore: remove from deployment foldeR

42bb594

Signed-off-by: Neal Vaidya <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB September 2, 2025 18:56 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 2, 2025 18:57 Inactive

nealvaidya enabled auto-merge (squash) September 2, 2025 19:01

nealvaidya merged commit 42669ba into main Sep 2, 2025
10 of 11 checks passed

nealvaidya deleted the khuang-dist branch September 2, 2025 19:30

dillon-cullinan pushed a commit that referenced this pull request Sep 5, 2025

docs: SNS agg k8s example (#2773)

ac769a6

Signed-off-by: Neal Vaidya <[email protected]> Co-authored-by: Neal Vaidya <[email protected]>

nnshah1 pushed a commit that referenced this pull request Sep 8, 2025

docs: SNS agg k8s example (#2773)

9e0ae96

Signed-off-by: Neal Vaidya <[email protected]> Co-authored-by: Neal Vaidya <[email protected]> Signed-off-by: nnshah1 <[email protected]>

coderabbitai bot mentioned this pull request Sep 11, 2025

docs: added example for a frontend shared across multiple models #3008

Merged

coderabbitai bot mentioned this pull request Sep 23, 2025

docs: Add AIConfigurator and disagg example for Dynamo vLLM #3183

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: SNS agg k8s example #2773

docs: SNS agg k8s example #2773

Uh oh!

kylehh commented Aug 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Aug 28, 2025

Uh oh!

coderabbitai bot commented Aug 28, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

docs: SNS agg k8s example #2773

docs: SNS agg k8s example #2773

Uh oh!

Conversation

kylehh commented Aug 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Aug 28, 2025

Uh oh!

coderabbitai bot commented Aug 28, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kylehh commented Aug 28, 2025 •

edited by coderabbitai bot

Loading