-
Notifications
You must be signed in to change notification settings - Fork 688
docs: SNS agg k8s example #2773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds a new distributed inference example under examples/deployments/Distributed_Inference with a README and a DynamoGraphDeployment manifest. The README documents setup and testing steps. The YAML defines a frontend with KV router mode and multiple vLLM decode workers running a Qwen model, including probes, resources, caching, and environment configuration. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant FE as Frontend (KV Router)
participant W1 as vLLM Worker 1
participant W2 as vLLM Worker 2
participant Wn as vLLM Worker N
Note over FE: Liveness: GET /health<br/>Readiness: probes ensure availability
User->>FE: POST /v1/chat/completions (prompt)
FE->>FE: Route via KV policy
par Dispatch to available workers
FE->>W1: Generate(request shard)
FE->>W2: Generate(request shard)
FE->>Wn: Generate(request shard)
end
W1-->>FE: Tokens/partial result
W2-->>FE: Tokens/partial result
Wn-->>FE: Tokens/partial result
FE->>FE: Aggregate/stream response
FE-->>User: Completion (stream or final)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/deployments/Distributed_Inference/README.md (1)
57-58: Trailing whitespace broke pre-commit; run hooks and trim.The CI failure indicates trailing whitespace in this file.
Run:
pre-commit run --all-filesand ensure your editor trims trailing spaces on save.
🧹 Nitpick comments (6)
examples/deployments/Distributed_Inference/agg_router.yaml (2)
87-101: hostPath cache breaks multi-node scheduling; prefer PVC or node affinity.With replicas possibly across nodes, /raid/models must exist on each node, else pods fail and caching is inconsistent.
Options:
- Use a ReadWriteMany PVC (NFS/FSx/NetApp) and mount at /root/.cache.
- If sticking to hostPath, add node affinity to constrain workers to nodes with that path and a DaemonSet pre-provisioner. I can draft a PVC-based patch if you share your storage class.
107-107: Minor: cleanup command spacing and ensure pipefail.Double space before redirection; also consider pipefail so exit codes propagate.
- - python3 -m dynamo.vllm --model Qwen/Qwen2.5-1.5B-Instruct 2>&1 | tee /tmp/vllm.log + - set -o pipefail; python3 -m dynamo.vllm --model Qwen/Qwen2.5-1.5B-Instruct 2>&1 | tee /tmp/vllm.logexamples/deployments/Distributed_Inference/README.md (4)
1-3: Tighten title and section grammar.-# Distributed Inferences with Dynamo -## 1. Single-Node-Sized Models hosting on multiple Nodes -For SNS (Single-Node-Sized) Model, we can use Dynamo aggregated serving to deploy multiple replicas of the model and create a frontend with different routing strategies +# Distributed Inference with Dynamo +## 1. Single-Node-Sized models hosted on multiple nodes +For a Single-Node-Sized (SNS) model, use Dynamo aggregated serving to deploy multiple replicas and a frontend with different routing strategies.
11-14: Grammar and naming fixes.-Create a K8S namespace for your Dynamo application and install the Dynamo platform. It will install following pods: -- ETCD -- NATs -- Dynamo Operator Controller +Create a K8s namespace for your Dynamo application and install the Dynamo platform. It installs the following pods: +- etcd +- NATS +- Dynamo Operator Controller
21-28: Typo and list intro punctuation; mention model path consistency.-This `agg_router.yaml` is adpated from vLLM deployment [example](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/agg_router.yaml). It has following customizations +This `agg_router.yaml` is adapted from the vLLM deployment [example](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/agg_router.yaml). It has the following customizations: @@ -- Mounted a local cache folder `/YOUR/LOCAL/CACHE/FOLDER` for model artifacts reuse +- Mounted a local cache folder for reusing model artifacts (update the hostPath in the YAML; default is `/raid/models`)Also call out that the YAML uses hostPath; provide a PVC alternative if available.
43-55: Minor: request polish and typo fixes in the sample prompt.
- Add -sS for cleaner output; fix typos “ests”→“suggests”, “familt”→“family”.
-curl localhost:8000/v1/chat/completions \ +curl -sS localhost:8000/v1/chat/completions \ @@ - "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden." + "content": "In the heart of Eldoria lies the long-forgotten city of Aeloria. An ancient map suggests that Aeloria holds a secret so profound it could reshape reality. Your Task: Character Background — describe your explorer’s motivations, skills, weaknesses, and any personal connection to Aeloria’s legends or family history."
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
examples/deployments/Distributed_Inference/README.md(1 hunks)examples/deployments/Distributed_Inference/agg_router.yaml(1 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/deployments/Distributed_Inference/README.md
[grammar] ~2-~2: There might be a mistake here.
Context: ...e-Sized Models hosting on multiple Nodes For SNS (Single-Node-Sized) Model, we ca...
(QB_NEW_EN)
[grammar] ~11-~11: There might be a mistake here.
Context: ...nd install the Dynamo platform. It will install following pods: - ETCD - NATs - Dynamo ...
(QB_NEW_EN)
[grammar] ~11-~11: There might be a mistake here.
Context: ...latform. It will install following pods: - ETCD - NATs - Dynamo Operator Controller...
(QB_NEW_EN)
[grammar] ~21-~21: There might be a mistake here.
Context: ...} ``` 3. Model hosting with vLLM backend This agg_router.yaml is adpated from v...
(QB_NEW_EN)
[grammar] ~22-~22: There might be a mistake here.
Context: ...ckends/vllm/deploy/agg_router.yaml). It has following customizations - Deployed `Qw...
(QB_NEW_EN)
[grammar] ~36-~36: There might be a mistake here.
Context: ...esting the deployment and run benchmarks After deployment, forward the frontend s...
(QB_NEW_EN)
[grammar] ~41-~41: There might be a mistake here.
Context: ...ntend- 8000:8000 ``` and use following request to test the deployed ...
(QB_NEW_EN)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2773/merge) by kylehh.
examples/deployments/Distributed_Inference/README.md
[error] 1-1: Pre-commit hook 'trailing-whitespace' failed (exit code 1) during 'pre-commit run --show-diff-on-failure --color=always --all-files'. Trailing whitespace detected and the file was updated: examples/deployments/Distributed_Inference/README.md.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
examples/deployments/Distributed_Inference/agg_router.yaml (3)
66-76: Confirm GPU resource key mapping.This CRD uses gpu: "1". Verify the operator maps this to the correct extended resource (e.g., nvidia.com/gpu). If not, scheduling will fail.
Run a quick check against your cluster CRD docs/operator config and confirm whether gpu or nvidia.com/gpu is required. I can adapt the manifest accordingly.
49-50: HF token env key: verify the name consumed by vLLM.You create hf-token-secret with HF_TOKEN. Many stacks expect HUGGINGFACE_HUB_TOKEN or HUGGING_FACE_HUB_TOKEN.
If vLLM reads a different key, either rename the secret key or add an explicit env mapping:
envs: + - name: HUGGINGFACE_HUB_TOKEN + valueFrom: + secretKeyRef: + name: hf-token-secret + key: HF_TOKENAlso applies to: 76-84
58-64: Probe thresholds are too forgiving; failures may take 10+ minutes to surface.Readiness and startup failureThreshold=60 with 10s period delays fail detection by ~10 minutes. Consider tighter bounds; also make liveness less aggressive than 1.
readinessProbe: httpGet: path: /health port: 9090 - periodSeconds: 10 - timeoutSeconds: 30 - failureThreshold: 60 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 6 @@ startupProbe: httpGet: path: /health port: 9090 - periodSeconds: 10 - failureThreshold: 60 + periodSeconds: 5 + failureThreshold: 24 # ~2 minutes @@ livenessProbe: httpGet: path: /live port: 9090 - periodSeconds: 5 - timeoutSeconds: 30 - failureThreshold: 1 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 3Also applies to: 92-98
⛔ Skipped due to learnings
Learnt from: nnshah1 PR: ai-dynamo/dynamo#2124 File: components/backends/vllm/deploy/disagg.yaml:54-60 Timestamp: 2025-07-25T22:34:11.384Z Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.examples/deployments/Distributed_Inference/README.md (1)
31-35: Secret key name may not match vLLM expectations.Tie this to the YAML note.
If vLLM expects HUGGINGFACE_HUB_TOKEN, rename the key:
- --from-literal=HF_TOKEN=${HF_TOKEN} \ + --from-literal=HUGGINGFACE_HUB_TOKEN=${HF_TOKEN} \Or keep HF_TOKEN and map via env as suggested in the YAML comment.
Signed-off-by: Neal Vaidya <[email protected]>
Signed-off-by: Neal Vaidya <[email protected]>
Signed-off-by: Neal Vaidya <[email protected]>
Signed-off-by: Neal Vaidya <[email protected]>
5a3eb5c to
3b6a760
Compare
Signed-off-by: Neal Vaidya <[email protected]>
Signed-off-by: Neal Vaidya <[email protected]> Co-authored-by: Neal Vaidya <[email protected]>
Signed-off-by: Neal Vaidya <[email protected]> Co-authored-by: Neal Vaidya <[email protected]> Signed-off-by: nnshah1 <[email protected]>
Overview:
Create a distributed serving example for Single-node-sized model
Details:
First example has following features
Summary by CodeRabbit
New Features
Documentation