-
Notifications
You must be signed in to change notification settings - Fork 690
feat: add vllm aggregated multinode deployment example #2541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add vllm aggregated multinode deployment example #2541
Conversation
Signed-off-by: Eric Liu <[email protected]>
|
👋 Hi eric-liu-nvidia! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughIntroduces a new DynamoGraphDeployment YAML (vllm-mul) defining two services: a single-replica Frontend HTTP server on port 8000 and a multinode VllmDecodeWorker (2 nodes, tensor-parallel size 2) using shared runtime image, hf-token secret, and module-based startup commands. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant Frontend as Frontend (HTTP :8000)
participant Worker1 as VllmDecodeWorker [Node 1]
participant Worker2 as VllmDecodeWorker [Node 2]
Note over Frontend,Worker2: Deployment: DynamoGraphDeployment "vllm-mul"<br/>Tensor Parallel = 2 • Nodes = 2 • Model: Qwen/Qwen3-0.6B
Client->>Frontend: POST /generate
Frontend->>Worker1: Dispatch decode shard (tp rank 0)
Frontend->>Worker2: Dispatch decode shard (tp rank 1)
par Parallel decode
Worker1-->>Frontend: Partial results / logits
Worker2-->>Frontend: Partial results / logits
end
Frontend-->>Client: Aggregated response
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (7)
components/backends/vllm/deploy/agg-multinode.yaml (7)
7-7: Naming nit: align resource name with filename/purposeThe filename suggests “agg-multinode,” but the resource is named
vllm-mul. Consider renaming to something likevllm-agg-multinodefor clarity and discoverability.Apply this minimal rename if desired:
-metadata: - name: vllm-mul +metadata: + name: vllm-agg-multinodeAlso update
dynamoNamespaceoccurrences accordingly.
11-22: Frontend: prefer exec-form command, add containerPort and basic probes
- Using
/bin/sh -cis brittle when no shell features are needed. The exec form is safer and avoids shell quoting gotchas.- Expose container port 8000 explicitly; many controllers auto-wire Services/ingress based on this.
- Add basic readiness/liveness probes (TCP is the safest if an HTTP health path isn’t guaranteed).
Apply this refactor:
extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0 workingDir: /workspace/components/backends/vllm - command: - - /bin/sh - - -c - args: - - "python3 -m dynamo.frontend --http-port 8000" + command: + - python3 + - -m + - dynamo.frontend + - --http-port + - "8000" + ports: + - name: http + containerPort: 8000 + readinessProbe: + tcpSocket: + port: 8000 + initialDelaySeconds: 5 + periodSeconds: 10 + livenessProbe: + tcpSocket: + port: 8000 + initialDelaySeconds: 15 + periodSeconds: 20Please confirm whether the controller auto-creates a Service for the frontend or if an explicit Service manifest is required.
26-27: Secret wiring: validate key names and existence
envFromSecret: hf-token-secretassumes a Secret exists and that the operator maps it into env vars correctly (e.g.,HF_TOKEN). Please verify:
- The Secret name is correct and created in the same namespace.
- The expected key names match what vLLM/dynamo code reads.
If helpful, I can add a sample Secret manifest and a short setup README for this example.
31-38: Worker: add GPU resource requests/limits and harden container; also prefer exec-form command
- Without
resourcesrequests/limits fornvidia.com/gpu, the pods may not get scheduled on GPU nodes (unless the operator injects defaults).- Use exec-form command for consistency.
- Optional hardening using a basic
securityContext.Apply this refactor:
mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0 workingDir: /workspace/components/backends/vllm - command: - - /bin/sh - - -c - args: - - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --tensor-parallel-size 2 --no-kv-transfer-config + command: + - python3 + - -m + - dynamo.vllm + - --model + - Qwen/Qwen3-0.6B + - --tensor-parallel-size + - "2" + - --no-kv-transfer-config + resources: + requests: + nvidia.com/gpu: 1 + limits: + nvidia.com/gpu: 1 + securityContext: + runAsNonRoot: true + allowPrivilegeEscalation: false + readOnlyRootFilesystem: trueIf 2 GPUs per node are desired, adjust requests/limits and tensor-parallel accordingly.
16-16: Pin images to digests for reproducibility and supply imagePullSecrets if needed
- Consider pinning the image to a digest (e.g.,
@sha256:...) to avoid drift when tags are re-pushed.- If the nvcr.io registry requires auth in your cluster, ensure
imagePullSecretsare configured (either cluster-wide or viaextraPodSpecif supported).I can add a small doc section or Kustomize patch for
imagePullSecretsif your operator supports injecting them viaextraPodSpec.Also applies to: 32-32
11-22: Health, observability, and operability
- Consider adding basic logging stdout/stderr aggregation (likely already fine), but also
terminationGracePeriodSecondsand apreStophook for graceful shutdown if the operator doesn’t handle it.- If the operator surfaces metrics, we might need to expose a metrics port and annotations for scraping.
Happy to add a minimal operability pass once we confirm what the operator auto-injects vs. what needs to be set here.
Also applies to: 31-38
4-8: End-to-end usability: service exposure and docs
- How is the frontend exposed to users (ClusterIP/NodePort/Ingress/Gateway)? If the operator doesn’t auto-create a Service and Ingress, we should add them so users can try the example.
- The PR description is placeholder. For reviewers and users, please add:
- Prereqs (GPU nodes, CRDs installed, Secret creation, nvcr access).
- Deploy steps and how to validate (curl against port 8000, sample prompt).
- Expected output and teardown.
I can draft a README.md and the Service/Ingress manifests to land alongside this YAML. Want me to include those in this PR?
Also applies to: 10-23, 24-38
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
components/backends/vllm/deploy/agg-multinode.yaml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (.)
🔇 Additional comments (3)
components/backends/vllm/deploy/agg-multinode.yaml (3)
4-8: Confirm CRD apiVersion and set Kubernetes namespace
- Verify that
apiVersion: nvidia.com/v1alpha1andkind: DynamoGraphDeploymentmatch the installed CRD version.- Consider setting
metadata.namespaceto avoid deploying into the default namespace unintentionally.Would you like me to add a namespace and a companion Kustomize overlay to keep environments separated?
24-29: Multinode semantics and scheduling: confirm anti-affinity and replica semantics
- With
multinode.nodeCount: 2andreplicas: 1, confirm that the operator distributes workers across nodes as intended and prevents co-location (anti-affinity). Otherwise, the two processes may land on the same node.- Ensure
tensor-parallel-size 2matches the physical GPUs available (one per node if that’s the intention).I can add an explicit podAntiAffinity or topology spread (if supported by
extraPodSpec) once you confirm the operator’s defaults.
38-38: Double-check flag intent:--no-kv-transfer-configFor an “aggregated multinode” example, confirm that disabling kv transfer is intentional. If the goal is to demonstrate cross-node KV exchange, this may need to be removed or replaced with the desired config.
I can adjust the startup args once you confirm the desired behavior.
|
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
|
This PR has been closed due to inactivity. If you believe this PR is still relevant, please feel free to reopen it with additional context or information. |
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit