feat: add vllm aggregated multinode deployment example #2541

eric-liu-nvidia · 2025-08-19T23:12:30Z

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Introduces a multi-node inference deployment with a user-facing HTTP frontend and distributed decode workers.
- Enables serving the Qwen/Qwen3-0.6B model with 2-way tensor parallelism for improved scalability.
- Frontend exposes an HTTP server on port 8000 for requests.
- Worker nodes load credentials from a configured secret for model access.

Signed-off-by: Eric Liu <[email protected]>

copy-pr-bot · 2025-08-19T23:12:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-08-19T23:12:38Z

👋 Hi eric-liu-nvidia! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-08-19T23:18:34Z

Walkthrough

Introduces a new DynamoGraphDeployment YAML (vllm-mul) defining two services: a single-replica Frontend HTTP server on port 8000 and a multinode VllmDecodeWorker (2 nodes, tensor-parallel size 2) using shared runtime image, hf-token secret, and module-based startup commands.

Changes

Cohort / File(s)	Summary
vLLM multinode deployment `components/backends/vllm/deploy/agg-multinode.yaml`	Adds a DynamoGraphDeployment named vllm-mul with Frontend (python module HTTP server on 8000) and VllmDecodeWorker (2 nodes, TP=2) using Qwen/Qwen3-0.6B, env from hf-token-secret, and /bin/sh -c commands for both containers within namespace vllm-mul.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Client
    participant Frontend as Frontend (HTTP :8000)
    participant Worker1 as VllmDecodeWorker [Node 1]
    participant Worker2 as VllmDecodeWorker [Node 2]

    Note over Frontend,Worker2: Deployment: DynamoGraphDeployment "vllm-mul"<br/>Tensor Parallel = 2 • Nodes = 2 • Model: Qwen/Qwen3-0.6B

    Client->>Frontend: POST /generate
    Frontend->>Worker1: Dispatch decode shard (tp rank 0)
    Frontend->>Worker2: Dispatch decode shard (tp rank 1)
    par Parallel decode
        Worker1-->>Frontend: Partial results / logits
        Worker2-->>Frontend: Partial results / logits
    end
    Frontend-->>Client: Aggregated response

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

feat: update DynamoGraphDeployments for vllm_v1 #1890 — Adjusts vLLM DynamoGraphDeployment manifests with Frontend and worker specs, similar structure and goals.
chore: Change vllm K8s from dynamo-run to python -m dynamo.frontend #2055 — Switches Frontend startup to python -m dynamo.frontend on port 8000, matching this PR’s approach.
fix: vllm deployment examples #2062 — Updates container commands to module-style invocation for vLLM components, aligned with this change.

Poem

A rabbit taps deploy with gentle might,
Two workers hum, a frontend lights the night.
Qwen wakes up, tensors pair in tune,
Secrets whispered, pods in steady swoon.
Hop, hop—requests arrive, results take flight. 🐇✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (7)

components/backends/vllm/deploy/agg-multinode.yaml (7)
7-7: Naming nit: align resource name with filename/purpose

The filename suggests “agg-multinode,” but the resource is named vllm-mul. Consider renaming to something like vllm-agg-multinode for clarity and discoverability.

Apply this minimal rename if desired:
-metadata:
-  name: vllm-mul
+metadata:
+  name: vllm-agg-multinode
Also update dynamoNamespace occurrences accordingly.

11-22: Frontend: prefer exec-form command, add containerPort and basic probes

Using /bin/sh -c is brittle when no shell features are needed. The exec form is safer and avoids shell quoting gotchas.

Expose container port 8000 explicitly; many controllers auto-wire Services/ingress based on this.

Add basic readiness/liveness probes (TCP is the safest if an HTTP health path isn’t guaranteed).

Apply this refactor:
       extraPodSpec:
         mainContainer:
           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0
           workingDir: /workspace/components/backends/vllm
-          command:
-            - /bin/sh
-            - -c
-          args:
-            - "python3 -m dynamo.frontend --http-port 8000"
+          command:
+            - python3
+            - -m
+            - dynamo.frontend
+            - --http-port
+            - "8000"
+          ports:
+            - name: http
+              containerPort: 8000
+          readinessProbe:
+            tcpSocket:
+              port: 8000
+            initialDelaySeconds: 5
+            periodSeconds: 10
+          livenessProbe:
+            tcpSocket:
+              port: 8000
+            initialDelaySeconds: 15
+            periodSeconds: 20
Please confirm whether the controller auto-creates a Service for the frontend or if an explicit Service manifest is required.

26-27: Secret wiring: validate key names and existence

envFromSecret: hf-token-secret assumes a Secret exists and that the operator maps it into env vars correctly (e.g., HF_TOKEN). Please verify:

The Secret name is correct and created in the same namespace.

The expected key names match what vLLM/dynamo code reads.

If helpful, I can add a sample Secret manifest and a short setup README for this example.

31-38: Worker: add GPU resource requests/limits and harden container; also prefer exec-form command

Without resources requests/limits for nvidia.com/gpu, the pods may not get scheduled on GPU nodes (unless the operator injects defaults).

Use exec-form command for consistency.

Optional hardening using a basic securityContext.

Apply this refactor:
         mainContainer:
           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0
           workingDir: /workspace/components/backends/vllm
-          command:
-            - /bin/sh
-            - -c
-          args:
-            - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --tensor-parallel-size 2 --no-kv-transfer-config
+          command:
+            - python3
+            - -m
+            - dynamo.vllm
+            - --model
+            - Qwen/Qwen3-0.6B
+            - --tensor-parallel-size
+            - "2"
+            - --no-kv-transfer-config
+          resources:
+            requests:
+              nvidia.com/gpu: 1
+            limits:
+              nvidia.com/gpu: 1
+          securityContext:
+            runAsNonRoot: true
+            allowPrivilegeEscalation: false
+            readOnlyRootFilesystem: true
If 2 GPUs per node are desired, adjust requests/limits and tensor-parallel accordingly.

16-16: Pin images to digests for reproducibility and supply imagePullSecrets if needed

Consider pinning the image to a digest (e.g., @sha256:...) to avoid drift when tags are re-pushed.

If the nvcr.io registry requires auth in your cluster, ensure imagePullSecrets are configured (either cluster-wide or via extraPodSpec if supported).

I can add a small doc section or Kustomize patch for imagePullSecrets if your operator supports injecting them via extraPodSpec.

Also applies to: 32-32

11-22: Health, observability, and operability

Consider adding basic logging stdout/stderr aggregation (likely already fine), but also terminationGracePeriodSeconds and a preStop hook for graceful shutdown if the operator doesn’t handle it.

If the operator surfaces metrics, we might need to expose a metrics port and annotations for scraping.

Happy to add a minimal operability pass once we confirm what the operator auto-injects vs. what needs to be set here.

Also applies to: 31-38

4-8: End-to-end usability: service exposure and docs

How is the frontend exposed to users (ClusterIP/NodePort/Ingress/Gateway)? If the operator doesn’t auto-create a Service and Ingress, we should add them so users can try the example.

The PR description is placeholder. For reviewers and users, please add:

Prereqs (GPU nodes, CRDs installed, Secret creation, nvcr access).

Deploy steps and how to validate (curl against port 8000, sample prompt).

Expected output and teardown.

I can draft a README.md and the Service/Ingress manifests to land alongside this YAML. Want me to include those in this PR?

Also applies to: 10-23, 24-38

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 86a4a58 and 2fb6ffb.

📒 Files selected for processing (1)

components/backends/vllm/deploy/agg-multinode.yaml (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: Build and Test - dynamo
GitHub Check: pre-merge-rust (.)

🔇 Additional comments (3)

components/backends/vllm/deploy/agg-multinode.yaml (3)

4-8: Confirm CRD apiVersion and set Kubernetes namespace

Verify that apiVersion: nvidia.com/v1alpha1 and kind: DynamoGraphDeployment match the installed CRD version.

Consider setting metadata.namespace to avoid deploying into the default namespace unintentionally.

Would you like me to add a namespace and a companion Kustomize overlay to keep environments separated?

24-29: Multinode semantics and scheduling: confirm anti-affinity and replica semantics

With multinode.nodeCount: 2 and replicas: 1, confirm that the operator distributes workers across nodes as intended and prevents co-location (anti-affinity). Otherwise, the two processes may land on the same node.

Ensure tensor-parallel-size 2 matches the physical GPUs available (one per node if that’s the intention).

I can add an explicit podAntiAffinity or topology spread (if supported by extraPodSpec) once you confirm the operator’s defaults.

38-38: Double-check flag intent: --no-kv-transfer-config

For an “aggregated multinode” example, confirm that disabling kv transfer is intentional. If the goal is to demonstrate cross-node KV exchange, this may need to be removed or replaced with the desired config.

I can adjust the startup args once you confirm the desired behavior.

github-actions · 2025-09-19T09:34:30Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2025-09-24T09:34:52Z

This PR has been closed due to inactivity. If you believe this PR is still relevant, please feel free to reopen it with additional context or information.

feat: add vllm aggregated multinode deployment example

2fb6ffb

Signed-off-by: Eric Liu <[email protected]>

pull-request-size bot added the size/M label Aug 19, 2025

github-actions bot added feat external-contribution Pull request is from an external contributor labels Aug 19, 2025

coderabbitai bot reviewed Aug 19, 2025

View reviewed changes

github-actions bot added the Stale label Sep 19, 2025

github-actions bot closed this Sep 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add vllm aggregated multinode deployment example #2541

feat: add vllm aggregated multinode deployment example #2541

Uh oh!

eric-liu-nvidia commented Aug 19, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Aug 19, 2025

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

coderabbitai bot commented Aug 19, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

github-actions bot commented Sep 19, 2025

Uh oh!

github-actions bot commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: add vllm aggregated multinode deployment example #2541

feat: add vllm aggregated multinode deployment example #2541

Uh oh!

Conversation

eric-liu-nvidia commented Aug 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Aug 19, 2025

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

coderabbitai bot commented Aug 19, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 19, 2025

Uh oh!

github-actions bot commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eric-liu-nvidia commented Aug 19, 2025 •

edited by coderabbitai bot

Loading