Skip to content

Conversation

@eric-liu-nvidia
Copy link

@eric-liu-nvidia eric-liu-nvidia commented Aug 19, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features
    • Introduces a multi-node inference deployment with a user-facing HTTP frontend and distributed decode workers.
    • Enables serving the Qwen/Qwen3-0.6B model with 2-way tensor parallelism for improved scalability.
    • Frontend exposes an HTTP server on port 8000 for requests.
    • Worker nodes load credentials from a configured secret for model access.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi eric-liu-nvidia! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added feat external-contribution Pull request is from an external contributor labels Aug 19, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 19, 2025

Walkthrough

Introduces a new DynamoGraphDeployment YAML (vllm-mul) defining two services: a single-replica Frontend HTTP server on port 8000 and a multinode VllmDecodeWorker (2 nodes, tensor-parallel size 2) using shared runtime image, hf-token secret, and module-based startup commands.

Changes

Cohort / File(s) Summary
vLLM multinode deployment
components/backends/vllm/deploy/agg-multinode.yaml
Adds a DynamoGraphDeployment named vllm-mul with Frontend (python module HTTP server on 8000) and VllmDecodeWorker (2 nodes, TP=2) using Qwen/Qwen3-0.6B, env from hf-token-secret, and /bin/sh -c commands for both containers within namespace vllm-mul.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Client
    participant Frontend as Frontend (HTTP :8000)
    participant Worker1 as VllmDecodeWorker [Node 1]
    participant Worker2 as VllmDecodeWorker [Node 2]

    Note over Frontend,Worker2: Deployment: DynamoGraphDeployment "vllm-mul"<br/>Tensor Parallel = 2 • Nodes = 2 • Model: Qwen/Qwen3-0.6B

    Client->>Frontend: POST /generate
    Frontend->>Worker1: Dispatch decode shard (tp rank 0)
    Frontend->>Worker2: Dispatch decode shard (tp rank 1)
    par Parallel decode
        Worker1-->>Frontend: Partial results / logits
        Worker2-->>Frontend: Partial results / logits
    end
    Frontend-->>Client: Aggregated response
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Poem

A rabbit taps deploy with gentle might,
Two workers hum, a frontend lights the night.
Qwen wakes up, tensors pair in tune,
Secrets whispered, pods in steady swoon.
Hop, hop—requests arrive, results take flight. 🐇✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
components/backends/vllm/deploy/agg-multinode.yaml (7)

7-7: Naming nit: align resource name with filename/purpose

The filename suggests “agg-multinode,” but the resource is named vllm-mul. Consider renaming to something like vllm-agg-multinode for clarity and discoverability.

Apply this minimal rename if desired:

-metadata:
-  name: vllm-mul
+metadata:
+  name: vllm-agg-multinode

Also update dynamoNamespace occurrences accordingly.


11-22: Frontend: prefer exec-form command, add containerPort and basic probes

  • Using /bin/sh -c is brittle when no shell features are needed. The exec form is safer and avoids shell quoting gotchas.
  • Expose container port 8000 explicitly; many controllers auto-wire Services/ingress based on this.
  • Add basic readiness/liveness probes (TCP is the safest if an HTTP health path isn’t guaranteed).

Apply this refactor:

       extraPodSpec:
         mainContainer:
           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0
           workingDir: /workspace/components/backends/vllm
-          command:
-            - /bin/sh
-            - -c
-          args:
-            - "python3 -m dynamo.frontend --http-port 8000"
+          command:
+            - python3
+            - -m
+            - dynamo.frontend
+            - --http-port
+            - "8000"
+          ports:
+            - name: http
+              containerPort: 8000
+          readinessProbe:
+            tcpSocket:
+              port: 8000
+            initialDelaySeconds: 5
+            periodSeconds: 10
+          livenessProbe:
+            tcpSocket:
+              port: 8000
+            initialDelaySeconds: 15
+            periodSeconds: 20

Please confirm whether the controller auto-creates a Service for the frontend or if an explicit Service manifest is required.


26-27: Secret wiring: validate key names and existence

envFromSecret: hf-token-secret assumes a Secret exists and that the operator maps it into env vars correctly (e.g., HF_TOKEN). Please verify:

  • The Secret name is correct and created in the same namespace.
  • The expected key names match what vLLM/dynamo code reads.

If helpful, I can add a sample Secret manifest and a short setup README for this example.


31-38: Worker: add GPU resource requests/limits and harden container; also prefer exec-form command

  • Without resources requests/limits for nvidia.com/gpu, the pods may not get scheduled on GPU nodes (unless the operator injects defaults).
  • Use exec-form command for consistency.
  • Optional hardening using a basic securityContext.

Apply this refactor:

         mainContainer:
           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0
           workingDir: /workspace/components/backends/vllm
-          command:
-            - /bin/sh
-            - -c
-          args:
-            - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --tensor-parallel-size 2 --no-kv-transfer-config
+          command:
+            - python3
+            - -m
+            - dynamo.vllm
+            - --model
+            - Qwen/Qwen3-0.6B
+            - --tensor-parallel-size
+            - "2"
+            - --no-kv-transfer-config
+          resources:
+            requests:
+              nvidia.com/gpu: 1
+            limits:
+              nvidia.com/gpu: 1
+          securityContext:
+            runAsNonRoot: true
+            allowPrivilegeEscalation: false
+            readOnlyRootFilesystem: true

If 2 GPUs per node are desired, adjust requests/limits and tensor-parallel accordingly.


16-16: Pin images to digests for reproducibility and supply imagePullSecrets if needed

  • Consider pinning the image to a digest (e.g., @sha256:...) to avoid drift when tags are re-pushed.
  • If the nvcr.io registry requires auth in your cluster, ensure imagePullSecrets are configured (either cluster-wide or via extraPodSpec if supported).

I can add a small doc section or Kustomize patch for imagePullSecrets if your operator supports injecting them via extraPodSpec.

Also applies to: 32-32


11-22: Health, observability, and operability

  • Consider adding basic logging stdout/stderr aggregation (likely already fine), but also terminationGracePeriodSeconds and a preStop hook for graceful shutdown if the operator doesn’t handle it.
  • If the operator surfaces metrics, we might need to expose a metrics port and annotations for scraping.

Happy to add a minimal operability pass once we confirm what the operator auto-injects vs. what needs to be set here.

Also applies to: 31-38


4-8: End-to-end usability: service exposure and docs

  • How is the frontend exposed to users (ClusterIP/NodePort/Ingress/Gateway)? If the operator doesn’t auto-create a Service and Ingress, we should add them so users can try the example.
  • The PR description is placeholder. For reviewers and users, please add:
    • Prereqs (GPU nodes, CRDs installed, Secret creation, nvcr access).
    • Deploy steps and how to validate (curl against port 8000, sample prompt).
    • Expected output and teardown.

I can draft a README.md and the Service/Ingress manifests to land alongside this YAML. Want me to include those in this PR?

Also applies to: 10-23, 24-38

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 86a4a58 and 2fb6ffb.

📒 Files selected for processing (1)
  • components/backends/vllm/deploy/agg-multinode.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (.)
🔇 Additional comments (3)
components/backends/vllm/deploy/agg-multinode.yaml (3)

4-8: Confirm CRD apiVersion and set Kubernetes namespace

  • Verify that apiVersion: nvidia.com/v1alpha1 and kind: DynamoGraphDeployment match the installed CRD version.
  • Consider setting metadata.namespace to avoid deploying into the default namespace unintentionally.

Would you like me to add a namespace and a companion Kustomize overlay to keep environments separated?


24-29: Multinode semantics and scheduling: confirm anti-affinity and replica semantics

  • With multinode.nodeCount: 2 and replicas: 1, confirm that the operator distributes workers across nodes as intended and prevents co-location (anti-affinity). Otherwise, the two processes may land on the same node.
  • Ensure tensor-parallel-size 2 matches the physical GPUs available (one per node if that’s the intention).

I can add an explicit podAntiAffinity or topology spread (if supported by extraPodSpec) once you confirm the operator’s defaults.


38-38: Double-check flag intent: --no-kv-transfer-config

For an “aggregated multinode” example, confirm that disabling kv transfer is intentional. If the goal is to demonstrate cross-node KV exchange, this may need to be removed or replaced with the desired config.

I can adjust the startup args once you confirm the desired behavior.

@github-actions
Copy link

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Sep 19, 2025
@github-actions
Copy link

This PR has been closed due to inactivity. If you believe this PR is still relevant, please feel free to reopen it with additional context or information.

@github-actions github-actions bot closed this Sep 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor feat size/M Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant