feat: Dynamo deployment and benchmarking recipe for llama3-70b and oss-gpt-120b #2792

biswapanda · 2025-08-29T22:48:57Z

Overview:

Dynamo model serving recipes

Model family	Backend	Mode	Deployment	Benchmark
llama-3-70b	vllm	agg	✓	✓
llama-3-70b	vllm	disagg-multi-node	✓	✓
llama-3-70b	vllm	disagg-single-node	✓	✓
oss-gpt	trtllm	aggregated	✓	✓
DeepSeek-R1	sglang	disaggregated	🚧	🚧

closes: DEP-369, DEP-365, DEP-361, DYN-974, DYN-916

Summary by CodeRabbit

New Features
- Added a Llama‑3‑70B aggregate deployment and benchmarking recipe.
- Included Kubernetes manifests to provision model cache storage, download the model, deploy services, and run performance benchmarks (aggregate and single-endpoint).
- Added a one-command script to automate setup, deployment, and benchmark execution with logs.
Documentation
- Introduced a README outlining prerequisites (shared storage class, Hugging Face token secret) and step-by-step usage instructions.

coderabbitai · 2025-08-29T22:57:11Z

Walkthrough

Adds a new Llama-3-70B recipe: PVC and model download Job, an aggregation deployment (Frontend + vLLM prefill worker), benchmark Jobs (agg and single-endpoint), a run script to orchestrate apply/wait/log workflow, and a README with prerequisites and usage.

Changes

Cohort / File(s)	Summary of Changes
Documentation `recipies/llama-3-70b/README.md`	New README describing prerequisites (shared storage class, HF token secret) and run instructions (`./run.sh`).
Aggregation Deployment `recipies/llama-3-70b/agg/llama3-70b-agg.yaml`	Adds DynamoGraphDeployment for Llama-3-70B on vLLM with Frontend and VllmPrefillWorker services, PVC mounts, hf-token secret, resource specs, and command to run `python -m dynamo.vllm`.
Benchmark Jobs `recipies/llama-3-70b/agg/benchmark-job.yaml`, `recipies/llama-3-70b/benchmark/benchmark-job.yaml`	Adds two Kubernetes Jobs to wait for model readiness and run genai-perf profiling against agg and single-node endpoints; include resources, model ID, polling, artifact export, and PVC mount.
Model Assets `recipies/llama-3-70b/model/model-cache.yaml`, `recipies/llama-3-70b/model/model-download.yaml`	Introduces a PVC (`ReadWriteMany`, 100Gi, configurable storage class) and a Job to pre-download the HF model to the cache using `huggingface-cli`, with hf-token secret and resource requests/limits.
Orchestration Script `recipies/llama-3-70b/run.sh`	New script to apply PVC and download Job, wait for completion, deploy aggregation, run benchmark Job, wait, and fetch logs in a target namespace.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant S as run.sh
  participant K as Kubernetes API
  participant PVC as model-cache PVC
  participant DL as Model Download Job
  participant HF as Hugging Face
  participant D as DynamoGraphDeployment
  participant F as Frontend (vLLM)
  participant W as VllmPrefillWorker
  participant B as Benchmark Job
  participant E as /v1 endpoint

  U->>S: Execute ./run.sh
  S->>K: apply model-cache.yaml
  K-->>PVC: Create PVC
  S->>K: apply model-download.yaml
  K-->>DL: Create Job/Pod
  DL->>HF: Download model artifacts
  HF-->>DL: Model files
  DL->>PVC: Write cache
  S->>K: wait Job Complete
  S->>K: apply llama3-70b-agg.yaml
  K-->>D: Create deployment
  D->>F: Start frontend
  D->>W: Start prefill worker
  S->>K: apply benchmark-job.yaml
  K-->>B: Create Job/Pod
  B->>E: Poll /v1/models until model ready
  B->>E: Run genai-perf requests (streaming)
  E-->>B: Responses
  B->>B: Export CSV artifacts
  S->>K: wait Job Complete
  S->>K: fetch logs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

refactor: vLLM to new Python UX #1983 — Introduces/updates vLLM entrypoints (python -m dynamo.vllm) used by the new aggregation deployment.
feat: add crds for vllm and llm examples #1766 — Adds similar DynamoGraphDeployment manifests for vLLM aggregation; structure mirrors Frontend + worker services.
chore: Change vllm K8s from dynamo-run to python -m dynamo.frontend #2055 — Adjusts startup commands for vLLM-based components toward Python module invocations, aligned with this recipe.

Poem

A whisk of pods, a cache to fill,
I hop through YAML, calm and still.
Prefill hums, the frontend sings,
Benchmarks flutter on streaming wings.
With secrets snug and PVCs tight—
Llama awakens, swift and bright.
Thump-thump! The graphs take flight. 🐇🚀

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 15

🧹 Nitpick comments (14)

recipies/llama-3-70b/model/model-cache.yaml (2)

10-10: Double-check capacity for 70B model artifacts.

100Gi may be tight for weights + tokenizer + caches. Consider 200–300Gi depending on snapshots and HF cache behavior.

11-11: Add trailing newline.

YAML lint flagged missing newline at EOF.
recipies/llama-3-70b/README.md (3)
1-1: Remove trailing period from heading.

Complies with MD026.
-# This recipe is used to deploy and benchmark llama-3-70b model in aggregate mode.
+# This recipe is used to deploy and benchmark llama-3-70b model in aggregate mode
5-7: Minor wording and completeness improvements to prerequisites.

Capitalize “Hugging Face” and add two actionable prerequisites users commonly miss.
-- A shared storage class
-- A secret with the huggingface token
+- A shared storage class (RWX) and its name configured in model-cache.yaml
+- A secret with the Hugging Face token (name: hf-token-secret)
+- A Kubernetes namespace (matches NAMESPACE in run.sh; default: test-bis)
If you want, I can add example commands to create the namespace and secret.

10-12: Mention namespace and storage class requirements right where users execute.

Prevents apply failures.
 ```bash
-./run.sh
+NAMESPACE=test-bis ./run.sh
+# Ensure model-cache.yaml has a valid storageClassName before running.
</blockquote></details>
<details>
<summary>recipies/llama-3-70b/run.sh (2)</summary><blockquote>

`11-11`: **Consider shorter, bounded wait.**

6000s is excessive; if download stalls, the pipeline hangs. Suggest 1800–3600s with retries/logs.

---

`16-21`: **Optionally surface benchmark exit code and fail fast.**

Propagate job failure by checking status before logs.


```diff
 kubectl apply -n $NAMESPACE -f $CUR_DIR/agg/benchmark-job.yaml
-kubectl wait --for=condition=Complete job/llama-benchmark-job -n $NAMESPACE --timeout=6000s
+kubectl wait --for=condition=Complete job/llama-benchmark-job -n $NAMESPACE --timeout=6000s || {
+  echo "Benchmark job did not complete successfully."
+  kubectl get pods -n $NAMESPACE -l job-name=llama-benchmark-job -o wide || true
+  kubectl logs job/llama-benchmark-job -n $NAMESPACE --all-containers || true
+  exit 1
+}
recipies/llama-3-70b/agg/llama3-70b-agg.yaml (2)
23-41: Consider adding a non-root security context (if image supports it).

Reduces risk and satisfies common policy checks.
       extraPodSpec:
         mainContainer:
+          securityContext:
+            runAsNonRoot: true
+            allowPrivilegeEscalation: false
+            readOnlyRootFilesystem: true
           args:
           - "python3 -m dynamo.vllm --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic --tensor-parallel-size 8 --data-parallel-size 1 --disable-log-requests --gpu-memory-utilization 0.90 --no-enable-prefix-caching --block-size 128"
           command:
           - /bin/sh
           - -c
           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
           workingDir: /workspace/components/backends/vllm
48-48: Add trailing newline.

Fix YAML lint warning.
recipies/llama-3-70b/agg/benchmark-job.yaml (3)
24-35: Avoid hard dependency on jq; or ensure it’s present.

The image may not include jq. Either switch to python JSON parsing or install jq, or document the requirement.

Python-based readiness (no jq):
-          while ! curl -s "http://$ENDPOINT/v1/models" | jq -e --arg model "$TARGET_MODEL" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
+          while ! python3 - "$TARGET_MODEL" "$ENDPOINT" >/dev/null 2>&1 <<'PY'
+import json,sys,urllib.request
+model,ep=sys.argv[1],sys.argv[2]
+data=json.load(urllib.request.urlopen(f"http://{ep}/v1/models"))
+sys.exit(0 if any(m.get("id")==model for m in data.get("data", [])) else 1)
+PY
43-53: Deduplicate ignore_eos parameters.

Both plain and nvext JSON flags are set; keep only the one your stack honors to avoid ambiguity.
-            --extra-inputs ignore_eos:true \
-            --extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
+            --extra-inputs "{\"nvext\":{\"ignore_eos\":true}}\" \
(or vice versa)

61-67: Add pod securityContext to satisfy baseline policies.

Only if the image supports non-root.
       resources:
         limits:
           cpu: "64"
           memory: 80Gi
         requests:
           cpu: "64"
           memory: 80Gi
+      securityContext:
+        runAsNonRoot: true
+        allowPrivilegeEscalation: false
+        readOnlyRootFilesystem: true
recipies/llama-3-70b/benchmark/benchmark-job.yaml (1)
59-65: Consider security implications of high resource allocation.

The job requests 64 CPUs and 80Gi memory, which is substantial. Additionally, the static analysis flags potential security concerns with privilege escalation and root containers.

Consider adding security context and resource monitoring:
         resources:
           limits:
             cpu: "64"
             memory: 80Gi
           requests:
             cpu: "64"
             memory: 80Gi
+        securityContext:
+          runAsNonRoot: true
+          runAsUser: 1000
+          allowPrivilegeEscalation: false
+          readOnlyRootFilesystem: false
+          capabilities:
+            drop:
+            - ALL
recipies/llama-3-70b/model/model-download.yaml (1)
21-27: Consider security hardening for the container.

Similar to the benchmark job, this container runs as root and could benefit from security hardening, especially since it's downloading external content.

Add security context to follow least privilege principle:
           resources:
             requests:
               cpu: "10"
               memory: "5Gi"
             limits:
               cpu: "10"
               memory: "5Gi"
+          securityContext:
+            runAsNonRoot: true
+            runAsUser: 1000
+            allowPrivilegeEscalation: false
+            readOnlyRootFilesystem: false
+            capabilities:
+              drop:
+              - ALL

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3c4adde and 143e9f8.

📒 Files selected for processing (7)

recipies/llama-3-70b/README.md (1 hunks)
recipies/llama-3-70b/agg/benchmark-job.yaml (1 hunks)
recipies/llama-3-70b/agg/llama3-70b-agg.yaml (1 hunks)
recipies/llama-3-70b/benchmark/benchmark-job.yaml (1 hunks)
recipies/llama-3-70b/model/model-cache.yaml (1 hunks)
recipies/llama-3-70b/model/model-download.yaml (1 hunks)
recipies/llama-3-70b/run.sh (1 hunks)

🧰 Additional context used

🪛 LanguageTool

recipies/llama-3-70b/README.md

[grammar] ~5-~5: There might be a mistake here.
Context: ... Prerequisites - A shared storage class - A secret with the huggingface token # R...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)

recipies/llama-3-70b/README.md

1-1: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)

🪛 Checkov (3.2.334)

recipies/llama-3-70b/agg/benchmark-job.yaml

[MEDIUM] 1-74: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[MEDIUM] 1-74: Minimize the admission of root containers

(CKV_K8S_23)

recipies/llama-3-70b/benchmark/benchmark-job.yaml

[MEDIUM] 1-72: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[MEDIUM] 1-72: Minimize the admission of root containers

(CKV_K8S_23)

recipies/llama-3-70b/model/model-download.yaml

[MEDIUM] 3-51: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)

[MEDIUM] 3-51: Minimize the admission of root containers

(CKV_K8S_23)

🪛 GitHub Actions: Copyright Checks

recipies/llama-3-70b/agg/benchmark-job.yaml

[error] 1-1: Copyright header check failed. Invalid/missing header in recipies/llama-3-70b/agg/benchmark-job.yaml. (Script: copyright-check.ps1)

recipies/llama-3-70b/model/model-cache.yaml

[error] 1-1: Copyright header check failed. Invalid/missing header in recipies/llama-3-70b/model/model-cache.yaml. (Script: copyright-check.ps1)

recipies/llama-3-70b/benchmark/benchmark-job.yaml

[error] 1-1: Copyright header check failed. Invalid/missing header in recipies/llama-3-70b/benchmark/benchmark-job.yaml. (Script: copyright-check.ps1)

recipies/llama-3-70b/run.sh

[error] 1-1: Copyright header check failed. Invalid/missing header in recipies/llama-3-70b/run.sh. (Script: copyright-check.ps1)

🪛 YAMLlint (1.37.1)

recipies/llama-3-70b/model/model-cache.yaml

[error] 11-11: no new line character at the end of file

(new-line-at-end-of-file)

recipies/llama-3-70b/model/model-download.yaml

[warning] 49-49: wrong indentation: expected 8 but found 6

(indentation)

[error] 51-51: no new line character at the end of file

(new-line-at-end-of-file)

recipies/llama-3-70b/agg/llama3-70b-agg.yaml

[error] 48-48: no new line character at the end of file

(new-line-at-end-of-file)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Build and Test - vllm
GitHub Check: Build and Test - dynamo

🔇 Additional comments (6)

recipies/llama-3-70b/model/model-cache.yaml (1)

6-7: Confirm RWX support.

PVC requires ReadWriteMany; ensure the chosen StorageClass provides RWX (e.g., NFS/FSx/Portworx/Ceph). Otherwise mounts will fail at runtime.

recipies/llama-3-70b/agg/llama3-70b-agg.yaml (2)

9-21: LGTM for Frontend block and PVC wiring.

Image, workingDir, and shared model cache are consistent with the rest of the recipe.

31-45: Resource requests look coherent; verify cluster capacity.

Replicas=1 with gpu: "8", memory 160Gi, shm 80Gi; ensure nodes satisfy these to avoid Pending pods.
recipies/llama-3-70b/agg/benchmark-job.yaml (2)
62-67: Check cluster capacity for 64 CPU/80Gi.

Large CPU/memory requests can block scheduling; tune for your node shape.

25-25: Verify service DNS for ENDPOINT
In recipies/llama-3-70b/agg/benchmark-job.yaml, the line
ENDPOINT=llama3-70b-agg-0:8000
assumes a Service named llama3-70b-agg-0. Confirm the CRD or Service exposes this DNS or update ENDPOINT to the actual service name and port.
recipies/llama-3-70b/model/model-download.yaml (1)

1-2: LGTM on copyright header.

The copyright header is properly formatted and follows the required SPDX format.

recipies/llama-3-70b/agg/benchmark-job.yaml

recipies/llama-3-70b/benchmark/benchmark-job.yaml

recipies/llama-3-70b/model/model-download.yaml

recipies/llama-3-70b/run.sh

recipies/llama-3-70b/agg/llama3-70b-agg.yaml

recipies/llama-3-70b/benchmark/benchmark-job.yaml

recipies/llama-3-70b/disagg-multi-node/disagg-mn-bench.yaml

recipies/llama-3-70b/disagg-single-node/disagg-sn-bench.yaml

recipies/llama-3-70b/disagg-single-node/disagg-sn-deploy.yaml

recipies/llama-3-70b/model/model-download.yaml

recipies/llama-3-70b/README.md

copy-pr-bot · 2025-09-04T06:59:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

biswapanda · 2025-09-15T18:50:30Z

/ok to test ec86028

biswapanda · 2025-09-16T05:04:25Z

/ok to test e279496

Signed-off-by: Biswa Panda <[email protected]>

…70b models (#2792) Signed-off-by: Biswa Panda <[email protected]> Signed-off-by: Kristen Kelleher <[email protected]>

biswapanda self-assigned this Aug 29, 2025

pull-request-size bot added the size/L label Aug 29, 2025

github-actions bot added the feat label Aug 29, 2025

coderabbitai bot reviewed Aug 29, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to GITLAB August 29, 2025 23:35 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 29, 2025 23:36 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 29, 2025 23:37 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 29, 2025 23:41 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 31, 2025 06:51 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 31, 2025 07:13 Inactive

pull-request-size bot added size/XL and removed size/L labels Aug 31, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 31, 2025 07:14 Inactive

biswapanda changed the title ~~feat: add deployment and benchmarking job for llama3, agg, vllm~~ feat: add deployment and benchmarking recipies for llama3-70b models Aug 31, 2025

biswapanda changed the title ~~feat: add deployment and benchmarking recipies for llama3-70b models~~ feat: add single-liner deployment and benchmarking recipe for llama3-70b models Sep 2, 2025