Skip to content

Conversation

@biswapanda
Copy link
Contributor

@biswapanda biswapanda commented Aug 29, 2025

Overview:

Dynamo model serving recipes

Model family Backend Mode Deployment Benchmark
llama-3-70b vllm agg
llama-3-70b vllm disagg-multi-node
llama-3-70b vllm disagg-single-node
oss-gpt trtllm aggregated
DeepSeek-R1 sglang disaggregated 🚧 🚧

closes: DEP-369, DEP-365, DEP-361, DYN-974, DYN-916

Summary by CodeRabbit

  • New Features

    • Added a Llama‑3‑70B aggregate deployment and benchmarking recipe.
    • Included Kubernetes manifests to provision model cache storage, download the model, deploy services, and run performance benchmarks (aggregate and single-endpoint).
    • Added a one-command script to automate setup, deployment, and benchmark execution with logs.
  • Documentation

    • Introduced a README outlining prerequisites (shared storage class, Hugging Face token secret) and step-by-step usage instructions.

@biswapanda biswapanda self-assigned this Aug 29, 2025
@github-actions github-actions bot added the feat label Aug 29, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 29, 2025

Walkthrough

Adds a new Llama-3-70B recipe: PVC and model download Job, an aggregation deployment (Frontend + vLLM prefill worker), benchmark Jobs (agg and single-endpoint), a run script to orchestrate apply/wait/log workflow, and a README with prerequisites and usage.

Changes

Cohort / File(s) Summary of Changes
Documentation
recipies/llama-3-70b/README.md
New README describing prerequisites (shared storage class, HF token secret) and run instructions (./run.sh).
Aggregation Deployment
recipies/llama-3-70b/agg/llama3-70b-agg.yaml
Adds DynamoGraphDeployment for Llama-3-70B on vLLM with Frontend and VllmPrefillWorker services, PVC mounts, hf-token secret, resource specs, and command to run python -m dynamo.vllm.
Benchmark Jobs
recipies/llama-3-70b/agg/benchmark-job.yaml, recipies/llama-3-70b/benchmark/benchmark-job.yaml
Adds two Kubernetes Jobs to wait for model readiness and run genai-perf profiling against agg and single-node endpoints; include resources, model ID, polling, artifact export, and PVC mount.
Model Assets
recipies/llama-3-70b/model/model-cache.yaml, recipies/llama-3-70b/model/model-download.yaml
Introduces a PVC (ReadWriteMany, 100Gi, configurable storage class) and a Job to pre-download the HF model to the cache using huggingface-cli, with hf-token secret and resource requests/limits.
Orchestration Script
recipies/llama-3-70b/run.sh
New script to apply PVC and download Job, wait for completion, deploy aggregation, run benchmark Job, wait, and fetch logs in a target namespace.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant S as run.sh
  participant K as Kubernetes API
  participant PVC as model-cache PVC
  participant DL as Model Download Job
  participant HF as Hugging Face
  participant D as DynamoGraphDeployment
  participant F as Frontend (vLLM)
  participant W as VllmPrefillWorker
  participant B as Benchmark Job
  participant E as /v1 endpoint

  U->>S: Execute ./run.sh
  S->>K: apply model-cache.yaml
  K-->>PVC: Create PVC
  S->>K: apply model-download.yaml
  K-->>DL: Create Job/Pod
  DL->>HF: Download model artifacts
  HF-->>DL: Model files
  DL->>PVC: Write cache
  S->>K: wait Job Complete
  S->>K: apply llama3-70b-agg.yaml
  K-->>D: Create deployment
  D->>F: Start frontend
  D->>W: Start prefill worker
  S->>K: apply benchmark-job.yaml
  K-->>B: Create Job/Pod
  B->>E: Poll /v1/models until model ready
  B->>E: Run genai-perf requests (streaming)
  E-->>B: Responses
  B->>B: Export CSV artifacts
  S->>K: wait Job Complete
  S->>K: fetch logs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

A whisk of pods, a cache to fill,
I hop through YAML, calm and still.
Prefill hums, the frontend sings,
Benchmarks flutter on streaming wings.
With secrets snug and PVCs tight—
Llama awakens, swift and bright.
Thump-thump! The graphs take flight. 🐇🚀

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Nitpick comments (14)
recipies/llama-3-70b/model/model-cache.yaml (2)

10-10: Double-check capacity for 70B model artifacts.

100Gi may be tight for weights + tokenizer + caches. Consider 200–300Gi depending on snapshots and HF cache behavior.


11-11: Add trailing newline.

YAML lint flagged missing newline at EOF.

recipies/llama-3-70b/README.md (3)

1-1: Remove trailing period from heading.

Complies with MD026.

-# This recipe is used to deploy and benchmark llama-3-70b model in aggregate mode.
+# This recipe is used to deploy and benchmark llama-3-70b model in aggregate mode

5-7: Minor wording and completeness improvements to prerequisites.

Capitalize “Hugging Face” and add two actionable prerequisites users commonly miss.

-- A shared storage class
-- A secret with the huggingface token
+- A shared storage class (RWX) and its name configured in model-cache.yaml
+- A secret with the Hugging Face token (name: hf-token-secret)
+- A Kubernetes namespace (matches NAMESPACE in run.sh; default: test-bis)

If you want, I can add example commands to create the namespace and secret.


10-12: Mention namespace and storage class requirements right where users execute.

Prevents apply failures.

 ```bash
-./run.sh
+NAMESPACE=test-bis ./run.sh
+# Ensure model-cache.yaml has a valid storageClassName before running.

</blockquote></details>
<details>
<summary>recipies/llama-3-70b/run.sh (2)</summary><blockquote>

`11-11`: **Consider shorter, bounded wait.**

6000s is excessive; if download stalls, the pipeline hangs. Suggest 1800–3600s with retries/logs.

---

`16-21`: **Optionally surface benchmark exit code and fail fast.**

Propagate job failure by checking status before logs.


```diff
 kubectl apply -n $NAMESPACE -f $CUR_DIR/agg/benchmark-job.yaml
-kubectl wait --for=condition=Complete job/llama-benchmark-job -n $NAMESPACE --timeout=6000s
+kubectl wait --for=condition=Complete job/llama-benchmark-job -n $NAMESPACE --timeout=6000s || {
+  echo "Benchmark job did not complete successfully."
+  kubectl get pods -n $NAMESPACE -l job-name=llama-benchmark-job -o wide || true
+  kubectl logs job/llama-benchmark-job -n $NAMESPACE --all-containers || true
+  exit 1
+}
recipies/llama-3-70b/agg/llama3-70b-agg.yaml (2)

23-41: Consider adding a non-root security context (if image supports it).

Reduces risk and satisfies common policy checks.

       extraPodSpec:
         mainContainer:
+          securityContext:
+            runAsNonRoot: true
+            allowPrivilegeEscalation: false
+            readOnlyRootFilesystem: true
           args:
           - "python3 -m dynamo.vllm --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic --tensor-parallel-size 8 --data-parallel-size 1 --disable-log-requests --gpu-memory-utilization 0.90 --no-enable-prefix-caching --block-size 128"
           command:
           - /bin/sh
           - -c
           image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
           workingDir: /workspace/components/backends/vllm

48-48: Add trailing newline.

Fix YAML lint warning.

recipies/llama-3-70b/agg/benchmark-job.yaml (3)

24-35: Avoid hard dependency on jq; or ensure it’s present.

The image may not include jq. Either switch to python JSON parsing or install jq, or document the requirement.

Python-based readiness (no jq):

-          while ! curl -s "http://$ENDPOINT/v1/models" | jq -e --arg model "$TARGET_MODEL" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
+          while ! python3 - "$TARGET_MODEL" "$ENDPOINT" >/dev/null 2>&1 <<'PY'
+import json,sys,urllib.request
+model,ep=sys.argv[1],sys.argv[2]
+data=json.load(urllib.request.urlopen(f"http://{ep}/v1/models"))
+sys.exit(0 if any(m.get("id")==model for m in data.get("data", [])) else 1)
+PY

43-53: Deduplicate ignore_eos parameters.

Both plain and nvext JSON flags are set; keep only the one your stack honors to avoid ambiguity.

-            --extra-inputs ignore_eos:true \
-            --extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
+            --extra-inputs "{\"nvext\":{\"ignore_eos\":true}}\" \

(or vice versa)


61-67: Add pod securityContext to satisfy baseline policies.

Only if the image supports non-root.

       resources:
         limits:
           cpu: "64"
           memory: 80Gi
         requests:
           cpu: "64"
           memory: 80Gi
+      securityContext:
+        runAsNonRoot: true
+        allowPrivilegeEscalation: false
+        readOnlyRootFilesystem: true
recipies/llama-3-70b/benchmark/benchmark-job.yaml (1)

59-65: Consider security implications of high resource allocation.

The job requests 64 CPUs and 80Gi memory, which is substantial. Additionally, the static analysis flags potential security concerns with privilege escalation and root containers.

Consider adding security context and resource monitoring:

         resources:
           limits:
             cpu: "64"
             memory: 80Gi
           requests:
             cpu: "64"
             memory: 80Gi
+        securityContext:
+          runAsNonRoot: true
+          runAsUser: 1000
+          allowPrivilegeEscalation: false
+          readOnlyRootFilesystem: false
+          capabilities:
+            drop:
+            - ALL
recipies/llama-3-70b/model/model-download.yaml (1)

21-27: Consider security hardening for the container.

Similar to the benchmark job, this container runs as root and could benefit from security hardening, especially since it's downloading external content.

Add security context to follow least privilege principle:

           resources:
             requests:
               cpu: "10"
               memory: "5Gi"
             limits:
               cpu: "10"
               memory: "5Gi"
+          securityContext:
+            runAsNonRoot: true
+            runAsUser: 1000
+            allowPrivilegeEscalation: false
+            readOnlyRootFilesystem: false
+            capabilities:
+              drop:
+              - ALL
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3c4adde and 143e9f8.

📒 Files selected for processing (7)
  • recipies/llama-3-70b/README.md (1 hunks)
  • recipies/llama-3-70b/agg/benchmark-job.yaml (1 hunks)
  • recipies/llama-3-70b/agg/llama3-70b-agg.yaml (1 hunks)
  • recipies/llama-3-70b/benchmark/benchmark-job.yaml (1 hunks)
  • recipies/llama-3-70b/model/model-cache.yaml (1 hunks)
  • recipies/llama-3-70b/model/model-download.yaml (1 hunks)
  • recipies/llama-3-70b/run.sh (1 hunks)
🧰 Additional context used
🪛 LanguageTool
recipies/llama-3-70b/README.md

[grammar] ~5-~5: There might be a mistake here.
Context: ... Prerequisites - A shared storage class - A secret with the huggingface token # R...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
recipies/llama-3-70b/README.md

1-1: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)

🪛 Checkov (3.2.334)
recipies/llama-3-70b/agg/benchmark-job.yaml

[MEDIUM] 1-74: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[MEDIUM] 1-74: Minimize the admission of root containers

(CKV_K8S_23)

recipies/llama-3-70b/benchmark/benchmark-job.yaml

[MEDIUM] 1-72: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[MEDIUM] 1-72: Minimize the admission of root containers

(CKV_K8S_23)

recipies/llama-3-70b/model/model-download.yaml

[MEDIUM] 3-51: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[MEDIUM] 3-51: Minimize the admission of root containers

(CKV_K8S_23)

🪛 GitHub Actions: Copyright Checks
recipies/llama-3-70b/agg/benchmark-job.yaml

[error] 1-1: Copyright header check failed. Invalid/missing header in recipies/llama-3-70b/agg/benchmark-job.yaml. (Script: copyright-check.ps1)

recipies/llama-3-70b/model/model-cache.yaml

[error] 1-1: Copyright header check failed. Invalid/missing header in recipies/llama-3-70b/model/model-cache.yaml. (Script: copyright-check.ps1)

recipies/llama-3-70b/benchmark/benchmark-job.yaml

[error] 1-1: Copyright header check failed. Invalid/missing header in recipies/llama-3-70b/benchmark/benchmark-job.yaml. (Script: copyright-check.ps1)

recipies/llama-3-70b/run.sh

[error] 1-1: Copyright header check failed. Invalid/missing header in recipies/llama-3-70b/run.sh. (Script: copyright-check.ps1)

🪛 YAMLlint (1.37.1)
recipies/llama-3-70b/model/model-cache.yaml

[error] 11-11: no new line character at the end of file

(new-line-at-end-of-file)

recipies/llama-3-70b/model/model-download.yaml

[warning] 49-49: wrong indentation: expected 8 but found 6

(indentation)


[error] 51-51: no new line character at the end of file

(new-line-at-end-of-file)

recipies/llama-3-70b/agg/llama3-70b-agg.yaml

[error] 48-48: no new line character at the end of file

(new-line-at-end-of-file)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (6)
recipies/llama-3-70b/model/model-cache.yaml (1)

6-7: Confirm RWX support.

PVC requires ReadWriteMany; ensure the chosen StorageClass provides RWX (e.g., NFS/FSx/Portworx/Ceph). Otherwise mounts will fail at runtime.

recipies/llama-3-70b/agg/llama3-70b-agg.yaml (2)

9-21: LGTM for Frontend block and PVC wiring.

Image, workingDir, and shared model cache are consistent with the rest of the recipe.


31-45: Resource requests look coherent; verify cluster capacity.

Replicas=1 with gpu: "8", memory 160Gi, shm 80Gi; ensure nodes satisfy these to avoid Pending pods.

recipies/llama-3-70b/agg/benchmark-job.yaml (2)

62-67: Check cluster capacity for 64 CPU/80Gi.

Large CPU/memory requests can block scheduling; tune for your node shape.


25-25: Verify service DNS for ENDPOINT
In recipies/llama-3-70b/agg/benchmark-job.yaml, the line

ENDPOINT=llama3-70b-agg-0:8000

assumes a Service named llama3-70b-agg-0. Confirm the CRD or Service exposes this DNS or update ENDPOINT to the actual service name and port.

recipies/llama-3-70b/model/model-download.yaml (1)

1-2: LGTM on copyright header.

The copyright header is properly formatted and follows the required SPDX format.

@biswapanda biswapanda changed the title feat: add deployment and benchmarking job for llama3, agg, vllm feat: add deployment and benchmarking recipies for llama3-70b models Aug 31, 2025
@biswapanda biswapanda changed the title feat: add deployment and benchmarking recipies for llama3-70b models feat: add single-liner deployment and benchmarking recipe for llama3-70b models Sep 2, 2025
@biswapanda biswapanda marked this pull request as draft September 3, 2025 01:48
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 4, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@biswapanda biswapanda force-pushed the bis/dep-369-every-single-example-is-deployablebenchmarkable-in-k8s-with branch from fc40ae7 to e643ec1 Compare September 8, 2025 19:58
@biswapanda
Copy link
Contributor Author

/ok to test ec86028

@biswapanda biswapanda changed the title feat: add single-liner deployment and benchmarking recipe for llama3-70b models feat: Dynamo deployment and benchmarking recipe for llama3-70b and oss-gpt-120b Sep 15, 2025
@biswapanda
Copy link
Contributor Author

/ok to test e279496

@biswapanda biswapanda force-pushed the bis/dep-369-every-single-example-is-deployablebenchmarkable-in-k8s-with branch 3 times, most recently from 4be47a3 to ba9f3c4 Compare September 16, 2025 06:12
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
@biswapanda biswapanda force-pushed the bis/dep-369-every-single-example-is-deployablebenchmarkable-in-k8s-with branch from ba9f3c4 to 4f16a11 Compare September 16, 2025 06:13
@biswapanda biswapanda merged commit 2303313 into main Sep 16, 2025
12 of 13 checks passed
@biswapanda biswapanda deleted the bis/dep-369-every-single-example-is-deployablebenchmarkable-in-k8s-with branch September 16, 2025 07:02
kmkelle-nv pushed a commit that referenced this pull request Sep 17, 2025
…70b models (#2792)

Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Kristen Kelleher <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants