-
Notifications
You must be signed in to change notification settings - Fork 689
feat: Dynamo deployment and benchmarking recipe for llama3-70b and oss-gpt-120b #2792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Dynamo deployment and benchmarking recipe for llama3-70b and oss-gpt-120b #2792
Conversation
WalkthroughAdds a new Llama-3-70B recipe: PVC and model download Job, an aggregation deployment (Frontend + vLLM prefill worker), benchmark Jobs (agg and single-endpoint), a run script to orchestrate apply/wait/log workflow, and a README with prerequisites and usage. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor U as User
participant S as run.sh
participant K as Kubernetes API
participant PVC as model-cache PVC
participant DL as Model Download Job
participant HF as Hugging Face
participant D as DynamoGraphDeployment
participant F as Frontend (vLLM)
participant W as VllmPrefillWorker
participant B as Benchmark Job
participant E as /v1 endpoint
U->>S: Execute ./run.sh
S->>K: apply model-cache.yaml
K-->>PVC: Create PVC
S->>K: apply model-download.yaml
K-->>DL: Create Job/Pod
DL->>HF: Download model artifacts
HF-->>DL: Model files
DL->>PVC: Write cache
S->>K: wait Job Complete
S->>K: apply llama3-70b-agg.yaml
K-->>D: Create deployment
D->>F: Start frontend
D->>W: Start prefill worker
S->>K: apply benchmark-job.yaml
K-->>B: Create Job/Pod
B->>E: Poll /v1/models until model ready
B->>E: Run genai-perf requests (streaming)
E-->>B: Responses
B->>B: Export CSV artifacts
S->>K: wait Job Complete
S->>K: fetch logs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 15
🧹 Nitpick comments (14)
recipies/llama-3-70b/model/model-cache.yaml (2)
10-10: Double-check capacity for 70B model artifacts.100Gi may be tight for weights + tokenizer + caches. Consider 200–300Gi depending on snapshots and HF cache behavior.
11-11: Add trailing newline.YAML lint flagged missing newline at EOF.
recipies/llama-3-70b/README.md (3)
1-1: Remove trailing period from heading.Complies with MD026.
-# This recipe is used to deploy and benchmark llama-3-70b model in aggregate mode. +# This recipe is used to deploy and benchmark llama-3-70b model in aggregate mode
5-7: Minor wording and completeness improvements to prerequisites.Capitalize “Hugging Face” and add two actionable prerequisites users commonly miss.
-- A shared storage class -- A secret with the huggingface token +- A shared storage class (RWX) and its name configured in model-cache.yaml +- A secret with the Hugging Face token (name: hf-token-secret) +- A Kubernetes namespace (matches NAMESPACE in run.sh; default: test-bis)If you want, I can add example commands to create the namespace and secret.
10-12: Mention namespace and storage class requirements right where users execute.Prevents apply failures.
```bash -./run.sh +NAMESPACE=test-bis ./run.sh +# Ensure model-cache.yaml has a valid storageClassName before running.</blockquote></details> <details> <summary>recipies/llama-3-70b/run.sh (2)</summary><blockquote> `11-11`: **Consider shorter, bounded wait.** 6000s is excessive; if download stalls, the pipeline hangs. Suggest 1800–3600s with retries/logs. --- `16-21`: **Optionally surface benchmark exit code and fail fast.** Propagate job failure by checking status before logs. ```diff kubectl apply -n $NAMESPACE -f $CUR_DIR/agg/benchmark-job.yaml -kubectl wait --for=condition=Complete job/llama-benchmark-job -n $NAMESPACE --timeout=6000s +kubectl wait --for=condition=Complete job/llama-benchmark-job -n $NAMESPACE --timeout=6000s || { + echo "Benchmark job did not complete successfully." + kubectl get pods -n $NAMESPACE -l job-name=llama-benchmark-job -o wide || true + kubectl logs job/llama-benchmark-job -n $NAMESPACE --all-containers || true + exit 1 +}recipies/llama-3-70b/agg/llama3-70b-agg.yaml (2)
23-41: Consider adding a non-root security context (if image supports it).Reduces risk and satisfies common policy checks.
extraPodSpec: mainContainer: + securityContext: + runAsNonRoot: true + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true args: - "python3 -m dynamo.vllm --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic --tensor-parallel-size 8 --data-parallel-size 1 --disable-log-requests --gpu-memory-utilization 0.90 --no-enable-prefix-caching --block-size 128" command: - /bin/sh - -c image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1 workingDir: /workspace/components/backends/vllm
48-48: Add trailing newline.Fix YAML lint warning.
recipies/llama-3-70b/agg/benchmark-job.yaml (3)
24-35: Avoid hard dependency on jq; or ensure it’s present.The image may not include jq. Either switch to python JSON parsing or install jq, or document the requirement.
Python-based readiness (no jq):
- while ! curl -s "http://$ENDPOINT/v1/models" | jq -e --arg model "$TARGET_MODEL" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do + while ! python3 - "$TARGET_MODEL" "$ENDPOINT" >/dev/null 2>&1 <<'PY' +import json,sys,urllib.request +model,ep=sys.argv[1],sys.argv[2] +data=json.load(urllib.request.urlopen(f"http://{ep}/v1/models")) +sys.exit(0 if any(m.get("id")==model for m in data.get("data", [])) else 1) +PY
43-53: Deduplicate ignore_eos parameters.Both plain and nvext JSON flags are set; keep only the one your stack honors to avoid ambiguity.
- --extra-inputs ignore_eos:true \ - --extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \ + --extra-inputs "{\"nvext\":{\"ignore_eos\":true}}\" \(or vice versa)
61-67: Add pod securityContext to satisfy baseline policies.Only if the image supports non-root.
resources: limits: cpu: "64" memory: 80Gi requests: cpu: "64" memory: 80Gi + securityContext: + runAsNonRoot: true + allowPrivilegeEscalation: false + readOnlyRootFilesystem: truerecipies/llama-3-70b/benchmark/benchmark-job.yaml (1)
59-65: Consider security implications of high resource allocation.The job requests 64 CPUs and 80Gi memory, which is substantial. Additionally, the static analysis flags potential security concerns with privilege escalation and root containers.
Consider adding security context and resource monitoring:
resources: limits: cpu: "64" memory: 80Gi requests: cpu: "64" memory: 80Gi + securityContext: + runAsNonRoot: true + runAsUser: 1000 + allowPrivilegeEscalation: false + readOnlyRootFilesystem: false + capabilities: + drop: + - ALLrecipies/llama-3-70b/model/model-download.yaml (1)
21-27: Consider security hardening for the container.Similar to the benchmark job, this container runs as root and could benefit from security hardening, especially since it's downloading external content.
Add security context to follow least privilege principle:
resources: requests: cpu: "10" memory: "5Gi" limits: cpu: "10" memory: "5Gi" + securityContext: + runAsNonRoot: true + runAsUser: 1000 + allowPrivilegeEscalation: false + readOnlyRootFilesystem: false + capabilities: + drop: + - ALL
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
recipies/llama-3-70b/README.md(1 hunks)recipies/llama-3-70b/agg/benchmark-job.yaml(1 hunks)recipies/llama-3-70b/agg/llama3-70b-agg.yaml(1 hunks)recipies/llama-3-70b/benchmark/benchmark-job.yaml(1 hunks)recipies/llama-3-70b/model/model-cache.yaml(1 hunks)recipies/llama-3-70b/model/model-download.yaml(1 hunks)recipies/llama-3-70b/run.sh(1 hunks)
🧰 Additional context used
🪛 LanguageTool
recipies/llama-3-70b/README.md
[grammar] ~5-~5: There might be a mistake here.
Context: ... Prerequisites - A shared storage class - A secret with the huggingface token # R...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
recipies/llama-3-70b/README.md
1-1: Trailing punctuation in heading
Punctuation: '.'
(MD026, no-trailing-punctuation)
🪛 Checkov (3.2.334)
recipies/llama-3-70b/agg/benchmark-job.yaml
[MEDIUM] 1-74: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[MEDIUM] 1-74: Minimize the admission of root containers
(CKV_K8S_23)
recipies/llama-3-70b/benchmark/benchmark-job.yaml
[MEDIUM] 1-72: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[MEDIUM] 1-72: Minimize the admission of root containers
(CKV_K8S_23)
recipies/llama-3-70b/model/model-download.yaml
[MEDIUM] 3-51: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[MEDIUM] 3-51: Minimize the admission of root containers
(CKV_K8S_23)
🪛 GitHub Actions: Copyright Checks
recipies/llama-3-70b/agg/benchmark-job.yaml
[error] 1-1: Copyright header check failed. Invalid/missing header in recipies/llama-3-70b/agg/benchmark-job.yaml. (Script: copyright-check.ps1)
recipies/llama-3-70b/model/model-cache.yaml
[error] 1-1: Copyright header check failed. Invalid/missing header in recipies/llama-3-70b/model/model-cache.yaml. (Script: copyright-check.ps1)
recipies/llama-3-70b/benchmark/benchmark-job.yaml
[error] 1-1: Copyright header check failed. Invalid/missing header in recipies/llama-3-70b/benchmark/benchmark-job.yaml. (Script: copyright-check.ps1)
recipies/llama-3-70b/run.sh
[error] 1-1: Copyright header check failed. Invalid/missing header in recipies/llama-3-70b/run.sh. (Script: copyright-check.ps1)
🪛 YAMLlint (1.37.1)
recipies/llama-3-70b/model/model-cache.yaml
[error] 11-11: no new line character at the end of file
(new-line-at-end-of-file)
recipies/llama-3-70b/model/model-download.yaml
[warning] 49-49: wrong indentation: expected 8 but found 6
(indentation)
[error] 51-51: no new line character at the end of file
(new-line-at-end-of-file)
recipies/llama-3-70b/agg/llama3-70b-agg.yaml
[error] 48-48: no new line character at the end of file
(new-line-at-end-of-file)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Build and Test - vllm
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (6)
recipies/llama-3-70b/model/model-cache.yaml (1)
6-7: Confirm RWX support.PVC requires ReadWriteMany; ensure the chosen StorageClass provides RWX (e.g., NFS/FSx/Portworx/Ceph). Otherwise mounts will fail at runtime.
recipies/llama-3-70b/agg/llama3-70b-agg.yaml (2)
9-21: LGTM for Frontend block and PVC wiring.Image, workingDir, and shared model cache are consistent with the rest of the recipe.
31-45: Resource requests look coherent; verify cluster capacity.Replicas=1 with gpu: "8", memory 160Gi, shm 80Gi; ensure nodes satisfy these to avoid Pending pods.
recipies/llama-3-70b/agg/benchmark-job.yaml (2)
62-67: Check cluster capacity for 64 CPU/80Gi.Large CPU/memory requests can block scheduling; tune for your node shape.
25-25: Verify service DNS for ENDPOINT
In recipies/llama-3-70b/agg/benchmark-job.yaml, the lineENDPOINT=llama3-70b-agg-0:8000assumes a Service named
llama3-70b-agg-0. Confirm the CRD or Service exposes this DNS or update ENDPOINT to the actual service name and port.recipies/llama-3-70b/model/model-download.yaml (1)
1-2: LGTM on copyright header.The copyright header is properly formatted and follows the required SPDX format.
fc40ae7 to
e643ec1
Compare
|
/ok to test ec86028 |
|
/ok to test e279496 |
4be47a3 to
ba9f3c4
Compare
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
Signed-off-by: Biswa Panda <[email protected]>
ba9f3c4 to
4f16a11
Compare
…70b models (#2792) Signed-off-by: Biswa Panda <[email protected]> Signed-off-by: Kristen Kelleher <[email protected]>
Overview:
Dynamo model serving recipes
closes: DEP-369, DEP-365, DEP-361, DYN-974, DYN-916
Summary by CodeRabbit
New Features
Documentation