Skip to content

Commit 2fe5753

Browse files
biswapandakmkelle-nv
authored andcommitted
feat: add single-liner deployment and benchmarking recipe for llama3-70b models (#2792)
Signed-off-by: Biswa Panda <[email protected]> Signed-off-by: Kristen Kelleher <[email protected]>
1 parent 5d703c6 commit 2fe5753

File tree

15 files changed

+1039
-0
lines changed

15 files changed

+1039
-0
lines changed

recipes/README.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Dynamo model serving recipes
2+
3+
| Model family | Backend | Mode | Deployment | Benchmark |
4+
|---------------|---------|---------------------|------------|-----------|
5+
| llama-3-70b | vllm | agg |||
6+
| llama-3-70b | vllm | disagg-multi-node |||
7+
| llama-3-70b | vllm | disagg-single-node |||
8+
| oss-gpt | trtllm | aggregated |||
9+
| DeepSeek-R1 | sglang | disaggregated | 🚧 | 🚧 |
10+
11+
12+
## Prerequisites
13+
14+
1. Create a namespace and populate NAMESPACE environment variable
15+
This environment variable is used in later steps to deploy and perf-test the model.
16+
17+
```bash
18+
export NAMESPACE=your-namespace
19+
kubectl create namespace ${NAMESPACE}
20+
```
21+
22+
2. **Dynamo Cloud Platform installed** - Follow [Quickstart Guide](../docs/guides/dynamo_deploy/README.md)
23+
24+
3. **Kubernetes cluster with GPU support**
25+
26+
4. **Container registry access** for vLLM runtime images
27+
28+
5. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
29+
Update the `hf-token-secret.yaml` file with your HuggingFace token.
30+
31+
```bash
32+
kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}
33+
```
34+
35+
6. (Optional) Create a shared model cache pvc to store the model weights.
36+
Choose a storage class to create the model cache pvc. You'll need to use this storage class name to update the `storageClass` field in the model-cache/model-cache.yaml file.
37+
38+
```bash
39+
kubectl get storageclass
40+
```
41+
42+
## Running the recipes
43+
44+
Run the recipe to deploy a model:
45+
46+
```bash
47+
./run.sh --model <model> --framework <framework> <deployment-type>
48+
```
49+
50+
Arguments:
51+
<deployment-type> Deployment type (e.g., agg, disagg-single-node, disagg-multi-node)
52+
53+
Required Options:
54+
--model <model> Model name (e.g., llama-3-70b)
55+
--framework <fw> Framework one of VLLM TRTLLM SGLANG (default: VLLM)
56+
57+
Optional:
58+
--skip-model-cache Skip model downloading (assumes model cache already exists)
59+
-h, --help Show this help message
60+
61+
Environment Variables:
62+
NAMESPACE Kubernetes namespace (default: dynamo)
63+
64+
Examples:
65+
./run.sh --model llama-3-70b --framework vllm agg
66+
./run.sh --skip-model-cache --model llama-3-70b --framework vllm agg
67+
./run.sh --model llama-3-70b --framework trtllm disagg-single-node
68+
Example:
69+
```bash
70+
./run.sh --model llama-3-70b --framework vllm --deployment-type agg
71+
```
72+
73+
74+
## Dry run mode
75+
76+
To dry run the recipe, add the `--dry-run` flag.
77+
78+
```bash
79+
./run.sh --dry-run --model llama-3-70b --framework vllm agg
80+
```
81+
82+
## (Optional) Running the recipes with model cache
83+
You may need to cache the model weights on a PVC to avoid repeated downloads of the model weights.
84+
See the [Prerequisites](#prerequisites) section for more details.
85+
86+
```bash
87+
./run.sh --model llama-3-70b --framework vllm --deployment-type agg --skip-model-cache
88+
```
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
apiVersion: batch/v1
4+
kind: Job
5+
metadata:
6+
name: oss-gpt120b-bench
7+
spec:
8+
backoffLimit: 1
9+
completions: 1
10+
parallelism: 1
11+
template:
12+
metadata:
13+
labels:
14+
app: oss-gpt120b
15+
spec:
16+
restartPolicy: Never
17+
containers:
18+
- name: perf
19+
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:aiperf-0637181
20+
workingDir: /workspace/components/backends/vllm
21+
env:
22+
- name: TARGET_MODEL
23+
value: openai/gpt-oss-120b
24+
- name: ENDPOINT
25+
value: gpt-oss-agg-trtllmworker:8000
26+
- name: CONCURRENCIES
27+
value: "13000 13500 1400"
28+
- name: ISL
29+
value: "16"
30+
- name: OSL
31+
value: "1000"
32+
- name: DEPLOYMENT_MODE
33+
value: "agg"
34+
- name: DEPLOYMENT_GPU_COUNT
35+
value: "32"
36+
- name: JOB_NAME
37+
valueFrom:
38+
fieldRef:
39+
fieldPath: metadata.labels['job-name']
40+
- name: ROOT_ARTIFACT_DIR
41+
value: /root/.cache/huggingface/hub/perf
42+
command:
43+
- /bin/sh
44+
- -c
45+
- |
46+
#TODO: this can be baked into the aiperf image
47+
apt-get update && apt-get install -y curl jq
48+
export COLUMNS=200
49+
EPOCH=$(date +%s)
50+
## utility functions -- can be moved to a bash script / configmap
51+
wait_for_model_ready() {
52+
echo "Waiting for model '$TARGET_MODEL' at $ENDPOINT/v1/models (checking every 5s)..."
53+
while ! curl -s "http://$ENDPOINT/v1/models" | jq -e --arg model "$TARGET_MODEL" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
54+
echo "[$(date '+%H:%M:%S')] Model not ready yet, waiting 5s..."
55+
sleep 5
56+
done
57+
echo "✅ Model '$TARGET_MODEL' is now available!"
58+
echo "Model '$TARGET_MODEL' is now available!"
59+
curl -s "http://$ENDPOINT/v1/models" | jq .
60+
}
61+
run_perf() {
62+
local concurrency=$1
63+
local isl=$2
64+
local osl=$3
65+
key=concurrency_${concurrency}
66+
export ARTIFACT_DIR="${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/${key}"
67+
mkdir -p "$ARTIFACT_DIR"
68+
aiperf profile --artifact-dir $ARTIFACT_DIR \
69+
--model $TARGET_MODEL \
70+
--tokenizer ~/.cache/huggingface/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a \
71+
--endpoint-type chat \
72+
--endpoint /v1/chat/completions \
73+
--streaming \
74+
--url http://$ENDPOINT \
75+
--synthetic-input-tokens-mean $isl \
76+
--synthetic-input-tokens-stddev 0 \
77+
--output-tokens-mean $osl \
78+
--output-tokens-stddev 0 \
79+
--extra-inputs "{\"max_tokens\":$osl}" \
80+
--extra-inputs "{\"min_tokens\":$osl}" \
81+
--extra-inputs "{\"ignore_eos\":true}" \
82+
--extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
83+
--concurrency $concurrency \
84+
--request-count $((3*concurrency)) \
85+
--warmup-request-count $concurrency \
86+
--conversation-num 1 \
87+
--random-seed 100 \
88+
--request-rate 100000 \
89+
--workers-max 128 \
90+
-H 'Authorization: Bearer NOT USED' \
91+
-H 'Accept: text/event-stream'\
92+
--record-processors 32 \
93+
--ui simple
94+
echo "ARTIFACT_DIR: $ARTIFACT_DIR"
95+
ls -la $ARTIFACT_DIR
96+
}
97+
#### Actual execution ####
98+
wait_for_model_ready
99+
mkdir -p "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}"
100+
# Write input_config.json
101+
cat > "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/input_config.json" <<EOF
102+
{
103+
"gpu_count": $DEPLOYMENT_GPU_COUNT,
104+
"mode": "$DEPLOYMENT_MODE",
105+
"isl": $ISL,
106+
"osl": $OSL,
107+
"endpoint": "$ENDPOINT",
108+
"model endpoint": "$TARGET_MODEL"
109+
}
110+
EOF
111+
# Run perf for each concurrency
112+
for concurrency in $CONCURRENCIES; do
113+
run_perf $concurrency $ISL $OSL
114+
sleep 10
115+
done
116+
volumeMounts:
117+
- name: model-cache
118+
mountPath: /root/.cache/huggingface
119+
imagePullSecrets:
120+
- name: nvcrimagepullsecret
121+
volumes:
122+
- name: model-cache
123+
persistentVolumeClaim:
124+
claimName: model-cache
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
apiVersion: v1
4+
kind: ConfigMap
5+
metadata:
6+
name: llm-config
7+
data:
8+
config.yaml: |
9+
tensor_parallel_size: 4
10+
moe_expert_parallel_size: 4
11+
enable_attention_dp: true
12+
build_config:
13+
max_batch_size: 640
14+
max_num_tokens: 20000
15+
moe_config:
16+
backend: CUTLASS
17+
cuda_graph_config:
18+
max_batch_size: 640
19+
enable_padding: true
20+
kv_cache_config:
21+
free_gpu_memory_fraction: 0.9
22+
enable_block_reuse: false
23+
print_iter_log: false
24+
stream_interval: 50
25+
use_torch_sampler: true
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
apiVersion: nvidia.com/v1alpha1
4+
kind: DynamoGraphDeployment
5+
metadata:
6+
name: gpt-oss-agg-shm
7+
spec:
8+
backendFramework: trtllm
9+
services:
10+
TrtllmWorker:
11+
componentType: main
12+
dynamoNamespace: gpt-oss-agg-shm
13+
envFromSecret: hf-token-secret
14+
pvc:
15+
create: false
16+
name: model-cache-oss-gpt120b
17+
mountPoint: /root/.cache/huggingface
18+
sharedMemory:
19+
size: 80Gi
20+
extraPodSpec:
21+
tolerations:
22+
- key: "dedicated"
23+
operator: "Equal"
24+
value: "user-workload"
25+
effect: "NoSchedule"
26+
- key: "dedicated"
27+
operator: "Equal"
28+
value: "user-workload"
29+
effect: "NoExecute"
30+
affinity:
31+
nodeAffinity:
32+
requiredDuringSchedulingIgnoredDuringExecution:
33+
nodeSelectorTerms:
34+
- matchExpressions:
35+
- key: nvidia.com/gpu.present
36+
operator: In
37+
values:
38+
- "true"
39+
mainContainer:
40+
args:
41+
- |
42+
export TRTLLM_ENABLE_PDL=1
43+
export TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True
44+
export ENGINE_ARGS=${AGG_ENGINE_ARGS:-"/root/.cache/huggingface/gpt-oss-120b/config.yaml"}
45+
export MODEL_PATH=${MODEL_PATH:-"/root/.cache/huggingface/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a"}
46+
export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"openai/gpt-oss-120b"}
47+
trap 'echo Cleaning up...; kill 0' EXIT
48+
python3 -m dynamo.frontend --router-mode round-robin --http-port 8000 &
49+
python3 -m dynamo.trtllm \
50+
--model-path "$MODEL_PATH" \
51+
--served-model-name "$SERVED_MODEL_NAME" \
52+
--extra-engine-args "$ENGINE_ARGS" \
53+
--max-num-tokens 20000 \
54+
--max-batch-size 640 \
55+
--free-gpu-memory-fraction 0.9
56+
command:
57+
- /bin/sh
58+
- -c
59+
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:gpt-oss-dynamo-nvl72-debug-trtllm-tot
60+
workingDir: /workspace/components/backends/trtllm
61+
replicas: 1
62+
resources:
63+
limits:
64+
gpu: "4"
65+
requests:
66+
gpu: "4"
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
apiVersion: v1
4+
kind: Service
5+
metadata:
6+
name: gpt-oss-agg-trtllmworker
7+
spec:
8+
selector:
9+
nvidia.com/selector: gpt-oss-agg-trtllmworker
10+
ports:
11+
- protocol: TCP
12+
port: 8000
13+
targetPort: 8000
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
apiVersion: v1
4+
kind: Secret
5+
metadata:
6+
name: hf-token-secret
7+
type: Opaque
8+
stringData:
9+
HF_TOKEN: "<Huggingface token with access to the model>"
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
apiVersion: v1
4+
kind: PersistentVolumeClaim
5+
metadata:
6+
name: model-cache
7+
spec:
8+
accessModes:
9+
- ReadWriteMany
10+
resources:
11+
requests:
12+
storage: 100Gi
13+
storageClassName: "your-storage-class-name"
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
apiVersion: batch/v1
4+
kind: Job
5+
metadata:
6+
name: model-download
7+
spec:
8+
backoffLimit: 3
9+
completions: 1
10+
parallelism: 1
11+
template:
12+
metadata:
13+
labels:
14+
app: model-download
15+
spec:
16+
restartPolicy: Never
17+
containers:
18+
- name: model-download
19+
image: python:3.10-slim
20+
command: ["sh", "-c"]
21+
envFrom:
22+
- secretRef:
23+
name: hf-token-secret
24+
env:
25+
# NOTE: This is the model name for the llama-3-70b model
26+
# Update this to model name for the model you are downloading
27+
- name: MODEL_NAME
28+
value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
29+
- name: HF_TOKEN
30+
valueFrom:
31+
secretKeyRef:
32+
name: hf-token-secret
33+
key: HF_TOKEN
34+
args:
35+
- |
36+
set -eux
37+
pip install --no-cache-dir huggingface_hub hf_transfer
38+
export HF_HUB_ENABLE_HF_TRANSFER=1
39+
huggingface-cli download $MODEL_NAME
40+
volumeMounts:
41+
- name: model-cache
42+
mountPath: /root/.cache/huggingface/hub
43+
volumes:
44+
- name: model-cache
45+
persistentVolumeClaim:
46+
claimName: model-cache

0 commit comments

Comments
 (0)