ai-dynamo
diff --git a/‎recipes/README.md‎
Lines changed: 88 additions & 0 deletions b/‎recipes/README.md‎
Lines changed: 88 additions & 0 deletions
diff --git a/‎recipes/gpt-oss-120b/trtllm/agg/bench.yaml‎
Lines changed: 124 additions & 0 deletions b/‎recipes/gpt-oss-120b/trtllm/agg/bench.yaml‎
Lines changed: 124 additions & 0 deletions
diff --git a/‎recipes/gpt-oss-120b/trtllm/agg/config.yaml‎
Lines changed: 25 additions & 0 deletions b/‎recipes/gpt-oss-120b/trtllm/agg/config.yaml‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎recipes/gpt-oss-120b/trtllm/agg/deploy.yaml‎
Lines changed: 66 additions & 0 deletions b/‎recipes/gpt-oss-120b/trtllm/agg/deploy.yaml‎
Lines changed: 66 additions & 0 deletions
diff --git a/‎recipes/gpt-oss-120b/trtllm/agg/service.yaml‎
Lines changed: 13 additions & 0 deletions b/‎recipes/gpt-oss-120b/trtllm/agg/service.yaml‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎recipes/hf_hub_secret/hf_hub_secret.yaml‎
Lines changed: 9 additions & 0 deletions b/‎recipes/hf_hub_secret/hf_hub_secret.yaml‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎recipes/llama-3-70b/model-cache/model-cache.yaml‎
Lines changed: 13 additions & 0 deletions b/‎recipes/llama-3-70b/model-cache/model-cache.yaml‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎recipes/llama-3-70b/model-cache/model-download.yaml‎
Lines changed: 46 additions & 0 deletions b/‎recipes/llama-3-70b/model-cache/model-download.yaml‎
Lines changed: 46 additions & 0 deletions
@@ -0,0 +1,88 @@
+# Dynamo model serving recipes
+
+| Model family  | Backend | Mode                | Deployment | Benchmark |
+|---------------|---------|---------------------|------------|-----------|
+| llama-3-70b   | vllm    | agg                 |     ✓      |     ✓     |
+| llama-3-70b   | vllm    | disagg-multi-node   |     ✓      |     ✓     |
+| llama-3-70b   | vllm    | disagg-single-node  |     ✓      |     ✓     |
+| oss-gpt       | trtllm  | aggregated          |     ✓      |     ✓     |
+| DeepSeek-R1   | sglang  | disaggregated       |     🚧     |    🚧     |
+
+
+## Prerequisites
+
+1. Create a namespace and populate NAMESPACE environment variable
+This environment variable is used in later steps to deploy and perf-test the model.
+
+```bash
+export NAMESPACE=your-namespace
+kubectl create namespace ${NAMESPACE}
+```
+
+2. **Dynamo Cloud Platform installed** - Follow [Quickstart Guide](../docs/guides/dynamo_deploy/README.md)
+
+3. **Kubernetes cluster with GPU support**
+
+4. **Container registry access** for vLLM runtime images
+
+5. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
+Update the `hf-token-secret.yaml` file with your HuggingFace token.
+
+```bash
+kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}
+```
+
+6. (Optional) Create a shared model cache pvc to store the model weights.
+Choose a storage class to create the model cache pvc. You'll need to use this storage class name to update the `storageClass` field in the model-cache/model-cache.yaml file.
+
+```bash
+kubectl get storageclass
+```
+
+## Running the recipes
+
+Run the recipe to deploy a model:
+
+```bash
+./run.sh --model <model> --framework <framework> <deployment-type>
+```
+
+Arguments:
+  <deployment-type>  Deployment type (e.g., agg, disagg-single-node, disagg-multi-node)
+
+Required Options:
+  --model <model>    Model name (e.g., llama-3-70b)
+  --framework <fw>   Framework one of VLLM TRTLLM SGLANG (default: VLLM)
+
+Optional:
+  --skip-model-cache Skip model downloading (assumes model cache already exists)
+  -h, --help         Show this help message
+
+Environment Variables:
+  NAMESPACE          Kubernetes namespace (default: dynamo)
+
+Examples:
+  ./run.sh --model llama-3-70b --framework vllm agg
+  ./run.sh --skip-model-cache --model llama-3-70b --framework vllm agg
+  ./run.sh --model llama-3-70b --framework trtllm disagg-single-node
+Example:
+```bash
+./run.sh --model llama-3-70b --framework vllm --deployment-type agg
+```
+
+
+## Dry run mode
+
+To dry run the recipe, add the `--dry-run` flag.
+
+```bash
+./run.sh --dry-run --model llama-3-70b --framework vllm agg
+```
+
+## (Optional) Running the recipes with model cache
+You may need to cache the model weights on a PVC to avoid repeated downloads of the model weights.
+ See the [Prerequisites](#prerequisites) section for more details.
+
+```bash
+./run.sh --model llama-3-70b --framework vllm --deployment-type agg --skip-model-cache
+```
@@ -0,0 +1,124 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: oss-gpt120b-bench
+spec:
+  backoffLimit: 1
+  completions: 1
+  parallelism: 1
+  template:
+    metadata:
+      labels:
+        app: oss-gpt120b
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: perf
+        image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:aiperf-0637181
+        workingDir: /workspace/components/backends/vllm
+        env:
+          - name: TARGET_MODEL
+            value: openai/gpt-oss-120b
+          - name: ENDPOINT
+            value: gpt-oss-agg-trtllmworker:8000
+          - name: CONCURRENCIES
+            value: "13000 13500 1400"
+          - name: ISL
+            value: "16"
+          - name: OSL
+            value: "1000"
+          - name: DEPLOYMENT_MODE
+            value: "agg"
+          - name: DEPLOYMENT_GPU_COUNT
+            value: "32"
+          - name: JOB_NAME
+            valueFrom:
+              fieldRef:
+                fieldPath: metadata.labels['job-name']
+          - name: ROOT_ARTIFACT_DIR
+            value: /root/.cache/huggingface/hub/perf
+        command:
+        - /bin/sh
+        - -c
+        - |
+          #TODO: this can be baked into the aiperf image
+          apt-get update && apt-get install -y curl jq
+          export COLUMNS=200
+          EPOCH=$(date +%s)
+          ## utility functions -- can be moved to a bash script / configmap
+          wait_for_model_ready() {
+            echo "Waiting for model '$TARGET_MODEL' at $ENDPOINT/v1/models (checking every 5s)..."
+            while ! curl -s "http://$ENDPOINT/v1/models" | jq -e --arg model "$TARGET_MODEL" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
+                echo "[$(date '+%H:%M:%S')] Model not ready yet, waiting 5s..."
+                sleep 5
+            done
+            echo "✅ Model '$TARGET_MODEL' is now available!"
+            echo "Model '$TARGET_MODEL' is now available!"
+            curl -s "http://$ENDPOINT/v1/models" | jq .
+          }
+          run_perf() {
+            local concurrency=$1
+            local isl=$2
+            local osl=$3
+            key=concurrency_${concurrency}
+            export ARTIFACT_DIR="${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/${key}"
+            mkdir -p "$ARTIFACT_DIR"
+            aiperf profile --artifact-dir $ARTIFACT_DIR \
+                --model $TARGET_MODEL \
+                --tokenizer ~/.cache/huggingface/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a  \
+                --endpoint-type chat \
+                --endpoint /v1/chat/completions \
+                --streaming \
+                --url http://$ENDPOINT \
+                --synthetic-input-tokens-mean $isl \
+                --synthetic-input-tokens-stddev 0 \
+                --output-tokens-mean $osl \
+                --output-tokens-stddev 0 \
+                --extra-inputs "{\"max_tokens\":$osl}" \
+                --extra-inputs "{\"min_tokens\":$osl}" \
+                --extra-inputs "{\"ignore_eos\":true}" \
+                --extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
+                --concurrency $concurrency \
+                --request-count $((3*concurrency)) \
+                --warmup-request-count $concurrency \
+                --conversation-num 1 \
+                --random-seed 100 \
+                --request-rate 100000 \
+                --workers-max 128 \
+                -H 'Authorization: Bearer NOT USED' \
+                -H 'Accept: text/event-stream'\
+                --record-processors 32 \
+                --ui simple
+            echo "ARTIFACT_DIR: $ARTIFACT_DIR"
+            ls -la $ARTIFACT_DIR
+          }
+          #### Actual execution ####
+          wait_for_model_ready
+          mkdir -p "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}"
+          # Write input_config.json
+          cat > "${ROOT_ARTIFACT_DIR}/${EPOCH}_${JOB_NAME}/input_config.json" <<EOF
+          {
+            "gpu_count": $DEPLOYMENT_GPU_COUNT,
+            "mode": "$DEPLOYMENT_MODE",
+            "isl": $ISL,
+            "osl": $OSL,
+            "endpoint": "$ENDPOINT",
+            "model endpoint": "$TARGET_MODEL"
+          }
+          EOF
+          # Run perf for each concurrency
+          for concurrency in $CONCURRENCIES; do
+            run_perf $concurrency $ISL $OSL
+            sleep 10
+          done
+        volumeMounts:
+        - name: model-cache
+          mountPath: /root/.cache/huggingface
+      imagePullSecrets:
+        - name: nvcrimagepullsecret
+      volumes:
+      - name: model-cache
+        persistentVolumeClaim:
+          claimName: model-cache
@@ -0,0 +1,25 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: llm-config
+data:
+  config.yaml: |
+    tensor_parallel_size: 4
+    moe_expert_parallel_size: 4
+    enable_attention_dp: true
+    build_config:
+      max_batch_size: 640
+      max_num_tokens: 20000
+    moe_config:
+        backend: CUTLASS
+    cuda_graph_config:
+        max_batch_size: 640
+        enable_padding: true
+    kv_cache_config:
+      free_gpu_memory_fraction: 0.9
+      enable_block_reuse: false
+    print_iter_log: false
+    stream_interval: 50
+    use_torch_sampler: true
@@ -0,0 +1,66 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: gpt-oss-agg-shm
+spec:
+  backendFramework: trtllm
+  services:
+    TrtllmWorker:
+      componentType: main
+      dynamoNamespace: gpt-oss-agg-shm
+      envFromSecret: hf-token-secret
+      pvc:
+        create: false
+        name: model-cache-oss-gpt120b
+        mountPoint: /root/.cache/huggingface
+      sharedMemory:
+        size: 80Gi
+      extraPodSpec:
+        tolerations:
+          - key: "dedicated"
+            operator: "Equal"
+            value: "user-workload"
+            effect: "NoSchedule"
+          - key: "dedicated"
+            operator: "Equal"
+            value: "user-workload"
+            effect: "NoExecute"
+        affinity:
+          nodeAffinity:
+            requiredDuringSchedulingIgnoredDuringExecution:
+              nodeSelectorTerms:
+                - matchExpressions:
+                    - key: nvidia.com/gpu.present
+                      operator: In
+                      values:
+                        - "true"
+        mainContainer:
+          args:
+          - |
+            export TRTLLM_ENABLE_PDL=1
+            export TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True
+            export ENGINE_ARGS=${AGG_ENGINE_ARGS:-"/root/.cache/huggingface/gpt-oss-120b/config.yaml"}
+            export MODEL_PATH=${MODEL_PATH:-"/root/.cache/huggingface/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a"}
+            export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"openai/gpt-oss-120b"}
+            trap 'echo Cleaning up...; kill 0' EXIT
+            python3 -m dynamo.frontend --router-mode round-robin --http-port 8000 &
+            python3 -m dynamo.trtllm \
+              --model-path "$MODEL_PATH" \
+              --served-model-name "$SERVED_MODEL_NAME" \
+              --extra-engine-args "$ENGINE_ARGS" \
+              --max-num-tokens 20000 \
+              --max-batch-size 640 \
+              --free-gpu-memory-fraction 0.9
+          command:
+          - /bin/sh
+          - -c
+          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:gpt-oss-dynamo-nvl72-debug-trtllm-tot
+          workingDir: /workspace/components/backends/trtllm
+      replicas: 1
+      resources:
+        limits:
+          gpu: "4"
+        requests:
+          gpu: "4"
@@ -0,0 +1,13 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: Service
+metadata:
+  name: gpt-oss-agg-trtllmworker
+spec:
+  selector:
+    nvidia.com/selector: gpt-oss-agg-trtllmworker
+  ports:
+    - protocol: TCP
+      port: 8000
+      targetPort: 8000
@@ -0,0 +1,9 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: Secret
+metadata:
+  name: hf-token-secret
+type: Opaque
+stringData:
+  HF_TOKEN: "<Huggingface token with access to the model>"
@@ -0,0 +1,13 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: model-cache
+spec:
+  accessModes:
+    - ReadWriteMany
+  resources:
+    requests:
+      storage: 100Gi
+  storageClassName: "your-storage-class-name"
@@ -0,0 +1,46 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: model-download
+spec:
+  backoffLimit: 3
+  completions: 1
+  parallelism: 1
+  template:
+    metadata:
+      labels:
+        app: model-download
+    spec:
+      restartPolicy: Never
+      containers:
+        - name: model-download
+          image: python:3.10-slim
+          command: ["sh", "-c"]
+          envFrom:
+            - secretRef:
+                name: hf-token-secret
+          env:
+            # NOTE: This is the model name for the llama-3-70b model
+            # Update this to model name for the model you are downloading
+            - name: MODEL_NAME
+              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-token-secret
+                  key: HF_TOKEN
+          args:
+            - |
+              set -eux
+              pip install --no-cache-dir huggingface_hub hf_transfer
+              export HF_HUB_ENABLE_HF_TRANSFER=1
+              huggingface-cli download $MODEL_NAME
+          volumeMounts:
+            - name: model-cache
+              mountPath: /root/.cache/huggingface/hub
+      volumes:
+      - name: model-cache
+        persistentVolumeClaim:
+          claimName: model-cache