Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions recipes/gpt-oss-120b/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# GPT-OSS-120B Recipe Guide

This guide will help you run the GPT-OSS-120B language model using Dynamo's optimized setup.

## Prerequisites

Follow the instructions in recipe [README.md](../README.md) to create a namespace and kubernetes secret for huggingface token.

## Quick Start

To run the model, simply execute this command in your terminal:

```bash
cd recipe
./run.sh --model gpt-oss-120b --framework trtllm agg
```

## (Alternative) Step by Step Guide

### 1. Download the Model

```bash
cd recipes/gpt-oss-120b
kubectl apply -n $NAMESPACE -f ./model-cache
```

### 2. Deploy and Benchmark the Model

```bash
cd recipes/gpt-oss-120b
kubectl apply -n $NAMESPACE -f ./trtllm/agg
```

### Container Image
This recipe was tested with dynamo trtllm runtime container for ARM64 processors.

**Important Note:**

Before dynamo v0.5.1 release, following container image is supported:
```
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3
```

After dynamo v0.5.1 release, following container image will be supported:
```
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1
```

## Notes
1. The benchmark container image uses a specific commit of aiperf to ensure reproducible results and compatibility with the benchmarking setup.

2. storage class is not specified in the recipe, you need to specify it in the `deploy.yaml` file.
13 changes: 13 additions & 0 deletions recipes/gpt-oss-120b/model-cache/model-cache.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
storageClassName: "your-storage-class-name"
44 changes: 44 additions & 0 deletions recipes/gpt-oss-120b/model-cache/model-download.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
backoffLimit: 3
completions: 1
parallelism: 1
template:
metadata:
labels:
app: model-download
spec:
restartPolicy: Never
containers:
- name: model-download
image: python:3.10-slim
command: ["sh", "-c"]
envFrom:
- secretRef:
name: hf-token-secret
env:
- name: MODEL_NAME
value: openai/gpt-oss-120b
- name: HF_HOME
value: /model-store
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "1"
- name: MODEL_REVISION
value: b5c939de8f754692c1647ca79fbf85e8c1e70f8a
args:
- |
set -eux
pip install --no-cache-dir huggingface_hub hf_transfer
hf download $MODEL_NAME --revision $MODEL_REVISION --exclude "original/*" --exclude "metal/*"
volumeMounts:
- name: model-cache
mountPath: /model-store
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
16 changes: 4 additions & 12 deletions recipes/gpt-oss-120b/trtllm/agg/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,12 @@ metadata:
name: llm-config
data:
config.yaml: |
tensor_parallel_size: 4
moe_expert_parallel_size: 4
enable_attention_dp: true
build_config:
max_batch_size: 640
max_num_tokens: 20000
moe_config:
backend: CUTLASS
cuda_graph_config:
max_batch_size: 640
max_batch_size: 800
enable_padding: true
kv_cache_config:
free_gpu_memory_fraction: 0.9
enable_block_reuse: false
print_iter_log: false
stream_interval: 50
use_torch_sampler: true
stream_interval: 20
moe_config:
backend: CUTLASS
96 changes: 63 additions & 33 deletions recipes/gpt-oss-120b/trtllm/agg/deploy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,61 +3,91 @@
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: gpt-oss-agg-shm
name: gpt-oss-agg
spec:
backendFramework: trtllm
services:
Frontend:
componentType: frontend
dynamoNamespace: gpt-oss-agg
extraPodSpec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvidia.com/dynamo-graph-deployment-name
operator: In
values:
- gpt-oss-agg-frontend
topologyKey: kubernetes.io/hostname
mainContainer:
args:
- python3 -m dynamo.frontend --router-mode round-robin --http-port 8000
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1
pvc:
create: false
mountPoint: /model-store
name: model-cache
replicas: 1
TrtllmWorker:
componentType: main
dynamoNamespace: gpt-oss-agg-shm
dynamoNamespace: gpt-oss-agg
envFromSecret: hf-token-secret
pvc:
create: false
name: model-cache-oss-gpt120b
mountPoint: /root/.cache/huggingface
sharedMemory:
size: 80Gi
extraPodSpec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "user-workload"
effect: "NoSchedule"
- key: "dedicated"
operator: "Equal"
value: "user-workload"
effect: "NoExecute"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
mainContainer:
args:
- |
export TRTLLM_ENABLE_PDL=1
export TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True
export ENGINE_ARGS=${AGG_ENGINE_ARGS:-"/root/.cache/huggingface/gpt-oss-120b/config.yaml"}
export MODEL_PATH=${MODEL_PATH:-"/root/.cache/huggingface/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a"}
export SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"openai/gpt-oss-120b"}
trap 'echo Cleaning up...; kill 0' EXIT
python3 -m dynamo.frontend --router-mode round-robin --http-port 8000 &
python3 -m dynamo.trtllm \
--model-path "$MODEL_PATH" \
--served-model-name "$SERVED_MODEL_NAME" \
--extra-engine-args "$ENGINE_ARGS" \
--max-num-tokens 20000 \
--max-batch-size 640 \
--model-path "${MODEL_PATH}" \
--served-model-name "openai/gpt-oss-120b" \
--extra-engine-args "${ENGINE_ARGS}" \
--tensor-parallel-size 4 \
--expert-parallel-size 4 \
--max-batch-size 800 \
--free-gpu-memory-fraction 0.9
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
env:
- name: TRTLLM_ENABLE_PDL
value: "1"
- name: TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL
value: "True"
- name: SERVED_MODEL_NAME
value: "openai/gpt-oss-120b"
- name: ENGINE_ARGS
value: "/opt/dynamo/configs/config.yaml"
- name: MODEL_PATH
value: "/model-store/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a"
volumeMounts:
- mountPath: /opt/dynamo/configs
name: llm-config
readOnly: true
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1
workingDir: /workspace/components/backends/trtllm
volumes:
- configMap:
name: llm-config
name: llm-config
pvc:
create: false
mountPoint: /model-store
name: model-cache
replicas: 1
resources:
limits:
Expand Down
Loading
Loading