Skip to content

Commit f156221

Browse files
[None][doc] add GPT OSS Eagle3 blog (#7140)
Signed-off-by: Izzy Putterman <[email protected]>
1 parent 7c73c2f commit f156221

File tree

1 file changed

+140
-0
lines changed

1 file changed

+140
-0
lines changed
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
## Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT-LLM)
2+
3+
This guide sets up a production endpoint that uses Eagle3 speculative decoding on NVIDIA GB200 or B200 GPUs only. It replaces the low‑latency flow from the previous guide and intentionally omits max‑throughput, Hopper, and benchmarking content.
4+
5+
### Prerequisites
6+
7+
- NVIDIA GB200 or B200 GPUs (example below assumes 8 GPUs; adjust flags for your setup)
8+
- Fast SSD storage for model weights
9+
- Base model weights available under a directory named `gpt-oss-120b` (example path)
10+
- Eagle3 speculative model assets available under a directory named `eagle`
11+
12+
Expected directory layout on the host (example):
13+
14+
```
15+
/path/to/models/
16+
├─ gpt-oss-120b/ # base model directory
17+
└─ eagle/ # Eagle3 speculative decoding assets
18+
```
19+
20+
### Get the TensorRT-LLM Container (1.1.0rc0)
21+
22+
If required by your environment, log into NGC and pull the image:
23+
24+
```bash
25+
# Create an API key at https://ngc.nvidia.com (if you don't have one)
26+
docker login nvcr.io
27+
# Username: $oauthtoken
28+
# Password: <your NGC API key>
29+
30+
docker pull nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0
31+
```
32+
33+
### Start the TensorRT-LLM Container
34+
35+
Run the container and bind-mount your models directory to `/config/models` inside the container:
36+
37+
```bash
38+
docker run --rm --ipc=host -it \
39+
--ulimit stack=67108864 \
40+
--ulimit memlock=-1 \
41+
--gpus all \
42+
-p 8000:8000 \
43+
-v /path/to/models:/config/models:rw \
44+
nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
45+
/bin/bash
46+
```
47+
48+
Replace `/path/to/models` with the absolute path on your host.
49+
50+
### Download the models (Base + Eagle3)
51+
52+
Inside the container, download the base model and the Eagle3 speculative model to the expected directories under `/config/models/`:
53+
54+
```bash
55+
# Optional: authenticate if the repository requires it
56+
# export HF_TOKEN=hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
57+
# huggingface-cli login --token "$HF_TOKEN" --add-to-git-credential
58+
59+
pip install -q "huggingface_hub[cli]"
60+
61+
# Base model: openai/gpt-oss-120b
62+
huggingface-cli download openai/gpt-oss-120b \
63+
--local-dir /config/models/gpt-oss-120b \
64+
--repo-type model
65+
66+
# Eagle3 model assets
67+
mkdir -p /config/models/eagle
68+
huggingface-cli download nvidia/gpt-oss-120b-Eagle3 \
69+
--local-dir /config/models/eagle \
70+
--repo-type model
71+
```
72+
73+
References: `https://huggingface.co/openai/gpt-oss-120b` and `https://huggingface.co/nvidia/gpt-oss-120b-Eagle3`
74+
75+
### Create the Eagle3 Configuration
76+
77+
Inside the container, create the YAML file at `/config/models/eagle/eagle.yaml` with the following content:
78+
79+
```bash
80+
mkdir -p /config/models/eagle
81+
cat > /config/models/eagle/eagle.yaml << 'EOF'
82+
trust_remote_code: true
83+
kv_cache_config:
84+
enable_block_reuse: false
85+
free_gpu_memory_fraction: 0.8
86+
speculative_config:
87+
decoding_type: Eagle
88+
max_draft_len: 3
89+
speculative_model_dir: /config/models/eagle/
90+
cuda_graph_config:
91+
max_batch_size: 10
92+
use_torch_sampler: true
93+
moe_config:
94+
backend: TRTLLM
95+
EOF
96+
```
97+
98+
Notes:
99+
- Ensure your base model directory is `/config/models/gpt-oss-120b`.
100+
- Ensure your Eagle3 assets are present under `/config/models/eagle/`.
101+
- If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`.
102+
103+
### Launch the Server (Eagle3 Speculative Decoding)
104+
105+
Run the following command inside the container to start the endpoint:
106+
107+
```bash
108+
TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 --max_batch_size 10 --tp_size 8 --ep_size 4 --trust_remote_code --extra_llm_api_options /config/models/eagle/eagle.yaml --max_num_tokens 131072 --max_seq_len 131072
109+
```
110+
111+
The server initializes, loads, and optimizes the models. After it is ready, it listens on port 8000.
112+
113+
### Quick Health Check
114+
115+
From another terminal on the host, verify that the server is healthy:
116+
117+
```bash
118+
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
119+
```
120+
121+
When `Status: 200` is returned, the endpoint is ready to serve requests.
122+
123+
### Sample Chat Completions Request
124+
125+
Note: This Eagle3 + TensorRT-LLM endpoint currently supports only greedy sampling. The following Chat Completions parameters are ignored (no-ops): `temperature`, `top_p`, `top_k`, and `seed`.
126+
127+
Send a simple OpenAI-compatible Chat Completions request to the running server:
128+
129+
```bash
130+
curl -X POST "http://localhost:8000/v1/chat/completions" \
131+
-H "Content-Type: application/json" \
132+
-d '{
133+
"model": "gpt-oss-120b",
134+
"messages": [
135+
{"role": "user", "content": "Give me a two-sentence summary of Eagle3 speculative decoding."}
136+
],
137+
"max_tokens": 128,
138+
"stream": false
139+
}'
140+
```

0 commit comments

Comments
 (0)