Skip to content

Commit 01a70ef

Browse files
committed
Update
Signed-off-by: Chenfei Zhang <[email protected]>
1 parent 2bff3ab commit 01a70ef

File tree

1 file changed

+33
-33
lines changed

1 file changed

+33
-33
lines changed

examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md

Lines changed: 33 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Deployment Guide for TensorRT-LLM Llama4 Scout 17B FP8 and NVFP4
22

3-
##
3+
##
44

55
# Introduction
66

@@ -14,15 +14,15 @@ To use Llama4 Scout 17B, you must first agree to Meta’s Llama 4 Community Lice
1414

1515
# Prerequisites
1616

17-
GPU: NVIDIA Blackwell or Hopper Architecture
18-
OS: Linux
19-
Drivers: CUDA Driver 575 or Later
20-
Docker with NVIDIA Container Toolkit installed
17+
GPU: NVIDIA Blackwell or Hopper Architecture
18+
OS: Linux
19+
Drivers: CUDA Driver 575 or Later
20+
Docker with NVIDIA Container Toolkit installed
2121
Python3 and python3-pip (Optional, for accuracy evaluation only)
2222

2323
# Models
2424

25-
* FP8 model: [Llama-4-Scout-17B-16E-Instruct-FP8](https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP8)
25+
* FP8 model: [Llama-4-Scout-17B-16E-Instruct-FP8](https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP8)
2626
* NVFP4 model: [Llama-4-Scout-17B-16E-Instruct-FP4](https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP4)
2727

2828

@@ -45,14 +45,14 @@ nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc4 \
4545
/bin/bash
4646
```
4747

48-
Note:
48+
Note:
4949

50-
* You can mount additional directories and paths using the \-v \<local\_path\>:\<path\> flag if needed, such as mounting the downloaded weight paths.
51-
* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to \~/.cache/huggingface/hub/ by default. This prevents having to redownload the weights each time you rerun the container. If the \~/.cache directory doesn’t exist please create it using mkdir \~/.cache
52-
* The command also maps port 8000 from the container to your host so you can access the LLM API endpoint from your host
50+
* You can mount additional directories and paths using the \-v \<local\_path\>:\<path\> flag if needed, such as mounting the downloaded weight paths.
51+
* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to \~/.cache/huggingface/hub/ by default. This prevents having to redownload the weights each time you rerun the container. If the \~/.cache directory doesn’t exist please create it using mkdir \~/.cache
52+
* The command also maps port 8000 from the container to your host so you can access the LLM API endpoint from your host
5353
* See the [https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) for all the available containers. The containers published in the main branch weekly have “rcN” suffix, while the monthly release with QA tests has no “rcN” suffix. Use the rc release to get the latest model and feature support.
5454

55-
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
55+
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
5656

5757
## Creating the TRT-LLM Server config
5858

@@ -66,7 +66,7 @@ enable_attention_dp: false
6666
cuda_graph_config:
6767
enable_padding: true
6868
max_batch_size: 1024
69-
kv_cache_config:
69+
kv_cache_config:
7070
dtype: fp8
7171
EOF
7272
```
@@ -117,7 +117,7 @@ These options are used directly on the command line when you start the `trtllm-s
117117

118118
&emsp;**Description:** The maximum number of user requests that can be grouped into a single batch for processing.
119119

120-
#### `--max_num_of_tokens`
120+
#### `--max_num_tokens`
121121

122122
&emsp;**Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
123123

@@ -138,7 +138,7 @@ These options provide finer control over performance and are set within a YAML f
138138

139139
&emsp;**Description**: A section for configuring the Key-Value (KV) cache.
140140

141-
&emsp;**Options**:
141+
&emsp;**Options**:
142142

143143
&emsp;&emsp;dtype: Sets the data type for the KV cache.
144144

@@ -186,7 +186,7 @@ See the [https://github.com/nvidia/TensorRT-LLM/blob/main/tensorrt\_llm/llmapi/l
186186

187187
## Basic Test
188188

189-
Start a new terminal on the host to test the TensorRT-LLM server you just launched.
189+
Start a new terminal on the host to test the TensorRT-LLM server you just launched.
190190

191191
You can query the health/readiness of the server using:
192192

@@ -196,7 +196,7 @@ curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
196196

197197
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
198198

199-
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
199+
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
200200

201201
```shell
202202
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
@@ -215,10 +215,10 @@ Here is an example response, showing that the TRT-LLM server returns “New York
215215

216216
## Troubleshooting Tips
217217

218-
* If you encounter CUDA out-of-memory errors, try reducing max\_batch\_size or max\_seq\_len
219-
* Ensure your model checkpoints are compatible with the expected format
220-
* For performance issues, check GPU utilization with nvidia-smi while the server is running
221-
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed
218+
* If you encounter CUDA out-of-memory errors, try reducing max\_batch\_size or max\_seq\_len
219+
* Ensure your model checkpoints are compatible with the expected format
220+
* For performance issues, check GPU utilization with nvidia-smi while the server is running
221+
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed
222222
* For connection issues, make sure port 8000 is not being used by another application
223223

224224
## Running Evaluations to Verify Accuracy (Optional)
@@ -241,7 +241,7 @@ MODEL_PATH=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8
241241
lm_eval --model local-completions --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0 --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp8.gsm8k
242242
```
243243

244-
Sample result in Blackwell.
244+
Sample result in Blackwell.
245245

246246
```shell
247247
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
@@ -253,7 +253,7 @@ Sample result in Blackwell.
253253
FP4 command for GSM8K
254254

255255
```shell
256-
MODEL_PATH=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8
256+
MODEL_PATH=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4
257257

258258
lm_eval --model local-completions --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0 --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp4.gsm8k
259259
```
@@ -309,7 +309,7 @@ If you want to save the results to a file add the following options.
309309
--result-filename "concurrency_${concurrency}.json"
310310
```
311311

312-
For more benchmarking options see. [https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
312+
For more benchmarking options see. [https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
313313

314314
Run bench.sh to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above bench.sh script.
315315

@@ -350,13 +350,13 @@ P99 E2EL (ms): [result]
350350

351351
## Key Metrics
352352

353-
* Median Time to First Token (TTFT)
354-
* The typical time elapsed from when a request is sent until the first output token is generated.
355-
* Median Time Per Output Token (TPOT)
356-
* The typical time required to generate each token *after* the first one.
357-
* Median Inter-Token Latency (ITL)
358-
* The typical time delay between the completion of one token and the completion of the next.
359-
* Median End-to-End Latency (E2EL)
360-
* The typical total time from when a request is submitted until the final token of the response is received.
361-
* Total Token Throughput
362-
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
353+
* Median Time to First Token (TTFT)
354+
* The typical time elapsed from when a request is sent until the first output token is generated.
355+
* Median Time Per Output Token (TPOT)
356+
* The typical time required to generate each token *after* the first one.
357+
* Median Inter-Token Latency (ITL)
358+
* The typical time delay between the completion of one token and the completion of the next.
359+
* Median End-to-End Latency (E2EL)
360+
* The typical total time from when a request is submitted until the final token of the response is received.
361+
* Total Token Throughput
362+
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.

0 commit comments

Comments
 (0)