Update

chenfeiz0326 · chenfeiz0326 · commit 01a70ef9552b · 2025-08-01T02:09:16.000-07:00
Signed-off-by: Chenfei Zhang &lt;chenfeiz@nvidia.com&gt;
diff --git a/examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md b/examples/models/core/llama4/Deployment Guide for TRT-LLM + Llama4 Scout.md
@@ -1,6 +1,6 @@
 # Deployment Guide for TensorRT-LLM Llama4 Scout 17B FP8 and NVFP4
 
-## 
+##
 
 # Introduction
 
@@ -14,15 +14,15 @@ To use Llama4 Scout 17B, you must first agree to Meta’s Llama 4 Community Lice
 
 # Prerequisites
 
-GPU: NVIDIA Blackwell or Hopper Architecture  
-OS: Linux  
-Drivers: CUDA Driver 575 or Later  
-Docker with NVIDIA Container Toolkit installed  
+GPU: NVIDIA Blackwell or Hopper Architecture
+OS: Linux
+Drivers: CUDA Driver 575 or Later
+Docker with NVIDIA Container Toolkit installed
 Python3 and python3-pip (Optional, for accuracy evaluation only)
 
 # Models
 
-* FP8 model: [Llama-4-Scout-17B-16E-Instruct-FP8](https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP8)  
+* FP8 model: [Llama-4-Scout-17B-16E-Instruct-FP8](https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP8)
 * NVFP4 model: [Llama-4-Scout-17B-16E-Instruct-FP4](https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP4)
 
 
@@ -45,14 +45,14 @@ nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc4 \
 /bin/bash
 ```
 
-Note: 
+Note:
 
-* You can mount additional directories and paths using the \-v \<local\_path\>:\<path\> flag if needed, such as mounting the downloaded weight paths.  
-* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to \~/.cache/huggingface/hub/ by default. This prevents having to redownload the weights each time you rerun the container. If the \~/.cache directory doesn’t exist please create it using  mkdir \~/.cache  
-* The command also maps port 8000 from the container to your host so you can access the LLM API endpoint from your host  
+* You can mount additional directories and paths using the \-v \<local\_path\>:\<path\> flag if needed, such as mounting the downloaded weight paths.
+* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to \~/.cache/huggingface/hub/ by default. This prevents having to redownload the weights each time you rerun the container. If the \~/.cache directory doesn’t exist please create it using  mkdir \~/.cache
+* The command also maps port 8000 from the container to your host so you can access the LLM API endpoint from your host
 * See the [https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) for all the available containers. The containers published in the main branch weekly have “rcN” suffix, while the monthly release with QA tests has no “rcN” suffix. Use the rc release to get the latest model and feature support.
 
-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) 
+If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
 
 ## Creating the TRT-LLM Server config
 
@@ -66,7 +66,7 @@ enable_attention_dp: false
 cuda_graph_config:
   enable_padding: true
   max_batch_size: 1024
-kv_cache_config:     
+kv_cache_config:
   dtype: fp8
 EOF
 ```
@@ -117,7 +117,7 @@ These options are used directly on the command line when you start the `trtllm-s
 
 &emsp;**Description:** The maximum number of user requests that can be grouped into a single batch for processing.
 
-#### `--max_num_of_tokens`
+#### `--max_num_tokens`
 
 &emsp;**Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
 
@@ -138,7 +138,7 @@ These options provide finer control over performance and are set within a YAML f
 
 &emsp;**Description**: A section for configuring the Key-Value (KV) cache.
 
-&emsp;**Options**: 
+&emsp;**Options**:
 
 &emsp;&emsp;dtype: Sets the data type for the KV cache.
 
@@ -186,7 +186,7 @@ See the [https://github.com/nvidia/TensorRT-LLM/blob/main/tensorrt\_llm/llmapi/l
 
 ## Basic Test
 
-Start a new terminal on the host to test the TensorRT-LLM server you just launched. 
+Start a new terminal on the host to test the TensorRT-LLM server you just launched.
 
 You can query the health/readiness of the server using:
 
@@ -196,7 +196,7 @@ curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
 
 When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
 
-After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server. 
+After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
 
 ```shell
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json"  -d '{
@@ -215,10 +215,10 @@ Here is an example response, showing that the TRT-LLM server returns “New York
 
 ## Troubleshooting Tips
 
-* If you encounter CUDA out-of-memory errors, try reducing max\_batch\_size or max\_seq\_len  
-* Ensure your model checkpoints are compatible with the expected format  
-* For performance issues, check GPU utilization with nvidia-smi while the server is running  
-* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed  
+* If you encounter CUDA out-of-memory errors, try reducing max\_batch\_size or max\_seq\_len
+* Ensure your model checkpoints are compatible with the expected format
+* For performance issues, check GPU utilization with nvidia-smi while the server is running
+* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed
 * For connection issues, make sure port 8000 is not being used by another application
 
 ## Running Evaluations to Verify Accuracy (Optional)
@@ -241,7 +241,7 @@ MODEL_PATH=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8
 lm_eval --model local-completions  --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0 --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp8.gsm8k
 ```
 
-Sample result in Blackwell. 
+Sample result in Blackwell.
 
 ```shell
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
@@ -253,7 +253,7 @@ Sample result in Blackwell.
 FP4 command for GSM8K
 
 ```shell
-MODEL_PATH=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8
+MODEL_PATH=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4
 
 lm_eval --model local-completions  --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0 --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp4.gsm8k
 ```
@@ -309,7 +309,7 @@ If you want to save the results to a file add the following options.
 --result-filename "concurrency_${concurrency}.json"
 ```
 
-For more benchmarking options see. [https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py) 
+For more benchmarking options see. [https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
 
 Run bench.sh to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above bench.sh script.
 
@@ -350,13 +350,13 @@ P99 E2EL (ms):                            [result]
 
 ## Key Metrics
 
-* Median Time to First Token (TTFT)  
-  * The typical time elapsed from when a request is sent until the first output token is generated.  
-* Median Time Per Output Token (TPOT)  
-  * The typical time required to generate each token *after* the first one.   
-* Median Inter-Token Latency (ITL)  
-  * The typical time delay between the completion of one token and the completion of the next.  
-* Median End-to-End Latency (E2EL)  
-  * The typical total time from when a request is submitted until the final token of the response is received.   
-* Total Token Throughput  
-  * The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens. 
+* Median Time to First Token (TTFT)
+  * The typical time elapsed from when a request is sent until the first output token is generated.
+* Median Time Per Output Token (TPOT)
+  * The typical time required to generate each token *after* the first one.
+* Median Inter-Token Latency (ITL)
+  * The typical time delay between the completion of one token and the completion of the next.
+* Median End-to-End Latency (E2EL)
+  * The typical total time from when a request is submitted until the final token of the response is received.
+* Total Token Throughput
+  * The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.