You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* You can mount additional directories and paths using the \-v \<local\_path\>:\<path\> flag if needed, such as mounting the downloaded weight paths.
51
-
* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to \~/.cache/huggingface/hub/ by default. This prevents having to redownload the weights each time you rerun the container. If the \~/.cache directory doesn’t exist please create it using mkdir \~/.cache
52
-
* The command also maps port 8000 from the container to your host so you can access the LLM API endpoint from your host
50
+
* You can mount additional directories and paths using the \-v \<local\_path\>:\<path\> flag if needed, such as mounting the downloaded weight paths.
51
+
* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to \~/.cache/huggingface/hub/ by default. This prevents having to redownload the weights each time you rerun the container. If the \~/.cache directory doesn’t exist please create it using mkdir \~/.cache
52
+
* The command also maps port 8000 from the container to your host so you can access the LLM API endpoint from your host
53
53
* See the [https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) for all the available containers. The containers published in the main branch weekly have “rcN” suffix, while the monthly release with QA tests has no “rcN” suffix. Use the rc release to get the latest model and feature support.
54
54
55
-
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
55
+
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
56
56
57
57
## Creating the TRT-LLM Server config
58
58
@@ -66,7 +66,7 @@ enable_attention_dp: false
66
66
cuda_graph_config:
67
67
enable_padding: true
68
68
max_batch_size: 1024
69
-
kv_cache_config:
69
+
kv_cache_config:
70
70
dtype: fp8
71
71
EOF
72
72
```
@@ -117,7 +117,7 @@ These options are used directly on the command line when you start the `trtllm-s
117
117
118
118
 **Description:** The maximum number of user requests that can be grouped into a single batch for processing.
119
119
120
-
#### `--max_num_of_tokens`
120
+
#### `--max_num_tokens`
121
121
122
122
 **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
123
123
@@ -138,7 +138,7 @@ These options provide finer control over performance and are set within a YAML f
138
138
139
139
 **Description**: A section for configuring the Key-Value (KV) cache.
140
140
141
-
 **Options**:
141
+
 **Options**:
142
142
143
143
  dtype: Sets the data type for the KV cache.
144
144
@@ -186,7 +186,7 @@ See the [https://github.com/nvidia/TensorRT-LLM/blob/main/tensorrt\_llm/llmapi/l
186
186
187
187
## Basic Test
188
188
189
-
Start a new terminal on the host to test the TensorRT-LLM server you just launched.
189
+
Start a new terminal on the host to test the TensorRT-LLM server you just launched.
190
190
191
191
You can query the health/readiness of the server using:
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
198
198
199
-
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
199
+
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
For more benchmarking options see. [https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
312
+
For more benchmarking options see. [https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
313
313
314
314
Run bench.sh to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above bench.sh script.
315
315
@@ -350,13 +350,13 @@ P99 E2EL (ms): [result]
350
350
351
351
## Key Metrics
352
352
353
-
* Median Time to First Token (TTFT)
354
-
* The typical time elapsed from when a request is sent until the first output token is generated.
355
-
* Median Time Per Output Token (TPOT)
356
-
* The typical time required to generate each token *after* the first one.
357
-
* Median Inter-Token Latency (ITL)
358
-
* The typical time delay between the completion of one token and the completion of the next.
359
-
* Median End-to-End Latency (E2EL)
360
-
* The typical total time from when a request is submitted until the final token of the response is received.
361
-
* Total Token Throughput
362
-
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
353
+
* Median Time to First Token (TTFT)
354
+
* The typical time elapsed from when a request is sent until the first output token is generated.
355
+
* Median Time Per Output Token (TPOT)
356
+
* The typical time required to generate each token *after* the first one.
357
+
* Median Inter-Token Latency (ITL)
358
+
* The typical time delay between the completion of one token and the completion of the next.
359
+
* Median End-to-End Latency (E2EL)
360
+
* The typical total time from when a request is submitted until the final token of the response is received.
361
+
* Total Token Throughput
362
+
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
0 commit comments