NVIDIA · LinPoly · Nov 27, 2025 · Nov 25, 2025
@@ -41,13 +41,13 @@ Chat API
 
 You can query Chat API with any http clients, a typical example is OpenAI Python client:
 
-.. literalinclude:: ../../../examples/serve/openai_chat_client.py
+.. literalinclude:: ../../../../examples/serve/openai_chat_client.py
     :language: python
     :linenos:
 
 Another example uses ``curl``:
 
-.. literalinclude:: ../../../examples/serve/curl_chat_client.sh
+.. literalinclude:: ../../../../examples/serve/curl_chat_client.sh
     :language: bash
     :linenos:
 
@@ -56,13 +56,13 @@ Completions API
 
 You can query Completions API with any http clients, a typical example is OpenAI Python client:
 
-.. literalinclude:: ../../../examples/serve/openai_completion_client.py
+.. literalinclude:: ../../../../examples/serve/openai_completion_client.py
     :language: python
     :linenos:
 
 Another example uses ``curl``:
 
-.. literalinclude:: ../../../examples/serve/curl_completion_client.sh
+.. literalinclude:: ../../../../examples/serve/curl_completion_client.sh
     :language: bash
     :linenos:
 
@@ -97,13 +97,13 @@ Multimodal Chat API
 
 You can query Completions API with any http clients, a typical example is OpenAI Python client:
 
-.. literalinclude:: ../../../examples/serve/openai_completion_client_for_multimodal.py
+.. literalinclude:: ../../../../examples/serve/openai_completion_client_for_multimodal.py
     :language: python
     :linenos:
 
 Another example uses ``curl``:
 
-.. literalinclude:: ../../../examples/serve/curl_chat_client_for_multimodal.sh
+.. literalinclude:: ../../../../examples/serve/curl_chat_client_for_multimodal.sh
     :language: bash
     :linenos:
 
@@ -254,7 +254,23 @@ Example output:
         }
     ]
 
+Configuring with YAML Files
+----------------------------
 
+You can configure various options of ``trtllm-serve`` using YAML files by setting the ``--extra_llm_api_options`` option to the path of a YAML file, the arguments in the file will override the corresponding command line arguments.
+
+The yaml file is configuration of `tensorrt_llm.llmapi.LlmArgs <https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs>`_, the class has multiple levels of hierarchy, to configure the top level arguments like ``max_batch_size``, the yaml file should be like:
+
+.. code-block:: yaml
+
+   max_batch_size: 8
+
+To configure the nested level arguments like ``moe_config.backend``, the yaml file should be like:
+
+.. code-block:: yaml
+
+   moe_config:
+       backend: CUTLASS
 
 Syntax
 ------

@@ -80,6 +80,8 @@ enable_attention_dp: true
 EOF
 ```
 
+then set `--extra_llm_api_options parallel_config.yaml` in `trtllm-serve` or `trtllm-bench`.
+
 ### FFN Module
 
 #### Dense Models

@@ -1,10 +1,23 @@
 #! /usr/bin/env bash
 
+cat >./extra-llm-api-config.yml <<EOF
+cuda_graph_config:
+    enable_padding: true
+    max_batch_size: 512
+enable_attention_dp: true
+kv_cache_config:
+    dtype: fp8
+    free_gpu_memory_fraction: 0.8
+stream_interval: 10
+moe_config:
+    backend: DEEPGEMM
+EOF
+
 trtllm-serve \
     deepseek-ai/DeepSeek-R1 \
     --host localhost --port 8000 \
-    --max_batch_size 161 --max_num_tokens 1160 \
+    --trust_remote_code \
+    --max_batch_size 1024 --max_num_tokens 8192 \
     --tp_size 8 --ep_size 8 --pp_size 1 \
-    --kv_cache_free_gpu_memory_fraction 0.95 \
     --extra_llm_api_options ./extra-llm-api-config.yml \
     --reasoning_parser deepseek-r1
-Original file line number
+Diff line change
@@ Expand Up / @@ -80,6 +80,8 @@ enable_attention_dp: true @@
     EOF
     ```
+    then set `--extra_llm_api_options parallel_config.yaml` in `trtllm-serve` or `trtllm-bench`.
     ### FFN Module
     #### Dense Models
@@ Expand Down @@