Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 22 additions & 6 deletions docs/source/commands/trtllm-serve/trtllm-serve.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,13 @@ Chat API

You can query Chat API with any http clients, a typical example is OpenAI Python client:

.. literalinclude:: ../../../examples/serve/openai_chat_client.py
.. literalinclude:: ../../../../examples/serve/openai_chat_client.py
:language: python
:linenos:

Another example uses ``curl``:

.. literalinclude:: ../../../examples/serve/curl_chat_client.sh
.. literalinclude:: ../../../../examples/serve/curl_chat_client.sh
:language: bash
:linenos:

Expand All @@ -56,13 +56,13 @@ Completions API

You can query Completions API with any http clients, a typical example is OpenAI Python client:

.. literalinclude:: ../../../examples/serve/openai_completion_client.py
.. literalinclude:: ../../../../examples/serve/openai_completion_client.py
:language: python
:linenos:

Another example uses ``curl``:

.. literalinclude:: ../../../examples/serve/curl_completion_client.sh
.. literalinclude:: ../../../../examples/serve/curl_completion_client.sh
:language: bash
:linenos:

Expand Down Expand Up @@ -97,13 +97,13 @@ Multimodal Chat API

You can query Completions API with any http clients, a typical example is OpenAI Python client:

.. literalinclude:: ../../../examples/serve/openai_completion_client_for_multimodal.py
.. literalinclude:: ../../../../examples/serve/openai_completion_client_for_multimodal.py
:language: python
:linenos:

Another example uses ``curl``:

.. literalinclude:: ../../../examples/serve/curl_chat_client_for_multimodal.sh
.. literalinclude:: ../../../../examples/serve/curl_chat_client_for_multimodal.sh
:language: bash
:linenos:

Expand Down Expand Up @@ -254,7 +254,23 @@ Example output:
}
]

Configuring with YAML Files
----------------------------

You can configure various options of ``trtllm-serve`` using YAML files by setting the ``--extra_llm_api_options`` option to the path of a YAML file, the arguments in the file will override the corresponding command line arguments.

The yaml file is configuration of `tensorrt_llm.llmapi.LlmArgs <https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs>`_, the class has multiple levels of hierarchy, to configure the top level arguments like ``max_batch_size``, the yaml file should be like:

.. code-block:: yaml

max_batch_size: 8

To configure the nested level arguments like ``moe_config.backend``, the yaml file should be like:

.. code-block:: yaml

moe_config:
backend: CUTLASS

Syntax
------
Expand Down
2 changes: 2 additions & 0 deletions docs/source/features/parallel-strategy.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,8 @@ enable_attention_dp: true
EOF
```

then set `--extra_llm_api_options parallel_config.yaml` in `trtllm-serve` or `trtllm-bench`.

### FFN Module

#### Dense Models
Expand Down
17 changes: 15 additions & 2 deletions examples/serve/deepseek_r1_reasoning_parser.sh
Original file line number Diff line number Diff line change
@@ -1,10 +1,23 @@
#! /usr/bin/env bash

cat >./extra-llm-api-config.yml <<EOF
cuda_graph_config:
enable_padding: true
max_batch_size: 512
enable_attention_dp: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
stream_interval: 10
moe_config:
backend: DEEPGEMM
EOF

trtllm-serve \
deepseek-ai/DeepSeek-R1 \
--host localhost --port 8000 \
--max_batch_size 161 --max_num_tokens 1160 \
--trust_remote_code \
--max_batch_size 1024 --max_num_tokens 8192 \
--tp_size 8 --ep_size 8 --pp_size 1 \
--kv_cache_free_gpu_memory_fraction 0.95 \
--extra_llm_api_options ./extra-llm-api-config.yml \
--reasoning_parser deepseek-r1