Skip to content

Commit 22e7fcb

Browse files
committed
chore: [Breaking Change] Rename cuda_graph_config padding_enabled field to enable_padding and optimize the TorchLlmArgs class by adding moe_config field.
Signed-off-by: nv-guomingz <[email protected]>
1 parent f225f5c commit 22e7fcb

27 files changed

+152
-135
lines changed

docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,8 @@ YOUR_DATA_PATH=<your dataset file following the format>
138138

139139
cat >./extra-llm-api-config.yml<<EOF
140140
cuda_graph_config: {}
141-
moe_backend: TRTLLM
141+
moe_config:
142+
backend: TRTLLM
142143
speculative_config:
143144
decoding_type: MTP
144145
num_nextn_predict_layers: 3
@@ -196,7 +197,7 @@ We are seeing meaningful speedup using FP8 KV cache, thus refreshing the numbers
196197
```bash
197198
cat >./extra-llm-api-config.yml <<EOF
198199
cuda_graph_config:
199-
padding_enabled: true
200+
enable_padding: true
200201
batch_sizes:
201202
- 896
202203
- 512
@@ -263,7 +264,7 @@ YOUR_DATA_PATH=./dataset.txt
263264

264265
cat >./extra-llm-api-config.yml <<EOF
265266
cuda_graph_config:
266-
padding_enabled: true
267+
enable_padding: true
267268
batch_sizes:
268269
- 1
269270
- 2

docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,8 @@ YOUR_DATA_PATH=<your dataset file following the format>
124124

125125
cat >./extra-llm-api-config.yml<<EOF
126126
cuda_graph_config: {}
127-
moe_backend: TRTLLM
127+
moe_config:
128+
backend: TRTLLM
128129
speculative_config:
129130
decoding_type: MTP
130131
num_nextn_predict_layers: 3
@@ -179,7 +180,8 @@ YOUR_DATA_PATH=<your dataset file following the format>
179180

180181
cat >./extra-llm-api-config.yml<<EOF
181182
cuda_graph_config: {}
182-
moe_backend: TRTLLM
183+
moe_config:
184+
backend: TRTLLM
183185
speculative_config:
184186
decoding_type: MTP
185187
num_nextn_predict_layers: 3

docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ These optimizations target the overall execution flow, scheduling, and resource
157157

158158
There is a feature called CUDA Graph padding in TensorRT-LLM, which is a good trade-off between the number of CUDA Graphs and the CUDA Graph hit ratio; it tries to pad a batch to the nearest one with a captured CUDA Graph. Normally you should enable the CUDA Graph padding feature to increase the CUDA Graph hit rate, but the padding itself has some overhead due to wasted tokens computation.
159159

160-
Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n padding_enabled: False`, see API here [Pytorch backend config](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/config.py#L41)
160+
Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n enable_padding: False`, see API here [Pytorch backend config](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/config.py#L41)
161161

162162
* Overlap Scheduler:
163163

docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -623,7 +623,8 @@ Run 36-way expert parallelism inference with the EPLB configuration incorporated
623623
cat > ./extra_llm_api_options_eplb.yaml <<EOF
624624
enable_attention_dp: true
625625
cuda_graph_config: {}
626-
moe_load_balancer: ./moe_load_balancer.yaml
626+
moe_config:
627+
load_balancer: ./moe_load_balancer.yaml
627628
EOF
628629

629630
trtllm-llmapi-launch \

docs/source/performance/perf-overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,7 @@ trtllm-bench --model $model_name throughput --dataset $dataset_file --backend py
201201
`llm_options.yml`
202202
```yaml
203203
cuda_graph_config:
204-
padding_enabled: true
204+
enable_padding: true
205205
batch_sizes:
206206
- 1
207207
- 2

docs/source/scripts/disaggregated/gen_yaml.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -190,12 +190,14 @@ def gen_config_file(config_path: str,
190190
'max_seq_len': 8576,
191191
'free_gpu_memory_fraction': gen_gpu_memory_fraction,
192192
'cuda_graph_config': {
193-
'padding_enabled': True,
193+
'enable_padding': True,
194194
'batch_sizes': gen_cuda_graph_batch_sizes,
195195
},
196196
'print_iter_log': True,
197197
'kv_cache_dtype': 'fp8',
198-
'moe_backend': 'TRTLLM',
198+
'moe_config': {
199+
'backend': 'TRTLLM',
200+
},
199201
'cache_transceiver_config': {
200202
'max_num_tokens': 8320,
201203
},

examples/llm-api/quickstart_advanced.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
from tensorrt_llm import LLM, SamplingParams
44
from tensorrt_llm.llmapi import (CudaGraphConfig, DraftTargetDecodingConfig,
5-
EagleDecodingConfig, KvCacheConfig,
5+
EagleDecodingConfig, KvCacheConfig, MoeConfig,
66
MTPDecodingConfig, NGramDecodingConfig,
77
TorchCompileConfig)
88

@@ -188,7 +188,7 @@ def setup_llm(args):
188188

189189
cuda_graph_config = CudaGraphConfig(
190190
batch_sizes=args.cuda_graph_batch_sizes,
191-
padding_enabled=args.cuda_graph_padding_enabled,
191+
enable_padding=args.cuda_graph_padding_enabled,
192192
) if args.use_cuda_graph else None
193193
llm = LLM(
194194
model=args.model_dir,
@@ -207,7 +207,7 @@ def setup_llm(args):
207207
enable_piecewise_cuda_graph= \
208208
args.use_piecewise_cuda_graph)
209209
if args.use_torch_compile else None,
210-
moe_backend=args.moe_backend,
210+
moe_config=MoeConfig(backend=args.moe_backend),
211211
enable_trtllm_sampler=args.enable_trtllm_sampler,
212212
max_seq_len=args.max_seq_len,
213213
max_batch_size=args.max_batch_size,

examples/models/core/deepseek_v3/README.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,7 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
142142

143143
cat <<EOF > /tmp/extra-llm-api-config.yml
144144
cuda_graph_config:
145-
padding_enabled: true
145+
enable_padding: true
146146
batch_sizes: [1, 4, 8, 12]
147147
EOF
148148

@@ -169,9 +169,10 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
169169

170170
cat <<EOF > /tmp/extra-llm-api-config.yml
171171
cuda_graph_config:
172-
padding_enabled: true
172+
enable_padding: true
173173
batch_sizes: [1, 2]
174-
moe_max_num_tokens: 16384
174+
moe_config:
175+
max_num_tokens: 16384
175176
EOF
176177

177178
trtllm-bench -m deepseek-ai/DeepSeek-R1 --model_path ${DS_R1_NVFP4_MODEL_PATH} throughput \
@@ -237,7 +238,7 @@ To serve the model using `trtllm-serve`:
237238
```bash
238239
cat >./extra-llm-api-config.yml <<EOF
239240
cuda_graph_config:
240-
padding_enabled: true
241+
enable_padding: true
241242
batch_sizes:
242243
- 1
243244
- 2
@@ -316,7 +317,7 @@ export TRTLLM_USE_UCX_KVCACHE=1
316317

317318
cat >./gen-extra-llm-api-config.yml <<EOF
318319
cuda_graph_config:
319-
padding_enabled: true
320+
enable_padding: true
320321
batch_sizes:
321322
- 1
322323
- 2
@@ -539,7 +540,7 @@ python3 /path/to/TensorRT-LLM/benchmarks/cpp/prepare_dataset.py \
539540

540541
cat >/path/to/TensorRT-LLM/extra-llm-api-config.yml <<EOF
541542
cuda_graph_config:
542-
padding_enabled: true
543+
enable_padding: true
543544
batch_sizes:
544545
- 1
545546
- 2

examples/models/core/llama4/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,15 +29,15 @@ enable_attention_dp: true
2929
stream_interval: 2
3030
cuda_graph_config:
3131
max_batch_size: 512
32-
padding_enabled: true
32+
enable_padding: true
3333
EOF
3434
```
3535
Explanation:
3636
- `enable_attention_dp`: Enable attention Data Parallel which is recommend to enable in high concurrency.
3737
- `stream_interval`: The iteration interval to create responses under the streaming mode.
3838
- `cuda_graph_config`: CUDA Graph config.
3939
- `max_batch_size`: Max CUDA graph batch size to capture.
40-
- `padding_enabled`: Whether to enable CUDA graph padding.
40+
- `enable_padding`: Whether to enable CUDA graph padding.
4141

4242

4343
#### 2. Launch trtllm-serve OpenAI-compatible API server
@@ -81,7 +81,7 @@ enable_min_latency: true
8181
stream_interval: 2
8282
cuda_graph_config:
8383
max_batch_size: 8
84-
padding_enabled: true
84+
enable_padding: true
8585
EOF
8686
```
8787
Explanation:
@@ -90,7 +90,7 @@ Explanation:
9090
- `stream_interval`: The iteration interval to create responses under the streaming mode.
9191
- `cuda_graph_config`: CUDA Graph config.
9292
- `max_batch_size`: Max CUDA graph batch size to capture.
93-
- `padding_enabled`: Whether to enable CUDA graph padding.
93+
- `enable_padding`: Whether to enable CUDA graph padding.
9494

9595

9696
#### 2. Launch trtllm-serve OpenAI-compatible API server

examples/models/core/qwen/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -745,7 +745,7 @@ To serve the model using `trtllm-serve`:
745745
```bash
746746
cat >./extra-llm-api-config.yml <<EOF
747747
cuda_graph_config:
748-
padding_enabled: true
748+
enable_padding: true
749749
batch_sizes:
750750
- 1
751751
- 2
@@ -821,7 +821,7 @@ export TRTLLM_USE_UCX_KVCACHE=1
821821

822822
cat >./gen-extra-llm-api-config.yml <<EOF
823823
cuda_graph_config:
824-
padding_enabled: true
824+
enable_padding: true
825825
batch_sizes:
826826
- 1
827827
- 2

0 commit comments

Comments
 (0)