[Bugfix] Update TrtLLM MoE routing methods by wzhao18 · Pull Request #44347 · vllm-project/vllm

wzhao18 · 2026-06-02T18:34:04Z

Purpose

The PR introduces various fixes related to Trtllm MoE routing methods:

Revert _supports_router_logits_dtype change from [Model]Support Step-3.7-Flash #43859, which causes regression in nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8, see CI failure

ValueError: FP8 MoE backend FLASHINFER_TRTLLM does not support the deployment configuration since kernel does not support quantization scheme QuantKey(f8e4m3fn,scale(f32,static,per_tensor),symmetric)xQuantKey(f8e4m3fn,scale(f32,static,per_tensor),symmetric).

Update RoutingMethodType in correspondence with flashinfer.
Refine get_routing_method_type to prevent stepfun-ai/Step-3.7-Flash from being categorized into Minimax-M2 routing.
Remove casting of e_score_correction_bias as fixed in flashinfer.

Test Plan

GSM8k eval:
- Deepseek R1 (fp8 and nvfp4)
- Minimax M2.7 (fp8 and nvfp4)
- Nemotron-3-Nano (fp8)
- Step-3.7-Flash (bf16, fp8 and nvfp4)

Test Result

# Nemotron-3-Nano
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 --enforce-eager --max-model-len 8192 --tensor-parallel-size 2 --moe-backend=flashinfer_trtllm --trust-remote-code

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5648|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.8514|±  |0.0098|

# DeepSeek R1 FP8
vllm serve deepseek-ai/DeepSeek-R1 --trust-remote-code --tensor-parallel-size 8

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9575|±  |0.0056|
|     |       |strict-match    |     5|exact_match|↑  |0.9560|±  |0.0056|

# DeepSeek R1 NVFP4
vllm serve nvidia/DeepSeek-R1-NVFP4 --trust-remote-code --tensor-parallel-size 8

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9409|±  |0.0065|
|     |       |strict-match    |     5|exact_match|↑  |0.9386|±  |0.0066|

# Minimax M2 FP8
vllm serve MiniMaxAI/MiniMax-M2.5 --trust-remote-code --tensor-parallel-size 4

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9204|±  |0.0075|
|     |       |strict-match    |     5|exact_match|↑  |0.9174|±  |0.0076|

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9318|±  |0.0069|
|     |       |strict-match    |     5|exact_match|↑  |0.9280|±  |0.0071|

# Minimax M2 NVFP4
vllm serve nvidia/MiniMax-M2.5-NVFP4 --trust-remote-code --tensor-parallel-size 4

# Step-3.7-Flash BF16
vllm serve stepfun-ai/Step-3.7-Flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
--trust-remote-code

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8901|±  |0.0086|
|     |       |strict-match    |     5|exact_match|↑  |0.8901|±  |0.0086|

# Step-3.7-Flash FP8
vllm serve stepfun-ai/Step-3.7-Flash-FP8 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --disable-cascade-attn \
  --reasoning-parser step3p5 \
  --enable-auto-tool-choice \
  --tool-call-parser step3p5 \
  --speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
  --trust-remote-code

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8855|±  |0.0088|
|     |       |strict-match    |     5|exact_match|↑  |0.8840|±  |0.0088|

# Step-3.7-Flash NVFP4
python3 -m vllm.entrypoints.openai.api_server \
--model stepfun-ai/Step-3.7-Flash-NVFP4 \
--served-model-name step3p7 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--enable-expert-parallel \
--trust-remote-code \
--quantization modelopt \
--kv-cache-dtype fp8 \
--max-model-len 8192 \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--async-scheduling

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8817|±  |0.0089|
|     |       |strict-match    |     5|exact_match|↑  |0.8817|±  |0.0089|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: JisoLya <523420504@qq.com>

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

Update trtllm routing methods - fix fp8 and nvfp4 backends

6960252

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 force-pushed the wzhao/update-trtllm-routing-methods branch from ca30cc7 to 6960252 Compare June 2, 2026 19:27

wzhao18 changed the title ~~Update TrtLLM MoE routing methods~~ [Bugfix] Update TrtLLM MoE routing methods Jun 2, 2026

mergify Bot added nvidia bug Something isn't working labels Jun 2, 2026

github-project-automation Bot added this to NVIDIA Jun 2, 2026

wzhao18 marked this pull request as ready for review June 3, 2026 01:42

wzhao18 requested review from mgoin, pavanimajety and zyongye as code owners June 3, 2026 01:42

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 3, 2026

jeejeelee approved these changes Jun 3, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Jun 3, 2026

Merge branch 'main' into wzhao/update-trtllm-routing-methods

796173d

vllm-bot merged commit ace95c9 into vllm-project:main Jun 3, 2026
74 of 76 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Update TrtLLM MoE routing methods#44347

[Bugfix] Update TrtLLM MoE routing methods#44347
vllm-bot merged 2 commits into
vllm-project:mainfrom
wzhao18:wzhao/update-trtllm-routing-methods

wzhao18 commented Jun 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

wzhao18 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wzhao18 commented Jun 2, 2026 •

edited

Loading