Skip to content

[Bugfix] Update TrtLLM MoE routing methods#44347

Merged
vllm-bot merged 2 commits into
vllm-project:mainfrom
wzhao18:wzhao/update-trtllm-routing-methods
Jun 3, 2026
Merged

[Bugfix] Update TrtLLM MoE routing methods#44347
vllm-bot merged 2 commits into
vllm-project:mainfrom
wzhao18:wzhao/update-trtllm-routing-methods

Conversation

@wzhao18

@wzhao18 wzhao18 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Purpose

The PR introduces various fixes related to Trtllm MoE routing methods:

ValueError: FP8 MoE backend FLASHINFER_TRTLLM does not support the deployment configuration since kernel does not support quantization scheme QuantKey(f8e4m3fn,scale(f32,static,per_tensor),symmetric)xQuantKey(f8e4m3fn,scale(f32,static,per_tensor),symmetric).
  • Update RoutingMethodType in correspondence with flashinfer.
  • Refine get_routing_method_type to prevent stepfun-ai/Step-3.7-Flash from being categorized into Minimax-M2 routing.
  • Remove casting of e_score_correction_bias as fixed in flashinfer.

Test Plan

  • GSM8k eval:
    • Deepseek R1 (fp8 and nvfp4)
    • Minimax M2.7 (fp8 and nvfp4)
    • Nemotron-3-Nano (fp8)
    • Step-3.7-Flash (bf16, fp8 and nvfp4)

Test Result

# Nemotron-3-Nano
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 --enforce-eager --max-model-len 8192 --tensor-parallel-size 2 --moe-backend=flashinfer_trtllm --trust-remote-code

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5648|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.8514|±  |0.0098|

# DeepSeek R1 FP8
vllm serve deepseek-ai/DeepSeek-R1 --trust-remote-code --tensor-parallel-size 8

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9575|±  |0.0056|
|     |       |strict-match    |     5|exact_match|↑  |0.9560|±  |0.0056|

# DeepSeek R1 NVFP4
vllm serve nvidia/DeepSeek-R1-NVFP4 --trust-remote-code --tensor-parallel-size 8

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9409|±  |0.0065|
|     |       |strict-match    |     5|exact_match|↑  |0.9386|±  |0.0066|

# Minimax M2 FP8
vllm serve MiniMaxAI/MiniMax-M2.5 --trust-remote-code --tensor-parallel-size 4

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9204|±  |0.0075|
|     |       |strict-match    |     5|exact_match|↑  |0.9174|±  |0.0076|

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9318|±  |0.0069|
|     |       |strict-match    |     5|exact_match|↑  |0.9280|±  |0.0071|

# Minimax M2 NVFP4
vllm serve nvidia/MiniMax-M2.5-NVFP4 --trust-remote-code --tensor-parallel-size 4

# Step-3.7-Flash BF16
vllm serve stepfun-ai/Step-3.7-Flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
--trust-remote-code

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8901|±  |0.0086|
|     |       |strict-match    |     5|exact_match|↑  |0.8901|±  |0.0086|

# Step-3.7-Flash FP8
vllm serve stepfun-ai/Step-3.7-Flash-FP8 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --disable-cascade-attn \
  --reasoning-parser step3p5 \
  --enable-auto-tool-choice \
  --tool-call-parser step3p5 \
  --speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
  --trust-remote-code

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8855|±  |0.0088|
|     |       |strict-match    |     5|exact_match|↑  |0.8840|±  |0.0088|

# Step-3.7-Flash NVFP4
python3 -m vllm.entrypoints.openai.api_server \
--model stepfun-ai/Step-3.7-Flash-NVFP4 \
--served-model-name step3p7 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--enable-expert-parallel \
--trust-remote-code \
--quantization modelopt \
--kv-cache-dtype fp8 \
--max-model-len 8192 \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--async-scheduling

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8817|±  |0.0089|
|     |       |strict-match    |     5|exact_match|↑  |0.8817|±  |0.0089|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@wzhao18 wzhao18 force-pushed the wzhao/update-trtllm-routing-methods branch from ca30cc7 to 6960252 Compare June 2, 2026 19:27
@wzhao18 wzhao18 changed the title Update TrtLLM MoE routing methods [Bugfix] Update TrtLLM MoE routing methods Jun 2, 2026
@mergify mergify Bot added nvidia bug Something isn't working labels Jun 2, 2026
@wzhao18 wzhao18 marked this pull request as ready for review June 3, 2026 01:42
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 3, 2026
@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Jun 3, 2026
@vllm-bot vllm-bot merged commit ace95c9 into vllm-project:main Jun 3, 2026
74 of 76 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 3, 2026
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
andakai pushed a commit to andakai/vllm that referenced this pull request Jun 4, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
JisoLya pushed a commit to JisoLya/vllm that referenced this pull request Jun 5, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: JisoLya <523420504@qq.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants