[Perf] Add TRTLLM FP8 MoE Modular Kernel by wzhao18 · Pull Request #36307 · vllm-project/vllm

wzhao18 · 2026-03-07T04:23:53Z

Purpose

Add TRTLLM FP8 MoE Modular Kernel. The trtllm monolithic kernel has restrictions over the routing method of the model. This PR enables more models (e.g., minimax m2) to use the trtllm moe backend by adding the modular version.

Test Plan

Tested with Minimax-m2.5 (which cannot use trtllm moe monolithic backend due to routing method restriction)

vllm serve MiniMaxAI/MiniMax-M2.5 \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --enable-expert-parallel

Note: to get Minimax to run with trtllm moe, it is required to set minimax's router_logits_dtype to bfloat16, as trtllm backend only supports this. This restriction should go away soon with flashinfer PR. (Edit 03/17: this is not true. the modular kernel does not enforce router logits dtype, as routing is done externally. The constraint should be removed)

Test Result

1K/1K TP=2 Benchmark on B200:

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a modular FP8 MoE kernel for the TRTLLM backend, which is a valuable addition for improving model compatibility. The refactoring of the existing monolithic kernel into a shared base class is well-executed and enhances the code structure. My review identifies two potential issues: a possible null pointer exception from an unchecked optional parameter and a hardcoded value that could restrict the new kernel's flexibility. Addressing these points will help ensure the implementation is robust and fully achieves its intended goal.

vllm/model_executor/layers/fused_moe/experts/trtllm_fp8_moe.py

wzhao18 · 2026-03-07T04:30:39Z

@robertgshaw2-redhat Could you help review this PR when you get a chance?

mergify · 2026-03-10T06:20:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wzhao18.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

mergify · 2026-03-10T17:58:58Z

Hi @wzhao18, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

mergify · 2026-03-10T18:03:58Z

Hi @wzhao18, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

mgoin · 2026-03-10T18:59:03Z

Note: to get Minimax to run with trtllm moe, it is required to set minimax's router_logits_dtype to bfloat16, as trtllm backend only supports this. This restriction should go away soon with flashinfer-ai/flashinfer#2534.

@wzhao18 can you please check the accuracy of the model with this change? I originally opened the issue because of this bfloat16 issue I think flashinfer-ai/flashinfer#2469

wzhao18 · 2026-03-10T19:36:55Z

@mgoin I did not actually change the router logits dtype for Minimax in the PR. I only changed it locally for testing as the support for float32 router logits has not yet been merged. After this is supported in FI, we need to change _supports_router_logits_dtype.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

mgoin

LGTM as an interim step then, thanks!

mgoin · 2026-03-12T10:02:39Z

vllm/model_executor/layers/fused_moe/experts/trtllm_fp8_moe.py

+            topk_ids=packed_topk_ids,
+            routing_bias=None,
+            hidden_states=hidden_states,
+            hidden_states_scale=a1q_scale.t().contiguous(),  # type: ignore[union-attr]


Could we fuse this from the activation quant before? This could be quite slow. Worth considering when fixing the output issue

Will keep a note about it. Thanks for point this out.

mgoin · 2026-03-12T10:03:05Z

vllm/model_executor/layers/fused_moe/experts/trtllm_fp8_moe.py

+            local_expert_offset=self.ep_rank * self.local_num_experts,
+            local_num_experts=self.local_num_experts,
+            routed_scaling_factor=None,
+            routing_method_type=1,


Can you note that this is ignored in this case?

Will do in future PR then.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

wzhao18 requested review from mgoin and pavanimajety as code owners March 7, 2026 04:23

mergify bot added the nvidia label Mar 7, 2026

github-project-automation bot added this to NVIDIA Mar 7, 2026

wzhao18 force-pushed the wzhao/fp8-trtllm-modular-moe branch from 6014cf2 to 4efbb10 Compare March 7, 2026 04:24

gemini-code-assist bot reviewed Mar 7, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/experts/trtllm_fp8_moe.py Show resolved Hide resolved

vllm/model_executor/layers/fused_moe/experts/trtllm_fp8_moe.py Show resolved Hide resolved

wzhao18 changed the title ~~Add TRTLLM FP8 MoE Modular Kernel~~ [Perf] Add TRTLLM FP8 MoE Modular Kernel Mar 7, 2026

mergify bot added the needs-rebase label Mar 10, 2026

wzhao18 added 6 commits March 10, 2026 17:51

Support trtllm fp8 modular kernel

9b3e8a9

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Add base class for trtllm fp8 modular moe

7f0c69d

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Add base class for trtllm fp8 modular moe

2091176

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Fix trtllm moe modular monolithic

6c80d32

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

fix linting

469b8e5

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Revert changing minimax m2 routing logits dtype

e454522

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 force-pushed the wzhao/fp8-trtllm-modular-moe branch from 4efbb10 to e454522 Compare March 10, 2026 17:55

mergify bot removed the needs-rebase label Mar 10, 2026

fixup

2c55cd7

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 requested review from LucasWilkinson and MatthewBonanni as code owners March 10, 2026 18:00

fixup

507d3d4

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 10, 2026

wzhao18 added 2 commits March 10, 2026 22:54

Update tests

a3ab5c9

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Fix gemini comments

19af1ba

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 requested a review from tlrmchlsmth as a code owner March 10, 2026 22:55

wzhao18 requested review from WoosukKwon and yewentao256 as code owners March 10, 2026 22:55

mgoin approved these changes Mar 12, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 12, 2026

Merge branch 'main' into wzhao/fp8-trtllm-modular-moe

451172a

vllm-bot merged commit 2e693f4 into vllm-project:main Mar 12, 2026
53 of 55 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 12, 2026

haosdent mentioned this pull request Mar 15, 2026

[Bugfix] Add autotuning guard to all unprotected FlashInfer MoE kernels #37091

Open

3 tasks

xinli-sw mentioned this pull request Mar 17, 2026

Update FP8 MoE backend selection for B200 (Blackwell) #37056

Open

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[Perf] Add TRTLLM FP8 MoE Modular Kernel (vllm-project#36307)

6aaed0e

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[Perf] Add TRTLLM FP8 MoE Modular Kernel (vllm-project#36307)

694c453

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Perf] Add TRTLLM FP8 MoE Modular Kernel (vllm-project#36307)

22d817e

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

Uh oh!

Conversation

wzhao18 commented Mar 7, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

wzhao18 commented Mar 7, 2026

Uh oh!

mergify bot commented Mar 10, 2026

Uh oh!

mergify bot commented Mar 10, 2026

Uh oh!

mergify bot commented Mar 10, 2026

Uh oh!

mgoin commented Mar 10, 2026

Uh oh!

wzhao18 commented Mar 10, 2026

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

wzhao18 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

mgoin Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

wzhao18 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wzhao18 commented Mar 7, 2026 •

edited by github-actions bot

Loading