Skip to content

[Perf] Add TRTLLM FP8 MoE Modular Kernel#36307

Merged
vllm-bot merged 11 commits intovllm-project:mainfrom
wzhao18:wzhao/fp8-trtllm-modular-moe
Mar 12, 2026
Merged

[Perf] Add TRTLLM FP8 MoE Modular Kernel#36307
vllm-bot merged 11 commits intovllm-project:mainfrom
wzhao18:wzhao/fp8-trtllm-modular-moe

Conversation

@wzhao18
Copy link
Contributor

@wzhao18 wzhao18 commented Mar 7, 2026

Purpose

Add TRTLLM FP8 MoE Modular Kernel. The trtllm monolithic kernel has restrictions over the routing method of the model. This PR enables more models (e.g., minimax m2) to use the trtllm moe backend by adding the modular version.

Test Plan

Tested with Minimax-m2.5 (which cannot use trtllm moe monolithic backend due to routing method restriction)

vllm serve MiniMaxAI/MiniMax-M2.5 \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --enable-expert-parallel

Note: to get Minimax to run with trtllm moe, it is required to set minimax's router_logits_dtype to bfloat16, as trtllm backend only supports this. This restriction should go away soon with flashinfer PR. (Edit 03/17: this is not true. the modular kernel does not enforce router logits dtype, as routing is done externally. The constraint should be removed)

Test Result

1K/1K TP=2 Benchmark on B200:
isl1024_osl1024


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the nvidia label Mar 7, 2026
@wzhao18 wzhao18 force-pushed the wzhao/fp8-trtllm-modular-moe branch from 6014cf2 to 4efbb10 Compare March 7, 2026 04:24
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a modular FP8 MoE kernel for the TRTLLM backend, which is a valuable addition for improving model compatibility. The refactoring of the existing monolithic kernel into a shared base class is well-executed and enhances the code structure. My review identifies two potential issues: a possible null pointer exception from an unchecked optional parameter and a hardcoded value that could restrict the new kernel's flexibility. Addressing these points will help ensure the implementation is robust and fully achieves its intended goal.

@wzhao18
Copy link
Contributor Author

wzhao18 commented Mar 7, 2026

@robertgshaw2-redhat Could you help review this PR when you get a chance?

@wzhao18 wzhao18 changed the title Add TRTLLM FP8 MoE Modular Kernel [Perf] Add TRTLLM FP8 MoE Modular Kernel Mar 7, 2026
@mergify
Copy link

mergify bot commented Mar 10, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wzhao18.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 10, 2026
wzhao18 added 6 commits March 10, 2026 17:51
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@wzhao18 wzhao18 force-pushed the wzhao/fp8-trtllm-modular-moe branch from 4efbb10 to e454522 Compare March 10, 2026 17:55
@mergify mergify bot removed the needs-rebase label Mar 10, 2026
@mergify
Copy link

mergify bot commented Mar 10, 2026

Hi @wzhao18, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@mergify
Copy link

mergify bot commented Mar 10, 2026

Hi @wzhao18, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 10, 2026
@mgoin
Copy link
Member

mgoin commented Mar 10, 2026

Note: to get Minimax to run with trtllm moe, it is required to set minimax's router_logits_dtype to bfloat16, as trtllm backend only supports this. This restriction should go away soon with flashinfer-ai/flashinfer#2534.

@wzhao18 can you please check the accuracy of the model with this change? I originally opened the issue because of this bfloat16 issue I think flashinfer-ai/flashinfer#2469

@wzhao18
Copy link
Contributor Author

wzhao18 commented Mar 10, 2026

@mgoin I did not actually change the router logits dtype for Minimax in the PR. I only changed it locally for testing as the support for float32 router logits has not yet been merged. After this is supported in FI, we need to change _supports_router_logits_dtype.

wzhao18 added 2 commits March 10, 2026 22:54
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@wzhao18 wzhao18 requested a review from tlrmchlsmth as a code owner March 10, 2026 22:55
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as an interim step then, thanks!

topk_ids=packed_topk_ids,
routing_bias=None,
hidden_states=hidden_states,
hidden_states_scale=a1q_scale.t().contiguous(), # type: ignore[union-attr]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we fuse this from the activation quant before? This could be quite slow. Worth considering when fixing the output issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will keep a note about it. Thanks for point this out.

local_expert_offset=self.ep_rank * self.local_num_experts,
local_num_experts=self.local_num_experts,
routed_scaling_factor=None,
routing_method_type=1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you note that this is ignored in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do in future PR then.

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Mar 12, 2026
@vllm-bot vllm-bot merged commit 2e693f4 into vllm-project:main Mar 12, 2026
53 of 55 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Mar 12, 2026
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Mar 16, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants