[Perf] Add TRTLLM FP8 MoE Modular Kernel#36307
Conversation
6014cf2 to
4efbb10
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a modular FP8 MoE kernel for the TRTLLM backend, which is a valuable addition for improving model compatibility. The refactoring of the existing monolithic kernel into a shared base class is well-executed and enhances the code structure. My review identifies two potential issues: a possible null pointer exception from an unchecked optional parameter and a hardcoded value that could restrict the new kernel's flexibility. Addressing these points will help ensure the implementation is robust and fully achieves its intended goal.
|
@robertgshaw2-redhat Could you help review this PR when you get a chance? |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
4efbb10 to
e454522
Compare
|
Hi @wzhao18, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @wzhao18, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
@wzhao18 can you please check the accuracy of the model with this change? I originally opened the issue because of this bfloat16 issue I think flashinfer-ai/flashinfer#2469 |
|
@mgoin I did not actually change the router logits dtype for Minimax in the PR. I only changed it locally for testing as the support for float32 router logits has not yet been merged. After this is supported in FI, we need to change |
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
mgoin
left a comment
There was a problem hiding this comment.
LGTM as an interim step then, thanks!
| topk_ids=packed_topk_ids, | ||
| routing_bias=None, | ||
| hidden_states=hidden_states, | ||
| hidden_states_scale=a1q_scale.t().contiguous(), # type: ignore[union-attr] |
There was a problem hiding this comment.
Could we fuse this from the activation quant before? This could be quite slow. Worth considering when fixing the output issue
There was a problem hiding this comment.
Will keep a note about it. Thanks for point this out.
| local_expert_offset=self.ep_rank * self.local_num_experts, | ||
| local_num_experts=self.local_num_experts, | ||
| routed_scaling_factor=None, | ||
| routing_method_type=1, |
There was a problem hiding this comment.
Can you note that this is ignored in this case?
There was a problem hiding this comment.
Will do in future PR then.
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
Purpose
Add TRTLLM FP8 MoE Modular Kernel. The trtllm monolithic kernel has restrictions over the routing method of the model. This PR enables more models (e.g., minimax m2) to use the trtllm moe backend by adding the modular version.
Test Plan
Tested with Minimax-m2.5 (which cannot use trtllm moe monolithic backend due to routing method restriction)
Note: to get Minimax to run with trtllm moe, it is required to set minimax's router_logits_dtype to
bfloat16, as trtllm backend only supports this. This restriction should go away soon with flashinfer PR. (Edit 03/17: this is not true. the modular kernel does not enforce router logits dtype, as routing is done externally. The constraint should be removed)Test Result
1K/1K TP=2 Benchmark on B200:

Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.