Add Flashinfer trtllm moe to compressed tensor FP4 path#28090
Add Flashinfer trtllm moe to compressed tensor FP4 path#28090Victor49152 wants to merge 290 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Determine routing method type | ||
| use_llama4_routing = ( | ||
| custom_routing_function is Llama4MoE.custom_routing_function | ||
| ) | ||
| routing_method_type = flashinfer.RoutingMethodType.Renormalize | ||
| if use_llama4_routing: | ||
| routing_method_type = flashinfer.RoutingMethodType.Llama4 |
There was a problem hiding this comment.
Preserve DeepSeek routing mode in TRT-LLM helper
The new apply_flashinfer_trtllm_fp4_moe helper sets routing_method_type to flashinfer.RoutingMethodType.Renormalize by default and only switches to Llama4 when the custom routing function matches that model. Prior to this refactor the TRT‑LLM path in ModelOptNvFp4FusedMoE.apply defaulted to RoutingMethodType.DeepSeekV3, which is required for DeepSeek‑style routing. Because the helper is now used from the original code path, DeepSeek models will fall back to the generic renormalize kernel and route tokens incorrectly. Please keep the default routing type as DeepSeekV3 (or make it configurable) so existing DeepSeekV3 checkpoints continue to work.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request adds support for Flashinfer's TensorRT-LLM FP4 MoE kernels to the compressed tensor path. This involves adding a new backend option and the corresponding logic for weight preparation and kernel invocation. The changes include refactoring the MoE weight processing to handle different backends (TensorRT-LLM, Marlin, Cutlass) and moving some shared logic into utility functions. My review focuses on the maintainability and robustness of the integration with the flashinfer library. I've identified a couple of areas where the implementation relies on internal details of the flashinfer library, which could lead to future breakages.
| _maybe_get_cached_w3_w1_permute_indices, | ||
| get_w2_permute_indices_with_cache, | ||
| ) |
There was a problem hiding this comment.
The code imports _maybe_get_cached_w3_w1_permute_indices from flashinfer.fused_moe.core. Functions prefixed with an underscore are considered internal implementation details of a library and are not part of its public API. Relying on such functions is risky because they can be changed or removed in future versions of flashinfer without any warning, which would break this code. It is strongly recommended to use public APIs if they exist for this functionality. If no public API is available, consider vendoring the function (copying its source code into this repository, if the license permits) to ensure stability and avoid unexpected breakages from dependency updates.
| ) | ||
|
|
||
| """Prepare quantized weights for kernel (done offline with weights).""" | ||
| epilogue_tile_m = 128 # FIXME: this depends on the kernel internals |
There was a problem hiding this comment.
The variable epilogue_tile_m is hardcoded to 128. The FIXME comment correctly indicates that this value depends on the kernel's internal implementation. This creates a tight coupling, and if the FlashInfer kernel's internal tile size changes in a future update, this code could fail or produce incorrect results silently. This value should be obtained from the flashinfer library through a public API or a constant if one is available. If not, this magic number should be defined as a well-documented constant with a clear explanation of its origin and dependency on the kernel version, to make it easier to identify and update in the future.
…` by `get_last_useful_token` (vllm-project#25431) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…ct#28083) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
…EAM` (vllm-project#28157) Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
…2E performance improvement (vllm-project#28164) Signed-off-by: yewentao256 <zhyanwentao@126.com>
… from HF. Skip it in tests (vllm-project#28170) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…ject#28141) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Jacob <cmpute@qq.com>
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Xiaozhu <mxz297@gmail.com>
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
Signed-off-by: Seungduk Kim <seungduk.kim@yanolja.com> Signed-off-by: Biswa Panda <biswa.panda@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Biswa Panda <biswa.panda@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
…e empty (vllm-project#28181) Signed-off-by: courage17340 <courage17340@163.com>
…harmony (vllm-project#26874) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
Documentation preview: https://vllm--28090.org.readthedocs.build/en/28090/ |
Purpose
Allow compressed tensor W4A4 method use flashinfer.trtllm_moe kernel
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.