Add Flashinfer trtllm moe to compressed tensor FP4 path by Victor49152 · Pull Request #28090 · vllm-project/vllm

Victor49152 · 2025-11-05T02:44:33Z

Purpose

Allow compressed tensor W4A4 method use flashinfer.trtllm_moe kernel

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-11-05T02:45:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Victor49152.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-05T02:50:37Z

vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py

+    # Determine routing method type
+    use_llama4_routing = (
+        custom_routing_function is Llama4MoE.custom_routing_function
+    )
+    routing_method_type = flashinfer.RoutingMethodType.Renormalize
+    if use_llama4_routing:
+        routing_method_type = flashinfer.RoutingMethodType.Llama4


Preserve DeepSeek routing mode in TRT-LLM helper

The new apply_flashinfer_trtllm_fp4_moe helper sets routing_method_type to flashinfer.RoutingMethodType.Renormalize by default and only switches to Llama4 when the custom routing function matches that model. Prior to this refactor the TRT‑LLM path in ModelOptNvFp4FusedMoE.apply defaulted to RoutingMethodType.DeepSeekV3, which is required for DeepSeek‑style routing. Because the helper is now used from the original code path, DeepSeek models will fall back to the generic renormalize kernel and route tokens incorrectly. Please keep the default routing type as DeepSeekV3 (or make it configurable) so existing DeepSeekV3 checkpoints continue to work.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request adds support for Flashinfer's TensorRT-LLM FP4 MoE kernels to the compressed tensor path. This involves adding a new backend option and the corresponding logic for weight preparation and kernel invocation. The changes include refactoring the MoE weight processing to handle different backends (TensorRT-LLM, Marlin, Cutlass) and moving some shared logic into utility functions. My review focuses on the maintainability and robustness of the integration with the flashinfer library. I've identified a couple of areas where the implementation relies on internal details of the flashinfer library, which could lead to future breakages.

gemini-code-assist · 2025-11-05T02:54:31Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

+            _maybe_get_cached_w3_w1_permute_indices,
+            get_w2_permute_indices_with_cache,
+        )


The code imports _maybe_get_cached_w3_w1_permute_indices from flashinfer.fused_moe.core. Functions prefixed with an underscore are considered internal implementation details of a library and are not part of its public API. Relying on such functions is risky because they can be changed or removed in future versions of flashinfer without any warning, which would break this code. It is strongly recommended to use public APIs if they exist for this functionality. If no public API is available, consider vendoring the function (copying its source code into this repository, if the license permits) to ensure stability and avoid unexpected breakages from dependency updates.

gemini-code-assist · 2025-11-05T02:54:31Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

+        )
+
+        """Prepare quantized weights for kernel (done offline with weights)."""
+        epilogue_tile_m = 128  # FIXME: this depends on the kernel internals


The variable epilogue_tile_m is hardcoded to 128. The FIXME comment correctly indicates that this value depends on the kernel's internal implementation. This creates a tight coupling, and if the FlashInfer kernel's internal tile size changes in a future update, this code could fail or produce incorrect results silently. This value should be obtained from the flashinfer library through a public API or a constant if one is available. If not, this magic number should be defined as a well-documented constant with a clear explanation of its origin and dependency on the kernel version, to make it easier to identify and update in the future.

…` by `get_last_useful_token` (vllm-project#25431) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

)

…ct#28083) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

…EAM` (vllm-project#28157) Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com>

…2E performance improvement (vllm-project#28164) Signed-off-by: yewentao256 <zhyanwentao@126.com>

… from HF. Skip it in tests (vllm-project#28170) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

…ject#28141) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

Signed-off-by: Jacob <cmpute@qq.com>

Signed-off-by: Yanan Cao <gmagogsfm@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Signed-off-by: Xiaozhu <mxz297@gmail.com>

Signed-off-by: Julien Denize <julien.denize@mistral.ai>

Signed-off-by: Seungduk Kim <seungduk.kim@yanolja.com> Signed-off-by: Biswa Panda <biswa.panda@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Biswa Panda <biswa.panda@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

…e empty (vllm-project#28181) Signed-off-by: courage17340 <courage17340@163.com>

…harmony (vllm-project#26874) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

mergify · 2025-11-13T18:23:42Z

Documentation preview: https://vllm--28090.org.readthedocs.build/en/28090/

Victor49152 requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners November 5, 2025 02:44

mergify bot added needs-rebase and removed needs-rebase labels Nov 5, 2025

chatgpt-codex-connector bot reviewed Nov 5, 2025

View reviewed changes

Victor49152 marked this pull request as draft November 5, 2025 02:53

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

KuntaiDu and others added 19 commits November 6, 2025 00:12

[Core][Hybrid allocator + connector 2/n] Unify `remove_skipped_blocks…

efe73e9

…` by `get_last_useful_token` (vllm-project#25431) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

[Debugging] Add annotation for easier trace analysis (vllm-project#22496

1767658

)

[PERF] Decouple projections from GDN custom op. Attempt 2 (vllm-proje…

b6a248b

…ct#28083) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

[Bug] Fix cpu disable shared_experts `VLLM_DISABLE_SHARED_EXPERTS_STR…

d79d9f0

…EAM` (vllm-project#28157) Signed-off-by: yewentao256 <zhyanwentao@126.com>

[Bug] Fix env string "0" same to True (vllm-project#28159)

90189c7

Signed-off-by: yewentao256 <zhyanwentao@126.com>

[Feature] Enable TP + EP shared_experts overlap with router, 3.7% E…

d71af5f

…2E performance improvement (vllm-project#28164) Signed-off-by: yewentao256 <zhyanwentao@126.com>

[CI Failure] nm-testing/Qwen2-0.5B-Instruct-FP8-SkipQKV was removed…

f948ab6

… from HF. Skip it in tests (vllm-project#28170) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

[Misc] Remove the duplicate code (vllm-project#28111)

07d6145

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[Chore] Clean up deepseek v2/v3 config copy (vllm-project#28055)

43ecd0a

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

[Core][MM] Use non-blocking CPU-GPU copy of multimodal data (vllm-pro…

80679f1

…ject#28141) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

Make the cv2 dependency optional (vllm-project#27780)

d72299d

Signed-off-by: Jacob <cmpute@qq.com>

[CI] Add compile/test_multimodal_compile.py to CI (vllm-project#28151)

bde5039

Signed-off-by: Yanan Cao <gmagogsfm@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

[flashinfer] fix FI all2all with FI cutlass moe (vllm-project#28166)

e31946f

Signed-off-by: Xiaozhu <mxz297@gmail.com>

Patch Mistral Tokenizer (vllm-project#28146)

a404e2c

Signed-off-by: Julien Denize <julien.denize@mistral.ai>

[CPU] Enable torch profiling (vllm-project#28130)

3755c14

Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>

[V0 deprecation]clean up is_v1_supported_oracle (vllm-project#28116)

c3ee80a

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

[Bugfix][Kernel] fix merge attn states when both prefix and suffix ar…

981cadb

…e empty (vllm-project#28181) Signed-off-by: courage17340 <courage17340@163.com>

[Frontend] OpenAI Responses API supports Tool/Function calling - non-…

59a50af

…harmony (vllm-project#26874) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

github-project-automation bot added this to gpt-oss Issues & Enhancements Nov 13, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Nov 13, 2025

mergify bot added rocm Related to AMD ROCm structured-output speculative-decoding labels Nov 13, 2025

github-project-automation bot added this to Structured Output Nov 13, 2025

mergify bot added v1 tpu Related to Google TPUs tool-calling labels Nov 13, 2025

mergify bot assigned sangstar Nov 13, 2025

github-project-automation bot added this to Tool Calling Nov 13, 2025

mergify bot added the kv-connector label Nov 13, 2025

Victor49152 closed this Nov 13, 2025

github-project-automation bot moved this to Done in Tool Calling Nov 13, 2025

github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Nov 13, 2025

github-project-automation bot moved this to Done in NVIDIA Nov 13, 2025

github-project-automation bot moved this to Done in Structured Output Nov 13, 2025

wangshangsam unassigned sangstar Nov 13, 2025

mergify bot assigned sangstar Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Flashinfer trtllm moe to compressed tensor FP4 path#28090

Add Flashinfer trtllm moe to compressed tensor FP4 path#28090
Victor49152 wants to merge 290 commits intovllm-project:mainfrom
Victor49152:add_trtllm_moe_to_compressed_tensor_path

Victor49152 commented Nov 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Nov 5, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 5, 2025

Uh oh!

gemini-code-assist bot Nov 5, 2025

Uh oh!

mergify bot commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

Victor49152 commented Nov 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Nov 5, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Victor49152 commented Nov 5, 2025 •

edited by github-actions bot

Loading