Skip to content

Add Flashinfer trtllm moe to compressed tensor FP4 path#28090

Closed
Victor49152 wants to merge 290 commits intovllm-project:mainfrom
Victor49152:add_trtllm_moe_to_compressed_tensor_path
Closed

Add Flashinfer trtllm moe to compressed tensor FP4 path#28090
Victor49152 wants to merge 290 commits intovllm-project:mainfrom
Victor49152:add_trtllm_moe_to_compressed_tensor_path

Conversation

@Victor49152
Copy link
Contributor

@Victor49152 Victor49152 commented Nov 5, 2025

Purpose

Allow compressed tensor W4A4 method use flashinfer.trtllm_moe kernel

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Nov 5, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Victor49152.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +138 to +144
# Determine routing method type
use_llama4_routing = (
custom_routing_function is Llama4MoE.custom_routing_function
)
routing_method_type = flashinfer.RoutingMethodType.Renormalize
if use_llama4_routing:
routing_method_type = flashinfer.RoutingMethodType.Llama4

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve DeepSeek routing mode in TRT-LLM helper

The new apply_flashinfer_trtllm_fp4_moe helper sets routing_method_type to flashinfer.RoutingMethodType.Renormalize by default and only switches to Llama4 when the custom routing function matches that model. Prior to this refactor the TRT‑LLM path in ModelOptNvFp4FusedMoE.apply defaulted to RoutingMethodType.DeepSeekV3, which is required for DeepSeek‑style routing. Because the helper is now used from the original code path, DeepSeek models will fall back to the generic renormalize kernel and route tokens incorrectly. Please keep the default routing type as DeepSeekV3 (or make it configurable) so existing DeepSeekV3 checkpoints continue to work.

Useful? React with 👍 / 👎.

@Victor49152 Victor49152 marked this pull request as draft November 5, 2025 02:53
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Flashinfer's TensorRT-LLM FP4 MoE kernels to the compressed tensor path. This involves adding a new backend option and the corresponding logic for weight preparation and kernel invocation. The changes include refactoring the MoE weight processing to handle different backends (TensorRT-LLM, Marlin, Cutlass) and moving some shared logic into utility functions. My review focuses on the maintainability and robustness of the integration with the flashinfer library. I've identified a couple of areas where the implementation relies on internal details of the flashinfer library, which could lead to future breakages.

Comment on lines +341 to +343
_maybe_get_cached_w3_w1_permute_indices,
get_w2_permute_indices_with_cache,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The code imports _maybe_get_cached_w3_w1_permute_indices from flashinfer.fused_moe.core. Functions prefixed with an underscore are considered internal implementation details of a library and are not part of its public API. Relying on such functions is risky because they can be changed or removed in future versions of flashinfer without any warning, which would break this code. It is strongly recommended to use public APIs if they exist for this functionality. If no public API is available, consider vendoring the function (copying its source code into this repository, if the license permits) to ensure stability and avoid unexpected breakages from dependency updates.

)

"""Prepare quantized weights for kernel (done offline with weights)."""
epilogue_tile_m = 128 # FIXME: this depends on the kernel internals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable epilogue_tile_m is hardcoded to 128. The FIXME comment correctly indicates that this value depends on the kernel's internal implementation. This creates a tight coupling, and if the FlashInfer kernel's internal tile size changes in a future update, this code could fail or produce incorrect results silently. This value should be obtained from the flashinfer library through a public API or a constant if one is available. If not, this magic number should be defined as a well-documented constant with a clear explanation of its origin and dependency on the kernel version, to make it easier to identify and update in the future.

KuntaiDu and others added 19 commits November 6, 2025 00:12
…` by `get_last_useful_token` (vllm-project#25431)

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…ct#28083)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
…EAM` (vllm-project#28157)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
…2E performance improvement (vllm-project#28164)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
… from HF. Skip it in tests (vllm-project#28170)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…ject#28141)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Jacob <cmpute@qq.com>
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
Signed-off-by: Seungduk Kim <seungduk.kim@yanolja.com>
Signed-off-by: Biswa Panda <biswa.panda@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Biswa Panda <biswa.panda@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
…e empty (vllm-project#28181)

Signed-off-by: courage17340 <courage17340@163.com>
…harmony (vllm-project#26874)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@mergify
Copy link

mergify bot commented Nov 13, 2025

Documentation preview: https://vllm--28090.org.readthedocs.build/en/28090/

@mergify mergify bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues qwen Related to Qwen models gpt-oss Related to GPT-OSS models labels Nov 13, 2025
@mergify mergify bot added v1 tpu Related to Google TPUs tool-calling labels Nov 13, 2025
@mergify mergify bot added the kv-connector label Nov 13, 2025
@github-project-automation github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Nov 13, 2025
@github-project-automation github-project-automation bot moved this to Done in NVIDIA Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models kv-connector llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models nvidia performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding structured-output tool-calling tpu Related to Google TPUs v1

Projects

Status: Done
Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.