[Kernel] Support Flashinfer trtllm fused MoE non gated FP8 & NVFP4 by amitz-nv · Pull Request #33506 · vllm-project/vllm

amitz-nv · 2026-02-01T08:58:11Z

Purpose

Add support for Flashinfer trtllm fused MoE non-gated activation for FP8 and for NVFP4.

Changes:

Pass activation_type argument to FlashInfer trtllm fused MoE FP8 and NVFP4.
Add DeepSeek routing to supported routing list of Flashinfer trtllm fused MoE FP8
Add support to non-gated flow in Flashinfer trtllm fused MoE NVFP4
Use min_alignment=128 (padding) for non-gated activation in Flashinfer trtllm fused MoE
Fix tests/kernels/moe/test_flashinfer.py and expand it to also test relu2_no_mul activation for both cutlass and trtllm kernels.

lm_eval on Nemotron 3 Nano FP8:

export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_FLASHINFER_MOE_BACKEND=latency

LLM_FLASHINFER_MOE_BACKEND="backend" lm_eval --model vllm --model_args pretrained=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4,\ 
tensor_parallel_size=1,max_model_len=2048,kv_cache_dtype=auto \ 
--gen_kwargs temperature=0.0 --limit 500 --trust_remote_code \ 
--tasks gsm8k --num_fewshot 5 --batch_size 200

Outputs:

vllm ({'pretrained': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8', 'tensor_parallel_size': 1, 'max_model_len': 2048, 'kv_cache_dtype': 'auto'}), gen_kwargs: ({'temperature': 0.0}), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.568|±  |0.0222|
|     |       |strict-match    |     5|exact_match|↑  |0.848|±  |0.0161|

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2026-02-01T08:58:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @amitz-nv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request adds support for non-gated Mixture of Experts (MoE) models using FlashInfer with FP8 and NVFP4 quantization. The changes are comprehensive, including updates to tests, support checks, activation handling, and weight preparation logic. Overall, the changes are well-aligned with the PR's objective. However, I've identified a critical bug in the FP4 MoE weight preparation logic that incorrectly calculates shapes for gated activations, which could lead to runtime errors or incorrect results. I have provided specific suggestions to address this issue.

gemini-code-assist · 2026-02-01T09:00:06Z

vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py


    gemm2_weights_fp4 = gemm2_weights.view(torch.float8_e4m3fn).reshape(
-        num_experts, hidden_size, intermediate_size // 2
+        num_experts, hidden_size, actual_intermediate_size // 2
    )  # packed fp4
    gemm2_scales_linear_fp4 = gemm2_scales_linear_fp4_bytes.view(
        torch.float8_e4m3fn
-    ).reshape(num_experts, hidden_size, intermediate_size // 16)  # fp8 scaling factors
+    ).reshape(
+        num_experts, hidden_size, actual_intermediate_size // 16
+    )  # fp8 scaling factors


The calculation for gemm2_weights_fp4 and gemm2_scales_linear_fp4 shapes is incorrect for gated activations. actual_intermediate_size is derived from w13's shape, which differs for gated and non-gated models. However, the down-projection (gemm2) should have a consistent intermediate dimension. This change introduces mlp_ffn_dim to correctly calculate the shapes for both gated and non-gated cases, fixing a bug for gated activations.

Suggested change

gemm2_weights_fp4 = gemm2_weights.view(torch.float8_e4m3fn).reshape(

num_experts, hidden_size, intermediate_size // 2

num_experts, hidden_size, actual_intermediate_size // 2

) # packed fp4

gemm2_scales_linear_fp4 = gemm2_scales_linear_fp4_bytes.view(

torch.float8_e4m3fn

).reshape(num_experts, hidden_size, intermediate_size // 16) # fp8 scaling factors

).reshape(

num_experts, hidden_size, actual_intermediate_size // 16

) # fp8 scaling factors

mlp_ffn_dim = intermediate_size if is_gated_activation else 2 * intermediate_size

gemm2_weights_fp4 = gemm2_weights.view(torch.float8_e4m3fn).reshape(

num_experts, hidden_size, mlp_ffn_dim // 2

) # packed fp4

gemm2_scales_linear_fp4 = gemm2_scales_linear_fp4_bytes.view(

torch.float8_e4m3fn

).reshape(

num_experts, hidden_size, mlp_ffn_dim // 16

) # fp8 scaling factors

gemini-code-assist · 2026-02-01T09:00:06Z

vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py

    gemm2_scales_fp4_shuffled = (
        torch.stack(gemm2_scales_fp4_shuffled)
        .view(torch.float8_e4m3fn)
-        .reshape(num_experts, hidden_size, intermediate_size // 16)
+        .reshape(num_experts, hidden_size, actual_intermediate_size // 16)
    )


Similar to the previous comment, the reshape dimension for gemm2_scales_fp4_shuffled is incorrect for gated activations. It should use the mlp_ffn_dim variable (defined in the suggested fix for the previous issue) to ensure the correct shape.

Suggested change

gemm2_scales_fp4_shuffled = (

torch.stack(gemm2_scales_fp4_shuffled)

.view(torch.float8_e4m3fn)

.reshape(num_experts, hidden_size, intermediate_size // 16)

.reshape(num_experts, hidden_size, actual_intermediate_size // 16)

)

gemm2_scales_fp4_shuffled = (

torch.stack(gemm2_scales_fp4_shuffled)

.view(torch.float8_e4m3fn)

.reshape(num_experts, hidden_size, mlp_ffn_dim // 16)

)

mergify · 2026-02-08T07:27:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @amitz-nv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mgoin · 2026-02-10T23:29:31Z

vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py

    use_routing_scales_on_input: bool,
    routing_method_type: int,
    routed_scaling_factor: float = 1.0,
+    activation_type: int = 3,  # Swiglu


Let's remove the default value to always be explicit

mgoin · 2026-02-10T23:31:55Z

vllm/model_executor/layers/quantization/utils/flashinfer_utils.py

+def is_gated_activation(activation: str) -> bool:
+    return not activation.lower().endswith("_no_mul")
+
+
+def activation_str_to_int(activation: str) -> int:
+    from flashinfer.fused_moe.core import ActivationType
+
+    # silu and gelu are mapped to their gated versions SwiGLU and GeGLU respectively
+    ACTIVATION_TO_FI_ACTIVATION = {
+        "silu_no_mul": ActivationType.Silu,
+        "gelu_no_mul": ActivationType.Gelu,
+        "silu": ActivationType.Swiglu,
+        "gelu": ActivationType.Geglu,
+        "relu2_no_mul": ActivationType.Relu2,
+    }
+    return ACTIVATION_TO_FI_ACTIVATION[activation.lower()].value


Would be nice if we could have this use the MoEActivation refactor, hopefully landing soon #33843

Nice, I definitely agree that refactor is necessary!
Regarding the order, I think it depends on when the refactor PR is merged

mgoin · 2026-02-10T23:33:29Z

vllm/model_executor/layers/quantization/utils/flashinfer_utils.py

    # for the gate-up proj. Pad the weights to respect this.
+    is_gated = is_gated_activation(layer.activation)
    if not block_quant:
+        min_alignment = 16 if is_gated else 128


Is there some justification for 128 we can reference?

That's what the current Flashinfer kernels require, otherwise it doesn't find a suitable kernel.

For example, Nemotron 3 Nano TP=1 would fail unless it's set to 128 here:

(EngineCore_DP0 pid=3184059) File "/usr/local/lib/python3.12/dist-packages/flashinfer/fused_moe/core.py", line 2258, in trtllm_fp8_per_tensor_scale_moe (EngineCore_DP0 pid=3184059) return get_trtllm_moe_sm100_module().trtllm_fp8_per_tensor_scale_moe( (EngineCore_DP0 pid=3184059) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=3184059) File "/usr/local/lib/python3.12/dist-packages/flashinfer/fused_moe/core.py", line 1488, in trtllm_fp8_per_tensor_scale_moe_op (EngineCore_DP0 pid=3184059) result = moe_op.trtllm_fp8_per_tensor_scale_moe( (EngineCore_DP0 pid=3184059) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=3184059) File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__ (EngineCore_DP0 pid=3184059) RuntimeError: Error in function 'getValidConfigIndices' at /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/trtllm_batched_gemm_runner.cu:416: No valid config found for the given problem shape

mgoin · 2026-02-10T23:34:22Z

vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py

+        block_quant = (
+            hasattr(layer, "weight_block_size") and layer.weight_block_size is not None
+        )


If we are in NVFP4, why would we expect weight_block_size in any case?

It was copied from the FP8 flow, removing it

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

…ted MoE and rel2_no_mul activation, support DeepSeek routing in FP8 per-tensor, fix prepare_static_weights_for_trtllm_fp4_moe Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

…m/utils/flashinfer.py Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

…gated, otherwise use min_alignment=16 Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

mergify · 2026-02-12T10:16:10Z

Hi @amitz-nv, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

mgoin

LGTM nice work! Will manually trigger MoE refactor tests

mgoin · 2026-02-12T17:41:07Z

vllm/model_executor/layers/quantization/modelopt.py

        # time in the oracle rather than here.
-        assert layer.activation == MoEActivation.SILU, (
-            f"Expected 'silu' activation but got {layer.activation}"
+        SUPPORTED_ACTIVATIONS = [MoEActivation.SILU, MoEActivation.RELU2_NO_MUL]
+        assert layer.activation in SUPPORTED_ACTIVATIONS, (
+            f"Only {SUPPORTED_ACTIVATIONS} activations are supported for FlashInfer "
+            f"TRTLLM FP4 MoE, {layer.activation} found instead."
        )
-        assert not layer.renormalize


Note: we need to update the compressed tensors side too, can do in followup PR

…llm-project#33506) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

…llm-project#33506) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

mergify bot added the nvidia label Feb 1, 2026

mergify bot added the needs-rebase label Feb 1, 2026

github-project-automation bot added this to NVIDIA Feb 1, 2026

gemini-code-assist bot reviewed Feb 1, 2026

View reviewed changes

amitz-nv force-pushed the support-fi-fused-moe-non-gated-fp8-nvfp4 branch from 2d08ea2 to 5e81d21 Compare February 1, 2026 16:51

mergify bot removed the needs-rebase label Feb 1, 2026

amitz-nv force-pushed the support-fi-fused-moe-non-gated-fp8-nvfp4 branch from 500f8e3 to e1b1314 Compare February 2, 2026 15:31

mergify bot added the needs-rebase label Feb 8, 2026

amitz-nv force-pushed the support-fi-fused-moe-non-gated-fp8-nvfp4 branch 2 times, most recently from 5956b26 to 0e40b63 Compare February 10, 2026 12:00

mergify bot removed the needs-rebase label Feb 10, 2026

amitz-nv force-pushed the support-fi-fused-moe-non-gated-fp8-nvfp4 branch from 3e18b3d to 4550510 Compare February 10, 2026 12:35

amitz-nv changed the title ~~Support FI fused MoE non gated FP8 & NVFP4~~ [Kernel] Support Flashinfer fused MoE non gated FP8 & NVFP4 Feb 10, 2026

amitz-nv changed the title ~~[Kernel] Support Flashinfer fused MoE non gated FP8 & NVFP4~~ [Kernel] Support Flashinfer trtllm-gen fused MoE non gated FP8 & NVFP4 Feb 10, 2026

amitz-nv changed the title ~~[Kernel] Support Flashinfer trtllm-gen fused MoE non gated FP8 & NVFP4~~ [Kernel] Support Flashinfer trtllm fused MoE non gated FP8 & NVFP4 Feb 10, 2026

amitz-nv marked this pull request as ready for review February 10, 2026 15:21

amitz-nv requested review from WoosukKwon, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners February 10, 2026 15:21

mgoin reviewed Feb 10, 2026

View reviewed changes

mgoin mentioned this pull request Feb 11, 2026

[Bugfix] Fix NVFP4 MoE weight shapes for non-gated MLPs (Nemotron-Nano) #33518

Closed

amitz-nv force-pushed the support-fi-fused-moe-non-gated-fp8-nvfp4 branch from 694312f to 1a69bc0 Compare February 11, 2026 18:17

amitz-nv requested a review from mgoin February 11, 2026 19:54

amitz-nv added 2 commits February 12, 2026 00:59

Add support for FI fused moe non-gated FP8 and NVFP4

5252c9f

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Fix test_flashinfer.py

2a27425

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

amitz-nv added 12 commits February 12, 2026 01:03

Fixes after rebase

4decedb

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Use intermediate size correctly in flashinfer_fp4_moe.py

7852c66

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Handle silu & gelu (without '_no_mul' suffix) correctly in FI flow

e717d7d

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Make is_gated_activation and activation_str_to_enum_value more robust

557833d

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Fix tests/kernels/moe/test_flashinfer.py

74410c9

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Remove unnecessary scale .max() in test_flashinfer.py

7e4e1fa

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Import ActivationType directly from FlashInfer, revert changes in vll…

03f9441

…m/utils/flashinfer.py Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Fix activation type passed (passing int instead of enum class)

d8df109

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

With Flashinfer fused MoE NVFP4 - use min_alignment=128 only for non-…

a006d3a

…gated, otherwise use min_alignment=16 Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Remove check of layer.weight_block_size in flashinfer_fp4_moe.py

8f0c7b3

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Remove default value of activation_type

ea22768

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

amitz-nv force-pushed the support-fi-fused-moe-non-gated-fp8-nvfp4 branch from 5d43e07 to ea22768 Compare February 12, 2026 10:11

Fix formatting

da29d13

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

mgoin added ready ONLY add when PR is ready to merge/full CI is needed performance Performance-related issues labels Feb 12, 2026

mgoin approved these changes Feb 12, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Feb 12, 2026

vllm-bot merged commit f120bd4 into vllm-project:main Feb 12, 2026
61 of 67 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Feb 12, 2026

mgoin mentioned this pull request Feb 17, 2026

[Bugfix] Fix NVFP4 TRTLLM MoE non-gated support; add gsm8k for Nemotron-3-Nano FP8+NVFP4 #34725

Merged

5 tasks

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

[Kernel] Support Flashinfer trtllm fused MoE non gated FP8 & NVFP4 (v…

80057a2

…llm-project#33506) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[Kernel] Support Flashinfer trtllm fused MoE non gated FP8 & NVFP4 (v…

1509770

…llm-project#33506) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

tomeras91 mentioned this pull request Mar 5, 2026

[Bugfix] Disable FlashInfer TRTLLM BF16 path for non-gated MoE #36146

Merged

5 tasks

Uh oh!

Conversation

amitz-nv commented Feb 1, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Feb 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amitz-nv Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 12, 2026

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amitz-nv commented Feb 1, 2026 •

edited by github-actions bot

Loading

amitz-nv Feb 11, 2026 •

edited

Loading