Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe by danisereb · Pull Request #35088 · vllm-project/vllm

danisereb · 2026-02-23T08:49:02Z

Purpose

The flashinfer autotuner expects the first dimension of the MoE tensors to be num_tokens.

The relevant code can be found in flashinfer function get_trtllm_moe_sm100_module:

        # their first dimension is num_tokens which will be tuned
        tuning_config_with_hidden_states_scales = TuningConfig(
            dynamic_tensor_specs=(
                DynamicTensorSpec(
                    (0, 1, 2, 3, 4, 5),
                    (0, 0, 0, 0, 0, 0),
                    get_last_power_of_2_num_tokens_buckets(8192, 1),
                    lambda x: min(last_positive_power_of_2(x), 8192),
                    dynamic_tensor_initializers,
                ),
            )
        )

When running with vLLM main with env var FLASHINFER_LOGGING_LEVEL=DEBUG, we can see the following:

flashinfer.jit: [AutoTunner]: Using fallback tactic for flashinfer::trtllm_fp4_block_scale_moe with input shapes ...

Using reshape(...) instead of flatten() solves the issues (no using fallback tactic prints).

This change is aligned with function Mxfp4MoEMethod.apply_monolithic:

x_scale = x_scale.view(torch.float8_e4m3fn).reshape(*x.shape[:-1], -1)

Test Plan

Verify that there is no drop in accuracy (lm_eval).

Test Result

Used the following MoE NVFP4 model:
https://huggingface.co/RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4

Use env vars:

export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_FLASHINFER_MOE_BACKEND=latency

To use 'FLASHINFER_TRTLLM' NvFp4 MoE backend.

lm_eval gsm8k:

lm_eval \
  --model vllm \
  --model_args pretrained="/my_home/hf_models/RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
  --tasks gsm8k_llama \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --batch_size auto

### vLLM main commit: 
|   Tasks   |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_llama|      3|flexible_extract|     8|exact_match|↑  |0.9371|±  |0.0067|
|           |       |strict_match    |     8|exact_match|↑  |0.9371|±  |0.0067|


### This PR:
|   Tasks   |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_llama|      3|flexible_extract|     8|exact_match|↑  |0.9371|±  |0.0067|
|           |       |strict_match    |     8|exact_match|↑  |0.9371|±  |0.0067|

There was no change in vllm bench serve performance (tok/sec).

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request correctly addresses a performance issue with the flashinfer MoE kernels. The original code used .flatten() on the hidden_states_scale tensor, which collapsed its dimensions and prevented the flashinfer autotuner from identifying the optimal tactic, causing a fallback to a less efficient default. The fix replaces .flatten() with .reshape(*hidden_states_fp4.shape[:-1], -1), which correctly preserves the tensor's leading dimensions (including num_tokens) as expected by the kernel. This change is applied consistently in both flashinfer_trtllm_fp4_moe and flashinfer_trtllm_fp4_routed_moe functions, resolving the performance degradation. The change is correct and well-justified.

mgoin

Makes sense, LGTM

mgoin · 2026-02-23T15:33:27Z

vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py

        hidden_states_scale=hidden_states_scale_linear_fp4.view(
            torch.float8_e4m3fn
-        ).flatten(),
+        ).reshape(*hidden_states_fp4.shape[:-1], -1),


Is reshape required or could we use view here? I just ask so we avoid the implicit .contiguous() that reshape has

Like I mentioned in the PR description, I followed the reshape from Mxfp4MoEMethod.apply_monolithic.

As far as I understand, reshape() uses a view when possible and implicit contiguous if it's required.
And view() may throw an error if the underlying storage/strides does not match the requested shape.

…_block_scale_moe Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

danisereb · 2026-02-23T17:11:17Z

Rebased to get this change:
#34874

seems to be related to the CI failures I see.

There's still a bug in certain cases (num tokens not power of 2), but the fix is in flashinfer - see this PR:
flashinfer-ai/flashinfer#2617

pavanimajety

Seems like the test failures are unrelated (model architectures in the failing tests likely won't invoke trtllm FP4 moe kernels)

There's still a bug in certain cases (num tokens not power of 2), but the fix is in flashinfer - see this PR:
flashinfer-ai/flashinfer#2617

With smaller batch sizes that would be enabled through #34936 in --performance-mode latency, we may see an issue. However, since this is a perf improvement, we should be okay to merge.

…_block_scale_moe (vllm-project#35088) Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

…_block_scale_moe (vllm-project#35088) Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

…_block_scale_moe (vllm-project#35088) Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

…_block_scale_moe (vllm-project#35088) Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

mergify bot added the nvidia label Feb 23, 2026

github-project-automation bot added this to NVIDIA Feb 23, 2026

gemini-code-assist bot reviewed Feb 23, 2026

View reviewed changes

danisereb changed the title ~~Fix fallback to default flashinfer tactic with trtllm_fp4_block_scale_moe~~ Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe Feb 23, 2026

danisereb force-pushed the fix_trtllm_moe_autotune branch 2 times, most recently from e882c41 to 19ec584 Compare February 23, 2026 13:37

danisereb marked this pull request as ready for review February 23, 2026 14:59

danisereb requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners February 23, 2026 14:59

mgoin approved these changes Feb 23, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Feb 23, 2026

mgoin added ready ONLY add when PR is ready to merge/full CI is needed performance Performance-related issues labels Feb 23, 2026

Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4…

fd87a3a

…_block_scale_moe Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

danisereb force-pushed the fix_trtllm_moe_autotune branch from 19ec584 to fd87a3a Compare February 23, 2026 17:08

wenscarl mentioned this pull request Feb 23, 2026

Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe sgl-project/sglang#19189

Merged

5 tasks

pavanimajety approved these changes Feb 23, 2026

View reviewed changes

pavanimajety enabled auto-merge (squash) February 23, 2026 20:07

vllm-bot merged commit a0c7081 into vllm-project:main Feb 24, 2026
60 of 62 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Feb 24, 2026

tom-zju pushed a commit to tom-zju/vllm that referenced this pull request Feb 26, 2026

Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4…

23c50fa

…_block_scale_moe (vllm-project#35088) Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4…

d3c68c9

…_block_scale_moe (vllm-project#35088) Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4…

df8f5f9

…_block_scale_moe (vllm-project#35088) Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026

Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4…

392a4e7

…_block_scale_moe (vllm-project#35088) Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe#35088

Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe#35088
vllm-bot merged 1 commit intovllm-project:mainfrom
de-inf:fix_trtllm_moe_autotune

danisereb commented Feb 23, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mgoin left a comment

Uh oh!

mgoin Feb 23, 2026

Uh oh!

danisereb Feb 23, 2026 •

edited

Loading

Uh oh!

danisereb commented Feb 23, 2026 •

edited

Loading

Uh oh!

pavanimajety left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

danisereb commented Feb 23, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

danisereb Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danisereb commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danisereb commented Feb 23, 2026 •

edited by github-actions bot

Loading

danisereb Feb 23, 2026 •

edited

Loading

danisereb commented Feb 23, 2026 •

edited

Loading