[Kernel] Add FlashInfer MoE A2A Kernel by leo-cf-tian · Pull Request #36022 · vllm-project/vllm

leo-cf-tian · 2026-03-04T17:01:52Z

Purpose

This PR is a port of PR #32217 to the vLLM top-of-tree after the modular kernel refactors in #32564. It adds the latest TRT-LLM gen A2A kernel from flashinfer's MoE-A2A API (one sided all-to-all) as added in (flashinfer-ai/flashinfer#2102). This should perform better than the older A2A kernel from #21003 (formerly flashinfer_all2allv) in large batch size.

The new kernel can be enabled by specifying --all2all-backend flashinfer_nvlink_one_sided. It is only available for nvfp4.

This PR also renames flashinfer_all2allv to flashinfer_nvlink_two_sided as per suggestion as it is more descriptive and matches the new implementation.

We conducted benchmarks and found a noticeable increase in throughput at high concurrency, up to a 14% increase in throughput at 512 concurrency.

Testing

The PR also adds test coverage from @stecasta.

Register FlashInferMoeA2APrepareAndFinalize in the modular kernel combinatorial test framework (mk_objects.py), enabling automatic multi-GPU testing against all compatible Expert backends with nvfp4 quantization
Register TrtLlmNvFp4ExpertsModular in the same framework (previously missing from the test registry)
Add parametrized tests validating the _supports_parallel_config incompatibility matrix for the new flashinfer_moe_a2a backend across 7 Expert types
Add a parity test ensuring flashinfer_moe_a2a and flashinfer_all2allv share the same incompatibility matrix, catching drift if one is updated without the other

Test plan

test_supports_parallel_config_flashinfer_moe_a2a — CPU only, 7 parametrized cases
test_supports_parallel_config_parity_with_all2allv — CPU only, 7 parametrized cases
Combinatorial coverage via test_modular_kernel_combinations_multigpu — multi-GPU, auto-generated from mk_objects.py registrations
Future: dedicated multi-GPU test with broader shape coverage once the A2A manager supports standalone initialization

Notes

The incompatibility matrix tests do not require a GPU and can run in any CI environment. The combinatorial multi-GPU tests require 2x Blackwell GPUs with FlashInfer trtllm_moe_alltoall support.

Reproduction

To reproduce our results, the server can be launched with the following configuration:

vllm serve nvidia/DeepSeek-R1-NVFP4 \
    --trust-remote-code \
    --max-num-seqs 1024 \
    --max-num-batched-tokens 2048 \
    --stream-interval 20 \
    --no-enable-prefix-caching \
    --kv-cache-dtype fp8 \
    --max-cudagraph-capture-size 2048 \
    --data-parallel-size 8 \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 1 \
    --enable-expert-parallel \
    --gpu-memory-utilization 0.8 \
    --all2all-backend (None / flashinfer_all2allv / flashinfer_moe_a2a)

To verify correctness, you can run gsm8k:

lm_eval --model local-chat-completions --tasks gsm8k --model_args base_url=http://0.0.0.0:8000/v1/chat/completions,max_gen_toks=16384,num_concurrent=64 --batch_size auto --fewshot_as_multiturn --apply_chat_template

gemini-code-assist

Code Review

This pull request introduces the FlashInfer MoE A2A kernel, which is a welcome addition for improving performance in large batch size scenarios. The integration of the new kernel is well-executed across the codebase, including configuration, communicator management, and kernel selection logic. I've identified one high-severity issue related to determining the number of GPUs per node, which could lead to suboptimal performance. My detailed feedback and a suggested fix are in the review comment.

vllm/distributed/device_communicators/all2all.py

mergify · 2026-03-11T18:01:47Z

Documentation preview: https://vllm--36022.org.readthedocs.build/en/36022/

elvircrn · 2026-03-11T21:39:40Z

@wzhao18 Was able to get past the blocking trtllm scales issue now and got a good lm_eval on gsm8k R1 NVFP4. This is still necessary on my end:

sed -i 's/token_selected_experts=topk_ids,/token_selected_experts=topk_ids.to(torch.int32),/' \
                  vllm/model_executor/layers/fused_moe/prepare_finalize/flashinfer_moe_a2a.py

leo-cf-tian · 2026-03-12T18:57:35Z

Hi @elvircrn,

Thanks for sharing your progress, would you mind sharing what the cause of the trtllm scales issue was and how you fixed it?

As for the topk_id dtype error, it seems that most routers construct the tensor with a default dtype of int32, which explains why I was not able to reproduce the crash. However, it seems like simulated routing seems to create topk_id tensors of int64 by default. This may have been the cause of the crash if you were using this implementation.

The fix for this would probably just be specifying a return value for topk_indices_dtype inside the prepare_finalize class to enforce typing.

Signed-off-by: Leo Tian <lctian@nvidia.com>

mergify · 2026-03-12T20:02:12Z

Hi @leo-cf-tian, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-13T13:40:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @leo-cf-tian.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

elvircrn · 2026-03-13T15:20:57Z

@leo-cf-tian re-running with your latest commit and without my sed.

tlrmchlsmth

This looks good to me, assuming we see correctness and are past the issue @elvircrn was running into

elvircrn · 2026-03-15T12:37:35Z

@tlrmchlsmth @leo-cf-tian

The trtllm scales issue appears for:

                    --max-cudagraph-capture-size 32768 \
                    --max-num-batched-tokens 32768 \

and switching to

                    --max-cudagraph-capture-size 8192 \
                    --max-num-batched-tokens 8192 \

made it go away.

Can confirm the int32/int64 index went away in both cases.

tlrmchlsmth · 2026-03-15T21:04:45Z

thanks @elvircrn. I don't expect many people to set those variables so high, but could be nice to add a warning in case

tlrmchlsmth

I'd like to get this into v0.18.0, which cuts tomorrow. Could you please fix the pre-commit issues? Looks like they are caused by divergence from main

wzhao18 · 2026-03-15T21:26:50Z

I can help take a look tonight.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 · 2026-03-16T02:45:23Z

@tylertitsworth I fixed the merge conflicts. Can you start CI for this PR?

tlrmchlsmth

Could we get

tlrmchlsmth · 2026-03-16T03:05:58Z

tests/kernels/moe/test_flashinfer_nvlink_one_sided.py

@wzhao18 could you hook up this kernel to CI?

needs to be added to .buildkite/test_areas/kernels.yaml

Sorry I thought I posted the following response but for some reason it was not submitted.

@tlrmchlsmth I re-examined the test and thought that this test may not be too meaningful to add here. It checks the result from _supports_parallel_config with some expectation that is derived from the function itself, which seems kind of redundant. Thus I removed the test from the PR.

I think test_modular_kernel_combinations_multigpu should be a unified test that ensures both that (1) _supports_parallel_config is set correctly and (2) the combination actually works (through testing). However, as far as I checked, this test is not in the CI pipeline and I am having some problems running it even in current main branch. I will look a bit more detail into this and potentially improve it in a future PR.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Squashed from vllm-project#36022. Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Leo Tian <lctian@nvidia.com> Co-authored-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com> Co-authored-by: root <root@lyris0267.lyris.clusters.nvidia.com>

leo-cf-tian requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth, yewentao256 and youkaichao as code owners March 4, 2026 17:01

mergify bot added nvidia rocm Related to AMD ROCm labels Mar 4, 2026

github-project-automation bot added this to AMD and NVIDIA Mar 4, 2026

github-project-automation bot moved this to Todo in AMD Mar 4, 2026

hjjq mentioned this pull request Mar 4, 2026

[Draft][Kernel] Add new flashinfer A2A kernel #32217

Closed

5 tasks

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

vllm/distributed/device_communicators/all2all.py Outdated Show resolved Hide resolved

leo-cf-tian force-pushed the wzhao/moe-a2a branch from 7c6aef4 to 0b13478 Compare March 4, 2026 17:32

leo-cf-tian requested review from ApostaC, DarkLight1337, aarnphm, alexm-redhat, heheda12345, jikunshang, njhill, orozery, russellb, sighingnow and ywang96 as code owners March 4, 2026 17:32

fix topk dtype issue

3e7f513

Signed-off-by: Leo Tian <lctian@nvidia.com>

mergify bot added the needs-rebase label Mar 13, 2026

tlrmchlsmth approved these changes Mar 14, 2026

View reviewed changes

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Mar 14, 2026

github-project-automation bot moved this to Ready in NVIDIA Mar 14, 2026

tlrmchlsmth reviewed Mar 15, 2026

View reviewed changes

Merge main

339cb1e

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

mergify bot removed the needs-rebase label Mar 16, 2026

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 16, 2026

tlrmchlsmth reviewed Mar 16, 2026

View reviewed changes

Rename one-sided prepare finalize file name

d7930ed

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

ywang96 merged commit 2754231 into vllm-project:main Mar 16, 2026
64 of 65 checks passed

github-project-automation bot moved this to Done in Structured Output Mar 16, 2026

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Mar 16, 2026

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 16, 2026

github-project-automation bot moved this from Todo to Done in AMD Mar 16, 2026

elvircrn added a commit to elvircrn/vllm that referenced this pull request Mar 16, 2026

feat: Add FlashInfer MoE A2A Kernel (vllm-project#36022)

40bfdeb

Squashed from vllm-project#36022. Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Uh oh!

Conversation

leo-cf-tian commented Mar 4, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Testing

Test plan

Notes

Reproduction

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

elvircrn commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leo-cf-tian commented Mar 12, 2026

Uh oh!

mergify bot commented Mar 12, 2026

Uh oh!

mergify bot commented Mar 13, 2026

Uh oh!

elvircrn commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

elvircrn commented Mar 15, 2026

Uh oh!

tlrmchlsmth commented Mar 15, 2026

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

wzhao18 commented Mar 15, 2026

Uh oh!

wzhao18 commented Mar 16, 2026

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

wzhao18 Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

leo-cf-tian commented Mar 4, 2026 •

edited by github-actions bot

Loading

elvircrn commented Mar 11, 2026 •

edited

Loading

elvircrn commented Mar 13, 2026 •

edited

Loading

wzhao18 Mar 16, 2026 •

edited

Loading