Skip to content

[Kernel] Add FlashInfer MoE A2A Kernel#36022

Merged
ywang96 merged 13 commits intovllm-project:mainfrom
CentML:wzhao/moe-a2a
Mar 16, 2026
Merged

[Kernel] Add FlashInfer MoE A2A Kernel#36022
ywang96 merged 13 commits intovllm-project:mainfrom
CentML:wzhao/moe-a2a

Conversation

@leo-cf-tian
Copy link
Contributor

@leo-cf-tian leo-cf-tian commented Mar 4, 2026

Purpose

This PR is a port of PR #32217 to the vLLM top-of-tree after the modular kernel refactors in #32564. It adds the latest TRT-LLM gen A2A kernel from flashinfer's MoE-A2A API (one sided all-to-all) as added in (flashinfer-ai/flashinfer#2102). This should perform better than the older A2A kernel from #21003 (formerly flashinfer_all2allv) in large batch size.

The new kernel can be enabled by specifying --all2all-backend flashinfer_nvlink_one_sided. It is only available for nvfp4.

This PR also renames flashinfer_all2allv to flashinfer_nvlink_two_sided as per suggestion as it is more descriptive and matches the new implementation.

We conducted benchmarks and found a noticeable increase in throughput at high concurrency, up to a 14% increase in throughput at 512 concurrency.

image

Testing

The PR also adds test coverage from @stecasta.

  • Register FlashInferMoeA2APrepareAndFinalize in the modular kernel combinatorial test framework (mk_objects.py), enabling automatic multi-GPU testing against all compatible Expert backends with nvfp4 quantization
  • Register TrtLlmNvFp4ExpertsModular in the same framework (previously missing from the test registry)
  • Add parametrized tests validating the _supports_parallel_config incompatibility matrix for the new flashinfer_moe_a2a backend across 7 Expert types
  • Add a parity test ensuring flashinfer_moe_a2a and flashinfer_all2allv share the same incompatibility matrix, catching drift if one is updated without the other

Test plan

  • test_supports_parallel_config_flashinfer_moe_a2a — CPU only, 7 parametrized cases
  • test_supports_parallel_config_parity_with_all2allv — CPU only, 7 parametrized cases
  • Combinatorial coverage via test_modular_kernel_combinations_multigpu — multi-GPU, auto-generated from mk_objects.py registrations
  • Future: dedicated multi-GPU test with broader shape coverage once the A2A manager supports standalone initialization

Notes

The incompatibility matrix tests do not require a GPU and can run in any CI environment. The combinatorial multi-GPU tests require 2x Blackwell GPUs with FlashInfer trtllm_moe_alltoall support.

Reproduction

To reproduce our results, the server can be launched with the following configuration:

vllm serve nvidia/DeepSeek-R1-NVFP4 \
    --trust-remote-code \
    --max-num-seqs 1024 \
    --max-num-batched-tokens 2048 \
    --stream-interval 20 \
    --no-enable-prefix-caching \
    --kv-cache-dtype fp8 \
    --max-cudagraph-capture-size 2048 \
    --data-parallel-size 8 \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 1 \
    --enable-expert-parallel \
    --gpu-memory-utilization 0.8 \
    --all2all-backend (None / flashinfer_all2allv / flashinfer_moe_a2a)

To verify correctness, you can run gsm8k:

lm_eval --model local-chat-completions --tasks gsm8k --model_args base_url=http://0.0.0.0:8000/v1/chat/completions,max_gen_toks=16384,num_concurrent=64 --batch_size auto --fewshot_as_multiturn --apply_chat_template

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the FlashInfer MoE A2A kernel, which is a welcome addition for improving performance in large batch size scenarios. The integration of the new kernel is well-executed across the codebase, including configuration, communicator management, and kernel selection logic. I've identified one high-severity issue related to determining the number of GPUs per node, which could lead to suboptimal performance. My detailed feedback and a suggested fix are in the review comment.

@mergify
Copy link

mergify bot commented Mar 11, 2026

Documentation preview: https://vllm--36022.org.readthedocs.build/en/36022/

@elvircrn
Copy link
Contributor

elvircrn commented Mar 11, 2026

@wzhao18 Was able to get past the blocking trtllm scales issue now and got a good lm_eval on gsm8k R1 NVFP4. This is still necessary on my end:

sed -i 's/token_selected_experts=topk_ids,/token_selected_experts=topk_ids.to(torch.int32),/' \
                  vllm/model_executor/layers/fused_moe/prepare_finalize/flashinfer_moe_a2a.py

@leo-cf-tian
Copy link
Contributor Author

Hi @elvircrn,

Thanks for sharing your progress, would you mind sharing what the cause of the trtllm scales issue was and how you fixed it?

As for the topk_id dtype error, it seems that most routers construct the tensor with a default dtype of int32, which explains why I was not able to reproduce the crash. However, it seems like simulated routing seems to create topk_id tensors of int64 by default. This may have been the cause of the crash if you were using this implementation.

The fix for this would probably just be specifying a return value for topk_indices_dtype inside the prepare_finalize class to enforce typing.

Signed-off-by: Leo Tian <lctian@nvidia.com>
@mergify
Copy link

mergify bot commented Mar 12, 2026

Hi @leo-cf-tian, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@mergify
Copy link

mergify bot commented Mar 13, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @leo-cf-tian.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 13, 2026
@elvircrn
Copy link
Contributor

elvircrn commented Mar 13, 2026

@leo-cf-tian re-running with your latest commit and without my sed.

Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, assuming we see correctness and are past the issue @elvircrn was running into

@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Mar 14, 2026
@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Mar 14, 2026
@elvircrn
Copy link
Contributor

@tlrmchlsmth @leo-cf-tian

The trtllm scales issue appears for:

                    --max-cudagraph-capture-size 32768 \
                    --max-num-batched-tokens 32768 \

and switching to

                    --max-cudagraph-capture-size 8192 \
                    --max-num-batched-tokens 8192 \

made it go away.

Can confirm the int32/int64 index went away in both cases.

@tlrmchlsmth
Copy link
Member

thanks @elvircrn. I don't expect many people to set those variables so high, but could be nice to add a warning in case

Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to get this into v0.18.0, which cuts tomorrow. Could you please fix the pre-commit issues? Looks like they are caused by divergence from main

@wzhao18
Copy link
Contributor

wzhao18 commented Mar 15, 2026

I can help take a look tonight.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@wzhao18
Copy link
Contributor

wzhao18 commented Mar 16, 2026

@tylertitsworth I fixed the merge conflicts. Can you start CI for this PR?

@mergify mergify bot removed the needs-rebase label Mar 16, 2026
@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 16, 2026
Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wzhao18 could you hook up this kernel to CI?

needs to be added to .buildkite/test_areas/kernels.yaml

Copy link
Contributor

@wzhao18 wzhao18 Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I thought I posted the following response but for some reason it was not submitted.

@tlrmchlsmth I re-examined the test and thought that this test may not be too meaningful to add here. It checks the result from _supports_parallel_config with some expectation that is derived from the function itself, which seems kind of redundant. Thus I removed the test from the PR.

I think test_modular_kernel_combinations_multigpu should be a unified test that ensures both that (1) _supports_parallel_config is set correctly and (2) the combination actually works (through testing). However, as far as I checked, this test is not in the CI pipeline and I am having some problems running it even in current main branch. I will look a bit more detail into this and potentially improve it in a future PR.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@ywang96 ywang96 merged commit 2754231 into vllm-project:main Mar 16, 2026
64 of 65 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Mar 16, 2026
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Mar 16, 2026
elvircrn added a commit to elvircrn/vllm that referenced this pull request Mar 16, 2026
Squashed from vllm-project#36022.

Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Leo Tian <lctian@nvidia.com>
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com>
Co-authored-by: root <root@lyris0267.lyris.clusters.nvidia.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Leo Tian <lctian@nvidia.com>
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com>
Co-authored-by: root <root@lyris0267.lyris.clusters.nvidia.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Leo Tian <lctian@nvidia.com>
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com>
Co-authored-by: root <root@lyris0267.lyris.clusters.nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models multi-modality Related to multi-modality (#4194) new-model Requests to new models nvidia performance Performance-related issues qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm structured-output v1

Projects

Status: Done
Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

10 participants