[Kernel] Add FlashInfer MoE A2A Kernel#36022
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the FlashInfer MoE A2A kernel, which is a welcome addition for improving performance in large batch size scenarios. The integration of the new kernel is well-executed across the codebase, including configuration, communicator management, and kernel selection logic. I've identified one high-severity issue related to determining the number of GPUs per node, which could lead to suboptimal performance. My detailed feedback and a suggested fix are in the review comment.
7c6aef4 to
0b13478
Compare
|
Documentation preview: https://vllm--36022.org.readthedocs.build/en/36022/ |
|
@wzhao18 Was able to get past the blocking trtllm scales issue now and got a good lm_eval on gsm8k R1 NVFP4. This is still necessary on my end: |
|
Hi @elvircrn, Thanks for sharing your progress, would you mind sharing what the cause of the trtllm scales issue was and how you fixed it? As for the The fix for this would probably just be specifying a return value for |
Signed-off-by: Leo Tian <lctian@nvidia.com>
|
Hi @leo-cf-tian, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
|
@leo-cf-tian re-running with your latest commit and without my |
tlrmchlsmth
left a comment
There was a problem hiding this comment.
This looks good to me, assuming we see correctness and are past the issue @elvircrn was running into
|
The trtllm scales issue appears for: and switching to made it go away. Can confirm the int32/int64 index went away in both cases. |
|
thanks @elvircrn. I don't expect many people to set those variables so high, but could be nice to add a warning in case |
tlrmchlsmth
left a comment
There was a problem hiding this comment.
I'd like to get this into v0.18.0, which cuts tomorrow. Could you please fix the pre-commit issues? Looks like they are caused by divergence from main
|
I can help take a look tonight. |
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
|
@tylertitsworth I fixed the merge conflicts. Can you start CI for this PR? |
There was a problem hiding this comment.
@wzhao18 could you hook up this kernel to CI?
needs to be added to .buildkite/test_areas/kernels.yaml
There was a problem hiding this comment.
Sorry I thought I posted the following response but for some reason it was not submitted.
@tlrmchlsmth I re-examined the test and thought that this test may not be too meaningful to add here. It checks the result from _supports_parallel_config with some expectation that is derived from the function itself, which seems kind of redundant. Thus I removed the test from the PR.
I think test_modular_kernel_combinations_multigpu should be a unified test that ensures both that (1) _supports_parallel_config is set correctly and (2) the combination actually works (through testing). However, as far as I checked, this test is not in the CI pipeline and I am having some problems running it even in current main branch. I will look a bit more detail into this and potentially improve it in a future PR.
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Squashed from vllm-project#36022. Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Leo Tian <lctian@nvidia.com> Co-authored-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com> Co-authored-by: root <root@lyris0267.lyris.clusters.nvidia.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Leo Tian <lctian@nvidia.com> Co-authored-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com> Co-authored-by: root <root@lyris0267.lyris.clusters.nvidia.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Leo Tian <lctian@nvidia.com> Co-authored-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com> Co-authored-by: root <root@lyris0267.lyris.clusters.nvidia.com>
Purpose
This PR is a port of PR #32217 to the vLLM top-of-tree after the modular kernel refactors in #32564. It adds the latest TRT-LLM gen A2A kernel from flashinfer's MoE-A2A API (one sided all-to-all) as added in (flashinfer-ai/flashinfer#2102). This should perform better than the older A2A kernel from #21003 (formerly flashinfer_all2allv) in large batch size.
The new kernel can be enabled by specifying
--all2all-backend flashinfer_nvlink_one_sided. It is only available for nvfp4.This PR also renames
flashinfer_all2allvtoflashinfer_nvlink_two_sidedas per suggestion as it is more descriptive and matches the new implementation.We conducted benchmarks and found a noticeable increase in throughput at high concurrency, up to a 14% increase in throughput at 512 concurrency.
Testing
The PR also adds test coverage from @stecasta.
FlashInferMoeA2APrepareAndFinalizein the modular kernel combinatorial test framework (mk_objects.py), enabling automatic multi-GPU testing against all compatible Expert backends with nvfp4 quantizationTrtLlmNvFp4ExpertsModularin the same framework (previously missing from the test registry)_supports_parallel_configincompatibility matrix for the newflashinfer_moe_a2abackend across 7 Expert typesflashinfer_moe_a2aandflashinfer_all2allvshare the same incompatibility matrix, catching drift if one is updated without the otherTest plan
test_supports_parallel_config_flashinfer_moe_a2a— CPU only, 7 parametrized casestest_supports_parallel_config_parity_with_all2allv— CPU only, 7 parametrized casestest_modular_kernel_combinations_multigpu— multi-GPU, auto-generated from mk_objects.py registrationsNotes
The incompatibility matrix tests do not require a GPU and can run in any CI environment. The combinatorial multi-GPU tests require 2x Blackwell GPUs with FlashInfer trtllm_moe_alltoall support.
Reproduction
To reproduce our results, the server can be launched with the following configuration:
To verify correctness, you can run gsm8k: