[Bugfix] Fix double reduce in flashinfer_nvlink_two_sided and flashinfer_nvlink_one_sided backends by amitz-nv · Pull Request #41382 · vllm-project/vllm

amitz-nv · 2026-04-30T15:15:58Z

Purpose

Fix accuracy degradation when using --all2all-backend flashinfer_nvlink_two_sided or --all2all-backend flashinfer_nvlink_one_sided.

Apparently reduce is already performed in flashinfer, see https://github.com/flashinfer-ai/flashinfer/blob/v0.6.8.post1/flashinfer/comm/trtllm_alltoall.py#L663 , so no need to perform it again in vLLM. This double reduce seemed to have caused the accuracy degradation, that was measured as following (see Test Plan & Test Result):

Test Plan

On each of the backends, get GSM8K score with the fix and without the fix:

For `--all2all-backend flashinfer_nvlink_two_sided`:

vLLM command line:

python3 -m vllm.entrypoints.openai.api_server --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 -tp 2 -dp 2 --enable-expert-parallel --all2all-backend flashinfer_nvlink_two_sided --async-scheduling --kv-cache-dtype auto --trust-remote-code --no-enable-prefix-caching --max-num-seqs 64

lm_eval command line:

lm_eval --model local-completions --model_args "base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5

For `--all2all-backend flashinfer_nvlink_one_sided`:

vLLM command line:

python3 -m vllm.entrypoints.openai.api_server --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 -tp 2 -dp 2 --enable-expert-parallel --all2all-backend flashinfer_nvlink_one_sided --async-scheduling --kv-cache-dtype auto --trust-remote-code --no-enable-prefix-caching --max-num-seqs 64

lm_eval command line:

lm_eval --model local-completions --model_args "base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5

Test Result

For `--all2all-backend flashinfer_nvlink_two_sided`:

GSM8K on main without this fix (output_is_reduced returning False):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8089|±  |0.0108|
|     |       |strict-match    |     5|exact_match|↑  |0.6437|±  |0.0132|

GSM8K on main with this fix (output_is_reduced returning True):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9303|±  |0.0070|
|     |       |strict-match    |     5|exact_match|↑  |0.9196|±  |0.0075|

For `--all2all-backend flashinfer_nvlink_one_sided`:

GSM8K on main without this fix (output_is_reduced returning False):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8029|±  |0.0110|
|     |       |strict-match    |     5|exact_match|↑  |0.6376|±  |0.0132|

GSM8K on main with this fix (output_is_reduced returning True):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9265|±  |0.0072|
|     |       |strict-match    |     5|exact_match|↑  |0.9181|±  |0.0076|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

…ded.py and in flashinfer_nvlink_one_sided.py Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request updates the output_is_reduced method to return True in both the one-sided and two-sided FlashInfer NVLink MoE implementations. These changes ensure that the output is correctly flagged as reduced within the model executor layers. I have no feedback to provide.

robertgshaw2-redhat · 2026-04-30T15:21:48Z

did a change in flashinfer happen?

amitz-nv · 2026-04-30T15:25:30Z

did a change in flashinfer happen?

I'm not really familiar with those areas in the code in flashinfer, but from blame on that file in flashinfer it looks like the last change was ~10 months ago. I can see there a call to moe_comm and then torch.sum over the experts dim.
See https://github.com/flashinfer-ai/flashinfer/blame/v0.6.8.post1/flashinfer/comm/trtllm_alltoall.py#L663

robertgshaw2-redhat · 2026-04-30T15:45:22Z

I'm not really familiar with those areas in the code in flashinfer, but from blame on that file in flashinfer it looks like the last change was ~10 months ago. I can see there a call to moe_comm and then torch.sum over the experts dim.

hm, I'm wondering if there was something we changed on the vllm side for this

amitz-nv · 2026-04-30T19:13:28Z

Maybe related to #36022 ?
@leo-cf-tian

leo-cf-tian · 2026-04-30T20:09:45Z

Hi @amitz-nv,

The implementation of the adapter for the --all2all-backend flashinfer_nvlink_one_sided backend was modeled directly after the --all2all-backend flashinfer_nvlink_two_sided backend. Looking at the history, it seems that output_is_reduced returning False was the established behaviour for --all2all-backend flashinfer_nvlink_two_sided prior to my PR.

Although I do not have the exact metrics anymore, I remember testing both implementations against AGRS using gsm8k and receiving the standard ~91-95% that was expected of the model (DSR1) prior to submitting the PR. I did not post exact metrics at the time but I did include accuracy testing methodology in the PR description.

Do you have any details surrounding the exact nature of the accuracy drop? I recall that at one point the Inferact team was facing accuracy drops with this backend that my team could not reproduce. You may need to probe into the exact response of the model but the cause of that issue was repetitive output. I am not sure if we ever found a root cause so I wonder if it is related.

Hope this helps.

amitz-nv · 2026-05-03T08:40:14Z

I think its origin in flashinfer_nvlnk_two_sided is from #32567 (merged Jan 27th, 2026):

It added flashinfer_a2a_prepare_finalize.py with output_is_reduced returning False, see https://github.com/vllm-project/vllm/pull/32567/changes#diff-3f1a3c1c33881ba34e61e926feec7951f164f43732b406c9006856fb2bed3bc7R46
I assume it was copied from the deleted flashinfer_cutlass_prepare_finalize.py, see https://github.com/vllm-project/vllm/pull/32567/changes#diff-4c97733106874eec0a167e7b9e4bb0ebb024f20d644d4385bee718beca848f2fL58

Later, PR #36022 (merged March 16th, 2026):

Renamed flashinfer_a2a_prepare_finalize.py to flashinfer_nvlink_two_sided_prepare_finalize.py
Added flashinfer_nvlink_one_sided_prepare_finalize.py also with output_is_reduced returning False, probably because that's how flashinfer_a2a_prepare_finalize.py implemented it.

Regardless of the origin - am I correct that flashinfer already does the reduce? I want to make sure I understand the current status correctly and that my fix is correct

@robertgshaw2-redhat

amitz-nv · 2026-05-03T18:13:22Z

Hi @amitz-nv,

The implementation of the adapter for the --all2all-backend flashinfer_nvlink_one_sided backend was modeled directly after the --all2all-backend flashinfer_nvlink_two_sided backend. Looking at the history, it seems that output_is_reduced returning False was the established behaviour for --all2all-backend flashinfer_nvlink_two_sided prior to my PR.

Although I do not have the exact metrics anymore, I remember testing both implementations against AGRS using gsm8k and receiving the standard ~91-95% that was expected of the model (DSR1) prior to submitting the PR. I did not post exact metrics at the time but I did include accuracy testing methodology in the PR description.

Do you have any details surrounding the exact nature of the accuracy drop? I recall that at one point the Inferact team was facing accuracy drops with this backend that my team could not reproduce. You may need to probe into the exact response of the model but the cause of that issue was repetitive output. I am not sure if we ever found a root cause so I wonder if it is related.

Hope this helps.

Hi @leo-cf-tian! and thanks for your response. I haven't checked it interactively as of how the degradation looks like in the responses, I only relied on GSM8K. See my previous comment as of how it seems it was introduced.
Regardless of its history, if we focus only on the change in this PR, do you think it's correct? or I misunderstand something?

zyongye · 2026-05-11T22:12:31Z

Can we defer reduction in a2a if flashinfer supports that? Would that be more performant? I assume it fuse moe_combine with a2a combine right?

…fer_nvlink_one_sided backends (vllm-project#41382) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Change 'output_is_reduced' to return True in flashinfer_nvlink_two_si…

30ad820

…ded.py and in flashinfer_nvlink_one_sided.py Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

amitz-nv requested review from mgoin and pavanimajety as code owners April 30, 2026 15:15

claude Bot reviewed Apr 30, 2026

View reviewed changes

mergify Bot added nvidia bug Something isn't working labels Apr 30, 2026

github-project-automation Bot added this to NVIDIA Apr 30, 2026

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

bnellnm approved these changes Apr 30, 2026

View reviewed changes

mgoin approved these changes May 7, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA May 7, 2026

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 7, 2026

mgoin enabled auto-merge (squash) May 7, 2026 21:15

Merge branch 'main' into fix-accuracy-fi-nvlink-one-two-sided

0d99149

Kevin-XiongC mentioned this pull request May 8, 2026

[Fix]Fix one-sided MoE padding sentinel for local expert maps #42034

Open

Merge branch 'main' into fix-accuracy-fi-nvlink-one-two-sided

039c635

robertgshaw2-redhat requested a review from zyongye as a code owner May 11, 2026 18:52

mgoin merged commit ef34592 into vllm-project:main May 12, 2026
70 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix double reduce in flashinfer_nvlink_two_sided and flashinfer_nvlink_one_sided backends#41382

[Bugfix] Fix double reduce in flashinfer_nvlink_two_sided and flashinfer_nvlink_one_sided backends#41382
mgoin merged 3 commits into
vllm-project:mainfrom
amitz-nv:fix-accuracy-fi-nvlink-one-two-sided

amitz-nv commented Apr 30, 2026 •

edited by github-actions Bot

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

robertgshaw2-redhat commented Apr 30, 2026

Uh oh!

amitz-nv commented Apr 30, 2026

Uh oh!

robertgshaw2-redhat commented Apr 30, 2026

Uh oh!

amitz-nv commented Apr 30, 2026

Uh oh!

leo-cf-tian commented Apr 30, 2026

Uh oh!

amitz-nv commented May 3, 2026

Uh oh!

amitz-nv commented May 3, 2026 •

edited

Loading

Uh oh!

zyongye commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

amitz-nv commented Apr 30, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

For --all2all-backend flashinfer_nvlink_two_sided:

For --all2all-backend flashinfer_nvlink_one_sided:

Test Result

For --all2all-backend flashinfer_nvlink_two_sided:

For --all2all-backend flashinfer_nvlink_one_sided:

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

robertgshaw2-redhat commented Apr 30, 2026

Uh oh!

amitz-nv commented Apr 30, 2026

Uh oh!

robertgshaw2-redhat commented Apr 30, 2026

Uh oh!

amitz-nv commented Apr 30, 2026

Uh oh!

leo-cf-tian commented Apr 30, 2026

Uh oh!

amitz-nv commented May 3, 2026

Uh oh!

amitz-nv commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zyongye commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

amitz-nv commented Apr 30, 2026 •

edited by github-actions Bot

Loading

For `--all2all-backend flashinfer_nvlink_two_sided`:

For `--all2all-backend flashinfer_nvlink_one_sided`:

For `--all2all-backend flashinfer_nvlink_two_sided`:

For `--all2all-backend flashinfer_nvlink_one_sided`:

amitz-nv commented May 3, 2026 •

edited

Loading