Skip to content

[Bugfix] Fix double reduce in flashinfer_nvlink_two_sided and flashinfer_nvlink_one_sided backends#41382

Merged
mgoin merged 3 commits into
vllm-project:mainfrom
amitz-nv:fix-accuracy-fi-nvlink-one-two-sided
May 12, 2026
Merged

[Bugfix] Fix double reduce in flashinfer_nvlink_two_sided and flashinfer_nvlink_one_sided backends#41382
mgoin merged 3 commits into
vllm-project:mainfrom
amitz-nv:fix-accuracy-fi-nvlink-one-two-sided

Conversation

@amitz-nv
Copy link
Copy Markdown
Contributor

@amitz-nv amitz-nv commented Apr 30, 2026

Purpose

Fix accuracy degradation when using --all2all-backend flashinfer_nvlink_two_sided or --all2all-backend flashinfer_nvlink_one_sided.

Apparently reduce is already performed in flashinfer, see https://github.com/flashinfer-ai/flashinfer/blob/v0.6.8.post1/flashinfer/comm/trtllm_alltoall.py#L663 , so no need to perform it again in vLLM. This double reduce seemed to have caused the accuracy degradation, that was measured as following (see Test Plan & Test Result):

Test Plan

On each of the backends, get GSM8K score with the fix and without the fix:

For --all2all-backend flashinfer_nvlink_two_sided:

vLLM command line:

python3 -m vllm.entrypoints.openai.api_server --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 -tp 2 -dp 2 --enable-expert-parallel --all2all-backend flashinfer_nvlink_two_sided --async-scheduling --kv-cache-dtype auto --trust-remote-code --no-enable-prefix-caching --max-num-seqs 64

lm_eval command line:

lm_eval --model local-completions --model_args "base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5

For --all2all-backend flashinfer_nvlink_one_sided:

vLLM command line:

python3 -m vllm.entrypoints.openai.api_server --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 -tp 2 -dp 2 --enable-expert-parallel --all2all-backend flashinfer_nvlink_one_sided --async-scheduling --kv-cache-dtype auto --trust-remote-code --no-enable-prefix-caching --max-num-seqs 64

lm_eval command line:

lm_eval --model local-completions --model_args "base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5

Test Result

For --all2all-backend flashinfer_nvlink_two_sided:

GSM8K on main without this fix (output_is_reduced returning False):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8089|±  |0.0108|
|     |       |strict-match    |     5|exact_match|↑  |0.6437|±  |0.0132|

GSM8K on main with this fix (output_is_reduced returning True):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9303|±  |0.0070|
|     |       |strict-match    |     5|exact_match|↑  |0.9196|±  |0.0075|

For --all2all-backend flashinfer_nvlink_one_sided:

GSM8K on main without this fix (output_is_reduced returning False):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8029|±  |0.0110|
|     |       |strict-match    |     5|exact_match|↑  |0.6376|±  |0.0132|

GSM8K on main with this fix (output_is_reduced returning True):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9265|±  |0.0072|
|     |       |strict-match    |     5|exact_match|↑  |0.9181|±  |0.0076|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

…ded.py and in flashinfer_nvlink_one_sided.py

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added nvidia bug Something isn't working labels Apr 30, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the output_is_reduced method to return True in both the one-sided and two-sided FlashInfer NVLink MoE implementations. These changes ensure that the output is correctly flagged as reduced within the model executor layers. I have no feedback to provide.

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

did a change in flashinfer happen?

@amitz-nv
Copy link
Copy Markdown
Contributor Author

did a change in flashinfer happen?

I'm not really familiar with those areas in the code in flashinfer, but from blame on that file in flashinfer it looks like the last change was ~10 months ago. I can see there a call to moe_comm and then torch.sum over the experts dim.
See https://github.com/flashinfer-ai/flashinfer/blame/v0.6.8.post1/flashinfer/comm/trtllm_alltoall.py#L663

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

I'm not really familiar with those areas in the code in flashinfer, but from blame on that file in flashinfer it looks like the last change was ~10 months ago. I can see there a call to moe_comm and then torch.sum over the experts dim.

hm, I'm wondering if there was something we changed on the vllm side for this

@amitz-nv
Copy link
Copy Markdown
Contributor Author

Maybe related to #36022 ?
@leo-cf-tian

@leo-cf-tian
Copy link
Copy Markdown
Contributor

Hi @amitz-nv,

The implementation of the adapter for the --all2all-backend flashinfer_nvlink_one_sided backend was modeled directly after the --all2all-backend flashinfer_nvlink_two_sided backend. Looking at the history, it seems that output_is_reduced returning False was the established behaviour for --all2all-backend flashinfer_nvlink_two_sided prior to my PR.

Although I do not have the exact metrics anymore, I remember testing both implementations against AGRS using gsm8k and receiving the standard ~91-95% that was expected of the model (DSR1) prior to submitting the PR. I did not post exact metrics at the time but I did include accuracy testing methodology in the PR description.

Do you have any details surrounding the exact nature of the accuracy drop? I recall that at one point the Inferact team was facing accuracy drops with this backend that my team could not reproduce. You may need to probe into the exact response of the model but the cause of that issue was repetitive output. I am not sure if we ever found a root cause so I wonder if it is related.

Hope this helps.

@amitz-nv
Copy link
Copy Markdown
Contributor Author

amitz-nv commented May 3, 2026

I think its origin in flashinfer_nvlnk_two_sided is from #32567 (merged Jan 27th, 2026):

Later, PR #36022 (merged March 16th, 2026):

  • Renamed flashinfer_a2a_prepare_finalize.py to flashinfer_nvlink_two_sided_prepare_finalize.py
  • Added flashinfer_nvlink_one_sided_prepare_finalize.py also with output_is_reduced returning False, probably because that's how flashinfer_a2a_prepare_finalize.py implemented it.

Regardless of the origin - am I correct that flashinfer already does the reduce? I want to make sure I understand the current status correctly and that my fix is correct

@robertgshaw2-redhat

@amitz-nv
Copy link
Copy Markdown
Contributor Author

amitz-nv commented May 3, 2026

Hi @amitz-nv,

The implementation of the adapter for the --all2all-backend flashinfer_nvlink_one_sided backend was modeled directly after the --all2all-backend flashinfer_nvlink_two_sided backend. Looking at the history, it seems that output_is_reduced returning False was the established behaviour for --all2all-backend flashinfer_nvlink_two_sided prior to my PR.

Although I do not have the exact metrics anymore, I remember testing both implementations against AGRS using gsm8k and receiving the standard ~91-95% that was expected of the model (DSR1) prior to submitting the PR. I did not post exact metrics at the time but I did include accuracy testing methodology in the PR description.

Do you have any details surrounding the exact nature of the accuracy drop? I recall that at one point the Inferact team was facing accuracy drops with this backend that my team could not reproduce. You may need to probe into the exact response of the model but the cause of that issue was repetitive output. I am not sure if we ever found a root cause so I wonder if it is related.

Hope this helps.

Hi @leo-cf-tian! and thanks for your response. I haven't checked it interactively as of how the degradation looks like in the responses, I only relied on GSM8K. See my previous comment as of how it seems it was introduced.
Regardless of its history, if we focus only on the change in this PR, do you think it's correct? or I misunderstand something?

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA May 7, 2026
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 7, 2026
@mgoin mgoin enabled auto-merge (squash) May 7, 2026 21:15
@zyongye
Copy link
Copy Markdown
Member

zyongye commented May 11, 2026

Can we defer reduction in a2a if flashinfer supports that? Would that be more performant? I assume it fuse moe_combine with a2a combine right?

@mgoin mgoin merged commit ef34592 into vllm-project:main May 12, 2026
70 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA May 12, 2026
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
…fer_nvlink_one_sided backends (vllm-project#41382)

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
…fer_nvlink_one_sided backends (vllm-project#41382)

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…fer_nvlink_one_sided backends (vllm-project#41382)

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026
…fer_nvlink_one_sided backends (vllm-project#41382)

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants