[Bugfix] Fix double reduce in flashinfer_nvlink_two_sided and flashinfer_nvlink_one_sided backends#41382
Conversation
…ded.py and in flashinfer_nvlink_one_sided.py Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request updates the output_is_reduced method to return True in both the one-sided and two-sided FlashInfer NVLink MoE implementations. These changes ensure that the output is correctly flagged as reduced within the model executor layers. I have no feedback to provide.
|
did a change in flashinfer happen? |
I'm not really familiar with those areas in the code in flashinfer, but from blame on that file in flashinfer it looks like the last change was ~10 months ago. I can see there a call to |
hm, I'm wondering if there was something we changed on the vllm side for this |
|
Maybe related to #36022 ? |
|
Hi @amitz-nv, The implementation of the adapter for the Although I do not have the exact metrics anymore, I remember testing both implementations against AGRS using gsm8k and receiving the standard ~91-95% that was expected of the model (DSR1) prior to submitting the PR. I did not post exact metrics at the time but I did include accuracy testing methodology in the PR description. Do you have any details surrounding the exact nature of the accuracy drop? I recall that at one point the Inferact team was facing accuracy drops with this backend that my team could not reproduce. You may need to probe into the exact response of the model but the cause of that issue was repetitive output. I am not sure if we ever found a root cause so I wonder if it is related. Hope this helps. |
|
I think its origin in
Later, PR #36022 (merged March 16th, 2026):
Regardless of the origin - am I correct that flashinfer already does the reduce? I want to make sure I understand the current status correctly and that my fix is correct |
Hi @leo-cf-tian! and thanks for your response. I haven't checked it interactively as of how the degradation looks like in the responses, I only relied on GSM8K. See my previous comment as of how it seems it was introduced. |
|
Can we defer reduction in a2a if flashinfer supports that? Would that be more performant? I assume it fuse moe_combine with a2a combine right? |
…fer_nvlink_one_sided backends (vllm-project#41382) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…fer_nvlink_one_sided backends (vllm-project#41382) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…fer_nvlink_one_sided backends (vllm-project#41382) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…fer_nvlink_one_sided backends (vllm-project#41382) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Purpose
Fix accuracy degradation when using
--all2all-backend flashinfer_nvlink_two_sidedor--all2all-backend flashinfer_nvlink_one_sided.Apparently reduce is already performed in flashinfer, see https://github.com/flashinfer-ai/flashinfer/blob/v0.6.8.post1/flashinfer/comm/trtllm_alltoall.py#L663 , so no need to perform it again in vLLM. This double reduce seemed to have caused the accuracy degradation, that was measured as following (see Test Plan & Test Result):
Test Plan
On each of the backends, get GSM8K score with the fix and without the fix:
For
--all2all-backend flashinfer_nvlink_two_sided:vLLM command line:
lm_evalcommand line:For
--all2all-backend flashinfer_nvlink_one_sided:vLLM command line:
lm_evalcommand line:Test Result
For
--all2all-backend flashinfer_nvlink_two_sided:GSM8K on
mainwithout this fix (output_is_reducedreturningFalse):GSM8K on
mainwith this fix (output_is_reducedreturningTrue):For
--all2all-backend flashinfer_nvlink_one_sided:GSM8K on
mainwithout this fix (output_is_reducedreturningFalse):GSM8K on
mainwith this fix (output_is_reducedreturningTrue):Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.