Skip to content

[Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations#35468

Open
haosdent wants to merge 1 commit intovllm-project:mainfrom
haosdent:fix-35426
Open

[Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations#35468
haosdent wants to merge 1 commit intovllm-project:mainfrom
haosdent:fix-35426

Conversation

@haosdent
Copy link
Contributor

@haosdent haosdent commented Feb 27, 2026

Purpose

Fix NCCL "Duplicate GPU detected" error when using -tp 2 -pp 2 -cc.fuse_allreduce_rms=True.
FIX #35426

PR #34109 introduced a regression by using TorchDistBackend for multi-group TP configurations, which has upstream issues with bcast(), IPC socket collisions, and device assignment. Tracked at flashinfer-ai/flashinfer#2647.

This PR gracefully disables AllReduceFusionPass when PP > 1 or DP > 1 until the upstream flashinfer fixes land.

Test Plan

Test Result

@haosdent haosdent marked this pull request as draft February 27, 2026 03:07
@mergify mergify bot added nvidia bug Something isn't working labels Feb 27, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a crash in AllReduceFusionPass when using both pipeline parallelism (PP) and tensor parallelism (TP). The fix involves two main parts: first, adding guards to gracefully disable the pass for PP+TP and DP+TP configurations, which is a sensible workaround for an upstream limitation in flashinfer. Second, it re-introduces a patched _TPCommBackend to fix communication issues in flashinfer's distributed backend when multiple TP groups are active. A new regression test is also added to verify that the pass is correctly disabled in a PP+TP setup, ensuring the fix is effective and preventing future regressions. The changes are well-structured and address the described issue effectively. I have one suggestion to improve maintainability by using a standard torch.distributed API instead of a custom one.

@haosdent haosdent marked this pull request as ready for review February 27, 2026 03:28
@haosdent haosdent changed the title [WIP] [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations Feb 27, 2026
@haosdent
Copy link
Contributor Author

@ProExpertProg @hjjq Can take a look when you are available, many thanks!

@ilmarkov
Copy link
Contributor

Thanks for the PR!
When flashinfer fixes the issue with SymmMemBackend, they may also fix the issue with broadcast_object_list and IPC socket opIds so we wouldn't have to create custom _TPCommBackend. Do we actually need this _TPCommBackend at the moment given that we disable the fusion for DP>1 and PP>1?

@haosdent
Copy link
Contributor Author

haosdent commented Feb 27, 2026

When flashinfer fixes the issue with SymmMemBackend, they may also fix the issue with broadcast_object_list and IPC socket opIds so we wouldn't have to create custom _TPCommBackend.

Thanks @ilmarkov I see, if flashinfer would fix together, then I think we don't need it and could just disable first. Let me update the PR later

BTW, do you have the issue links, and I could put them in the comments as references.

@ilmarkov
Copy link
Contributor

@haosdent I have filed an issue in Flashinfer: flashinfer-ai/flashinfer#2647

Gracefully disable AllReduceFusionPass when PP > 1 or DP > 1 to avoid
NCCL "Duplicate GPU detected" error caused by upstream flashinfer issues
with multi-group TP configurations.

Upstream tracking: flashinfer-ai/flashinfer#2647

Fixes vllm-project#35426

Signed-off-by: haosdent <haosdent@gmail.com>
@haosdent haosdent changed the title [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations [WIP] [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations Feb 27, 2026
@haosdent haosdent changed the title [WIP] [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations Feb 27, 2026
@haosdent
Copy link
Contributor Author

Thanks @ilmarkov , I have updated the PR, can take a look when you are available

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia

Projects

Status: Ready

Development

Successfully merging this pull request may close these issues.

[Bug]: AllReduceRMSFusionPass crashes with PP

3 participants