[Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations#35468
[Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations#35468haosdent wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a crash in AllReduceFusionPass when using both pipeline parallelism (PP) and tensor parallelism (TP). The fix involves two main parts: first, adding guards to gracefully disable the pass for PP+TP and DP+TP configurations, which is a sensible workaround for an upstream limitation in flashinfer. Second, it re-introduces a patched _TPCommBackend to fix communication issues in flashinfer's distributed backend when multiple TP groups are active. A new regression test is also added to verify that the pass is correctly disabled in a PP+TP setup, ensuring the fix is effective and preventing future regressions. The changes are well-structured and address the described issue effectively. I have one suggestion to improve maintainability by using a standard torch.distributed API instead of a custom one.
|
@ProExpertProg @hjjq Can take a look when you are available, many thanks! |
|
Thanks for the PR! |
Thanks @ilmarkov I see, if flashinfer would fix together, then I think we don't need it and could just disable first. Let me update the PR later BTW, do you have the issue links, and I could put them in the comments as references. |
|
@haosdent I have filed an issue in Flashinfer: flashinfer-ai/flashinfer#2647 |
Gracefully disable AllReduceFusionPass when PP > 1 or DP > 1 to avoid NCCL "Duplicate GPU detected" error caused by upstream flashinfer issues with multi-group TP configurations. Upstream tracking: flashinfer-ai/flashinfer#2647 Fixes vllm-project#35426 Signed-off-by: haosdent <haosdent@gmail.com>
|
Thanks @ilmarkov , I have updated the PR, can take a look when you are available |
Purpose
Fix NCCL "Duplicate GPU detected" error when using
-tp 2 -pp 2 -cc.fuse_allreduce_rms=True.FIX #35426
PR #34109 introduced a regression by using
TorchDistBackendfor multi-group TP configurations, which has upstream issues withbcast(), IPC socket collisions, and device assignment. Tracked at flashinfer-ai/flashinfer#2647.This PR gracefully disables
AllReduceFusionPasswhen PP > 1 or DP > 1 until the upstream flashinfer fixes land.Test Plan
Test Result