[Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations by haosdent · Pull Request #35468 · vllm-project/vllm

haosdent · 2026-02-27T03:07:00Z

Purpose

Fix NCCL "Duplicate GPU detected" error when using -tp 2 -pp 2 -cc.fuse_allreduce_rms=True.
FIX #35426

PR #34109 introduced a regression by using TorchDistBackend for multi-group TP configurations, which has upstream issues with bcast(), IPC socket collisions, and device assignment. Tracked at flashinfer-ai/flashinfer#2647.

This PR gracefully disables AllReduceFusionPass when PP > 1 or DP > 1 until the upstream flashinfer fixes land.

Test Plan

Test Result

gemini-code-assist

Code Review

This pull request addresses a crash in AllReduceFusionPass when using both pipeline parallelism (PP) and tensor parallelism (TP). The fix involves two main parts: first, adding guards to gracefully disable the pass for PP+TP and DP+TP configurations, which is a sensible workaround for an upstream limitation in flashinfer. Second, it re-introduces a patched _TPCommBackend to fix communication issues in flashinfer's distributed backend when multiple TP groups are active. A new regression test is also added to verify that the pass is correctly disabled in a PP+TP setup, ensuring the fix is effective and preventing future regressions. The changes are well-structured and address the described issue effectively. I have one suggestion to improve maintainability by using a standard torch.distributed API instead of a custom one.

vllm/distributed/device_communicators/flashinfer_all_reduce.py

haosdent · 2026-02-27T03:32:03Z

@ProExpertProg @hjjq Can take a look when you are available, many thanks!

ilmarkov · 2026-02-27T08:38:55Z

Thanks for the PR!
When flashinfer fixes the issue with SymmMemBackend, they may also fix the issue with broadcast_object_list and IPC socket opIds so we wouldn't have to create custom _TPCommBackend. Do we actually need this _TPCommBackend at the moment given that we disable the fusion for DP>1 and PP>1?

haosdent · 2026-02-27T08:55:53Z

When flashinfer fixes the issue with SymmMemBackend, they may also fix the issue with broadcast_object_list and IPC socket opIds so we wouldn't have to create custom _TPCommBackend.

Thanks @ilmarkov I see, if flashinfer would fix together, then I think we don't need it and could just disable first. Let me update the PR later

BTW, do you have the issue links, and I could put them in the comments as references.

ilmarkov · 2026-02-27T09:16:46Z

@haosdent I have filed an issue in Flashinfer: flashinfer-ai/flashinfer#2647

Gracefully disable AllReduceFusionPass when PP > 1 or DP > 1 to avoid NCCL "Duplicate GPU detected" error caused by upstream flashinfer issues with multi-group TP configurations. Upstream tracking: flashinfer-ai/flashinfer#2647 Fixes vllm-project#35426 Signed-off-by: haosdent <haosdent@gmail.com>

haosdent · 2026-02-27T09:38:01Z

Thanks @ilmarkov , I have updated the PR, can take a look when you are available

haosdent requested review from ProExpertProg, youkaichao and zou3519 as code owners February 27, 2026 03:07

haosdent marked this pull request as draft February 27, 2026 03:07

mergify bot added nvidia bug Something isn't working labels Feb 27, 2026

github-project-automation bot added this to NVIDIA Feb 27, 2026

gemini-code-assist bot reviewed Feb 27, 2026

View reviewed changes

vllm/distributed/device_communicators/flashinfer_all_reduce.py Outdated Show resolved Hide resolved

haosdent force-pushed the fix-35426 branch from d2f5b3e to 55b4062 Compare February 27, 2026 03:27

haosdent marked this pull request as ready for review February 27, 2026 03:28

haosdent changed the title ~~[WIP] [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations~~ [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations Feb 27, 2026

haosdent mentioned this pull request Feb 27, 2026

[Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations #34511

Closed

5 tasks

ehfd mentioned this pull request Feb 27, 2026

[Bug]: Qwen3.5 (NVIDIA H200) Pointer argument (at 0) cannot be accessed from Triton #35390

Closed

1 task

ilmarkov mentioned this pull request Feb 27, 2026

[BUG] AllReduceFusion doesn't support multiple distributed groups flashinfer-ai/flashinfer#2647

Open

haosdent force-pushed the fix-35426 branch from 55b4062 to b16f77a Compare February 27, 2026 09:32

haosdent changed the title ~~[Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations~~ [WIP] [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations Feb 27, 2026

haosdent changed the title ~~[WIP] [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations~~ [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations Feb 27, 2026

ProExpertProg approved these changes Feb 27, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations#35468

[Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations#35468
haosdent wants to merge 1 commit intovllm-project:mainfrom
haosdent:fix-35426

haosdent commented Feb 27, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

haosdent commented Feb 27, 2026

Uh oh!

ilmarkov commented Feb 27, 2026

Uh oh!

haosdent commented Feb 27, 2026 •

edited

Loading

Uh oh!

ilmarkov commented Feb 27, 2026

Uh oh!

haosdent commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

haosdent commented Feb 27, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

haosdent commented Feb 27, 2026

Uh oh!

ilmarkov commented Feb 27, 2026

Uh oh!

haosdent commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilmarkov commented Feb 27, 2026

Uh oh!

haosdent commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

haosdent commented Feb 27, 2026 •

edited by github-actions bot

Loading

haosdent commented Feb 27, 2026 •

edited

Loading