[Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations by haosdent · Pull Request #34511 · vllm-project/vllm

haosdent · 2026-02-13T10:15:58Z

Purpose

Fixes #34401
Fixes #34458

Four issues caused AllReduceFusionPass to fail in DP+TP setups (e.g. dp=2, tp=2 on 4 GPUs):

Missing comm_backend: create_allreduce_fusion_workspace was called without a comm_backend, so flashinfer defaulted to the global process group instead of the TP group. Independent DP groups collided during IPC handle exchange.
flashinfer bcast bug: TorchDistBackend.bcast passes a group-local root directly as src to broadcast_object_list, which expects a global rank. Fixed by using the group_src parameter instead.
flashinfer device_idx bug: SymmDeviceMemory uses device_idx=tp_rank which maps to the wrong GPU for DP groups > 0. Disable the pass for DP+TP until this is fixed upstream in flashinfer.
IPC socket opIds collide: IPC socket opIds collide across TP groups because random.randint produces identical values under vllm's deterministic seeding. Fixed by offsetting the opId with the global root rank in _TPCommBackend.bcast.

Test Plan

Add test_all_reduce_fusion_pass_dp_tp but keep disable.

Test Result

Expect the test cases pass but I don't have machines to test.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request addresses a critical bug in AllReduceFusionPass that caused crashes in Data Parallel + Tensor Parallel (DP+TP) configurations. The fix involves two main changes:

A new _TPCommBackend class is introduced to work around a bug in flashinfer's broadcast implementation, ensuring that global ranks are used for communication as expected.
The AllReduceFusionPass is updated to explicitly use a TP-scoped communication backend when creating the flashinfer workspace, preventing collisions between different data parallel groups.

The changes are well-implemented and directly solve the described issues. Additionally, a new multi-GPU test case (test_all_reduce_fusion_pass_dp_tp) has been added, which effectively reproduces the problematic DP+TP setup and serves as a valuable regression test. The code is clear, and the fix appears robust. Overall, this is a high-quality contribution that improves the stability of vLLM in complex distributed setups.

ilmarkov · 2026-02-13T10:55:17Z

I think there was also an issue with the same opId in Socket initialization on root ranks in 2 TP groups.

ilmarkov · 2026-02-13T10:57:56Z

Also there is incorrect use of device id at symm memory init. It will use the same devices for two different TP groups.

vllm/compilation/passes/fusion/allreduce_rms_fusion.py

haosdent · 2026-02-13T14:33:00Z

Hi, @ilmarkov thanks a lot for your review

I think there was also an issue with the same opId in Socket initialization on root ranks in 2 TP groups.

Do you mean opId are same at here https://github.com/flashinfer-ai/flashinfer/blob/292f9be3f5f6d76248d4d3577c167fd178d7952d/flashinfer/comm/mnnvl.py#L804-L810 ?

    def _init_ipc_socket(self) -> IpcSocket:
        if self.rank == 0:
            opId = random.randint(0, 2**64 - 1)
        else:
            opId = None
        opId = self.comm.bcast(opId, root=0)
        return IpcSocket(self.rank, opId)

Also there is incorrect use of device id at symm memory init. It will use the same devices for two different TP groups.

For this one, should we fix in flashinfer instead of vLLM

ilmarkov · 2026-02-13T14:38:28Z

@haosdent

Do you mean opId are same at here

Yes, probably we need to do sth like
opId = hash((random.randint(0, 2**64 - 1), self.comm.group_id())).

For this one, should we fix in flashinfer instead of vLLM

I'm afraid, yes. And before we get the fix in the new version of FlashInfer, we will need to disable this fusion for DP>1 TP>1 case.

ProExpertProg

Putting a block on until we fully resolve this. How long do we need to wait for a flashinfer upgrade? Could we monkeypatch IPCSocket in the meantime?

ilmarkov · 2026-02-13T16:05:33Z

@ProExpertProg

It's not only IPCSocket thing. Another problem will be in SymmMemory init:

 symm_mem = SymmDeviceMemory(
                aligned_size,
                tp_size,
                tp_rank,
                torch.device("cuda", tp_rank).index,
                comm_backend,
                enable_multicast=False,
                allocate_signal_pads=False,
            )

There we need to use a local_rank from the global group. So I think we need to fix it in Flashinfer. I think we can only use it with the next Flashinfer release.

Three issues caused AllReduceFusionPass to fail in DP+TP setups (e.g. dp=2, tp=2 on 4 GPUs): 1. Missing comm_backend: create_allreduce_fusion_workspace was called without a comm_backend, so flashinfer defaulted to the global process group instead of the TP group. Independent DP groups collided during IPC handle exchange. 2. flashinfer bcast bug: TorchDistBackend.bcast passes a group-local root directly as src to broadcast_object_list, which expects a global rank. Fixed by using the group_src parameter instead. 3. flashinfer device_idx bug: SymmDeviceMemory uses device_idx=tp_rank which maps to the wrong GPU for DP groups > 0. Disable the pass for DP+TP until this is fixed upstream in flashinfer. Additionally, IPC socket opIds collide across TP groups because random.randint produces identical values under vllm's deterministic seeding. Fixed by offsetting the opId with the global root rank in _TPCommBackend.bcast. Fixes vllm-project#34401 Fixes vllm-project#34458 Signed-off-by: haosdent <haosdent@gmail.com>

haosdent · 2026-02-13T16:46:52Z

Putting a block on until we fully resolve this. How long do we need to wait for a flashinfer upgrade? Could we monkeypatch IPCSocket in the meantime?

Thanks @ProExpertProg and @ilmarkov , have updated the pull request to block. Also monkeypatch _TPCommBackend.bcast to make IPSocket get a unique opId

mergify · 2026-02-25T15:55:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @haosdent.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

haosdent · 2026-02-27T03:24:47Z

It's not only IPCSocket thing. Another problem will be in SymmMemory init:

BTW, @ilmarkov is there any known issue link for this? Or the issue not been created yet?

haosdent · 2026-02-27T03:32:37Z

Close and continue to track in #35468

haosdent requested review from ProExpertProg, youkaichao and zou3519 as code owners February 13, 2026 10:15

mergify bot added the bug Something isn't working label Feb 13, 2026

haosdent mentioned this pull request Feb 13, 2026

[CI Failure]: Distributed Tests (8 GPUs)(H100) #34401

Closed

3 tasks

gemini-code-assist bot reviewed Feb 13, 2026

View reviewed changes

ProExpertProg approved these changes Feb 13, 2026

View reviewed changes

ilmarkov reviewed Feb 13, 2026

View reviewed changes

vllm/compilation/passes/fusion/allreduce_rms_fusion.py Show resolved Hide resolved

haosdent force-pushed the fix/allreduce-fusion-dp-tp-process-group branch from 0eec0da to f57a4f9 Compare February 13, 2026 14:39

ProExpertProg requested changes Feb 13, 2026

View reviewed changes

haosdent force-pushed the fix/allreduce-fusion-dp-tp-process-group branch from f57a4f9 to 0f01bca Compare February 13, 2026 16:46

haosdent changed the title ~~[WIP][Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations~~ [Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations Feb 15, 2026

haosdent requested a review from ProExpertProg February 23, 2026 10:00

mergify bot added the needs-rebase label Feb 25, 2026

haosdent mentioned this pull request Feb 27, 2026

[Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations #35468

Open

haosdent closed this Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations#34511

[Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations#34511
haosdent wants to merge 1 commit intovllm-project:mainfrom
haosdent:fix/allreduce-fusion-dp-tp-process-group

haosdent commented Feb 13, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

ilmarkov commented Feb 13, 2026 •

edited

Loading

Uh oh!

ilmarkov commented Feb 13, 2026

Uh oh!

Uh oh!

haosdent commented Feb 13, 2026

Uh oh!

ilmarkov commented Feb 13, 2026 •

edited

Loading

Uh oh!

ProExpertProg left a comment

Uh oh!

ilmarkov commented Feb 13, 2026

Uh oh!

haosdent commented Feb 13, 2026

Uh oh!

mergify bot commented Feb 25, 2026

Uh oh!

haosdent commented Feb 27, 2026

Uh oh!

haosdent commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

haosdent commented Feb 13, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

ilmarkov commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilmarkov commented Feb 13, 2026

Uh oh!

Uh oh!

haosdent commented Feb 13, 2026

Uh oh!

ilmarkov commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ilmarkov commented Feb 13, 2026

Uh oh!

haosdent commented Feb 13, 2026

Uh oh!

mergify bot commented Feb 25, 2026

Uh oh!

haosdent commented Feb 27, 2026

Uh oh!

haosdent commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

haosdent commented Feb 13, 2026 •

edited by github-actions bot

Loading

ilmarkov commented Feb 13, 2026 •

edited

Loading

ilmarkov commented Feb 13, 2026 •

edited

Loading