Skip to content

[Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations#34511

Closed
haosdent wants to merge 1 commit intovllm-project:mainfrom
haosdent:fix/allreduce-fusion-dp-tp-process-group
Closed

[Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations#34511
haosdent wants to merge 1 commit intovllm-project:mainfrom
haosdent:fix/allreduce-fusion-dp-tp-process-group

Conversation

@haosdent
Copy link
Contributor

@haosdent haosdent commented Feb 13, 2026

Purpose

Fixes #34401
Fixes #34458

Four issues caused AllReduceFusionPass to fail in DP+TP setups (e.g. dp=2, tp=2 on 4 GPUs):

  1. Missing comm_backend: create_allreduce_fusion_workspace was called without a comm_backend, so flashinfer defaulted to the global process group instead of the TP group. Independent DP groups collided during IPC handle exchange.

  2. flashinfer bcast bug: TorchDistBackend.bcast passes a group-local root directly as src to broadcast_object_list, which expects a global rank. Fixed by using the group_src parameter instead.

  3. flashinfer device_idx bug: SymmDeviceMemory uses device_idx=tp_rank which maps to the wrong GPU for DP groups > 0. Disable the pass for DP+TP until this is fixed upstream in flashinfer.

  4. IPC socket opIds collide: IPC socket opIds collide across TP groups because random.randint produces identical values under vllm's deterministic seeding. Fixed by offsetting the opId with the global root rank in _TPCommBackend.bcast.

Test Plan

Add test_all_reduce_fusion_pass_dp_tp but keep disable.

Test Result

Expect the test cases pass but I don't have machines to test.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the bug Something isn't working label Feb 13, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug in AllReduceFusionPass that caused crashes in Data Parallel + Tensor Parallel (DP+TP) configurations. The fix involves two main changes:

  1. A new _TPCommBackend class is introduced to work around a bug in flashinfer's broadcast implementation, ensuring that global ranks are used for communication as expected.
  2. The AllReduceFusionPass is updated to explicitly use a TP-scoped communication backend when creating the flashinfer workspace, preventing collisions between different data parallel groups.

The changes are well-implemented and directly solve the described issues. Additionally, a new multi-GPU test case (test_all_reduce_fusion_pass_dp_tp) has been added, which effectively reproduces the problematic DP+TP setup and serves as a valuable regression test. The code is clear, and the fix appears robust. Overall, this is a high-quality contribution that improves the stability of vLLM in complex distributed setups.

@ilmarkov
Copy link
Contributor

ilmarkov commented Feb 13, 2026

I think there was also an issue with the same opId in Socket initialization on root ranks in 2 TP groups.

@ilmarkov
Copy link
Contributor

Also there is incorrect use of device id at symm memory init. It will use the same devices for two different TP groups.

@haosdent
Copy link
Contributor Author

Hi, @ilmarkov thanks a lot for your review

I think there was also an issue with the same opId in Socket initialization on root ranks in 2 TP groups.

Do you mean opId are same at here https://github.com/flashinfer-ai/flashinfer/blob/292f9be3f5f6d76248d4d3577c167fd178d7952d/flashinfer/comm/mnnvl.py#L804-L810 ?

    def _init_ipc_socket(self) -> IpcSocket:
        if self.rank == 0:
            opId = random.randint(0, 2**64 - 1)
        else:
            opId = None
        opId = self.comm.bcast(opId, root=0)
        return IpcSocket(self.rank, opId)

Also there is incorrect use of device id at symm memory init. It will use the same devices for two different TP groups.

For this one, should we fix in flashinfer instead of vLLM

@ilmarkov
Copy link
Contributor

ilmarkov commented Feb 13, 2026

@haosdent

Do you mean opId are same at here

Yes, probably we need to do sth like
opId = hash((random.randint(0, 2**64 - 1), self.comm.group_id())).

For this one, should we fix in flashinfer instead of vLLM

I'm afraid, yes. And before we get the fix in the new version of FlashInfer, we will need to disable this fusion for DP>1 TP>1 case.

@haosdent haosdent force-pushed the fix/allreduce-fusion-dp-tp-process-group branch from 0eec0da to f57a4f9 Compare February 13, 2026 14:39
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting a block on until we fully resolve this. How long do we need to wait for a flashinfer upgrade? Could we monkeypatch IPCSocket in the meantime?

@ilmarkov
Copy link
Contributor

@ProExpertProg

It's not only IPCSocket thing. Another problem will be in SymmMemory init:

 symm_mem = SymmDeviceMemory(
                aligned_size,
                tp_size,
                tp_rank,
                torch.device("cuda", tp_rank).index,
                comm_backend,
                enable_multicast=False,
                allocate_signal_pads=False,
            )

There we need to use a local_rank from the global group. So I think we need to fix it in Flashinfer. I think we can only use it with the next Flashinfer release.

Three issues caused AllReduceFusionPass to fail in DP+TP setups
(e.g. dp=2, tp=2 on 4 GPUs):

1. Missing comm_backend: create_allreduce_fusion_workspace was called
   without a comm_backend, so flashinfer defaulted to the global
   process group instead of the TP group. Independent DP groups
   collided during IPC handle exchange.

2. flashinfer bcast bug: TorchDistBackend.bcast passes a group-local
   root directly as src to broadcast_object_list, which expects a
   global rank. Fixed by using the group_src parameter instead.

3. flashinfer device_idx bug: SymmDeviceMemory uses device_idx=tp_rank
   which maps to the wrong GPU for DP groups > 0. Disable the pass
   for DP+TP until this is fixed upstream in flashinfer.

Additionally, IPC socket opIds collide across TP groups because
random.randint produces identical values under vllm's deterministic
seeding. Fixed by offsetting the opId with the global root rank in
_TPCommBackend.bcast.

Fixes vllm-project#34401
Fixes vllm-project#34458

Signed-off-by: haosdent <haosdent@gmail.com>
@haosdent haosdent force-pushed the fix/allreduce-fusion-dp-tp-process-group branch from f57a4f9 to 0f01bca Compare February 13, 2026 16:46
@haosdent
Copy link
Contributor Author

Putting a block on until we fully resolve this. How long do we need to wait for a flashinfer upgrade? Could we monkeypatch IPCSocket in the meantime?

Thanks @ProExpertProg and @ilmarkov , have updated the pull request to block. Also monkeypatch _TPCommBackend.bcast to make IPSocket get a unique opId

@haosdent haosdent changed the title [WIP][Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations [Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations Feb 15, 2026
@mergify
Copy link

mergify bot commented Feb 25, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @haosdent.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@haosdent
Copy link
Contributor Author

It's not only IPCSocket thing. Another problem will be in SymmMemory init:

BTW, @ilmarkov is there any known issue link for this? Or the issue not been created yet?

@haosdent
Copy link
Contributor Author

Close and continue to track in #35468

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: AR+rms broken for TP=2 DP=2 [CI Failure]: Distributed Tests (8 GPUs)(H100)

3 participants