[Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations#34511
[Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations#34511haosdent wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug in AllReduceFusionPass that caused crashes in Data Parallel + Tensor Parallel (DP+TP) configurations. The fix involves two main changes:
- A new
_TPCommBackendclass is introduced to work around a bug inflashinfer's broadcast implementation, ensuring that global ranks are used for communication as expected. - The
AllReduceFusionPassis updated to explicitly use a TP-scoped communication backend when creating theflashinferworkspace, preventing collisions between different data parallel groups.
The changes are well-implemented and directly solve the described issues. Additionally, a new multi-GPU test case (test_all_reduce_fusion_pass_dp_tp) has been added, which effectively reproduces the problematic DP+TP setup and serves as a valuable regression test. The code is clear, and the fix appears robust. Overall, this is a high-quality contribution that improves the stability of vLLM in complex distributed setups.
|
I think there was also an issue with the same |
|
Also there is incorrect use of device id at symm memory init. It will use the same devices for two different TP groups. |
|
Hi, @ilmarkov thanks a lot for your review
Do you mean opId are same at here https://github.com/flashinfer-ai/flashinfer/blob/292f9be3f5f6d76248d4d3577c167fd178d7952d/flashinfer/comm/mnnvl.py#L804-L810 ? def _init_ipc_socket(self) -> IpcSocket:
if self.rank == 0:
opId = random.randint(0, 2**64 - 1)
else:
opId = None
opId = self.comm.bcast(opId, root=0)
return IpcSocket(self.rank, opId)
For this one, should we fix in flashinfer instead of vLLM |
Yes, probably we need to do sth like
I'm afraid, yes. And before we get the fix in the new version of FlashInfer, we will need to disable this fusion for DP>1 TP>1 case. |
0eec0da to
f57a4f9
Compare
ProExpertProg
left a comment
There was a problem hiding this comment.
Putting a block on until we fully resolve this. How long do we need to wait for a flashinfer upgrade? Could we monkeypatch IPCSocket in the meantime?
|
It's not only IPCSocket thing. Another problem will be in SymmMemory init: There we need to use a local_rank from the global group. So I think we need to fix it in Flashinfer. I think we can only use it with the next Flashinfer release. |
Three issues caused AllReduceFusionPass to fail in DP+TP setups (e.g. dp=2, tp=2 on 4 GPUs): 1. Missing comm_backend: create_allreduce_fusion_workspace was called without a comm_backend, so flashinfer defaulted to the global process group instead of the TP group. Independent DP groups collided during IPC handle exchange. 2. flashinfer bcast bug: TorchDistBackend.bcast passes a group-local root directly as src to broadcast_object_list, which expects a global rank. Fixed by using the group_src parameter instead. 3. flashinfer device_idx bug: SymmDeviceMemory uses device_idx=tp_rank which maps to the wrong GPU for DP groups > 0. Disable the pass for DP+TP until this is fixed upstream in flashinfer. Additionally, IPC socket opIds collide across TP groups because random.randint produces identical values under vllm's deterministic seeding. Fixed by offsetting the opId with the global root rank in _TPCommBackend.bcast. Fixes vllm-project#34401 Fixes vllm-project#34458 Signed-off-by: haosdent <haosdent@gmail.com>
f57a4f9 to
0f01bca
Compare
Thanks @ProExpertProg and @ilmarkov , have updated the pull request to block. Also monkeypatch |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Close and continue to track in #35468 |
Purpose
Fixes #34401
Fixes #34458
Four issues caused AllReduceFusionPass to fail in DP+TP setups (e.g. dp=2, tp=2 on 4 GPUs):
Missing comm_backend: create_allreduce_fusion_workspace was called without a comm_backend, so flashinfer defaulted to the global process group instead of the TP group. Independent DP groups collided during IPC handle exchange.
flashinfer bcast bug: TorchDistBackend.bcast passes a group-local root directly as src to broadcast_object_list, which expects a global rank. Fixed by using the group_src parameter instead.
flashinfer device_idx bug: SymmDeviceMemory uses device_idx=tp_rank which maps to the wrong GPU for DP groups > 0. Disable the pass for DP+TP until this is fixed upstream in flashinfer.
IPC socket opIds collide: IPC socket opIds collide across TP groups because random.randint produces identical values under vllm's deterministic seeding. Fixed by offsetting the opId with the global root rank in _TPCommBackend.bcast.
Test Plan
Add
test_all_reduce_fusion_pass_dp_tpbut keep disable.Test Result
Expect the test cases pass but I don't have machines to test.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.