[Bugfix] Disable allreduce_rms_fusion when pipeline_parallel_size > 1#43616
Open
zixi-qi wants to merge 1 commit into
Open
[Bugfix] Disable allreduce_rms_fusion when pipeline_parallel_size > 1#43616zixi-qi wants to merge 1 commit into
zixi-qi wants to merge 1 commit into
Conversation
Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the enable_allreduce_rms_fusion configuration logic in vllm/config/vllm.py to only enable the fusion when pipeline parallelism (PP) is equal to 1. This change prevents potential deadlocks caused by divergent FlashInfer launch configurations during concurrent warmup of multiple tensor parallel subgroups when PP > 1. There are no review comments to address.
Collaborator
Author
|
@claude review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-gate the FlashInfer allreduce+RMSNorm fusion to
pipeline_parallel_size == 1. Verified hang on GB200 withmeta-llama/Llama-3.1-70B-Instruct, PP=2 TP=2, FlashInfer 0.6.11.post2 — disabling the fusion makes startup complete and inference correct.Why this isn't a duplicate
The same gate originally landed in #35424, was removed in #41458 (which claimed FlashInfer 0.6.7's #2662 fixed PP/DP), and the auto-generated revert #41503 was closed without merging. Open PR #35960 only adds a regression test — it does not restore the gate. Empirically, PP > 1 + TP > 1 + fusion still deadlocks the FlashInfer fused-AR peer-signal kernel during cudagraph capture / kernel warmup on FlashInfer 0.6.11.post2 because divergent per-rank launch configs are still possible.
Root cause (short)
The fused op synchronizes TP peers via a GPU-side peer-signal spin-wait (
trtllm_allreduce_fusion.cuh:902-916) that assumes byte-identicalgridDim.xacross peers. With PP > 1, the two TP subgroups warm up concurrently (no PP send/recv in_dummy_run), and the resulting cross-PP contention is enough to flip near-tied per-rank launch-config decisions inside a single TP subgroup — leading to mismatched CTA counts and an infinite spin on barrier flags.Test commands run
PP=2 TP=2 vllm serve meta-llama/Llama-3.1-70B-Instruct ...with fusion on (current main): hangs atCapturing CUDA graphs (decode, FULL) 0/51with one GPU at 100% util and the rest at 0%.curl /v1/completionsproduces coherent output ("Once upon a time" → ", in a small village nestled in the rolling hills of Tuscany,...").Note
AI assistance was used to investigate the spin-wait mechanism and draft this PR; the change itself is a one-line gate restoration matching the previously-merged #35424.