[Bugfix] Disable allreduce_rms_fusion when pipeline_parallel_size > 1 by zixi-qi · Pull Request #43616 · vllm-project/vllm

zixi-qi · 2026-05-25T19:06:58Z

Summary

Re-gate the FlashInfer allreduce+RMSNorm fusion to pipeline_parallel_size == 1. Verified hang on GB200 with meta-llama/Llama-3.1-70B-Instruct, PP=2 TP=2, FlashInfer 0.6.11.post2 — disabling the fusion makes startup complete and inference correct.

Why this isn't a duplicate

The same gate originally landed in #35424, was removed in #41458 (which claimed FlashInfer 0.6.7's #2662 fixed PP/DP), and the auto-generated revert #41503 was closed without merging. Open PR #35960 only adds a regression test — it does not restore the gate. Empirically, PP > 1 + TP > 1 + fusion still deadlocks the FlashInfer fused-AR peer-signal kernel during cudagraph capture / kernel warmup on FlashInfer 0.6.11.post2 because divergent per-rank launch configs are still possible.

Root cause (short)

The fused op synchronizes TP peers via a GPU-side peer-signal spin-wait (trtllm_allreduce_fusion.cuh:902-916) that assumes byte-identical gridDim.x across peers. With PP > 1, the two TP subgroups warm up concurrently (no PP send/recv in _dummy_run), and the resulting cross-PP contention is enough to flip near-tied per-rank launch-config decisions inside a single TP subgroup — leading to mismatched CTA counts and an infinite spin on barrier flags.

Test commands run

Env: GB200, FlashInfer 0.6.11.post2.
Repro: PP=2 TP=2 vllm serve meta-llama/Llama-3.1-70B-Instruct ... with fusion on (current main): hangs at Capturing CUDA graphs (decode, FULL) 0/51 with one GPU at 100% util and the rest at 0%.
With this gate: startup completes; curl /v1/completions produces coherent output ("Once upon a time" → ", in a small village nestled in the rolling hills of Tuscany,...").
PP=1 TP=4 regression: still works correctly (fusion still on for the supported path).
Verified GSM8K

local-completions ({'model': 'meta-llama/Llama-3.1-70B-Instruct', 'base_url': 'http://127.0.0.1:8000/v1/completions', 'num_concurrent': 32, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9272|±  |0.0072|
|     |       |strict-match    |     5|exact_match|↑  |0.8688|±  |0.0093|

Note

AI assistance was used to investigate the spin-wait mechanism and draft this PR; the change itself is a one-line gate restoration matching the previously-merged #35424.

Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the enable_allreduce_rms_fusion configuration logic in vllm/config/vllm.py to only enable the fusion when pipeline parallelism (PP) is equal to 1. This change prevents potential deadlocks caused by divergent FlashInfer launch configurations during concurrent warmup of multiple tensor parallel subgroups when PP > 1. There are no review comments to address.

zixi-qi · 2026-05-25T19:19:29Z

@claude review

[Bugfix] Disable allreduce_rms_fusion when pipeline_parallel_size > 1

b2340be

Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude <noreply@anthropic.com>

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

mergify Bot added the bug Something isn't working label May 25, 2026

zixi-qi marked this pull request as ready for review May 25, 2026 19:14

zixi-qi requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners May 25, 2026 19:14

zixi-qi mentioned this pull request May 25, 2026

Re-enable allreduce rms fusion for DP / PP #41458

Merged

zixi-qi added the ready ONLY add when PR is ready to merge/full CI is needed label May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Disable allreduce_rms_fusion when pipeline_parallel_size > 1#43616

[Bugfix] Disable allreduce_rms_fusion when pipeline_parallel_size > 1#43616
zixi-qi wants to merge 1 commit into
vllm-project:mainfrom
zixi-qi:bugfix-disable-allreduce-rms-fusion-pp

zixi-qi commented May 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

zixi-qi commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zixi-qi commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this isn't a duplicate

Root cause (short)

Test commands run

Note

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

zixi-qi commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zixi-qi commented May 25, 2026 •

edited

Loading