Skip to content

[Bugfix] Disable allreduce_rms_fusion when pipeline_parallel_size > 1#43616

Open
zixi-qi wants to merge 1 commit into
vllm-project:mainfrom
zixi-qi:bugfix-disable-allreduce-rms-fusion-pp
Open

[Bugfix] Disable allreduce_rms_fusion when pipeline_parallel_size > 1#43616
zixi-qi wants to merge 1 commit into
vllm-project:mainfrom
zixi-qi:bugfix-disable-allreduce-rms-fusion-pp

Conversation

@zixi-qi
Copy link
Copy Markdown
Collaborator

@zixi-qi zixi-qi commented May 25, 2026

Summary

Re-gate the FlashInfer allreduce+RMSNorm fusion to pipeline_parallel_size == 1. Verified hang on GB200 with meta-llama/Llama-3.1-70B-Instruct, PP=2 TP=2, FlashInfer 0.6.11.post2 — disabling the fusion makes startup complete and inference correct.

Why this isn't a duplicate

The same gate originally landed in #35424, was removed in #41458 (which claimed FlashInfer 0.6.7's #2662 fixed PP/DP), and the auto-generated revert #41503 was closed without merging. Open PR #35960 only adds a regression test — it does not restore the gate. Empirically, PP > 1 + TP > 1 + fusion still deadlocks the FlashInfer fused-AR peer-signal kernel during cudagraph capture / kernel warmup on FlashInfer 0.6.11.post2 because divergent per-rank launch configs are still possible.

Root cause (short)

The fused op synchronizes TP peers via a GPU-side peer-signal spin-wait (trtllm_allreduce_fusion.cuh:902-916) that assumes byte-identical gridDim.x across peers. With PP > 1, the two TP subgroups warm up concurrently (no PP send/recv in _dummy_run), and the resulting cross-PP contention is enough to flip near-tied per-rank launch-config decisions inside a single TP subgroup — leading to mismatched CTA counts and an infinite spin on barrier flags.

Test commands run

  • Env: GB200, FlashInfer 0.6.11.post2.
  • Repro: PP=2 TP=2 vllm serve meta-llama/Llama-3.1-70B-Instruct ... with fusion on (current main): hangs at Capturing CUDA graphs (decode, FULL) 0/51 with one GPU at 100% util and the rest at 0%.
  • With this gate: startup completes; curl /v1/completions produces coherent output ("Once upon a time" → ", in a small village nestled in the rolling hills of Tuscany,...").
  • PP=1 TP=4 regression: still works correctly (fusion still on for the supported path).
  • Verified GSM8K
local-completions ({'model': 'meta-llama/Llama-3.1-70B-Instruct', 'base_url': 'http://127.0.0.1:8000/v1/completions', 'num_concurrent': 32, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9272|±  |0.0072|
|     |       |strict-match    |     5|exact_match|↑  |0.8688|±  |0.0093|

Note

AI assistance was used to investigate the spin-wait mechanism and draft this PR; the change itself is a one-line gate restoration matching the previously-merged #35424.

Signed-off-by: zixi-qi <zixi@inferact.ai>
Co-authored-by: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the enable_allreduce_rms_fusion configuration logic in vllm/config/vllm.py to only enable the fusion when pipeline parallelism (PP) is equal to 1. This change prevents potential deadlocks caused by divergent FlashInfer launch configurations during concurrent warmup of multiple tensor parallel subgroups when PP > 1. There are no review comments to address.

@mergify mergify Bot added the bug Something isn't working label May 25, 2026
@zixi-qi zixi-qi marked this pull request as ready for review May 25, 2026 19:14
@zixi-qi
Copy link
Copy Markdown
Collaborator Author

zixi-qi commented May 25, 2026

@claude review

@zixi-qi zixi-qi added the ready ONLY add when PR is ready to merge/full CI is needed label May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant