fix hang in allreduce comms in SGL by b8zhong · Pull Request #3247 · flashinfer-ai/flashinfer

b8zhong · 2026-05-06T18:16:04Z

📌 Description

Caused by #2955
Currently, it's causing a bug in SGLang. in missing group= parameter, (with scenario of 4 devices and world size = 2), the rendezvous will expect all 4 to respond, and cause a hang in warmup.

🔍 Related Issues

#2955

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

sgl-project/sglang#24452

Summary by CodeRabbit

Release Notes

New Features
- Added optional process group parameter for AllReduce fusion on TRTLLM backends, enabling users to configure symmetric memory rendezvous behavior.
Documentation
- Updated documentation to describe the new parameter and its default behavior.

coderabbitai · 2026-05-06T18:16:24Z

📝 Walkthrough

Walkthrough

The PR adds optional ProcessGroup support to TRTLLM-based AllReduce fusion by introducing a group parameter to the workspace initialization and factory function, enabling symmetric memory rendezvous through the underlying TRTLLM backend.

Changes

AllReduce ProcessGroup Parameter Wiring

Layer / File(s)	Summary
Import & Type Support `flashinfer/comm/allreduce.py` (line 59)	ProcessGroup is imported from torch.distributed to type the new parameter.
Workspace Class Signature `flashinfer/comm/allreduce.py` (lines 109, 121–122)	`TRTLLMAllReduceFusionWorkspace.__init__` adds optional `group: Optional[ProcessGroup] = None` parameter with docstring describing its purpose.
Backend Integration `flashinfer/comm/allreduce.py` (line 131)	The `group` parameter is forwarded to `trtllm_create_ipc_workspace_for_all_reduce_fusion` call.
Factory Function Signature `flashinfer/comm/allreduce.py` (lines 296, 331)	`create_allreduce_fusion_workspace` adds optional `group` parameter with docstring noting TRTLLM backend support.
Factory Wiring `flashinfer/comm/allreduce.py` (lines 417–421)	When backend is "trtllm", the factory passes `group=group` to the workspace constructor.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A ProcessGroup joins the dance,
Optional, for symmetric chance,
Through factory and workspace it flows,
No breaking changes, as each one knows!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main issue being fixed: addressing a hang in allreduce communication specific to SGL.
Description check	✅ Passed	The description provides context about the regression, the specific issue scenario, related PR references, and confirms pre-commit checks and testing were completed.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request adds a group parameter to the AllReduceFusionWorkspace and its factory function to support symmetric memory rendezvous, primarily for the TRTLLM backend. The review feedback recommends extending this group support to the MNNVL backend to ensure consistency and reliability across different communication backends, which would also require updating the associated docstrings.

I am having trouble creating individual review comments. Click here to see my feedback.

flashinfer/comm/allreduce.py (331)

If support for the group parameter is extended to the MNNVL backend for consistency, the "(trtllm backend only)" restriction in the docstring should be removed.

        group: Process group for symmetric memory rendezvous. Defaults to torch.distributed.group.WORLD.

flashinfer/comm/allreduce.py (420-423)

The group parameter is currently ignored when the mnnvl backend is selected. To ensure consistency across backends and prevent potential hangs in MNNVL (which also performs symmetric memory rendezvous), the group should be used to initialize the comm_backend if it's not explicitly provided. Note that TorchDistBackend is imported locally to avoid potential circular dependencies.

            group=group,
        )

    elif actual_backend == "mnnvl":
        if comm_backend is None and group is not None:
            from .mnnvl import TorchDistBackend
            comm_backend = TorchDistBackend(group=group)

coderabbitai

🧹 Nitpick comments (1)

flashinfer/comm/allreduce.py (1)
286-296: ⚡ Quick win

group is silently ignored for the mnnvl backend.

group is accepted at the public API level but only forwarded down the trtllm path (Line 420). For actual_backend == "mnnvl" and especially for backend="auto" where the heuristic prefers mnnvl when available (Line 274-275), a caller who explicitly passed a group to work around the SGLang hang will see it silently dropped. The docstring at Line 331 mentions “trtllm backend only”, but a runtime warning would make this much harder to misuse — particularly relevant since the motivation for this PR is exactly the kind of subtle hang that arises when group semantics are wrong.

Consider warning when the user explicitly passes a non-default group but the selected backend won't honor it.
♻️ Suggested change
     elif actual_backend == "mnnvl":
+        if group is not None:
+            logger.warning(
+                "`group` is only honored by the trtllm backend; ignoring for mnnvl backend."
+            )
         mapping = Mapping(
Also applies to: 411-451
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flashinfer/comm/allreduce.py` around lines 286 - 296, The function
create_allreduce_fusion_workspace currently accepts a group parameter but
silently ignores it when actual_backend == "mnnvl" (and when backend="auto" that
chooses mnnvl), which can break callers relying on group semantics; update
create_allreduce_fusion_workspace to detect when a non-default/non-None group is
passed and the chosen backend is not "trtllm" (e.g., actual_backend == "mnnvl"),
and emit a runtime warning (or logger.warn) informing the user that the provided
group will be ignored by the selected backend and that group semantics are only
honored for the trtllm path; reference the symbols actual_backend, backend,
group, and the trtllm/mnnvl branches to locate where to add the check and
warning (ensure the warning runs both when backend=="mnnvl" and when
backend=="auto" resolves to mnnvl).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@flashinfer/comm/allreduce.py`:
- Around line 286-296: The function create_allreduce_fusion_workspace currently
accepts a group parameter but silently ignores it when actual_backend == "mnnvl"
(and when backend="auto" that chooses mnnvl), which can break callers relying on
group semantics; update create_allreduce_fusion_workspace to detect when a
non-default/non-None group is passed and the chosen backend is not "trtllm"
(e.g., actual_backend == "mnnvl"), and emit a runtime warning (or logger.warn)
informing the user that the provided group will be ignored by the selected
backend and that group semantics are only honored for the trtllm path; reference
the symbols actual_backend, backend, group, and the trtllm/mnnvl branches to
locate where to add the check and warning (ensure the warning runs both when
backend=="mnnvl" and when backend=="auto" resolves to mnnvl).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b6ab1b15-6565-4798-ad55-f2c5040bb6ed

📥 Commits

Reviewing files that changed from the base of the PR and between d885b71 and cac7578.

📒 Files selected for processing (1)

flashinfer/comm/allreduce.py

aleozlx · 2026-05-07T00:11:20Z

/bot run

flashinfer-bot · 2026-05-07T00:12:20Z

GitLab MR !641 has been created, and the CI pipeline #50488199 is currently running. I'll report back once the pipeline job completes.

aleozlx · 2026-05-07T06:20:22Z

seeing a weird error on 5090 that seems irrelevant but need to take another look tmr

============================= test session starts ==============================
platform linux -- Python 3.12.13, pytest-9.0.3, pluggy-1.6.0
rootdir: /tmp/flashinfer
configfile: pytest.ini
collected 23 items
tests/comm/test_trtllm_alltoall.py ....................F..               [100%]
=================================== FAILURES ===================================
E   assert 0 == ((7 * 128) + 0)
     +  where 0 = int(tensor(0, dtype=torch.int32))
/tmp/flashinfer/tests/comm/test_trtllm_alltoall.py:846: assert 0 == ((7 * 128) + 0)
=============================== warnings summary ===============================
<frozen importlib._bootstrap_external>:1301
  <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301
  <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
- generated xml file: /tmp/junit/tests_comm_test_trtllm_alltoall.py.2385559793.xml -
=========================== short test summary info ============================
FAILED tests/comm/test_trtllm_alltoall.py::test_moe_alltoall_prepare[3-8-128-144-8-1]
================== 1 failed, 22 passed, 2 warnings in 31.32s ===================
❌ FAILED: tests/comm/test_trtllm_alltoall.py

## 📌 Description Caused by #2955 Currently, it's causing a bug in SGLang. in missing `group=` parameter, (with scenario of 4 devices and world size = 2), the rendezvous will expect all 4 to respond, and cause a hang in warmup.  ## 🔍 Related Issues #2955  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes sgl-project/sglang#24452  ## Summary by CodeRabbit ## Release Notes * **New Features** * Added optional process group parameter for AllReduce fusion on TRTLLM backends, enabling users to configure symmetric memory rendezvous behavior. * **Documentation** * Updated documentation to describe the new parameter and its default behavior.

more

cac7578

b8zhong requested review from aleozlx, bkryu, jimmyzho, nv-yunzheq and yzh119 as code owners May 6, 2026 18:16

flashinfer-bot added the op: comm label May 6, 2026

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

aleozlx approved these changes May 7, 2026

View reviewed changes

aleozlx added v0.6.10 release blocker label for 0.6.10 v0.6.11 release blocker label for 0.6.11 run-ci labels May 7, 2026

aleozlx merged commit 42bf79e into flashinfer-ai:main May 7, 2026
69 of 91 checks passed

coderabbitai Bot mentioned this pull request May 10, 2026

comm: avoid torch symmetric memory by default for TRTLLM allreduce workspaces #3277

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix hang in allreduce comms in SGL#3247

fix hang in allreduce comms in SGL#3247
aleozlx merged 1 commit into
flashinfer-ai:mainfrom
bzhng-development:brayden/fix-0610-cp

b8zhong commented May 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated Code Review Effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

aleozlx commented May 7, 2026

Uh oh!

flashinfer-bot commented May 7, 2026

Uh oh!

aleozlx commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

b8zhong commented May 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated Code Review Effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

flashinfer/comm/allreduce.py (331)

flashinfer/comm/allreduce.py (420-423)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

aleozlx commented May 7, 2026

Uh oh!

flashinfer-bot commented May 7, 2026

Uh oh!

aleozlx commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

b8zhong commented May 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 6, 2026 •

edited

Loading