Skip to content

fix hang in allreduce comms in SGL#3247

Merged
aleozlx merged 1 commit into
flashinfer-ai:mainfrom
bzhng-development:brayden/fix-0610-cp
May 7, 2026
Merged

fix hang in allreduce comms in SGL#3247
aleozlx merged 1 commit into
flashinfer-ai:mainfrom
bzhng-development:brayden/fix-0610-cp

Conversation

@b8zhong
Copy link
Copy Markdown
Contributor

@b8zhong b8zhong commented May 6, 2026

📌 Description

Caused by #2955
Currently, it's causing a bug in SGLang. in missing group= parameter, (with scenario of 4 devices and world size = 2), the rendezvous will expect all 4 to respond, and cause a hang in warmup.

🔍 Related Issues

#2955

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

sgl-project/sglang#24452

Summary by CodeRabbit

Release Notes

  • New Features

    • Added optional process group parameter for AllReduce fusion on TRTLLM backends, enabling users to configure symmetric memory rendezvous behavior.
  • Documentation

    • Updated documentation to describe the new parameter and its default behavior.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

📝 Walkthrough

Walkthrough

The PR adds optional ProcessGroup support to TRTLLM-based AllReduce fusion by introducing a group parameter to the workspace initialization and factory function, enabling symmetric memory rendezvous through the underlying TRTLLM backend.

Changes

AllReduce ProcessGroup Parameter Wiring

Layer / File(s) Summary
Import & Type Support
flashinfer/comm/allreduce.py (line 59)
ProcessGroup is imported from torch.distributed to type the new parameter.
Workspace Class Signature
flashinfer/comm/allreduce.py (lines 109, 121–122)
TRTLLMAllReduceFusionWorkspace.__init__ adds optional group: Optional[ProcessGroup] = None parameter with docstring describing its purpose.
Backend Integration
flashinfer/comm/allreduce.py (line 131)
The group parameter is forwarded to trtllm_create_ipc_workspace_for_all_reduce_fusion call.
Factory Function Signature
flashinfer/comm/allreduce.py (lines 296, 331)
create_allreduce_fusion_workspace adds optional group parameter with docstring noting TRTLLM backend support.
Factory Wiring
flashinfer/comm/allreduce.py (lines 417–421)
When backend is "trtllm", the factory passes group=group to the workspace constructor.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A ProcessGroup joins the dance,
Optional, for symmetric chance,
Through factory and workspace it flows,
No breaking changes, as each one knows!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main issue being fixed: addressing a hang in allreduce communication specific to SGL.
Description check ✅ Passed The description provides context about the regression, the specific issue scenario, related PR references, and confirms pre-commit checks and testing were completed.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a group parameter to the AllReduceFusionWorkspace and its factory function to support symmetric memory rendezvous, primarily for the TRTLLM backend. The review feedback recommends extending this group support to the MNNVL backend to ensure consistency and reliability across different communication backends, which would also require updating the associated docstrings.

I am having trouble creating individual review comments. Click here to see my feedback.

flashinfer/comm/allreduce.py (331)

medium

If support for the group parameter is extended to the MNNVL backend for consistency, the "(trtllm backend only)" restriction in the docstring should be removed.

        group: Process group for symmetric memory rendezvous. Defaults to torch.distributed.group.WORLD.

flashinfer/comm/allreduce.py (420-423)

medium

The group parameter is currently ignored when the mnnvl backend is selected. To ensure consistency across backends and prevent potential hangs in MNNVL (which also performs symmetric memory rendezvous), the group should be used to initialize the comm_backend if it's not explicitly provided. Note that TorchDistBackend is imported locally to avoid potential circular dependencies.

            group=group,
        )

    elif actual_backend == "mnnvl":
        if comm_backend is None and group is not None:
            from .mnnvl import TorchDistBackend
            comm_backend = TorchDistBackend(group=group)

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
flashinfer/comm/allreduce.py (1)

286-296: ⚡ Quick win

group is silently ignored for the mnnvl backend.

group is accepted at the public API level but only forwarded down the trtllm path (Line 420). For actual_backend == "mnnvl" and especially for backend="auto" where the heuristic prefers mnnvl when available (Line 274-275), a caller who explicitly passed a group to work around the SGLang hang will see it silently dropped. The docstring at Line 331 mentions “trtllm backend only”, but a runtime warning would make this much harder to misuse — particularly relevant since the motivation for this PR is exactly the kind of subtle hang that arises when group semantics are wrong.

Consider warning when the user explicitly passes a non-default group but the selected backend won't honor it.

♻️ Suggested change
     elif actual_backend == "mnnvl":
+        if group is not None:
+            logger.warning(
+                "`group` is only honored by the trtllm backend; ignoring for mnnvl backend."
+            )
         mapping = Mapping(

Also applies to: 411-451

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flashinfer/comm/allreduce.py` around lines 286 - 296, The function
create_allreduce_fusion_workspace currently accepts a group parameter but
silently ignores it when actual_backend == "mnnvl" (and when backend="auto" that
chooses mnnvl), which can break callers relying on group semantics; update
create_allreduce_fusion_workspace to detect when a non-default/non-None group is
passed and the chosen backend is not "trtllm" (e.g., actual_backend == "mnnvl"),
and emit a runtime warning (or logger.warn) informing the user that the provided
group will be ignored by the selected backend and that group semantics are only
honored for the trtllm path; reference the symbols actual_backend, backend,
group, and the trtllm/mnnvl branches to locate where to add the check and
warning (ensure the warning runs both when backend=="mnnvl" and when
backend=="auto" resolves to mnnvl).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@flashinfer/comm/allreduce.py`:
- Around line 286-296: The function create_allreduce_fusion_workspace currently
accepts a group parameter but silently ignores it when actual_backend == "mnnvl"
(and when backend="auto" that chooses mnnvl), which can break callers relying on
group semantics; update create_allreduce_fusion_workspace to detect when a
non-default/non-None group is passed and the chosen backend is not "trtllm"
(e.g., actual_backend == "mnnvl"), and emit a runtime warning (or logger.warn)
informing the user that the provided group will be ignored by the selected
backend and that group semantics are only honored for the trtllm path; reference
the symbols actual_backend, backend, group, and the trtllm/mnnvl branches to
locate where to add the check and warning (ensure the warning runs both when
backend=="mnnvl" and when backend=="auto" resolves to mnnvl).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b6ab1b15-6565-4798-ad55-f2c5040bb6ed

📥 Commits

Reviewing files that changed from the base of the PR and between d885b71 and cac7578.

📒 Files selected for processing (1)
  • flashinfer/comm/allreduce.py

@aleozlx aleozlx added v0.6.10 release blocker label for 0.6.10 v0.6.11 release blocker label for 0.6.11 run-ci labels May 7, 2026
@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented May 7, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !641 has been created, and the CI pipeline #50488199 is currently running. I'll report back once the pipeline job completes.

@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented May 7, 2026

seeing a weird error on 5090 that seems irrelevant but need to take another look tmr

============================= test session starts ==============================
platform linux -- Python 3.12.13, pytest-9.0.3, pluggy-1.6.0
rootdir: /tmp/flashinfer
configfile: pytest.ini
collected 23 items
tests/comm/test_trtllm_alltoall.py ....................F..               [100%]
=================================== FAILURES ===================================
E   assert 0 == ((7 * 128) + 0)
     +  where 0 = int(tensor(0, dtype=torch.int32))
/tmp/flashinfer/tests/comm/test_trtllm_alltoall.py:846: assert 0 == ((7 * 128) + 0)
=============================== warnings summary ===============================
<frozen importlib._bootstrap_external>:1301
  <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301
  <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
- generated xml file: /tmp/junit/tests_comm_test_trtllm_alltoall.py.2385559793.xml -
=========================== short test summary info ============================
FAILED tests/comm/test_trtllm_alltoall.py::test_moe_alltoall_prepare[3-8-128-144-8-1]
================== 1 failed, 22 passed, 2 warnings in 31.32s ===================
❌ FAILED: tests/comm/test_trtllm_alltoall.py

@aleozlx aleozlx merged commit 42bf79e into flashinfer-ai:main May 7, 2026
69 of 91 checks passed
aleozlx pushed a commit that referenced this pull request May 7, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description

Caused by #2955
Currently, it's causing a bug in SGLang. in missing `group=` parameter,
(with scenario of 4 devices and world size = 2), the rendezvous will
expect all 4 to respond, and cause a hang in warmup.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

#2955

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

sgl-project/sglang#24452


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* Added optional process group parameter for AllReduce fusion on TRTLLM
backends, enabling users to configure symmetric memory rendezvous
behavior.

* **Documentation**
* Updated documentation to describe the new parameter and its default
behavior.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

op: comm run-ci v0.6.10 release blocker label for 0.6.10 v0.6.11 release blocker label for 0.6.11

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants