fix hang in allreduce comms in SGL#3247
Conversation
📝 WalkthroughWalkthroughThe PR adds optional ProcessGroup support to TRTLLM-based AllReduce fusion by introducing a ChangesAllReduce ProcessGroup Parameter Wiring
Estimated Code Review Effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request adds a group parameter to the AllReduceFusionWorkspace and its factory function to support symmetric memory rendezvous, primarily for the TRTLLM backend. The review feedback recommends extending this group support to the MNNVL backend to ensure consistency and reliability across different communication backends, which would also require updating the associated docstrings.
I am having trouble creating individual review comments. Click here to see my feedback.
flashinfer/comm/allreduce.py (331)
If support for the group parameter is extended to the MNNVL backend for consistency, the "(trtllm backend only)" restriction in the docstring should be removed.
group: Process group for symmetric memory rendezvous. Defaults to torch.distributed.group.WORLD.
flashinfer/comm/allreduce.py (420-423)
The group parameter is currently ignored when the mnnvl backend is selected. To ensure consistency across backends and prevent potential hangs in MNNVL (which also performs symmetric memory rendezvous), the group should be used to initialize the comm_backend if it's not explicitly provided. Note that TorchDistBackend is imported locally to avoid potential circular dependencies.
group=group,
)
elif actual_backend == "mnnvl":
if comm_backend is None and group is not None:
from .mnnvl import TorchDistBackend
comm_backend = TorchDistBackend(group=group)There was a problem hiding this comment.
🧹 Nitpick comments (1)
flashinfer/comm/allreduce.py (1)
286-296: ⚡ Quick win
groupis silently ignored for the mnnvl backend.
groupis accepted at the public API level but only forwarded down thetrtllmpath (Line 420). Foractual_backend == "mnnvl"and especially forbackend="auto"where the heuristic prefers mnnvl when available (Line 274-275), a caller who explicitly passed agroupto work around the SGLang hang will see it silently dropped. The docstring at Line 331 mentions “trtllm backend only”, but a runtime warning would make this much harder to misuse — particularly relevant since the motivation for this PR is exactly the kind of subtle hang that arises when group semantics are wrong.Consider warning when the user explicitly passes a non-default
groupbut the selected backend won't honor it.♻️ Suggested change
elif actual_backend == "mnnvl": + if group is not None: + logger.warning( + "`group` is only honored by the trtllm backend; ignoring for mnnvl backend." + ) mapping = Mapping(Also applies to: 411-451
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@flashinfer/comm/allreduce.py` around lines 286 - 296, The function create_allreduce_fusion_workspace currently accepts a group parameter but silently ignores it when actual_backend == "mnnvl" (and when backend="auto" that chooses mnnvl), which can break callers relying on group semantics; update create_allreduce_fusion_workspace to detect when a non-default/non-None group is passed and the chosen backend is not "trtllm" (e.g., actual_backend == "mnnvl"), and emit a runtime warning (or logger.warn) informing the user that the provided group will be ignored by the selected backend and that group semantics are only honored for the trtllm path; reference the symbols actual_backend, backend, group, and the trtllm/mnnvl branches to locate where to add the check and warning (ensure the warning runs both when backend=="mnnvl" and when backend=="auto" resolves to mnnvl).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@flashinfer/comm/allreduce.py`:
- Around line 286-296: The function create_allreduce_fusion_workspace currently
accepts a group parameter but silently ignores it when actual_backend == "mnnvl"
(and when backend="auto" that chooses mnnvl), which can break callers relying on
group semantics; update create_allreduce_fusion_workspace to detect when a
non-default/non-None group is passed and the chosen backend is not "trtllm"
(e.g., actual_backend == "mnnvl"), and emit a runtime warning (or logger.warn)
informing the user that the provided group will be ignored by the selected
backend and that group semantics are only honored for the trtllm path; reference
the symbols actual_backend, backend, group, and the trtllm/mnnvl branches to
locate where to add the check and warning (ensure the warning runs both when
backend=="mnnvl" and when backend=="auto" resolves to mnnvl).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b6ab1b15-6565-4798-ad55-f2c5040bb6ed
📒 Files selected for processing (1)
flashinfer/comm/allreduce.py
|
/bot run |
|
seeing a weird error on 5090 that seems irrelevant but need to take another look tmr |
<!-- .github/pull_request_template.md --> ## 📌 Description Caused by #2955 Currently, it's causing a bug in SGLang. in missing `group=` parameter, (with scenario of 4 devices and world size = 2), the rendezvous will expect all 4 to respond, and cause a hang in warmup. <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues #2955 <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes sgl-project/sglang#24452 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added optional process group parameter for AllReduce fusion on TRTLLM backends, enabling users to configure symmetric memory rendezvous behavior. * **Documentation** * Updated documentation to describe the new parameter and its default behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
📌 Description
Caused by #2955
Currently, it's causing a bug in SGLang. in missing
group=parameter, (with scenario of 4 devices and world size = 2), the rendezvous will expect all 4 to respond, and cause a hang in warmup.🔍 Related Issues
#2955
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
sgl-project/sglang#24452
Summary by CodeRabbit
Release Notes
New Features
Documentation