[parallel state Refactor 3/n] Unify communicator interface#21098
[parallel state Refactor 3/n] Unify communicator interface#21098DarkSharpness wants to merge 6 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant refactoring of the distributed communication stack to enhance modularity, extensibility, and maintainability. By establishing a unified Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring of the distributed communication stack. The new BaseCommunicator abstraction and CommunicatorImpl dispatcher create a unified, data-driven, and more extensible system for handling different communication backends. The changes are clean, modular, and align perfectly with the stated goals. I have one suggestion regarding a potential issue with torch.compile compatibility in the new custom op registration.
60a1018 to
874c014
Compare
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci try again |
| _GROUPS: Dict[str, Callable[[], Optional[CommunicatorImpl]]] = {} | ||
|
|
||
|
|
||
| def _register_group(group: CommunicatorImpl) -> None: |
There was a problem hiding this comment.
I want to know why we use weakref here?
python/sglang/srt/distributed/device_communicators/custom_all_reduce_v2.py
Outdated
Show resolved
Hide resolved
| if mode is not None and _is_mode_supported(mode, inplace): | ||
| if not comm.should_use_custom_op(): | ||
| return comm.all_reduce(input_, inplace=inplace) | ||
| use_inplace = _can_use_inplace(mode) if inplace is None else inplace |
There was a problem hiding this comment.
nits: I’m wondering whether users can tell that we’re using inplace operations here, and whether there’s a risk of it being misused.
There was a problem hiding this comment.
If inplace is specified (not None), it must be strictly followed. Otherwise, it depends on the backend (it will choose the fastest possible implementation).
BBuf
left a comment
There was a problem hiding this comment.
This is a large change, so before merging we should make sure all CI checks passes.
Motivation
#20866, #20871
Modifications
This PR refactors the distributed communication stack in
srtto make communicator behavior explicit, composable, and easier to extend across backends.The main change is introducing a unified
BaseCommunicatorabstraction and moving backend selection logic out of the large ad hoc branches inparallel_state.pyinto a reusableCommunicatorImpldispatcher. This keeps the platform-specific implementations local to each communicator while preserving the existing runtime behavior at the group level.Previously, communicator behavior was spread across backend-specific helpers and
parallel_state.py, with different backends exposing slightly different methods such ascustom_all_reduce, backend-specificshould_*predicates, or implicit in-place assumptions.With
BaseCommunicator, each backend now exposes the same public contract:can_all_reduce()reports whether a communicator can handle the input, and if so whether it supports in-place, out-of-place, or both modes.all_reduce(..., inplace=...)is the single all-reduce entry point for all backends.should_use_custom_op()tells the dispatcher whether the communicator should run through registered custom ops fortorch.compile/ piecewise-cuda-graph-friendly execution.graph_capture_context()lets each communicator define any graph-capture-specific setup in one place.This makes communicator selection data-driven instead of backend-name-driven, and removes a large amount of backend-specific branching from the group coordinator.
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci