[plugin][distributed] use active platform's backend in get_default_distributed_backend#23969
Open
AgainstEntropy wants to merge 11 commits into
Open
[plugin][distributed] use active platform's backend in get_default_distributed_backend#23969AgainstEntropy wants to merge 11 commits into
get_default_distributed_backend#23969AgainstEntropy wants to merge 11 commits into
Conversation
…ckend for matching devices, with fallback handling for exceptions.
Five cases via real SRTPlatform subclasses: override on active device, override on non-active device, override raising, in-tree-style default, and unknown device. Registered for stage-a-test-cpu.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci , try again |
alexnails
reviewed
Apr 29, 2026
…D_BACKEND to device_mixin.py and updating get_default_distributed_backend to utilize platforms.current_platform for device type checks. This change enhances the flexibility of backend resolution and prepares for future platform refactoring.
alexnails
approved these changes
Apr 30, 2026
Collaborator
Author
|
/rerun-failed-ci, try again, and again, and again |
ch-wan
reviewed
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Follow-up to #21388.
get_default_distributed_backend(device)currently maps device to torch distributed backend via a hard-coded_DEVICE_TO_DISTRIBUTED_BACKENDdict (cuda → nccl,cpu → gloo, …). Out-of-tree platform plugins currently have to patch the dict for new backends.The platform interface already declares
get_torch_distributed_backend_str()as the source of truth for a platform'storch.distributedbackend, and the same pattern has already been used inmultimodal_genruntime when initializing process groups (multimodal_gen/runtime/distributed/parallel_state.py).This PR aligns the SRT distributed init with that pattern while keeping the dict as a safe fallback.
Modifications
python/sglang/srt/distributed/parallel_state.py—get_default_distributed_backend(device)now, only whendevice == current_platform.device_type, callscurrent_platform.get_torch_distributed_backend_str()first:NotImplementedError→ silently fall through to the dict (today's in-tree behavior is preserved: the base method indevice_mixin.pyisraise NotImplementedErrorand no in-tree platform overrides it yet)The
device == current_platform.device_typeguard preserves correctness for callers that ask for a non-active device's backend (e.g. acpuauxiliary gloo group on a CUDA process); those keep going through the dict.current_platformis imported lazily inside the function to avoid a circular import withsglang.srt.platforms.Tests
test/registered/unit/distributed/test_parallel_state.pycovers all paths throughget_default_distributed_backendusing realSRTPlatformsubclasses.Registered for
stage-a-test-cpu:Also have run some tp/dp manual tests to confirm the correctness.
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci