Skip to content

[plugin][distributed] use active platform's backend in get_default_distributed_backend#23969

Open
AgainstEntropy wants to merge 11 commits into
sgl-project:mainfrom
AgainstEntropy:feat/distributed-backend
Open

[plugin][distributed] use active platform's backend in get_default_distributed_backend#23969
AgainstEntropy wants to merge 11 commits into
sgl-project:mainfrom
AgainstEntropy:feat/distributed-backend

Conversation

@AgainstEntropy
Copy link
Copy Markdown
Collaborator

@AgainstEntropy AgainstEntropy commented Apr 28, 2026

Motivation

Follow-up to #21388.

get_default_distributed_backend(device) currently maps device to torch distributed backend via a hard-coded _DEVICE_TO_DISTRIBUTED_BACKEND dict (cuda → nccl, cpu → gloo, …). Out-of-tree platform plugins currently have to patch the dict for new backends.

The platform interface already declares get_torch_distributed_backend_str() as the source of truth for a platform's torch.distributed backend, and the same pattern has already been used in multimodal_gen runtime when initializing process groups (multimodal_gen/runtime/distributed/parallel_state.py).

This PR aligns the SRT distributed init with that pattern while keeping the dict as a safe fallback.

Modifications

python/sglang/srt/distributed/parallel_state.pyget_default_distributed_backend(device) now, only when device == current_platform.device_type, calls current_platform.get_torch_distributed_backend_str() first:

  • success → return the platform-supplied backend
  • NotImplementedError → silently fall through to the dict (today's in-tree behavior is preserved: the base method in device_mixin.py is raise NotImplementedError and no in-tree platform overrides it yet)
  • other exception → log a warning, fall through

The device == current_platform.device_type guard preserves correctness for callers that ask for a non-active device's backend (e.g. a cpu auxiliary gloo group on a CUDA process); those keep going through the dict.

current_platform is imported lazily inside the function to avoid a circular import with sglang.srt.platforms.

Tests

test/registered/unit/distributed/test_parallel_state.py covers all paths through get_default_distributed_backend using real SRTPlatform subclasses.

Registered for stage-a-test-cpu:

python test/registered/unit/distributed/test_parallel_state.py
.....
----------------------------------------------------------------------
Ran 5 tests in 0.001s

OK

Also have run some tp/dp manual tests to confirm the correctness.

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

…ckend for matching devices, with fallback handling for exceptions.
Five cases via real SRTPlatform subclasses: override on active device,
override on non-active device, override raising, in-tree-style default,
and unknown device. Registered for stage-a-test-cpu.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@AgainstEntropy
Copy link
Copy Markdown
Collaborator Author

AgainstEntropy commented Apr 29, 2026

/tag-and-rerun-ci , try again

Comment thread python/sglang/srt/distributed/parallel_state.py Outdated
Comment thread python/sglang/srt/distributed/parallel_state.py Outdated
…D_BACKEND to device_mixin.py and updating get_default_distributed_backend to utilize platforms.current_platform for device type checks. This change enhances the flexibility of backend resolution and prepares for future platform refactoring.
@AgainstEntropy
Copy link
Copy Markdown
Collaborator Author

AgainstEntropy commented Apr 30, 2026

/rerun-failed-ci, try again, and again, and again

Comment thread python/sglang/srt/platforms/__init__.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants