[Platform] allow platform to init dp group by wangxiyuan · Pull Request #22243 · vllm-project/vllm

wangxiyuan · 2025-08-05T08:09:35Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

When init dp group, it's always set to "gloo", we should allow platform to init dp group by itself. This PR allow platform init dp group and fallback to "gloo" if failed.

For cuda platfrom, stateless_init_device_torch_dist_pg is not used anywhere, let's remove it now.

Test Plan

Test Result

(Optional) Documentation Update

gemini-code-assist

Code Review

This pull request refactors the data parallel process group initialization to be platform-specific, allowing different backends like nccl on GPUs. The change is generally good, but I've identified a critical issue in the error handling. The current implementation could silently fall back to the gloo backend if a platform-specific initialization (e.g., nccl) fails, potentially masking errors and causing severe performance degradation. I've suggested a more robust error handling mechanism to prevent this.

vllm/distributed/utils.py

github-actions · 2025-08-05T08:47:39Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-08-05T10:37:19Z

PTAL at the test failure

wangxiyuan · 2025-08-06T02:25:40Z

@DarkLight1337 Thanks for the quick reply. the problem seems caused by gloo to nccl chaning. I'll take a look ASAP

mergify · 2025-08-11T02:39:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wangxiyuan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

hmellor · 2025-08-11T06:56:20Z

FYI the ParallelConfig has moved to vllm/config/parallel.py

wangxiyuan · 2025-08-11T08:42:55Z

@hmellor thanks for the reminding. And we've figured out the CI problem @zhaowei1936 will rebase and keep working on this PR.

DarkLight1337 · 2025-08-11T16:37:41Z

Please merge from main to fix merge conflict

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

hmellor · 2025-10-15T09:24:53Z

Failing test is the flaky tool choice one

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

wangxiyuan requested review from WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth and youkaichao as code owners August 5, 2025 08:09

gemini-code-assist bot reviewed Aug 5, 2025

View reviewed changes

vllm/distributed/utils.py Outdated Show resolved Hide resolved

wangxiyuan force-pushed the fix_dp_group branch from 7e024c0 to 356c148 Compare August 5, 2025 08:55

DarkLight1337 approved these changes Aug 5, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) August 5, 2025 09:47

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 5, 2025

wangxiyuan mentioned this pull request Aug 7, 2025

fix gllo allreduce to hccl vllm-project/vllm-ascend#2102

Closed

mergify bot added the needs-rebase label Aug 11, 2025

auto-merge was automatically disabled August 11, 2025 16:04
Head branch was pushed to by a user without write access

zhaowei1936 force-pushed the fix_dp_group branch 2 times, most recently from b49b611 to 093847e Compare August 12, 2025 03:04

mergify bot removed the needs-rebase label Aug 12, 2025

zhaowei1936 force-pushed the fix_dp_group branch 4 times, most recently from 95d4513 to c1d0e92 Compare August 13, 2025 04:41

zhaowei1936 requested review from ProExpertProg and yewentao256 as code owners August 13, 2025 04:41

wangxiyuan mentioned this pull request Aug 19, 2025

Br gloo fix hccl vllm-project/vllm-ascend#2058

Closed

enable stateless_init_device_torch_dist_pg

67ea857

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

wangxiyuan force-pushed the fix_dp_group branch from c1d0e92 to 67ea857 Compare October 14, 2025 03:00

mergify bot added the rocm Related to AMD ROCm label Oct 14, 2025

vllm-bot merged commit db1764e into vllm-project:main Oct 15, 2025
50 of 52 checks passed

bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025

[Platform] allow platform to init dp group (vllm-project#22243)

f022dc9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>

albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 16, 2025

[Platform] allow platform to init dp group (vllm-project#22243)

9a7846d

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

libertyeagle mentioned this pull request Oct 20, 2025

[1/N] Elastic EP Milestone 2 #26278

Closed

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Platform] allow platform to init dp group (vllm-project#22243)

523d35d

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Platform] allow platform to init dp group (vllm-project#22243)

896fc37

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Platform] allow platform to init dp group (vllm-project#22243)

ab312a9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Platform] allow platform to init dp group (vllm-project#22243)

4864b9a

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Platform] allow platform to init dp group#22243

[Platform] allow platform to init dp group#22243
vllm-bot merged 1 commit intovllm-project:mainfrom
wangxiyuan:fix_dp_group

wangxiyuan commented Aug 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

DarkLight1337 commented Aug 5, 2025

Uh oh!

wangxiyuan commented Aug 6, 2025

Uh oh!

mergify bot commented Aug 11, 2025

Uh oh!

hmellor commented Aug 11, 2025

Uh oh!

wangxiyuan commented Aug 11, 2025

Uh oh!

DarkLight1337 commented Aug 11, 2025

Uh oh!

hmellor commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

wangxiyuan commented Aug 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

DarkLight1337 commented Aug 5, 2025

Uh oh!

wangxiyuan commented Aug 6, 2025

Uh oh!

mergify bot commented Aug 11, 2025

Uh oh!

hmellor commented Aug 11, 2025

Uh oh!

wangxiyuan commented Aug 11, 2025

Uh oh!

DarkLight1337 commented Aug 11, 2025

Uh oh!

hmellor commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wangxiyuan commented Aug 5, 2025 •

edited by github-actions bot

Loading