Skip to content

[Platform] allow platform to init dp group#22243

Merged
vllm-bot merged 1 commit intovllm-project:mainfrom
wangxiyuan:fix_dp_group
Oct 15, 2025
Merged

[Platform] allow platform to init dp group#22243
vllm-bot merged 1 commit intovllm-project:mainfrom
wangxiyuan:fix_dp_group

Conversation

@wangxiyuan
Copy link
Copy Markdown
Contributor

@wangxiyuan wangxiyuan commented Aug 5, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

When init dp group, it's always set to "gloo", we should allow platform to init dp group by itself. This PR allow platform init dp group and fallback to "gloo" if failed.

For cuda platfrom, stateless_init_device_torch_dist_pg is not used anywhere, let's remove it now.

Test Plan

Test Result

(Optional) Documentation Update

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the data parallel process group initialization to be platform-specific, allowing different backends like nccl on GPUs. The change is generally good, but I've identified a critical issue in the error handling. The current implementation could silently fall back to the gloo backend if a platform-specific initialization (e.g., nccl) fails, potentially masking errors and causing severe performance degradation. I've suggested a more robust error handling mechanism to prevent this.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 5, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) August 5, 2025 09:47
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 5, 2025
@DarkLight1337
Copy link
Copy Markdown
Member

PTAL at the test failure

@wangxiyuan
Copy link
Copy Markdown
Contributor Author

@DarkLight1337 Thanks for the quick reply. the problem seems caused by gloo to nccl chaning. I'll take a look ASAP

@mergify
Copy link
Copy Markdown

mergify bot commented Aug 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wangxiyuan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 11, 2025
@hmellor
Copy link
Copy Markdown
Member

hmellor commented Aug 11, 2025

FYI the ParallelConfig has moved to vllm/config/parallel.py

@wangxiyuan
Copy link
Copy Markdown
Contributor Author

@hmellor thanks for the reminding. And we've figured out the CI problem @zhaowei1936 will rebase and keep working on this PR.

auto-merge was automatically disabled August 11, 2025 16:04

Head branch was pushed to by a user without write access

@DarkLight1337
Copy link
Copy Markdown
Member

Please merge from main to fix merge conflict

@zhaowei1936 zhaowei1936 force-pushed the fix_dp_group branch 2 times, most recently from b49b611 to 093847e Compare August 12, 2025 03:04
@mergify mergify bot removed the needs-rebase label Aug 12, 2025
@zhaowei1936 zhaowei1936 force-pushed the fix_dp_group branch 4 times, most recently from 95d4513 to c1d0e92 Compare August 13, 2025 04:41
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
@mergify mergify bot added the rocm Related to AMD ROCm label Oct 14, 2025
@hmellor
Copy link
Copy Markdown
Member

hmellor commented Oct 15, 2025

Failing test is the flaky tool choice one

@vllm-bot vllm-bot merged commit db1764e into vllm-project:main Oct 15, 2025
50 of 52 checks passed
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 16, 2025
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants