[Platform] allow platform to init dp group#22243
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the data parallel process group initialization to be platform-specific, allowing different backends like nccl on GPUs. The change is generally good, but I've identified a critical issue in the error handling. The current implementation could silently fall back to the gloo backend if a platform-specific initialization (e.g., nccl) fails, potentially masking errors and causing severe performance degradation. I've suggested a more robust error handling mechanism to prevent this.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
7e024c0 to
356c148
Compare
|
PTAL at the test failure |
|
@DarkLight1337 Thanks for the quick reply. the problem seems caused by |
|
This pull request has merge conflicts that must be resolved before it can be |
|
FYI the |
|
@hmellor thanks for the reminding. And we've figured out the CI problem @zhaowei1936 will rebase and keep working on this PR. |
Head branch was pushed to by a user without write access
|
Please merge from main to fix merge conflict |
b49b611 to
093847e
Compare
95d4513 to
c1d0e92
Compare
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
c1d0e92 to
67ea857
Compare
|
Failing test is the flaky tool choice one |
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
When init dp group, it's always set to "gloo", we should allow platform to init dp group by itself. This PR allow platform init dp group and fallback to "gloo" if failed.
For cuda platfrom,
stateless_init_device_torch_dist_pgis not used anywhere, let's remove it now.Test Plan
Test Result
(Optional) Documentation Update