[BugFix][Core] Fix error when enable async-scheduling in multi-node env#25887
[BugFix][Core] Fix error when enable async-scheduling in multi-node env#25887njhill merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request aims to fix a launch failure for async scheduling in a multi-node environment by adjusting when the distributed executor backend is configured. The change correctly removes the premature default setting of the backend to mp. However, the new validation logic for supported backends with async scheduling seems to have some inconsistencies. I've added a comment with a suggestion to clarify this logic and make it consistent with the information provided in the pull request description.
|
@WoosukKwon @benchislett Hello, could you please review this MR? |
2a3e01f to
5f213a1
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
5f213a1 to
5aa3adc
Compare
|
@WoosukKwon @benchislett Hi, could you take a look at this PR? |
benchislett
left a comment
There was a problem hiding this comment.
LGTM, one grammar nit
f4bb3bd to
54e56a5
Compare
…he default selection. Signed-off-by: Lehua Ding <lehuading@tencent.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Lehua Ding <lehuading@qq.com>
Signed-off-by: Lehua Ding <lehuading@tencent.com>
54e56a5 to
bc67b60
Compare
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: Lehua Ding <lehuading@qq.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: Lehua Ding <lehuading@qq.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: Lehua Ding <lehuading@qq.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: Lehua Ding <lehuading@qq.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: Lehua Ding <lehuading@qq.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: Lehua Ding <lehuading@qq.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: Lehua Ding <lehuading@qq.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: Lehua Ding <lehuading@qq.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
When launching in a multi-node environment (e.g., TP16), the ParallelConfig automatically selects
rayas the distributed_executor_backend. However, when async scheduling is enabled, it prematurely sets the default value of distributed_executor_backendto to mp, causing a launch failure like bellow. This fix moves the check to after that the backend is auto-selected.Currently, async scheduling (primarily the fully overlap feature) does not support Ray as a backend(error like bellow). Support for this can be added in a future PR.