Fix multi-node WorkerProc init ordering and compilation_time None#36200
Fix multi-node WorkerProc init ordering and compilation_time None#36200ananyakgarg wants to merge 1 commit intovllm-project:mainfrom
Conversation
Summary: vllm-project#34861 moved `init_device()` after `_init_message_queues()` which breaks the multi-node TP as `_init_message_queues` needs `_INNER_DP_WORLD` which is set in `init_device()`. This swaps the order back. vllm-project#35503 also added `max(compilation_times)` but remote workers return None in multi-node, and this filters them out. Test Plan: OSS Differential Revision: D95475427
There was a problem hiding this comment.
Code Review
This pull request introduces two important fixes. First, it addresses a potential TypeError in abstract.py by filtering out None values from compilation_times before calculating the maximum. This is crucial for multi-node setups where remote workers might not report a compilation time. Second, it corrects the initialization order in multiproc_executor.py by ensuring init_device() is called before _init_message_queues(). This resolves a dependency issue where message queue initialization required distributed groups to be set up first. The changes are correct and well-targeted to fix the described issues.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Thanks @ananyakgarg ... there is already #36186 and #35892 |
|
@ananyakgarg do you want to update this or open a separate PR for the other fix here? And I'm wondering why that isn't triggered by our existing multi-node DP tests? |
Summary:
#34861 moved
init_device()after_init_message_queues()which breaks the multi-node TP as_init_message_queuesneeds_INNER_DP_WORLDwhich is set ininit_device(). This swaps the order back.#35503 also added
max(compilation_times)but remote workers return None in multi-node, and this filters them out.Test Plan: OSS
Differential Revision: D95475427