Fix multi-node WorkerProc init ordering and compilation_time None by ananyakgarg · Pull Request #36200 · vllm-project/vllm

ananyakgarg · 2026-03-06T02:42:19Z

Summary:
#34861 moved init_device() after _init_message_queues() which breaks the multi-node TP as _init_message_queues needs _INNER_DP_WORLD which is set in init_device(). This swaps the order back.

#35503 also added max(compilation_times) but remote workers return None in multi-node, and this filters them out.

Test Plan: OSS

Differential Revision: D95475427

Summary: vllm-project#34861 moved `init_device()` after `_init_message_queues()` which breaks the multi-node TP as `_init_message_queues` needs `_INNER_DP_WORLD` which is set in `init_device()`. This swaps the order back. vllm-project#35503 also added `max(compilation_times)` but remote workers return None in multi-node, and this filters them out. Test Plan: OSS Differential Revision: D95475427

gemini-code-assist

Code Review

This pull request introduces two important fixes. First, it addresses a potential TypeError in abstract.py by filtering out None values from compilation_times before calculating the maximum. This is crucial for multi-node setups where remote workers might not report a compilation time. Second, it corrects the initialization order in multiproc_executor.py by ensuring init_device() is called before _init_message_queues(). This resolves a dependency issue where message queue initialization required distributed groups to be set up first. The changes are correct and well-targeted to fix the described issues.

github-actions · 2026-03-06T02:45:13Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

njhill · 2026-03-06T05:10:48Z

Thanks @ananyakgarg ... there is already #36186 and #35892

njhill · 2026-03-16T19:06:11Z

@ananyakgarg do you want to update this or open a separate PR for the other fix here? And I'm wondering why that isn't triggered by our existing multi-node DP tests?

ananyakgarg requested a review from njhill as a code owner March 6, 2026 02:42

meta-codesync bot added fb-exported meta-exported labels Mar 6, 2026

mergify bot added the v1 label Mar 6, 2026

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix multi-node WorkerProc init ordering and compilation_time None#36200

Fix multi-node WorkerProc init ordering and compilation_time None#36200
ananyakgarg wants to merge 1 commit intovllm-project:mainfrom
ananyakgarg:export-D95475427

ananyakgarg commented Mar 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

njhill commented Mar 6, 2026

Uh oh!

njhill commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ananyakgarg commented Mar 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

njhill commented Mar 6, 2026

Uh oh!

njhill commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants