Skip to content

Fix multi-node WorkerProc init ordering and compilation_time None#36200

Open
ananyakgarg wants to merge 1 commit intovllm-project:mainfrom
ananyakgarg:export-D95475427
Open

Fix multi-node WorkerProc init ordering and compilation_time None#36200
ananyakgarg wants to merge 1 commit intovllm-project:mainfrom
ananyakgarg:export-D95475427

Conversation

@ananyakgarg
Copy link

Summary:
#34861 moved init_device() after _init_message_queues() which breaks the multi-node TP as _init_message_queues needs _INNER_DP_WORLD which is set in init_device(). This swaps the order back.

#35503 also added max(compilation_times) but remote workers return None in multi-node, and this filters them out.

Test Plan: OSS

Differential Revision: D95475427

Summary:
vllm-project#34861  moved `init_device()` after `_init_message_queues()` which breaks the multi-node TP as  `_init_message_queues` needs `_INNER_DP_WORLD` which is set in `init_device()`. This swaps the order back.

vllm-project#35503 also added `max(compilation_times)` but remote workers return None in multi-node, and this filters them out.

Test Plan: OSS

Differential Revision: D95475427
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two important fixes. First, it addresses a potential TypeError in abstract.py by filtering out None values from compilation_times before calculating the maximum. This is crucial for multi-node setups where remote workers might not report a compilation time. Second, it corrects the initialization order in multiproc_executor.py by ensuring init_device() is called before _init_message_queues(). This resolves a dependency issue where message queue initialization required distributed groups to be set up first. The changes are correct and well-targeted to fix the described issues.

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@njhill
Copy link
Member

njhill commented Mar 6, 2026

Thanks @ananyakgarg ... there is already #36186 and #35892

@njhill
Copy link
Member

njhill commented Mar 16, 2026

@ananyakgarg do you want to update this or open a separate PR for the other fix here? And I'm wondering why that isn't triggered by our existing multi-node DP tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants