Skip to content

[Bugfix] Fix ZMQ port TOCTOU race in shm_broadcast.py#44495

Open
RTCartist wants to merge 1 commit into
vllm-project:mainfrom
RTCartist:fix/shm-broadcast-port-toctou
Open

[Bugfix] Fix ZMQ port TOCTOU race in shm_broadcast.py#44495
RTCartist wants to merge 1 commit into
vllm-project:mainfrom
RTCartist:fix/shm-broadcast-port-toctou

Conversation

@RTCartist

@RTCartist RTCartist commented Jun 4, 2026

Copy link
Copy Markdown

Summary

  • Fix TOCTOU (time-of-check-time-of-use) race condition in MessageQueue.__init__ where get_open_port() discovers a port, but another process (e.g., Ray) can claim it
    before the ZMQ XPUB socket binds — causing zmq.error.ZMQError: Address already in use
  • Replace with late binding (port=0), letting the OS assign the port atomically at bind() time, then read back the real address via zmq.LAST_ENDPOINT
  • Follows the same pattern already merged in the DP coordinator (PR Fix DP coordinator ZMQ port TOCTOU #37452)

Motivation

In multi-node deployments (e.g., 2P 4D with DeepSeek-V3), engine startup takes ~5 minutes after port selection. During this window, other services can claim the
pre-selected port. This fix eliminates that race window entirely.

Changes

  • vllm/distributed/device_communicators/shm_broadcast.py:
    • Removed get_open_port() call for remote subscriber socket
    • Bind to tcp://{connect_ip}:0 instead of a pre-selected port
    • Read actual bound endpoint via self.remote_socket.getsockopt(zmq.LAST_ENDPOINT).decode()
    • Removed unused get_open_port import

Existing PRs

PRs #30520, #35977, and #39496 address the same issue but have been stalled with merge conflicts for 3+ months. This PR is a minimal, focused fix for the one remaining
TOCTOU site (shm_broadcast.py), since the coordinator side was already fixed by PR #37452.

Test plan

  • Existing shm_broadcast tests pass (the fix is a drop-in replacement — same bind semantics, same address format from LAST_ENDPOINT)
  • Multi-node deployments no longer hit Address already in use on the remote XPUB socket

AI disclosure

This PR was developed with AI assistance (Claude). All code has been reviewed and understood by the human submitter.

Closes #28498

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@mergify mergify Bot added the bug Something isn't working label Jun 4, 2026
Replace `get_open_port()` with late binding (port 0) for the remote
XPUB socket in `MessageQueue.__init__`, then read back the actual
bound address via `zmq.LAST_ENDPOINT`. This eliminates the window
between port discovery and socket bind where another process could
claim the port.

Follows the same pattern already used in the DP coordinator
(PR vllm-project#37452).

Closes vllm-project#28498

Signed-off-by: RTCartist <wangshengb@buaa.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][RL]: Port Conflict

1 participant