Fix DP coordinator ZMQ port TOCTOU by itayalroy · Pull Request #37452 · vllm-project/vllm

itayalroy · 2026-03-18T16:02:34Z

Previously the parent selected the DP coordinator's TCP ZMQ ports with
get_open_port() before the coordinator actually bound them, leaving a
window where another socket could claim the ports.

Fix this by letting the coordinator bind first and report the bound ZMQ
addresses back to the parent via pipe.

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

gemini-code-assist

Code Review

This pull request addresses a TOCTOU (Time-of-Check Time-of-Use) vulnerability in the DP coordinator's ZMQ port selection. Previously, the parent process selected the ports before the coordinator bound them, creating a window for other processes to claim those ports. This is fixed by having the coordinator bind the ports first and then report the bound addresses back to the parent via a pipe. The changes involve modifications to network_utils.py and coordinator.py to implement this new mechanism.

gemini-code-assist · 2026-03-18T16:13:39Z

+            ready = multiprocessing.connection.wait(
+                [zmq_addr_pipe, self.proc.sentinel], timeout=30
+            )
+            if not ready:
+                raise RuntimeError(
+                    "DP Coordinator process failed to report ZMQ addresses "
+                    "during startup."
+                )
+            try:
+                return zmq_addr_pipe.recv()
+            except EOFError:
+                raise RuntimeError(
+                    "DP Coordinator process failed during startup."
+                ) from None


The _wait_for_zmq_addrs method includes a timeout of 30 seconds. If the DP Coordinator process fails to report ZMQ addresses within this time, a RuntimeError is raised. However, there's no mechanism to handle or retry this failure. Consider adding a retry mechanism or a more robust error handling strategy to improve the resilience of the system. This is a critical issue because a failure here will prevent the engine from starting up.

This is fatal IMO, if the DP Coordinator cannot report ZMQ addresses within 30 seconds it is reasonable to fail

@itayalroy @tlrmchlsmth can we make the timeout configurable? The current 30s limit can be too short, for example when spawn is forced, the child process will re-import many modules

same issue. from vllm.v1.engine import coordinator takes 70+ seconds to import.

gemini-code-assist · 2026-03-18T16:13:39Z

+        child_zmq_addr_pipe.close()
+        (
+            front_publish_address,
+            back_output_address,
+            back_publish_address,
+        ) = self._wait_for_zmq_addrs(parent_zmq_addr_pipe)


After starting the coordinator process, the parent process retrieves the bound ZMQ addresses using self._wait_for_zmq_addrs. However, if self._wait_for_zmq_addrs fails, the addresses used to initialize self.stats_publish_address and self.coord_in_address will be the original, unbound addresses. This could lead to the parent process attempting to communicate with the coordinator on the wrong ports. This is a critical issue because it can lead to communication failures between the parent and coordinator processes.

No. If _wait_for_zmq_addrs() fails, we raise an exception, so we never proceed using wrong ports.

gemini-code-assist · 2026-03-18T16:13:39Z

+        def bind_address(local_only: bool) -> str:
+            return (
+                get_engine_client_zmq_addr(local_only=True, host=host)
+                if local_only
+                else get_tcp_uri(host, 0)
+            )


The bind_address function uses get_engine_client_zmq_addr when local_only is true, which returns an IPC path. However, when local_only is false, it uses get_tcp_uri with port 0, which requests the OS to assign a port. This inconsistency in address types (IPC vs. TCP) could lead to unexpected behavior or configuration issues. Ensure that the address type is consistent based on the deployment environment or configuration. This is a high severity issue because it can lead to connectivity problems.

The inconsistency in address types (IPC/TCP) already exists, the only change is that with TCP we now let the OS assign the port on bind time instead of binding to a pre-chosen port that might be already taken

tlrmchlsmth

Looks good to me, @njhill if you can take a look, it'd be good to get another pair of eyes.

@itayalroy do you think there's a reasonable way to unit test this?

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

Signed-off-by: Itay Alroy <ialroy@nvidia.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

Replace `get_open_port()` with late binding (port 0) for the remote XPUB socket in `MessageQueue.__init__`, then read back the actual bound address via `zmq.LAST_ENDPOINT`. This eliminates the window between port discovery and socket bind where another process could claim the port. Follows the same pattern already used in the DP coordinator (PR vllm-project#37452). Closes vllm-project#28498 Signed-off-by: RTCartist <wangshengb@buaa.edu.cn>

Signed-off-by: Itay Alroy <ialroy@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

Fix ZMQ TCP port TOCTOU

5263afd

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

itayalroy requested a review from njhill as a code owner March 18, 2026 16:02

mergify Bot added the v1 label Mar 18, 2026

gemini-code-assist Bot reviewed Mar 18, 2026

View reviewed changes

tlrmchlsmth approved these changes Mar 19, 2026

View reviewed changes

Comment thread vllm/v1/engine/coordinator.py

Comment thread vllm/v1/engine/coordinator.py Outdated

tlrmchlsmth self-assigned this Mar 19, 2026

itayalroy added 2 commits March 19, 2026 12:33

Add missing mp.connection import

0edeb97

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

Close DPCoordinatorProc-side socket

b9aad9c

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 19, 2026

tlrmchlsmth enabled auto-merge (squash) March 19, 2026 22:15

tlrmchlsmth merged commit ca1ac1a into vllm-project:main Mar 20, 2026
51 of 52 checks passed

chooper26 pushed a commit to vLLM-HUST/vllm-hust that referenced this pull request Mar 21, 2026

Fix DP coordinator ZMQ port TOCTOU (vllm-project#37452)

5038b26

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

tzulingk mentioned this pull request Mar 24, 2026

[Bugfix] Fix elastic EP scale-up after scale-down #37357

Closed

SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026

Fix DP coordinator ZMQ port TOCTOU (vllm-project#37452)

d32aa71

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

Fix DP coordinator ZMQ port TOCTOU (vllm-project#37452)

c961d01

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026

Fix DP coordinator ZMQ port TOCTOU (vllm-project#37452)

cb93bae

Signed-off-by: Itay Alroy <ialroy@nvidia.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

Fix DP coordinator ZMQ port TOCTOU (vllm-project#37452)

aa0c8a0

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

AjAnubolu mentioned this pull request Apr 9, 2026

[Bugfix] Fix DP port conflict race condition with late binding #35977

Open

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

Fix DP coordinator ZMQ port TOCTOU (vllm-project#37452)

0e9e634

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

KrxGu mentioned this pull request Apr 13, 2026

[Feature][BugFix] Add opt-in request watchdog to abort stuck requests #36130

Open

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026

Fix DP coordinator ZMQ port TOCTOU (vllm-project#37452)

46d624f

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

hjh0119 mentioned this pull request May 11, 2026

fix vllm dp & reset_encoder_cache & fix vllm init with zero3 modelscope/ms-swift#9295

Merged

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

Fix DP coordinator ZMQ port TOCTOU (vllm-project#37452)

45a460f

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

Fix DP coordinator ZMQ port TOCTOU (vllm-project#37452)

e7283ef

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

zhudotexe mentioned this pull request May 25, 2026

[Frontend] Make DP ZMQ liveness timeout configurable #43611

Open

4 tasks

This was referenced Jun 4, 2026

[Bug][RL]: Port Conflict #28498

Open

[Bugfix] Fix ZMQ port TOCTOU race in shm_broadcast.py #44495

Open

mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026

Fix DP coordinator ZMQ port TOCTOU (vllm-project#37452)

4e12ecb

Signed-off-by: Itay Alroy <ialroy@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

arnavnagzirkar mentioned this pull request Jun 10, 2026

fix: [Bug][RL]: Port Conflict #45064

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix DP coordinator ZMQ port TOCTOU#37452

Fix DP coordinator ZMQ port TOCTOU#37452
tlrmchlsmth merged 3 commits into
vllm-project:mainfrom
itayalroy:zmq_toctou

itayalroy commented Mar 18, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 18, 2026

Uh oh!

itayalroy Mar 18, 2026

Uh oh!

zch42 May 6, 2026 •

edited

Loading

Uh oh!

hjh0119 May 11, 2026

Uh oh!

gemini-code-assist Bot Mar 18, 2026

Uh oh!

itayalroy Mar 18, 2026

Uh oh!

gemini-code-assist Bot Mar 18, 2026

Uh oh!

itayalroy Mar 18, 2026

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

itayalroy commented Mar 18, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

itayalroy Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

zch42 May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hjh0119 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

itayalroy Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

itayalroy Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

itayalroy commented Mar 18, 2026 •

edited by github-actions Bot

Loading

zch42 May 6, 2026 •

edited

Loading