Skip to content

[BugFix] Use late binding to avoid zmq port conflict race conditions#30520

Open
njhill wants to merge 5 commits into
vllm-project:mainfrom
njhill:avoid-port-conflicts
Open

[BugFix] Use late binding to avoid zmq port conflict race conditions#30520
njhill wants to merge 5 commits into
vllm-project:mainfrom
njhill:avoid-port-conflicts

Conversation

@njhill

@njhill njhill commented Dec 11, 2025

Copy link
Copy Markdown
Member

This is a variation of #28636 that doesn't rely on creating temporary sockets.

Used claude code to help with some of this.


Note

Cursor Bugbot is generating a summary for commit 19b52ee. Configure here.


Note

Implements late-binding for ZMQ endpoints to remove port allocation races and plumbs actual bound addresses through the system.

  • Add is_wildcard_addr, bind_zmq_socket_and_get_address, and extend make_zmq_socket(..., return_address=True) to discover real endpoints from wildcard TCP binds; improve typing/overloads
  • Replace pre-picked ports with tcp://<host>:0 in shm_broadcast.py and use last_endpoint to build subscriber address (incl. IPv6 handling)
  • Update DP coordinator and engine clients to bind sockets with make_zmq_socket(..., return_address=True) and report actual addresses via multiprocessing pipes; parent waits and updates addresses
  • Extend API server manager to collect late-bound input/output addresses from workers and update lists before engine handshake
  • Simplify parallel_state distributed init method selection and logging
  • Minor: use _get_open_port() in get_open_ports_list

Written by Cursor Bugbot for commit 19b52ee. This will update automatically on new commits. Configure here.

@mergify mergify Bot added the v1 label Dec 11, 2025

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a late-binding mechanism for ZeroMQ sockets to prevent race conditions when acquiring network ports. The changes are well-structured, using multiprocessing.Pipe to communicate the OS-assigned port from child processes back to their parents. This pattern is correctly applied in DPCoordinator and APIServerProcessManager. The use of @overload in make_zmq_socket also improves type safety.

However, I've identified a critical issue that could break multi-node deployments by incorrectly resolving wildcard addresses to a loopback IP. Additionally, there's a high-severity issue with error handling in APIServerProcessManager that could mask initialization failures and make debugging more difficult. Please see my detailed comments for suggestions on how to address these points.

Comment thread vllm/utils/network_utils.py Outdated
Comment thread vllm/v1/utils.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm/utils/network_utils.py Outdated
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill njhill force-pushed the avoid-port-conflicts branch from 2e6c13f to 78ab95c Compare December 11, 2025 23:15
njhill and others added 3 commits December 11, 2025 16:51
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
# Conflicts:
#	vllm/v1/engine/coordinator.py

Signed-off-by: Nick Hill <nickhill123@gmail.com>
except Exception as e:
logger.error("Failed to get addresses from DP Coordinator: %s", e)
raise
parent_conn.close()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Child process crashes when late binding not needed

High Severity

The parent process always creates a pipe and passes it to the child, but only receives from it when needs_late_binding is True. When needs_late_binding is False (common in local-only deployments using IPC addresses), the parent closes parent_conn without receiving. However, the child process (DPCoordinatorProc and MPClient) unconditionally sends on address_report_pipe if it's not None. This causes a BrokenPipeError in the child when it tries to send on a pipe whose read end is already closed, crashing the coordinator or API server process. The same pattern exists in APIServerProcessManager.

🔬 Verification Test

Why verification test was not possible: This bug involves multiprocessing race conditions between parent and child processes in Python's multiprocessing framework. Testing would require spawning actual processes and coordinating their timing, which cannot be done in a simple unit test without the full vLLM environment. The bug is confirmed through code analysis: the parent unconditionally passes the pipe to the child, but only calls recv() when needs_late_binding is True, while the child unconditionally calls send() when the pipe is not None.

Additional Locations (1)

Fix in Cursor Fix in Web

>>> is_wildcard_addr("ipc:///tmp/socket")
False
"""
return addr.startswith("tcp://") and ":0" in addr

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wildcard address check matches IPv6 addresses containing :0

Low Severity

The is_wildcard_addr function uses ":0" in addr to check for wildcard port addresses, but this can incorrectly match IPv6 addresses that contain :0 in the address portion. For example, tcp://[2001:db8::1:0]:8080 would be incorrectly identified as a wildcard because the IPv6 address ends with hex group 0. The check should use addr.endswith(":0") instead, since the port is always at the end of the address string. This causes false positives leading to unnecessary late-binding overhead, though functionality remains correct.

🔬 Verification Test

Test code:

# Test the is_wildcard_addr function logic
def is_wildcard_addr(addr: str) -> bool:
    return addr.startswith("tcp://") and ":0" in addr

# True positives (correct)
assert is_wildcard_addr("tcp://*:0") == True
assert is_wildcard_addr("tcp://192.168.1.5:0") == True
assert is_wildcard_addr("tcp://[::1]:0") == True

# True negatives (correct)
assert is_wildcard_addr("tcp://127.0.0.1:8080") == False
assert is_wildcard_addr("ipc:///tmp/socket") == False

# FALSE POSITIVE - this is the bug
# IPv6 address with hex group 0 at the end, but port is 8080 (not wildcard)
result = is_wildcard_addr("tcp://[2001:db8::1:0]:8080")
print(f"tcp://[2001:db8::1:0]:8080 -> {result}")
print(f"Expected: False, Got: {result}")
print(f"Bug confirmed: {result == True}")

# The fix would be:
def is_wildcard_addr_fixed(addr: str) -> bool:
    return addr.startswith("tcp://") and addr.endswith(":0")

result_fixed = is_wildcard_addr_fixed("tcp://[2001:db8::1:0]:8080")
print(f"Fixed function result: {result_fixed}")

Command run:

python3 -c "
def is_wildcard_addr(addr):
    return addr.startswith('tcp://') and ':0' in addr

result = is_wildcard_addr('tcp://[2001:db8::1:0]:8080')
print(f'tcp://[2001:db8::1:0]:8080 -> {result}')
print(f'Expected: False, Got: {result}')
print(f'Bug confirmed: {result == True}')
"

Output:

tcp://[2001:db8::1:0]:8080 -> True
Expected: False, Got: True
Bug confirmed: True

Why this proves the bug: The output shows that an IPv6 address with port 8080 (not a wildcard) is incorrectly identified as a wildcard address because :0 appears in the IPv6 portion of the address.

Fix in Cursor Fix in Web

"front_publish_address": front_publish_address,
"back_output_address": back_output_address,
"back_publish_address": back_publish_address,
"address_report_pipe": child_conn, # For late binding

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File descriptor leak from unclosed child pipe connections

Medium Severity

When creating pipes for late binding address reporting, child_conn is passed to the subprocess but never closed in the parent process after proc.start(). In Python's multiprocessing.Pipe(), both ends are open in the parent; after the subprocess is spawned with child_conn, the parent retains its own file descriptor for child_conn. Only parent_conn is closed. This leaks one file descriptor per DPCoordinator instance and one per API server started via APIServerProcessManager. In long-running deployments or with many restarts, this can exhaust the file descriptor limit.

🔬 Verification Test

Test code:

import multiprocessing
import os

def count_fds():
    """Count open file descriptors for current process"""
    return len(os.listdir(f'/proc/{os.getpid()}/fd'))

def child_func(pipe):
    pipe.send("hello")
    pipe.close()

# Simulate the leak pattern from the code
initial_fds = count_fds()
print(f"Initial FDs: {initial_fds}")

for i in range(10):
    parent_conn, child_conn = multiprocessing.Pipe()
    proc = multiprocessing.Process(target=child_func, args=(child_conn,))
    proc.start()
    
    # Parent receives and closes parent_conn (like the code does)
    parent_conn.recv()
    parent_conn.close()
    proc.join()
    
    # BUG: child_conn is never closed in parent!
    # child_conn.close()  # This line is missing

leaked_fds = count_fds()
print(f"FDs after 10 iterations: {leaked_fds}")
print(f"Leaked FDs: {leaked_fds - initial_fds}")

Command run:

python3 test_fd_leak.py

Output:

Initial FDs: 4
FDs after 10 iterations: 14
Leaked FDs: 10

Why this proves the bug: Each iteration leaks one file descriptor because child_conn is not closed in the parent after the subprocess is started. After 10 iterations, 10 file descriptors are leaked.

Additional Locations (1)

Fix in Cursor Fix in Web

@mergify mergify Bot added the bug Something isn't working label Jan 13, 2026
@haosdent

haosdent commented Feb 2, 2026

Copy link
Copy Markdown
Contributor

Thanks @njhill, this approach looks much better!. Do you think we need to add unit tests to cover this change?

c7ozkan added a commit to c7ozkan/vllm that referenced this pull request Mar 26, 2026
Add test coverage for the wildcard port / late-binding utilities
introduced to fix the ZMQ port-conflict race condition (vllm-project#28498):

- `is_wildcard_addr` — parametrized true/false cases
- `make_zmq_socket(..., return_address=True)` — verifies the tuple return
  type, that wildcard port 0 resolves to a concrete OS-assigned port,
  that a fixed port echoes the input address, and backward-compat
  (return_address=False still returns a bare socket)
- `bind_zmq_socket_and_get_address` — end-to-end connectivity check
  (PUSH/PULL round-trip over the resolved address)
- `test_wildcard_binds_produce_unique_ports` — 10 sequential wildcard
  binds all receive distinct ports
- `test_wildcard_binds_unique_ports_concurrent` — 20 concurrent threads
  each bind to port 0; asserts no port collision, directly exercising
  the race condition that caused vllm-project#28498

Implementation changes are from PR vllm-project#30520 (credit: @njhill).

Signed-off-by: Canberk Özkan <canberk.ozkan@ozu.edu.tr>
@Debonex

Debonex commented Apr 7, 2026

Copy link
Copy Markdown

Hi there, thanks for this PR. I wanted to check in on the status of this change. Is there a plan to merge it?


# Parse the endpoint to extract host and port
# Handle both IPv4 and IPv6 formats
scheme, host, port_str = split_zmq_path(actual_endpoint)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split_zmq_path 当port是0的时候,解析出来的port是“”,导致报错。port = str(parsed.port or "")改为port = str(parsed.port)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like this :
if parsed.port == 0:
port = "0"
else:
port = str(parsed.port or "")

@mergify

mergify Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants