Fix DP coordinator ZMQ port TOCTOU#37452
Conversation
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a TOCTOU (Time-of-Check Time-of-Use) vulnerability in the DP coordinator's ZMQ port selection. Previously, the parent process selected the ports before the coordinator bound them, creating a window for other processes to claim those ports. This is fixed by having the coordinator bind the ports first and then report the bound addresses back to the parent via a pipe. The changes involve modifications to network_utils.py and coordinator.py to implement this new mechanism.
| ready = multiprocessing.connection.wait( | ||
| [zmq_addr_pipe, self.proc.sentinel], timeout=30 | ||
| ) | ||
| if not ready: | ||
| raise RuntimeError( | ||
| "DP Coordinator process failed to report ZMQ addresses " | ||
| "during startup." | ||
| ) | ||
| try: | ||
| return zmq_addr_pipe.recv() | ||
| except EOFError: | ||
| raise RuntimeError( | ||
| "DP Coordinator process failed during startup." | ||
| ) from None |
There was a problem hiding this comment.
The _wait_for_zmq_addrs method includes a timeout of 30 seconds. If the DP Coordinator process fails to report ZMQ addresses within this time, a RuntimeError is raised. However, there's no mechanism to handle or retry this failure. Consider adding a retry mechanism or a more robust error handling strategy to improve the resilience of the system. This is a critical issue because a failure here will prevent the engine from starting up.
There was a problem hiding this comment.
This is fatal IMO, if the DP Coordinator cannot report ZMQ addresses within 30 seconds it is reasonable to fail
There was a problem hiding this comment.
@itayalroy @tlrmchlsmth can we make the timeout configurable? The current 30s limit can be too short, for example when spawn is forced, the child process will re-import many modules
There was a problem hiding this comment.
same issue. from vllm.v1.engine import coordinator takes 70+ seconds to import.
| child_zmq_addr_pipe.close() | ||
| ( | ||
| front_publish_address, | ||
| back_output_address, | ||
| back_publish_address, | ||
| ) = self._wait_for_zmq_addrs(parent_zmq_addr_pipe) |
There was a problem hiding this comment.
After starting the coordinator process, the parent process retrieves the bound ZMQ addresses using self._wait_for_zmq_addrs. However, if self._wait_for_zmq_addrs fails, the addresses used to initialize self.stats_publish_address and self.coord_in_address will be the original, unbound addresses. This could lead to the parent process attempting to communicate with the coordinator on the wrong ports. This is a critical issue because it can lead to communication failures between the parent and coordinator processes.
There was a problem hiding this comment.
No. If _wait_for_zmq_addrs() fails, we raise an exception, so we never proceed using wrong ports.
| def bind_address(local_only: bool) -> str: | ||
| return ( | ||
| get_engine_client_zmq_addr(local_only=True, host=host) | ||
| if local_only | ||
| else get_tcp_uri(host, 0) | ||
| ) |
There was a problem hiding this comment.
The bind_address function uses get_engine_client_zmq_addr when local_only is true, which returns an IPC path. However, when local_only is false, it uses get_tcp_uri with port 0, which requests the OS to assign a port. This inconsistency in address types (IPC vs. TCP) could lead to unexpected behavior or configuration issues. Ensure that the address type is consistent based on the deployment environment or configuration. This is a high severity issue because it can lead to connectivity problems.
There was a problem hiding this comment.
The inconsistency in address types (IPC/TCP) already exists, the only change is that with TCP we now let the OS assign the port on bind time instead of binding to a pre-chosen port that might be already taken
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Looks good to me, @njhill if you can take a look, it'd be good to get another pair of eyes.
@itayalroy do you think there's a reasonable way to unit test this?
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Replace `get_open_port()` with late binding (port 0) for the remote XPUB socket in `MessageQueue.__init__`, then read back the actual bound address via `zmq.LAST_ENDPOINT`. This eliminates the window between port discovery and socket bind where another process could claim the port. Follows the same pattern already used in the DP coordinator (PR vllm-project#37452). Closes vllm-project#28498 Signed-off-by: RTCartist <wangshengb@buaa.edu.cn>
Signed-off-by: Itay Alroy <ialroy@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Previously the parent selected the DP coordinator's TCP ZMQ ports with
get_open_port()before the coordinator actually bound them, leaving awindow where another socket could claim the ports.
Fix this by letting the coordinator bind first and report the bound ZMQ
addresses back to the parent via pipe.