[Bugfix] Fix message queue initialization order for cross-node DP#36001
[Bugfix] Fix message queue initialization order for cross-node DP#36001jianzs wants to merge 1 commit intovllm-project:mainfrom
Conversation
When nnodes_within_dp > 1 (cross-node data parallelism), the message queue initialization calls get_inner_dp_world_group() which requires _INNER_DP_WORLD to be initialized. However, _INNER_DP_WORLD is only set during initialize_model_parallel() which is called from worker.init_device(). Previously, _init_message_queues() was called BEFORE init_device(), causing an "inner dp world group is not initialized" error for any cross-node DP configuration. This fix moves _init_message_queues() to AFTER init_device() and load_model() to ensure parallel groups are initialized before the message queue setup. Fixes: Cross-node DP scenarios where nnodes_within_dp > 1 Signed-off-by: 沧濯 <zhengshoujian.zsj@antgroup.com>
There was a problem hiding this comment.
Code Review
This pull request addresses an initialization order bug for cross-node data parallelism by moving the _init_message_queues() call, ensuring that required distributed process groups are initialized before use and resolving the AssertionError. While the fix itself is logically sound, enabling this functionality exposes a critical security vulnerability in the distributed communication layer. The use of unauthenticated ZMQ sockets combined with the pickle serialization format allows for potential data leakage and Remote Code Execution (RCE) across the network, especially in multi-node deployments. The underlying architecture requires urgent security hardening.
| # Initialize message queues after parallel groups are initialized. | ||
| # This is required for cross-node DP (nnodes_within_dp > 1) which | ||
| # needs get_inner_dp_world_group() to be available. | ||
| self._init_message_queues(input_shm_handle, vllm_config) |
There was a problem hiding this comment.
The initialization of message queues via _init_message_queues (now correctly ordered for cross-node configurations) introduces a significant security risk when used in multi-node environments. The underlying MessageQueue implementation (in vllm/distributed/device_communicators/shm_broadcast.py) uses unauthenticated ZMQ sockets and the pickle serialization format.
In cross-node setups (where nnodes_within_dp > 1), the driver binds to a network-accessible IP (obtained via get_ip()), allowing any attacker on the network to eavesdrop on sensitive data (such as prompts and generated tokens) or achieve Remote Code Execution (RCE) on worker nodes by sending malicious pickled objects to the ZMQ ports.
While vLLM defines a VLLM_ALLOW_INSECURE_SERIALIZATION environment variable, it is not currently enforced in the MessageQueue component, which uses pickle unconditionally. It is highly recommended to implement ZMQ authentication (e.g., using the CURVE security mechanism) and transition to a safer serialization format (e.g., msgpack or protobuf) for distributed communication.
Summary
Fixes a bug where
_init_message_queues()was called beforeworker.init_device(), causing failures whennnodes_within_dp > 1.Problem
When
nnodes_within_dp > 1(cross-node DP group), the message queue initialization requiresget_inner_dp_world_group()which is only available after parallel groups are initialized duringinit_device(). The incorrect order caused:Root Cause
Dependency chain:
Fix
Move
_init_message_queues()call to afterworker.init_device()andworker.load_model().Affected Configurations
This bug affects ANY configuration where
nnodes_within_dp > 1, including:Test Plan
a60985b07