Skip to content

[Bugfix] Fix message queue initialization order for cross-node DP#36001

Closed
jianzs wants to merge 1 commit intovllm-project:mainfrom
jianzs:bugfix/message-queue-init-order
Closed

[Bugfix] Fix message queue initialization order for cross-node DP#36001
jianzs wants to merge 1 commit intovllm-project:mainfrom
jianzs:bugfix/message-queue-init-order

Conversation

@jianzs
Copy link
Contributor

@jianzs jianzs commented Mar 4, 2026

Summary

Fixes a bug where _init_message_queues() was called before worker.init_device(), causing failures when nnodes_within_dp > 1.

Problem

When nnodes_within_dp > 1 (cross-node DP group), the message queue initialization requires get_inner_dp_world_group() which is only available after parallel groups are initialized during init_device(). The incorrect order caused:

AssertionError: inner dp world group is not initialized

Root Cause

Dependency chain:

_init_message_queues() (Branch B)
    └── requires get_inner_dp_world_group()
            └── requires _INNER_DP_WORLD variable
                    └── initialized in initialize_model_parallel()
                            └── called in worker.init_device()

Fix

Move _init_message_queues() call to after worker.init_device() and worker.load_model().

Affected Configurations

This bug affects ANY configuration where nnodes_within_dp > 1, including:

  • Uniform cross-node DP (e.g., PP=4, TP=2, DP=1 across 2 nodes)
  • Non-uniform topology scenarios

Test Plan

  • E2E-20: Uniform cross-node PP=4, TP=2, DP=1 (nnodes_within_dp=2) - PASSES with fix
  • Verified same test FAILS on main branch commit a60985b07

When nnodes_within_dp > 1 (cross-node data parallelism), the message
queue initialization calls get_inner_dp_world_group() which requires
_INNER_DP_WORLD to be initialized. However, _INNER_DP_WORLD is only
set during initialize_model_parallel() which is called from
worker.init_device().

Previously, _init_message_queues() was called BEFORE init_device(),
causing an "inner dp world group is not initialized" error for any
cross-node DP configuration.

This fix moves _init_message_queues() to AFTER init_device() and
load_model() to ensure parallel groups are initialized before the
message queue setup.

Fixes: Cross-node DP scenarios where nnodes_within_dp > 1
Signed-off-by: 沧濯 <zhengshoujian.zsj@antgroup.com>
@jianzs jianzs requested a review from njhill as a code owner March 4, 2026 11:50
@mergify mergify bot added v1 bug Something isn't working labels Mar 4, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an initialization order bug for cross-node data parallelism by moving the _init_message_queues() call, ensuring that required distributed process groups are initialized before use and resolving the AssertionError. While the fix itself is logically sound, enabling this functionality exposes a critical security vulnerability in the distributed communication layer. The use of unauthenticated ZMQ sockets combined with the pickle serialization format allows for potential data leakage and Remote Code Execution (RCE) across the network, especially in multi-node deployments. The underlying architecture requires urgent security hardening.

# Initialize message queues after parallel groups are initialized.
# This is required for cross-node DP (nnodes_within_dp > 1) which
# needs get_inner_dp_world_group() to be available.
self._init_message_queues(input_shm_handle, vllm_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The initialization of message queues via _init_message_queues (now correctly ordered for cross-node configurations) introduces a significant security risk when used in multi-node environments. The underlying MessageQueue implementation (in vllm/distributed/device_communicators/shm_broadcast.py) uses unauthenticated ZMQ sockets and the pickle serialization format.

In cross-node setups (where nnodes_within_dp > 1), the driver binds to a network-accessible IP (obtained via get_ip()), allowing any attacker on the network to eavesdrop on sensitive data (such as prompts and generated tokens) or achieve Remote Code Execution (RCE) on worker nodes by sending malicious pickled objects to the ZMQ ports.

While vLLM defines a VLLM_ALLOW_INSECURE_SERIALIZATION environment variable, it is not currently enforced in the MessageQueue component, which uses pickle unconditionally. It is highly recommended to implement ZMQ authentication (e.g., using the CURVE security mechanism) and transition to a safer serialization format (e.g., msgpack or protobuf) for distributed communication.

@njhill
Copy link
Member

njhill commented Mar 4, 2026

Thanks @jianzs, I think this is a duplicate of #35892.

@jianzs
Copy link
Contributor Author

jianzs commented Mar 5, 2026

Thanks @jianzs, I think this is a duplicate of #35892.

Got it.

@jianzs jianzs closed this Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants