Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions vllm/v1/engine/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,11 @@ def __init__(
range(dp_size), local_dp_ranks, placement_groups
):
dp_vllm_config = copy.deepcopy(vllm_config)
if dp_size > 1:
# Append the DP rank to instance_id so that per-engine
# identifiers (e.g. Ray actor names in RayExecutorV2) are
# unique across DP replicas.
dp_vllm_config.instance_id = f"{dp_vllm_config.instance_id}_dp{index}"
Comment on lines +406 to +410
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The fix correctly addresses the actor name collision for the initial Ray DP startup. However, the same logic appears to be missing in two other critical locations where per-engine configurations are initialized:

  1. Elastic EP Scale-up: In CoreEngineActorManager.scale_up_elastic_ep (around line 766), new engines are launched but their instance_id is not updated with the new DP rank. Additionally, the kv_transfer_config.engine_id update (present in __init__ at line 399) is also missing here. This will cause collisions and incorrect behavior when scaling up a cluster using RayExecutorV2 or KV transfer.
  2. Multiprocessing DP Path: In vllm/v1/engine/core.py::run_engine_core (around line 1083), the vllm_config is modified for kv_transfer_config, but instance_id is not updated. If data_parallel_backend="mp" is used in conjunction with RayExecutorV2, collisions will occur.

To ensure full coverage and consistency, please apply the instance_id update in these locations as well, using the global DP rank (rank and dp_rank respectively). You should also fix the missing kv_transfer_config update in scale_up_elastic_ep.

dp_vllm_config.parallel_config.placement_group = pg
local_client = index < local_engine_count

Expand Down
Loading