[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1#40398
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request updates the engine configuration in vllm/v1/engine/utils.py to append the DP rank to the instance_id, ensuring unique identifiers for Ray actors across data-parallel replicas. While this addresses the initial startup, feedback indicates that similar logic is missing in the elastic EP scale-up path and the multiprocessing DP path, which could still result in naming collisions or incorrect KV transfer behavior.
| if dp_size > 1: | ||
| # Append the DP rank to instance_id so that per-engine | ||
| # identifiers (e.g. Ray actor names in RayExecutorV2) are | ||
| # unique across DP replicas. | ||
| dp_vllm_config.instance_id = f"{dp_vllm_config.instance_id}_dp{index}" |
There was a problem hiding this comment.
The fix correctly addresses the actor name collision for the initial Ray DP startup. However, the same logic appears to be missing in two other critical locations where per-engine configurations are initialized:
- Elastic EP Scale-up: In
CoreEngineActorManager.scale_up_elastic_ep(around line 766), new engines are launched but theirinstance_idis not updated with the new DP rank. Additionally, thekv_transfer_config.engine_idupdate (present in__init__at line 399) is also missing here. This will cause collisions and incorrect behavior when scaling up a cluster usingRayExecutorV2or KV transfer. - Multiprocessing DP Path: In
vllm/v1/engine/core.py::run_engine_core(around line 1083), thevllm_configis modified forkv_transfer_config, butinstance_idis not updated. Ifdata_parallel_backend="mp"is used in conjunction withRayExecutorV2, collisions will occur.
To ensure full coverage and consistency, please apply the instance_id update in these locations as well, using the global DP rank (rank and dp_rank respectively). You should also fix the missing kv_transfer_config update in scale_up_elastic_ep.
|
Thanks for the fix! |
…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>
…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Summary
RayExecutorV2names its TP worker actors asvllm_Worker_{instance_id}[_TP{n}](seevllm/v1/executor/ray_utils.py::build_actor_name). Whendata_parallel_size > 1,CoreEngineActorManager.__init__produces each DP engine'sVllmConfigviacopy.deepcopy(vllm_config), which preserves the originalinstance_idacross all DP replicas.instance_id, every DP engine attempts to create Ray actors with the same names and all but the first crash with:instance_idin each per-engine config copy, matching the existing precedent that does the same forkv_transfer_config.engine_idin the same function. Gated ondp_size > 1so single-DP deployments are unaffected.Bug only reproduces when
VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1(added in #36836); the legacyRayDistributedExecutordoesn't use named Ray actors and is unaffected.Test plan
AI assistance disclosure
AI assistance was used to audit `instance_id` usage across the codebase and draft the patch.