Skip to content

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1#40398

Merged
vllm-bot merged 2 commits into
vllm-project:mainfrom
tomeras91:fix-ray-v2-dp-instance-id-collision
May 3, 2026
Merged

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1#40398
vllm-bot merged 2 commits into
vllm-project:mainfrom
tomeras91:fix-ray-v2-dp-instance-id-collision

Conversation

@tomeras91
Copy link
Copy Markdown
Member

@tomeras91 tomeras91 commented Apr 20, 2026

Summary

  • RayExecutorV2 names its TP worker actors as vllm_Worker_{instance_id}[_TP{n}] (see vllm/v1/executor/ray_utils.py::build_actor_name). When data_parallel_size > 1, CoreEngineActorManager.__init__ produces each DP engine's VllmConfig via copy.deepcopy(vllm_config), which preserves the original instance_id across all DP replicas.
  • With a single shared instance_id, every DP engine attempts to create Ray actors with the same names and all but the first crash with:
    ray.exceptions.ActorAlreadyExistsError: Actor with name
    'vllm_Worker_<id>_TP0' already exists in the namespace ...
    
  • Fix: append the global DP rank to instance_id in each per-engine config copy, matching the existing precedent that does the same for kv_transfer_config.engine_id in the same function. Gated on dp_size > 1 so single-DP deployments are unaffected.

Bug only reproduces when VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1 (added in #36836); the legacy RayDistributedExecutor doesn't use named Ray actors and is unaffected.

Test plan

  • Reproduced on Nemotron-Super NVFP4, TP=2, DP=32, 16-node GB200 cluster with `VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1` + `VLLM_RAY_DP_PACK_STRATEGY=strict` — server previously crashed during actor creation with `ActorAlreadyExistsError`.
  • With this patch, all 32 DP engines start and the server serves requests normally.
  • No behavior change when DP=1 (guarded by `if dp_size > 1`).
  • Existing unit tests in `tests/distributed/test_ray_v2_executor*.py` still pass.

AI assistance disclosure

AI assistance was used to audit `instance_id` usage across the codebase and draft the patch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
@tomeras91 tomeras91 requested a review from njhill as a code owner April 20, 2026 19:58
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added v1 bug Something isn't working labels Apr 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the engine configuration in vllm/v1/engine/utils.py to append the DP rank to the instance_id, ensuring unique identifiers for Ray actors across data-parallel replicas. While this addresses the initial startup, feedback indicates that similar logic is missing in the elastic EP scale-up path and the multiprocessing DP path, which could still result in naming collisions or incorrect KV transfer behavior.

Comment thread vllm/v1/engine/utils.py
Comment on lines +391 to +395
if dp_size > 1:
# Append the DP rank to instance_id so that per-engine
# identifiers (e.g. Ray actor names in RayExecutorV2) are
# unique across DP replicas.
dp_vllm_config.instance_id = f"{dp_vllm_config.instance_id}_dp{index}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The fix correctly addresses the actor name collision for the initial Ray DP startup. However, the same logic appears to be missing in two other critical locations where per-engine configurations are initialized:

  1. Elastic EP Scale-up: In CoreEngineActorManager.scale_up_elastic_ep (around line 766), new engines are launched but their instance_id is not updated with the new DP rank. Additionally, the kv_transfer_config.engine_id update (present in __init__ at line 399) is also missing here. This will cause collisions and incorrect behavior when scaling up a cluster using RayExecutorV2 or KV transfer.
  2. Multiprocessing DP Path: In vllm/v1/engine/core.py::run_engine_core (around line 1083), the vllm_config is modified for kv_transfer_config, but instance_id is not updated. If data_parallel_backend="mp" is used in conjunction with RayExecutorV2, collisions will occur.

To ensure full coverage and consistency, please apply the instance_id update in these locations as well, using the global DP rank (rank and dp_rank respectively). You should also fix the missing kv_transfer_config update in scale_up_elastic_ep.

@jeffreywang-anyscale
Copy link
Copy Markdown
Contributor

Thanks for the fix!

@tomeras91 tomeras91 added the ready ONLY add when PR is ready to merge/full CI is needed label May 3, 2026
@tomeras91 tomeras91 enabled auto-merge (squash) May 3, 2026 09:50
@vllm-bot vllm-bot merged commit cb03fee into vllm-project:main May 3, 2026
52 of 54 checks passed
@tomeras91 tomeras91 deleted the fix-ray-v2-dp-instance-id-collision branch May 3, 2026 20:22
joa-stdn pushed a commit to joa-stdn/vllm that referenced this pull request May 4, 2026
…m-project#40398)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Joachim Studnia <joachim@mistral.ai>
chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026
…m-project#40398)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request May 7, 2026
…m-project#40398)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
ikaadil pushed a commit to ikaadil/vllm that referenced this pull request May 7, 2026
…m-project#40398)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants