[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1 by tomeras91 · Pull Request #40398 · vllm-project/vllm

tomeras91 · 2026-04-20T19:58:40Z

Summary

RayExecutorV2 names its TP worker actors as vllm_Worker_{instance_id}[_TP{n}] (see vllm/v1/executor/ray_utils.py::build_actor_name). When data_parallel_size > 1, CoreEngineActorManager.__init__ produces each DP engine's VllmConfig via copy.deepcopy(vllm_config), which preserves the original instance_id across all DP replicas.
With a single shared instance_id, every DP engine attempts to create Ray actors with the same names and all but the first crash with:
```
ray.exceptions.ActorAlreadyExistsError: Actor with name
'vllm_Worker_<id>_TP0' already exists in the namespace ...
```
Fix: append the global DP rank to instance_id in each per-engine config copy, matching the existing precedent that does the same for kv_transfer_config.engine_id in the same function. Gated on dp_size > 1 so single-DP deployments are unaffected.

Bug only reproduces when VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1 (added in #36836); the legacy RayDistributedExecutor doesn't use named Ray actors and is unaffected.

Test plan

Reproduced on Nemotron-Super NVFP4, TP=2, DP=32, 16-node GB200 cluster with `VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1` + `VLLM_RAY_DP_PACK_STRATEGY=strict` — server previously crashed during actor creation with `ActorAlreadyExistsError`.
With this patch, all 32 DP engines start and the server serves requests normally.
No behavior change when DP=1 (guarded by `if dp_size > 1`).
Existing unit tests in `tests/distributed/test_ray_v2_executor*.py` still pass.

AI assistance disclosure

AI assistance was used to audit `instance_id` usage across the codebase and draft the patch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request updates the engine configuration in vllm/v1/engine/utils.py to append the DP rank to the instance_id, ensuring unique identifiers for Ray actors across data-parallel replicas. While this addresses the initial startup, feedback indicates that similar logic is missing in the elastic EP scale-up path and the multiprocessing DP path, which could still result in naming collisions or incorrect KV transfer behavior.

gemini-code-assist · 2026-04-20T20:00:13Z

+            if dp_size > 1:
+                # Append the DP rank to instance_id so that per-engine
+                # identifiers (e.g. Ray actor names in RayExecutorV2) are
+                # unique across DP replicas.
+                dp_vllm_config.instance_id = f"{dp_vllm_config.instance_id}_dp{index}"


The fix correctly addresses the actor name collision for the initial Ray DP startup. However, the same logic appears to be missing in two other critical locations where per-engine configurations are initialized:

Elastic EP Scale-up: In CoreEngineActorManager.scale_up_elastic_ep (around line 766), new engines are launched but their instance_id is not updated with the new DP rank. Additionally, the kv_transfer_config.engine_id update (present in __init__ at line 399) is also missing here. This will cause collisions and incorrect behavior when scaling up a cluster using RayExecutorV2 or KV transfer.

Multiprocessing DP Path: In vllm/v1/engine/core.py::run_engine_core (around line 1083), the vllm_config is modified for kv_transfer_config, but instance_id is not updated. If data_parallel_backend="mp" is used in conjunction with RayExecutorV2, collisions will occur.

To ensure full coverage and consistency, please apply the instance_id update in these locations as well, using the global DP rank (rank and dp_rank respectively). You should also fix the missing kv_transfer_config update in scale_up_elastic_ep.

jeffreywang-anyscale · 2026-04-30T23:18:27Z

Thanks for the fix!

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

Fix RayExecutorV2 actor name collision with DP > 1

5048dcb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tomeras91 requested a review from njhill as a code owner April 20, 2026 19:58

claude Bot reviewed Apr 20, 2026

View reviewed changes

mergify Bot added v1 bug Something isn't working labels Apr 20, 2026

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

jeffreywang-anyscale approved these changes Apr 30, 2026

View reviewed changes

tomeras91 added the ready ONLY add when PR is ready to merge/full CI is needed label May 3, 2026

Merge branch 'main' into fix-ray-v2-dp-instance-id-collision

8e3c9d6

tomeras91 enabled auto-merge (squash) May 3, 2026 09:50

mgoin approved these changes May 3, 2026

View reviewed changes

vllm-bot merged commit cb03fee into vllm-project:main May 3, 2026
52 of 54 checks passed

tomeras91 deleted the fix-ray-v2-dp-instance-id-collision branch May 3, 2026 20:22

chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1 (vll…

bc18774

…m-project#40398) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1#40398

[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1#40398
vllm-bot merged 2 commits into
vllm-project:mainfrom
tomeras91:fix-ray-v2-dp-instance-id-collision

tomeras91 commented Apr 20, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

jeffreywang-anyscale commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

tomeras91 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

AI assistance disclosure

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tomeras91 commented Apr 20, 2026 •

edited

Loading