[Ray] Enable RayExecutorV2 by default by jeffreywang-anyscale · Pull Request #41421 · vllm-project/vllm

jeffreywang-anyscale · 2026-04-30T23:08:33Z

Purpose

Fix CI Step Failure: Elastic EP Scaling Test by flipping RayExecutorV2 on by default to sidestep the compiled graph bug. The compiled-graph-based RayExecutor is on deprecation path.
We need to set RayExecutorV2 on by default anyways. RayExecutor will be deprecated once RayExecutorV2 is stabilized.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request updates the Buildkite configuration for expert parallelism tests by enabling the Ray V2 executor backend via an environment variable when running distributed/test_elastic_ep.py. I have no feedback to provide.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale · 2026-05-02T04:07:12Z

@khluu can you also kick off elastic EP scaling test to ensure this fixes the flakiness that this PR intends to resolve? Thank you :)

mergify · 2026-05-04T08:24:37Z

Hi @jeffreywang-anyscale, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

vllm-project#41421 made RayExecutorV2 the default Ray DP executor and surfaced two regressions in tests/distributed/test_elastic_ep.py. 1. Ray actor name collision during DP scale-up. With TP=PP=PCP=1, build_actor_name() reduces to "vllm_Worker_{instance_id}". _init_engines uniquifies instance_id per DP rank; scale_up_elastic_ep did not, so newly-spawned engines registered on the same Ray name as already-running DP siblings. _init_engines also used local_dp_rank for kv_transfer_config.engine_id, which collides across nodes whenever two DP engines on different nodes share the same local rank. 2. MoE workspace locked too small on engines participating in scale-up. WorkspaceManager locks after warmup; EPLB reshuffle changes per-rank token routing afterwards. New engines never ran an unlock/warm/relock cycle, and even after one is added, compile_or_warm_up_model uses uniform dummy routing, leaving the workspace ~10-14 MB while real GSM8K traffic concentrates tokens on a subset of post-reshuffle hot experts and needs hundreds of MB. Existing engines absorbed this only because they accumulated workspace during real DP=2 traffic before scale-up. Changes: - Extract _apply_dp_identity_suffix(dp_vllm_config, dp_rank) for the per-DP-rank suffix convention. Used by _init_engines (deduped) and scale_up_elastic_ep (newly applied). Uses global dp_rank for both instance_id and kv_transfer_config.engine_id to avoid cross-node collisions. - Add ElasticEPScalingExecutor.rewarm_workspace and dispatch it from _eplb_reshuffle so both new and existing engines invalidate captured CUDA graphs (whose data pointers would otherwise be stale after the workspace is reallocated), rerun compile_or_warm_up_model, and relock against the post-reshuffle expert mapping. - MAX-all-reduce the per-rank workspace size across the DP group at the end of rewarm_workspace, then force-grow every rank to the synced size before relocking. Existing engines (with traffic history) act as the upper bound; newly-spawned engines adopt that high-watermark. - Add WorkspaceManager.current_size_bytes / ensure_size_bytes (lock- bypass force-grow), plus module-level workspace_current_size_bytes / workspace_ensure_size_bytes wrappers. Signed-off-by: haosdent <haosdent@gmail.com>

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

vllm-project#41421 made RayExecutorV2 the default Ray DP executor and broke tests/distributed/test_elastic_ep.py: - New engines collided on Ray actor names because scale_up_elastic_ep didn't append _dp{rank} to instance_id / kv_transfer_config.engine_id like _init_engines does. - After EPLB reshuffle, the locked MoE workspace was sized for cudagraph-capture batches (~14 MB) while real traffic needed hundreds of MB. Extract _apply_dp_identity_suffix and use it in both call sites (global dp_rank for instance_id and engine_id so multi-node DP doesn't collide). Add ElasticEPScalingExecutor.rewarm_workspace dispatched from _eplb_reshuffle on every DP sibling: save/clear block tables, release captured CUDA graphs, unlock workspace, run _dummy_run at max_num_tokens with skip_eplb=True so the MoE workspace grows for full-shape post- reshuffle traffic without polluting the just-rebalanced EPLB stats, re-capture cudagraphs, relock, restore block tables. Signed-off-by: haosdent <haosdent@gmail.com>

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

Under the new RayExecutorV2 default (PR vllm-project#41421), the example runs with async_scheduling enabled. AsyncScheduler drops the first post-resume token after pause_generation(mode="keep")'s cache reset, producing a deterministic off-by-one shift that fails 0/13 prompts in the H100 distributed CI step. Pin both LLM instances to async_scheduling=False as a workaround until the underlying AsyncScheduler bug is resolved. Signed-off-by: haosdent <haosdent@gmail.com>

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Libin Tang <libin.tang@intel.com>

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

Use RayExecutorV2 for EP tests

fdcbcc6

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

claude Bot reviewed Apr 30, 2026

View reviewed changes

mergify Bot added the ci/build label Apr 30, 2026

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

Flip RayExecutorV2 on by default

6f2a686

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale changed the title ~~Use RayExecutorV2 for EP tests~~ [Ray] Enable RayExecutorV2 by default Apr 30, 2026

khluu added the ready ONLY add when PR is ready to merge/full CI is needed label May 1, 2026

jeffreywang-anyscale mentioned this pull request May 2, 2026

[BugFix] Prevent orphaned process on NCCL destroy #39846

Merged

5 tasks

NickLucche approved these changes May 4, 2026

View reviewed changes

Merge branch 'main' into ray-executor-eep

656fe2f

NickLucche enabled auto-merge (squash) May 4, 2026 08:04

jeffreywang-anyscale added 3 commits May 4, 2026 10:44

Merge branch 'main' into ray-executor-eep

e4c238e

Merge branch 'main' into ray-executor-eep

3044353

Merge branch 'main' into ray-executor-eep

8f49475

NickLucche merged commit f04fd16 into vllm-project:main May 5, 2026
52 checks passed

haosdent mentioned this pull request May 6, 2026

[CI][Elastic EP] Fix Elastic EP Scaling Test Failure #41792

Merged

chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026

[Ray] Enable RayExecutorV2 by default (vllm-project#41421)

7cd8e6d

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

ikaadil pushed a commit to ikaadil/vllm that referenced this pull request May 7, 2026

[Ray] Enable RayExecutorV2 by default (vllm-project#41421)

f550c6c

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

This was referenced May 8, 2026

[CI][Examples][RLHF] Disable async scheduling in rlhf_async_new_apis #42042

Merged

[CI Failure][Bug] AsyncScheduler drops first post-resume token after pause_generation(mode="keep") + clear_cache #42043

Closed

libinta pushed a commit to libinta/vllm that referenced this pull request May 8, 2026

[Ray] Enable RayExecutorV2 by default (vllm-project#41421)

f77139f

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Libin Tang <libin.tang@intel.com>

weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026

[Ray] Enable RayExecutorV2 by default (vllm-project#41421)

4eecefd

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[Ray] Enable RayExecutorV2 by default (vllm-project#41421)

eff7463

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[Ray] Enable RayExecutorV2 by default (vllm-project#41421)

83966bd

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Ray] Enable RayExecutorV2 by default#41421

[Ray] Enable RayExecutorV2 by default#41421
NickLucche merged 6 commits into
vllm-project:mainfrom
jeffreywang-anyscale:ray-executor-eep

jeffreywang-anyscale commented Apr 30, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

jeffreywang-anyscale commented May 2, 2026 •

edited

Loading

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jeffreywang-anyscale commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

jeffreywang-anyscale commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeffreywang-anyscale commented Apr 30, 2026 •

edited

Loading

jeffreywang-anyscale commented May 2, 2026 •

edited

Loading