[Ray] Enable RayExecutorV2 by default#41421
Merged
NickLucche merged 6 commits intoMay 5, 2026
Merged
Conversation
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Contributor
Author
|
@khluu can you also kick off elastic EP scaling test to ensure this fixes the flakiness that this PR intends to resolve? Thank you :)
|
5 tasks
NickLucche
approved these changes
May 4, 2026
Contributor
|
Hi @jeffreywang-anyscale, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
chaojun-zhang
pushed a commit
to chaojun-zhang/vllm
that referenced
this pull request
May 6, 2026
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
haosdent
added a commit
to haosdent/vllm
that referenced
this pull request
May 6, 2026
vllm-project#41421 made RayExecutorV2 the default Ray DP executor and surfaced two regressions in tests/distributed/test_elastic_ep.py. 1. Ray actor name collision during DP scale-up. With TP=PP=PCP=1, build_actor_name() reduces to "vllm_Worker_{instance_id}". _init_engines uniquifies instance_id per DP rank; scale_up_elastic_ep did not, so newly-spawned engines registered on the same Ray name as already-running DP siblings. _init_engines also used local_dp_rank for kv_transfer_config.engine_id, which collides across nodes whenever two DP engines on different nodes share the same local rank. 2. MoE workspace locked too small on engines participating in scale-up. WorkspaceManager locks after warmup; EPLB reshuffle changes per-rank token routing afterwards. New engines never ran an unlock/warm/relock cycle, and even after one is added, compile_or_warm_up_model uses uniform dummy routing, leaving the workspace ~10-14 MB while real GSM8K traffic concentrates tokens on a subset of post-reshuffle hot experts and needs hundreds of MB. Existing engines absorbed this only because they accumulated workspace during real DP=2 traffic before scale-up. Changes: - Extract _apply_dp_identity_suffix(dp_vllm_config, dp_rank) for the per-DP-rank suffix convention. Used by _init_engines (deduped) and scale_up_elastic_ep (newly applied). Uses global dp_rank for both instance_id and kv_transfer_config.engine_id to avoid cross-node collisions. - Add ElasticEPScalingExecutor.rewarm_workspace and dispatch it from _eplb_reshuffle so both new and existing engines invalidate captured CUDA graphs (whose data pointers would otherwise be stale after the workspace is reallocated), rerun compile_or_warm_up_model, and relock against the post-reshuffle expert mapping. - MAX-all-reduce the per-rank workspace size across the DP group at the end of rewarm_workspace, then force-grow every rank to the synced size before relocking. Existing engines (with traffic history) act as the upper bound; newly-spawned engines adopt that high-watermark. - Add WorkspaceManager.current_size_bytes / ensure_size_bytes (lock- bypass force-grow), plus module-level workspace_current_size_bytes / workspace_ensure_size_bytes wrappers. Signed-off-by: haosdent <haosdent@gmail.com>
haosdent
added a commit
to haosdent/vllm
that referenced
this pull request
May 6, 2026
vllm-project#41421 made RayExecutorV2 the default Ray DP executor and surfaced two regressions in tests/distributed/test_elastic_ep.py. 1. Ray actor name collision during DP scale-up. With TP=PP=PCP=1, build_actor_name() reduces to "vllm_Worker_{instance_id}". _init_engines uniquifies instance_id per DP rank; scale_up_elastic_ep did not, so newly-spawned engines registered on the same Ray name as already-running DP siblings. _init_engines also used local_dp_rank for kv_transfer_config.engine_id, which collides across nodes whenever two DP engines on different nodes share the same local rank. 2. MoE workspace locked too small on engines participating in scale-up. WorkspaceManager locks after warmup; EPLB reshuffle changes per-rank token routing afterwards. New engines never ran an unlock/warm/relock cycle, and even after one is added, compile_or_warm_up_model uses uniform dummy routing, leaving the workspace ~10-14 MB while real GSM8K traffic concentrates tokens on a subset of post-reshuffle hot experts and needs hundreds of MB. Existing engines absorbed this only because they accumulated workspace during real DP=2 traffic before scale-up. Changes: - Extract _apply_dp_identity_suffix(dp_vllm_config, dp_rank) for the per-DP-rank suffix convention. Used by _init_engines (deduped) and scale_up_elastic_ep (newly applied). Uses global dp_rank for both instance_id and kv_transfer_config.engine_id to avoid cross-node collisions. - Add ElasticEPScalingExecutor.rewarm_workspace and dispatch it from _eplb_reshuffle so both new and existing engines invalidate captured CUDA graphs (whose data pointers would otherwise be stale after the workspace is reallocated), rerun compile_or_warm_up_model, and relock against the post-reshuffle expert mapping. - MAX-all-reduce the per-rank workspace size across the DP group at the end of rewarm_workspace, then force-grow every rank to the synced size before relocking. Existing engines (with traffic history) act as the upper bound; newly-spawned engines adopt that high-watermark. - Add WorkspaceManager.current_size_bytes / ensure_size_bytes (lock- bypass force-grow), plus module-level workspace_current_size_bytes / workspace_ensure_size_bytes wrappers. Signed-off-by: haosdent <haosdent@gmail.com>
haosdent
added a commit
to haosdent/vllm
that referenced
this pull request
May 7, 2026
vllm-project#41421 made RayExecutorV2 the default Ray DP executor and surfaced two regressions in tests/distributed/test_elastic_ep.py. 1. Ray actor name collision during DP scale-up. With TP=PP=PCP=1, build_actor_name() reduces to "vllm_Worker_{instance_id}". _init_engines uniquifies instance_id per DP rank; scale_up_elastic_ep did not, so newly-spawned engines registered on the same Ray name as already-running DP siblings. _init_engines also used local_dp_rank for kv_transfer_config.engine_id, which collides across nodes whenever two DP engines on different nodes share the same local rank. 2. MoE workspace locked too small on engines participating in scale-up. WorkspaceManager locks after warmup; EPLB reshuffle changes per-rank token routing afterwards. New engines never ran an unlock/warm/relock cycle, and even after one is added, compile_or_warm_up_model uses uniform dummy routing, leaving the workspace ~10-14 MB while real GSM8K traffic concentrates tokens on a subset of post-reshuffle hot experts and needs hundreds of MB. Existing engines absorbed this only because they accumulated workspace during real DP=2 traffic before scale-up. Changes: - Extract _apply_dp_identity_suffix(dp_vllm_config, dp_rank) for the per-DP-rank suffix convention. Used by _init_engines (deduped) and scale_up_elastic_ep (newly applied). Uses global dp_rank for both instance_id and kv_transfer_config.engine_id to avoid cross-node collisions. - Add ElasticEPScalingExecutor.rewarm_workspace and dispatch it from _eplb_reshuffle so both new and existing engines invalidate captured CUDA graphs (whose data pointers would otherwise be stale after the workspace is reallocated), rerun compile_or_warm_up_model, and relock against the post-reshuffle expert mapping. - MAX-all-reduce the per-rank workspace size across the DP group at the end of rewarm_workspace, then force-grow every rank to the synced size before relocking. Existing engines (with traffic history) act as the upper bound; newly-spawned engines adopt that high-watermark. - Add WorkspaceManager.current_size_bytes / ensure_size_bytes (lock- bypass force-grow), plus module-level workspace_current_size_bytes / workspace_ensure_size_bytes wrappers. Signed-off-by: haosdent <haosdent@gmail.com>
Copilot AI
pushed a commit
to hongbolv/vllm
that referenced
this pull request
May 7, 2026
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
haosdent
added a commit
to haosdent/vllm
that referenced
this pull request
May 7, 2026
vllm-project#41421 made RayExecutorV2 the default Ray DP executor and broke tests/distributed/test_elastic_ep.py: - New engines collided on Ray actor names because scale_up_elastic_ep didn't append _dp{rank} to instance_id / kv_transfer_config.engine_id like _init_engines does. - After EPLB reshuffle, the locked MoE workspace was sized for cudagraph-capture batches (~14 MB) while real traffic needed hundreds of MB. Extract _apply_dp_identity_suffix and use it in both call sites (global dp_rank for instance_id and engine_id so multi-node DP doesn't collide). Add ElasticEPScalingExecutor.rewarm_workspace dispatched from _eplb_reshuffle on every DP sibling: save/clear block tables, release captured CUDA graphs, unlock workspace, run _dummy_run at max_num_tokens with skip_eplb=True so the MoE workspace grows for full-shape post- reshuffle traffic without polluting the just-rebalanced EPLB stats, re-capture cudagraphs, relock, restore block tables. Signed-off-by: haosdent <haosdent@gmail.com>
ikaadil
pushed a commit
to ikaadil/vllm
that referenced
this pull request
May 7, 2026
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
haosdent
added a commit
to haosdent/vllm
that referenced
this pull request
May 8, 2026
Under the new RayExecutorV2 default (PR vllm-project#41421), the example runs with async_scheduling enabled. AsyncScheduler drops the first post-resume token after pause_generation(mode="keep")'s cache reset, producing a deterministic off-by-one shift that fails 0/13 prompts in the H100 distributed CI step. Pin both LLM instances to async_scheduling=False as a workaround until the underlying AsyncScheduler bug is resolved. Signed-off-by: haosdent <haosdent@gmail.com>
haosdent
added a commit
to haosdent/vllm
that referenced
this pull request
May 8, 2026
Under the new RayExecutorV2 default (PR vllm-project#41421), the example runs with async_scheduling enabled. AsyncScheduler drops the first post-resume token after pause_generation(mode="keep")'s cache reset, producing a deterministic off-by-one shift that fails 0/13 prompts in the H100 distributed CI step. Pin both LLM instances to async_scheduling=False as a workaround until the underlying AsyncScheduler bug is resolved. Signed-off-by: haosdent <haosdent@gmail.com>
haosdent
added a commit
to haosdent/vllm
that referenced
this pull request
May 8, 2026
Under the new RayExecutorV2 default (PR vllm-project#41421), the example runs with async_scheduling enabled. AsyncScheduler drops the first post-resume token after pause_generation(mode="keep")'s cache reset, producing a deterministic off-by-one shift that fails 0/13 prompts in the H100 distributed CI step. Pin both LLM instances to async_scheduling=False as a workaround until the underlying AsyncScheduler bug is resolved. Signed-off-by: haosdent <haosdent@gmail.com>
libinta
pushed a commit
to libinta/vllm
that referenced
this pull request
May 8, 2026
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Libin Tang <libin.tang@intel.com>
weifang231
pushed a commit
to weifang231/eb-vllm
that referenced
this pull request
May 13, 2026
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
mfylcek
pushed a commit
to mfylcek/vllm
that referenced
this pull request
May 19, 2026
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
jhu960213
pushed a commit
to jhu960213/vllm
that referenced
this pull request
May 20, 2026
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
mvanhorn
pushed a commit
to mvanhorn/vllm
that referenced
this pull request
Jun 4, 2026
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Purpose
RayExecutorV2on by default to sidestep the compiled graph bug. The compiled-graph-basedRayExecutoris on deprecation path.RayExecutorV2on by default anyways.RayExecutorwill be deprecated onceRayExecutorV2is stabilized.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.