[Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down by haosdent · Pull Request #42203 · vllm-project/vllm

haosdent · 2026-05-10T04:59:38Z

Purpose

Follow-up to #41792. Defer warmup on new engines during init, so existing ranks don't need a duplicate cudagraph capture to keep DP-coord in lockstep; dispatch a single warm_and_capture post-reshuffle on every elastic transition. One capture per rank per scale; scale-down gets the workspace fix it lacked. Closes #42107.

EPLB rearrange (rebalance_execute.py:383) is in-place, so the single capture stays valid.

Test plan

tests/distributed/test_elastic_ep.py on 4×H20, DP 2→4→2 and 2→3→2 with GSM8K at every stage:

================== 2 passed, 18 warnings in 388.70s (0:06:28) ==================

Test	Initial	After scale up	After scale down
`test_elastic_ep_scaling` (2→4→2)	0.656	0.664 (+0.008)	passed
`test_elastic_ep_scaling_uneven` (2→3→2)	0.637	0.648 (+0.011)	passed

gemini-code-assist

Code Review

This pull request consolidates the warming and CUDA graph capture processes into a single "warm_and_capture" phase within the elastic execution logic. It removes redundant block table management from the "switch_and_prepare" method and ensures that both scale-up and scale-down transitions trigger this unified phase to correctly size the MoE workspace and refresh CUDA graphs for the updated topology. I have no feedback to provide as there were no review comments.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Follow-up to vllm-project#41792 (53dc9f3). Two issues remained: - Scale-up captured cudagraphs twice per existing rank: once in switch_and_prepare() (pre-EPLB-reshuffle), once in rewarm_workspace (post-reshuffle). The first capture was load-bearing — it kept existing workers in lockstep with new workers' init-time coordinate_batch_across_dp — but the duplicate cost 5-10s per existing rank. - Scale-down didn't get the workspace fix at all. rewarm_workspace was only dispatched from _eplb_reshuffle (scale-up). After scale-down, each remaining rank serves more local experts, so the per-rank workspace requirement grows and would trip the locked-too-small assertion under real traffic. Break the DP-coord dependency at the source: - Split Executor.initialize_from_config into two public methods: KV cache setup, and compile_or_warm_up_model for warmup + cudagraph capture + compilation-time propagation. EngineCore calls them separately and skips the warmup under VLLM_ELASTIC_EP_SCALE_UP_LAUNCH. New engines finish __init__ with KV cache allocated but no cudagraphs and no DP-coord all_reduce — no deadlock. - Strip the now-unnecessary warmup block from switch_and_prepare. - Rename rewarm_workspace -> warm_and_capture (it is now the sole capture site for elastic transitions, not a re-warm bolt-on). - Dispatch warm_and_capture post-reshuffle on every elastic transition: _eplb_reshuffle (scale-up, existing + new) and _progress_remaining_engine (scale-down remaining). Removing engines skip it — they are about to shut down. Net: one cudagraph capture per rank per scale, workspace correctly grown to max_num_tokens before capture, scale-down gets the workspace fix it lacked. EPLB rearrange uses w[dst].copy_(b[dst], non_blocking=True) (rebalance_execute.py:383), an in-place write that preserves .data_ptr(), so the single capture remains valid after reshuffle. Closes vllm-project#42107 Tested on h20-server-0 (4xH20, DP 2->4->2 and 2->3->2 with GSM8K eval at every stage): 2 passed in 6:28. GSM8K accuracy holds within 0.011 across all post-scale stages. The previously-hung deadlock signature (30 min of shm_broadcast timeouts during cudagraph capture) is gone. Signed-off-by: haosdent <haosdent@gmail.com>

mergify Bot added the bug Something isn't working label May 10, 2026

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

haosdent force-pushed the elastic-ep-unify-warm-capture branch 2 times, most recently from f4064b4 to 712ccab Compare May 10, 2026 11:08

haosdent changed the title ~~[WIP][Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down~~ [Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down May 10, 2026

haosdent marked this pull request as ready for review May 10, 2026 11:09

claude Bot reviewed May 10, 2026

View reviewed changes

haosdent mentioned this pull request May 10, 2026

[CI][Elastic EP] Fix Elastic EP Scaling Test Failure #41792

Merged

haosdent force-pushed the elastic-ep-unify-warm-capture branch 2 times, most recently from 6c30841 to 0036b01 Compare May 11, 2026 03:03

haosdent marked this pull request as draft May 11, 2026 03:35

haosdent force-pushed the elastic-ep-unify-warm-capture branch from 0036b01 to c1fe7ef Compare May 11, 2026 03:43

mergify Bot added the v1 label May 11, 2026

haosdent force-pushed the elastic-ep-unify-warm-capture branch from c1fe7ef to 89aadc6 Compare May 11, 2026 03:47

haosdent force-pushed the elastic-ep-unify-warm-capture branch from 89aadc6 to bfae75e Compare May 11, 2026 04:08

haosdent marked this pull request as ready for review May 11, 2026 04:27

haosdent requested a review from njhill as a code owner May 11, 2026 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down#42203

[Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down#42203
haosdent wants to merge 1 commit into
vllm-project:mainfrom
haosdent:elastic-ep-unify-warm-capture

haosdent commented May 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

haosdent commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

haosdent commented May 10, 2026 •

edited

Loading