[Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down#42203
Open
haosdent wants to merge 1 commit into
Open
[Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down#42203haosdent wants to merge 1 commit into
haosdent wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request consolidates the warming and CUDA graph capture processes into a single "warm_and_capture" phase within the elastic execution logic. It removes redundant block table management from the "switch_and_prepare" method and ensures that both scale-up and scale-down transitions trigger this unified phase to correctly size the MoE workspace and refresh CUDA graphs for the updated topology. I have no feedback to provide as there were no review comments.
f4064b4 to
712ccab
Compare
6c30841 to
0036b01
Compare
0036b01 to
c1fe7ef
Compare
c1fe7ef to
89aadc6
Compare
Follow-up to vllm-project#41792 (53dc9f3). Two issues remained: - Scale-up captured cudagraphs twice per existing rank: once in switch_and_prepare() (pre-EPLB-reshuffle), once in rewarm_workspace (post-reshuffle). The first capture was load-bearing — it kept existing workers in lockstep with new workers' init-time coordinate_batch_across_dp — but the duplicate cost 5-10s per existing rank. - Scale-down didn't get the workspace fix at all. rewarm_workspace was only dispatched from _eplb_reshuffle (scale-up). After scale-down, each remaining rank serves more local experts, so the per-rank workspace requirement grows and would trip the locked-too-small assertion under real traffic. Break the DP-coord dependency at the source: - Split Executor.initialize_from_config into two public methods: KV cache setup, and compile_or_warm_up_model for warmup + cudagraph capture + compilation-time propagation. EngineCore calls them separately and skips the warmup under VLLM_ELASTIC_EP_SCALE_UP_LAUNCH. New engines finish __init__ with KV cache allocated but no cudagraphs and no DP-coord all_reduce — no deadlock. - Strip the now-unnecessary warmup block from switch_and_prepare. - Rename rewarm_workspace -> warm_and_capture (it is now the sole capture site for elastic transitions, not a re-warm bolt-on). - Dispatch warm_and_capture post-reshuffle on every elastic transition: _eplb_reshuffle (scale-up, existing + new) and _progress_remaining_engine (scale-down remaining). Removing engines skip it — they are about to shut down. Net: one cudagraph capture per rank per scale, workspace correctly grown to max_num_tokens before capture, scale-down gets the workspace fix it lacked. EPLB rearrange uses w[dst].copy_(b[dst], non_blocking=True) (rebalance_execute.py:383), an in-place write that preserves .data_ptr(), so the single capture remains valid after reshuffle. Closes vllm-project#42107 Tested on h20-server-0 (4xH20, DP 2->4->2 and 2->3->2 with GSM8K eval at every stage): 2 passed in 6:28. GSM8K accuracy holds within 0.011 across all post-scale stages. The previously-hung deadlock signature (30 min of shm_broadcast timeouts during cudagraph capture) is gone. Signed-off-by: haosdent <haosdent@gmail.com>
89aadc6 to
bfae75e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Follow-up to #41792. Defer warmup on new engines during init, so existing ranks don't need a duplicate cudagraph capture to keep DP-coord in lockstep; dispatch a single
warm_and_capturepost-reshuffle on every elastic transition. One capture per rank per scale; scale-down gets the workspace fix it lacked. Closes #42107.EPLB rearrange (
rebalance_execute.py:383) is in-place, so the single capture stays valid.Test plan
tests/distributed/test_elastic_ep.pyon 4×H20, DP 2→4→2 and 2→3→2 with GSM8K at every stage:test_elastic_ep_scaling(2→4→2)test_elastic_ep_scaling_uneven(2→3→2)