[CI][Elastic EP] Fix Elastic EP Scaling Test Failure #41792
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a workspace rewarming mechanism in the elastic EP to manage CUDA graph lifecycle during MoE workspace resizing and refactors the DP identity suffixing logic into a shared utility. The review feedback highlights a potential ID collision issue in multi-node environments, suggesting the use of global DP ranks instead of node-local ranks for engine identification, along with corresponding signature updates for the new helper function.
|
Hi @haosdent, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi, thanks @NickLucche I checked your changes in #39907 , it looks like able to work together In short, it should be overridden by your changes if it belongs to the same DP |
|
just one thing
I wouldn't say it's rare with gb series and nvl72 anymore |
Indeed, updated. |
|
@haosdent might be related do you have permissions to re-run CI here? |
|
Didn't notice the CI failure before, let me fix it |
3bbf260 to
a9d72aa
Compare
|
Seems fail on unrelated test cases, let me try to re-trigger |
| # Save and clear block tables so profile_run/compile_or_warm_up_model | ||
| # don't write dummy slot mappings into real KV-cache blocks (mirrors | ||
| # switch_and_prepare's pattern). | ||
| multi_block_table = self.worker.model_runner.input_batch.block_table | ||
| saved_block_tables: list[tuple[torch.Tensor, torch.Tensor]] = [] | ||
| for bt in multi_block_table.block_tables: | ||
| saved_block_tables.append( | ||
| (bt.block_table.gpu.clone(), bt.block_table.cpu.clone()) | ||
| ) | ||
| multi_block_table.clear() |
There was a problem hiding this comment.
Is there any model runner state that we're missing here?
There was a problem hiding this comment.
For the code at here, I duplicated it from https://github.com/haosdent/vllm/blob/53dc9f3923a8438b7fd31f4c348a59e17894a453/vllm/distributed/elastic_ep/elastic_execute.py#L435-L451
Or we add something like model_runner.prepare_for_warmup() , restore_after_warmup() to avoid duplicate?
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Looks good but @LucasWilkinson should check as well
vllm-project#41421 made RayExecutorV2 the default Ray DP executor and broke tests/distributed/test_elastic_ep.py: - New engines collided on Ray actor names because scale_up_elastic_ep didn't append _dp{rank} to instance_id / kv_transfer_config.engine_id like _init_engines does. - After EPLB reshuffle, the locked MoE workspace was sized for cudagraph-capture batches (~14 MB) while real traffic needed hundreds of MB. Extract _apply_dp_identity_suffix and use it in both call sites (global dp_rank for instance_id and engine_id so multi-node DP doesn't collide). Add ElasticEPScalingExecutor.rewarm_workspace dispatched from _eplb_reshuffle on every DP sibling: save/clear block tables, release captured CUDA graphs, unlock workspace, run _dummy_run at max_num_tokens with skip_eplb=True so the MoE workspace grows for full-shape post- reshuffle traffic without polluting the just-rebalanced EPLB stats, re-capture cudagraphs, relock, restore block tables. Signed-off-by: haosdent <haosdent@gmail.com>
LucasWilkinson
left a comment
There was a problem hiding this comment.
This looks ok to mean but im not super familiar with elastic EP; this seems to add alot of overhead
i.e. are we now adding another round of recapturing cudagraphs when we weren't before?
cc @SageMoore
SageMoore
left a comment
There was a problem hiding this comment.
We should not add a second capture to scale up. If this fix is an emergency to get CI green, I think it's fine to make the temporary sacrifice since scale up is not widely used. That being said if we go that route we need to get the second capture out of scale up ASAP.
Have you confirmed that this problem doesn't exist for scale_down? I could see some buffers being sized based on the number of EP ranks.
LucasWilkinson
left a comment
There was a problem hiding this comment.
tracking the followup from @SageMoore here: #42107
Follow-up to vllm-project#41792. Two issues remained: - Scale-up captured cudagraphs twice per existing engine: once in switch_and_prepare() (pre-EPLB-reshuffle) and once in rewarm_workspace (post-reshuffle). Each capture is 5-10 s of redundant work. - Scale-down didn't get the workspace fix at all. _eplb_reshuffle() (which dispatched rewarm_workspace) is only called from scale-up paths; scale-down's _progress_remaining_engine uses a different reshuffle helper. After scale-down, per-rank M_full grows (each remaining rank serves more local experts), so the locked MoE workspace can hit the same too-small assertion that vllm-project#41792 fixed for scale-up. Strip the warmup/capture block out of switch_and_prepare(), rename rewarm_workspace → warm_and_capture, and dispatch it once post-reshuffle on every elastic transition (scale-up existing, scale-up new, scale-down remaining). switch_and_prepare retains _release_cuda_graphs and group/ MoE-config updates; warm_and_capture is now the sole capture site for elastic transitions, growing the MoE workspace via _dummy_run at max_num_tokens before cudagraph capture pins it. Net: one capture per transition (vs. two on scale-up); scale-down now gets the workspace fix it lacked. Signed-off-by: haosdent <haosdent@gmail.com>
@SageMoore The scale-down path looks may trigger the issue as well |
Follow-up to vllm-project#41792. Two issues remained: - Scale-up captured cudagraphs twice per existing engine: once in switch_and_prepare() (pre-EPLB-reshuffle) and once in rewarm_workspace (post-reshuffle). Each capture is 5-10 s of redundant work. - Scale-down didn't get the workspace fix at all. _eplb_reshuffle() (which dispatched rewarm_workspace) is only called from scale-up paths; scale-down's _progress_remaining_engine uses a different reshuffle helper. After scale-down, per-rank M_full grows (each remaining rank serves more local experts), so the locked MoE workspace can hit the same too-small assertion that vllm-project#41792 fixed for scale-up. Strip the warmup/capture block out of switch_and_prepare(), rename rewarm_workspace → warm_and_capture, and dispatch it once post-reshuffle on every elastic transition (scale-up existing, scale-up new, scale-down remaining). switch_and_prepare retains _release_cuda_graphs and group/ MoE-config updates; warm_and_capture is now the sole capture site for elastic transitions, growing the MoE workspace via _dummy_run at max_num_tokens before cudagraph capture pins it. Net: one capture per transition (vs. two on scale-up); scale-down now gets the workspace fix it lacked. Signed-off-by: haosdent <haosdent@gmail.com>
|
The follow-up PR is #42203 @SageMoore @LucasWilkinson could take a look when you are available |
The block I stripped in commit 712ccab ("Unify warm+capture across scale-up and scale-down") wasn't a duplicate of warm_and_capture — it was load-bearing DP-coord synchronization. New-engine workers run compile_or_warm_up_model during their own __init__ and call coordinate_batch_across_dp on every warmup size; that DP all-reduce deadlocks unless existing workers participate in lockstep via their own compile_or_warm_up_model. py-spy of a hung scale-up confirmed: existing workers idle in worker_busy_loop, new workers blocked at dp_utils.all_reduce inside compile_or_warm_up_model. Restore the warmup block (save/clear block tables, unlock workspace, compile_or_warm_up_model, lock, restore tables). warm_and_capture in EPLB_RESHUFFLE remains — it grows the MoE workspace at max_num_tokens post-reshuffle, which compile_or_warm_up_model alone (only exercises cudagraph-capture sizes) doesn't cover. Scale-up cost regresses to two captures per existing rank (same as vllm-project#41792); the scale-down warmup this PR added is the net new fix. Signed-off-by: haosdent <haosdent@gmail.com>
The block I stripped in commit 712ccab ("Unify warm+capture across scale-up and scale-down") wasn't a duplicate of warm_and_capture — it was load-bearing DP-coord synchronization. New-engine workers run compile_or_warm_up_model during their own __init__ and call coordinate_batch_across_dp on every warmup size; that DP all-reduce deadlocks unless existing workers participate in lockstep via their own compile_or_warm_up_model. py-spy of a hung scale-up confirmed: existing workers idle in worker_busy_loop, new workers blocked at dp_utils.all_reduce inside compile_or_warm_up_model. Restore the warmup block (save/clear block tables, unlock workspace, compile_or_warm_up_model, lock, restore tables). warm_and_capture in EPLB_RESHUFFLE remains — it grows the MoE workspace at max_num_tokens post-reshuffle, which compile_or_warm_up_model alone (only exercises cudagraph-capture sizes) doesn't cover. Scale-up cost regresses to two captures per existing rank (same as vllm-project#41792); the scale-down warmup this PR added is the net new fix. Signed-off-by: haosdent <haosdent@gmail.com>
Follow-up to vllm-project#41792 that landed `rewarm_workspace` (renamed here to `warm_and_capture`) post-EPLB-reshuffle. The lockstep PR vllm-project#41792 implicitly relied on — new-engine workers entering coordinate_batch_across_dp during their init `compile_or_warm_up_model`, existing workers matching via `switch_and_prepare`'s own compile_or_warm_up_model — was load-bearing for DP-coord but resulted in two cudagraph captures per existing rank per scale. Break the dependency at the source: add `skip_warmup` to `Executor.initialize_from_config` and pass it through from `_initialize_kv_caches` when VLLM_ELASTIC_EP_SCALE_UP_LAUNCH is set. New engines now finish __init__ with KV cache allocated but no cudagraphs and no DP-coord all_reduce. The deferred warm+capture runs later in `warm_and_capture` (already dispatched from `_eplb_reshuffle` and `_progress_remaining_engine`), in lockstep on the new DP group across all ranks. Net: one capture per rank per scale operation, no deadlock, workspace correctly grown to max_num_tokens before capture. EPLB rearrange uses `w[dst].copy_(b[dst], non_blocking=True)` (rebalance_execute.py:383), an in-place write that preserves .data_ptr(), so the single capture remains valid after reshuffle. py-spy of the previously-hung scale-up confirmed the deadlock chain; the test passes locally on h20-server-1 in ~7 min (vs. timeouts before). Signed-off-by: haosdent <haosdent@gmail.com>
Follow-up to vllm-project#41792 that landed `rewarm_workspace` (renamed in 712ccab to `warm_and_capture`) post-EPLB-reshuffle. The lockstep PR vllm-project#41792 implicitly relied on — new-engine workers entering coordinate_batch_across_dp during their init `compile_or_warm_up_model`, existing workers matching via `switch_and_prepare`'s own compile_or_warm_up_model — was load-bearing for DP-coord but cost two cudagraph captures per existing rank per scale. Split `Executor.initialize_from_config` into two methods: KV cache setup, and `compile_or_warm_up_model` for warmup + cudagraph capture + compilation-time propagation. EngineCore now calls the second explicitly, and skips it under VLLM_ELASTIC_EP_SCALE_UP_LAUNCH. New engines finish __init__ with KV cache allocated but no cudagraphs and no DP-coord all_reduce. The deferred warm+capture runs later in `warm_and_capture` (already dispatched from `_eplb_reshuffle` and `_progress_remaining_engine`), in lockstep on the new DP group across all ranks. Net: one capture per rank per scale, no deadlock, workspace correctly grown to max_num_tokens before capture. EPLB rearrange uses `w[dst].copy_(b[dst], non_blocking=True)` (rebalance_execute.py:383), an in-place write that preserves .data_ptr(), so the single capture remains valid after reshuffle. py-spy of the previously-hung scale-up confirmed the deadlock chain; the test passes locally on h20-server-1 in ~7 min (vs. timeouts before). Signed-off-by: haosdent <haosdent@gmail.com>
Follow-up to vllm-project#41792 (53dc9f3). Two issues remained: - Scale-up captured cudagraphs twice per existing rank: once in switch_and_prepare() (pre-EPLB-reshuffle), once in rewarm_workspace (post-reshuffle). The first capture was load-bearing — it kept existing workers in lockstep with new workers' init-time coordinate_batch_across_dp — but the duplicate cost 5-10s per existing rank. - Scale-down didn't get the workspace fix at all. rewarm_workspace was only dispatched from _eplb_reshuffle (scale-up). After scale-down, each remaining rank serves more local experts, so the per-rank workspace requirement grows and would trip the locked-too-small assertion under real traffic. Break the DP-coord dependency at the source: - Split Executor.initialize_from_config into two public methods: KV cache setup, and compile_or_warm_up_model for warmup + cudagraph capture + compilation-time propagation. EngineCore calls them separately and skips the warmup under VLLM_ELASTIC_EP_SCALE_UP_LAUNCH. New engines finish __init__ with KV cache allocated but no cudagraphs and no DP-coord all_reduce — no deadlock. - Strip the now-unnecessary warmup block from switch_and_prepare. - Rename rewarm_workspace -> warm_and_capture (it is now the sole capture site for elastic transitions, not a re-warm bolt-on). - Dispatch warm_and_capture post-reshuffle on every elastic transition: _eplb_reshuffle (scale-up, existing + new) and _progress_remaining_engine (scale-down remaining). Removing engines skip it — they are about to shut down. Net: one cudagraph capture per rank per scale, workspace correctly grown to max_num_tokens before capture, scale-down gets the workspace fix it lacked. EPLB rearrange uses w[dst].copy_(b[dst], non_blocking=True) (rebalance_execute.py:383), an in-place write that preserves .data_ptr(), so the single capture remains valid after reshuffle. Closes vllm-project#42107 Tested on h20-server-0 (4xH20, DP 2->4->2 and 2->3->2 with GSM8K eval at every stage): 2 passed in 6:28. GSM8K accuracy holds within 0.011 across all post-scale stages. The previously-hung deadlock signature (30 min of shm_broadcast timeouts during cudagraph capture) is gone. Signed-off-by: haosdent <haosdent@gmail.com>
Purpose
#41421 made
RayExecutorV2the default Ray DP executor and broketests/distributed/test_elastic_ep.py:scale_up_elastic_epdidn't append_dp{rank}toinstance_id/kv_transfer_config.engine_idlike_init_enginesdoes.Fix: extract
_apply_dp_identity_suffixand use it in both call sites; addElasticEPScalingExecutor.rewarm_workspace(called from_eplb_reshuffleon every DP sibling) that runsprofile_run()to grow the workspace atmax_num_tokensbefore re-locking.Test Plan
Test Result