Skip to content

[Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down#42203

Open
haosdent wants to merge 1 commit into
vllm-project:mainfrom
haosdent:elastic-ep-unify-warm-capture
Open

[Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down#42203
haosdent wants to merge 1 commit into
vllm-project:mainfrom
haosdent:elastic-ep-unify-warm-capture

Conversation

@haosdent

@haosdent haosdent commented May 10, 2026

Copy link
Copy Markdown
Contributor

Purpose

Follow-up to #41792. Defer warmup on new engines during init, so existing ranks don't need a duplicate cudagraph capture to keep DP-coord in lockstep; dispatch a single warm_and_capture post-reshuffle on every elastic transition. One capture per rank per scale; scale-down gets the workspace fix it lacked. Closes #42107.

EPLB rearrange (rebalance_execute.py:383) is in-place, so the single capture stays valid.

Test plan

tests/distributed/test_elastic_ep.py on 4×H20, DP 2→4→2 and 2→3→2 with GSM8K at every stage:

================== 2 passed, 18 warnings in 388.70s (0:06:28) ==================
Test Initial After scale up After scale down
test_elastic_ep_scaling (2→4→2) 0.656 0.664 (+0.008) passed
test_elastic_ep_scaling_uneven (2→3→2) 0.637 0.648 (+0.011) passed

@mergify mergify Bot added the bug Something isn't working label May 10, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request consolidates the warming and CUDA graph capture processes into a single "warm_and_capture" phase within the elastic execution logic. It removes redundant block table management from the "switch_and_prepare" method and ensures that both scale-up and scale-down transitions trigger this unified phase to correctly size the MoE workspace and refresh CUDA graphs for the updated topology. I have no feedback to provide as there were no review comments.

@haosdent haosdent force-pushed the elastic-ep-unify-warm-capture branch 2 times, most recently from f4064b4 to 712ccab Compare May 10, 2026 11:08
@haosdent haosdent changed the title [WIP][Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down [Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down May 10, 2026
@haosdent haosdent marked this pull request as ready for review May 10, 2026 11:09

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@haosdent haosdent force-pushed the elastic-ep-unify-warm-capture branch 2 times, most recently from 6c30841 to 0036b01 Compare May 11, 2026 03:03
@haosdent haosdent marked this pull request as draft May 11, 2026 03:35
@haosdent haosdent force-pushed the elastic-ep-unify-warm-capture branch from 0036b01 to c1fe7ef Compare May 11, 2026 03:43
@mergify mergify Bot added the v1 label May 11, 2026
@haosdent haosdent force-pushed the elastic-ep-unify-warm-capture branch from c1fe7ef to 89aadc6 Compare May 11, 2026 03:47
Follow-up to vllm-project#41792 (53dc9f3). Two issues remained:

- Scale-up captured cudagraphs twice per existing rank: once in
  switch_and_prepare() (pre-EPLB-reshuffle), once in rewarm_workspace
  (post-reshuffle). The first capture was load-bearing — it kept
  existing workers in lockstep with new workers' init-time
  coordinate_batch_across_dp — but the duplicate cost 5-10s per
  existing rank.
- Scale-down didn't get the workspace fix at all. rewarm_workspace was
  only dispatched from _eplb_reshuffle (scale-up). After scale-down,
  each remaining rank serves more local experts, so the per-rank
  workspace requirement grows and would trip the locked-too-small
  assertion under real traffic.

Break the DP-coord dependency at the source:

- Split Executor.initialize_from_config into two public methods: KV
  cache setup, and compile_or_warm_up_model for warmup + cudagraph
  capture + compilation-time propagation. EngineCore calls them
  separately and skips the warmup under VLLM_ELASTIC_EP_SCALE_UP_LAUNCH.
  New engines finish __init__ with KV cache allocated but no
  cudagraphs and no DP-coord all_reduce — no deadlock.
- Strip the now-unnecessary warmup block from switch_and_prepare.
- Rename rewarm_workspace -> warm_and_capture (it is now the sole
  capture site for elastic transitions, not a re-warm bolt-on).
- Dispatch warm_and_capture post-reshuffle on every elastic transition:
  _eplb_reshuffle (scale-up, existing + new) and
  _progress_remaining_engine (scale-down remaining). Removing engines
  skip it — they are about to shut down.

Net: one cudagraph capture per rank per scale, workspace correctly
grown to max_num_tokens before capture, scale-down gets the workspace
fix it lacked. EPLB rearrange uses
w[dst].copy_(b[dst], non_blocking=True) (rebalance_execute.py:383),
an in-place write that preserves .data_ptr(), so the single capture
remains valid after reshuffle.

Closes vllm-project#42107

Tested on h20-server-0 (4xH20, DP 2->4->2 and 2->3->2 with GSM8K eval
at every stage): 2 passed in 6:28. GSM8K accuracy holds within 0.011
across all post-scale stages. The previously-hung deadlock signature
(30 min of shm_broadcast timeouts during cudagraph capture) is gone.

Signed-off-by: haosdent <haosdent@gmail.com>
@haosdent haosdent force-pushed the elastic-ep-unify-warm-capture branch from 89aadc6 to bfae75e Compare May 11, 2026 04:08
@haosdent haosdent marked this pull request as ready for review May 11, 2026 04:27
@haosdent haosdent requested a review from njhill as a code owner May 11, 2026 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance]: Remove duplicate cudagraph capture in elastic EP

1 participant