[CI][Elastic EP] Fix Elastic EP Scaling Test Failure by haosdent · Pull Request #41792 · vllm-project/vllm

haosdent · 2026-05-06T07:19:46Z

Purpose

#41421 made RayExecutorV2 the default Ray DP executor and broke tests/distributed/test_elastic_ep.py:

New engines collided on Ray actor names because scale_up_elastic_ep didn't append _dp{rank} to instance_id / kv_transfer_config.engine_id like _init_engines does.
After EPLB reshuffle, the locked MoE workspace was sized for cudagraph-capture batches (~14 MB) while real traffic needed hundreds of MB.

Fix: extract _apply_dp_identity_suffix and use it in both call sites; add ElasticEPScalingExecutor.rewarm_workspace (called from _eplb_reshuffle on every DP sibling) that runs profile_run() to grow the workspace at max_num_tokens before re-locking.

Test Plan

Test Result

gemini-code-assist

Code Review

This pull request introduces a workspace rewarming mechanism in the elastic EP to manage CUDA graph lifecycle during MoE workspace resizing and refactors the DP identity suffixing logic into a shared utility. The review feedback highlights a potential ID collision issue in multi-node environments, suggesting the use of global DP ranks instead of node-local ranks for engine identification, along with corresponding signature updates for the new helper function.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-05-06T07:27:27Z

Hi @haosdent, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

NickLucche

thanks for looking into this @haosdent . I wonder if the engine_id change here will happen after the tp-sync I am adding here #39907 (tp ranks may get different engine_ids across numa nodes currently)

haosdent · 2026-05-06T07:59:54Z

Hi, thanks @NickLucche I checked your changes in #39907 , it looks like able to work together

══ [API server process] ════════════════════════════════════════════

  KVTransferConfig.__post_init__()              # engine_id = uuid4()

  _init_engines / scale_up_elastic_ep:
    cfg = deepcopy(vllm_config)                 # per DP engine
    _apply_dp_identity_suffix(cfg, dp_rank)     # ← this PR
        cfg.engine_id = f"{UUID}_dp{dp_rank}"
    ray.remote(...).remote(vllm_config=cfg)     # ship to actor

══ [Ray actor: DP engine r]   (one per DP rank) ════════════════════

  inherits cfg.engine_id = f"{UUID}_dp{r}"
  # __post_init__ does NOT re-fire on unpickle

══ [Worker process: DP=r, TP=t] ════════════════════════════════════

  # multi-node TP setups re-construct config on the remote node
  if config reconstructed on this node:
      __post_init__() re-fires
      cfg.engine_id = uuid4()                   # drift!

  ensure_kv_transfer_initialized():
    _sync_engine_id_across_tp(cfg)              # ← your fix (#39907)
        cfg.engine_id = broadcast(rank0.engine_id, group=tp_group)

  # post-sync: every TP worker in DP r matches its rank 0

In short, it should be overridden by your changes if it belongs to the same DP

NickLucche · 2026-05-06T08:12:22Z

just one thing

if config got reconstructed on this node: # rare: only on some multi-node TP setups

I wouldn't say it's rare with gb series and nvl72 anymore

haosdent · 2026-05-06T08:15:10Z

I wouldn't say it's rare with gb series and nvl72 anymore

Indeed, updated.

NickLucche · 2026-05-06T12:14:03Z

@haosdent might be related

(EngineCore pid=5926)     assert self.local_rank < torch.accelerator.device_count(), (
--
(EngineCore pid=5926)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=5926) AssertionError: DP adjusted local rank 0 is out of bounds.

do you have permissions to re-run CI here?

haosdent · 2026-05-06T14:38:27Z

Didn't notice the CI failure before, let me fix it

haosdent · 2026-05-07T10:39:05Z

Seems fail on unrelated test cases, let me try to re-trigger

tlrmchlsmth · 2026-05-07T14:17:30Z

+        # Save and clear block tables so profile_run/compile_or_warm_up_model
+        # don't write dummy slot mappings into real KV-cache blocks (mirrors
+        # switch_and_prepare's pattern).
+        multi_block_table = self.worker.model_runner.input_batch.block_table
+        saved_block_tables: list[tuple[torch.Tensor, torch.Tensor]] = []
+        for bt in multi_block_table.block_tables:
+            saved_block_tables.append(
+                (bt.block_table.gpu.clone(), bt.block_table.cpu.clone())
+            )
+        multi_block_table.clear()


Is there any model runner state that we're missing here?

For the code at here, I duplicated it from https://github.com/haosdent/vllm/blob/53dc9f3923a8438b7fd31f4c348a59e17894a453/vllm/distributed/elastic_ep/elastic_execute.py#L435-L451

Or we add something like model_runner.prepare_for_warmup() , restore_after_warmup() to avoid duplicate?

tlrmchlsmth

Looks good but @LucasWilkinson should check as well

vllm-project#41421 made RayExecutorV2 the default Ray DP executor and broke tests/distributed/test_elastic_ep.py: - New engines collided on Ray actor names because scale_up_elastic_ep didn't append _dp{rank} to instance_id / kv_transfer_config.engine_id like _init_engines does. - After EPLB reshuffle, the locked MoE workspace was sized for cudagraph-capture batches (~14 MB) while real traffic needed hundreds of MB. Extract _apply_dp_identity_suffix and use it in both call sites (global dp_rank for instance_id and engine_id so multi-node DP doesn't collide). Add ElasticEPScalingExecutor.rewarm_workspace dispatched from _eplb_reshuffle on every DP sibling: save/clear block tables, release captured CUDA graphs, unlock workspace, run _dummy_run at max_num_tokens with skip_eplb=True so the MoE workspace grows for full-shape post- reshuffle traffic without polluting the just-rebalanced EPLB stats, re-capture cudagraphs, relock, restore block tables. Signed-off-by: haosdent <haosdent@gmail.com>

LucasWilkinson

This looks ok to mean but im not super familiar with elastic EP; this seems to add alot of overhead

i.e. are we now adding another round of recapturing cudagraphs when we weren't before?

cc @SageMoore

SageMoore

We should not add a second capture to scale up. If this fix is an emergency to get CI green, I think it's fine to make the temporary sacrifice since scale up is not widely used. That being said if we go that route we need to get the second capture out of scale up ASAP.

Have you confirmed that this problem doesn't exist for scale_down? I could see some buffers being sized based on the number of EP ranks.

LucasWilkinson

tracking the followup from @SageMoore here: #42107

Follow-up to vllm-project#41792. Two issues remained: - Scale-up captured cudagraphs twice per existing engine: once in switch_and_prepare() (pre-EPLB-reshuffle) and once in rewarm_workspace (post-reshuffle). Each capture is 5-10 s of redundant work. - Scale-down didn't get the workspace fix at all. _eplb_reshuffle() (which dispatched rewarm_workspace) is only called from scale-up paths; scale-down's _progress_remaining_engine uses a different reshuffle helper. After scale-down, per-rank M_full grows (each remaining rank serves more local experts), so the locked MoE workspace can hit the same too-small assertion that vllm-project#41792 fixed for scale-up. Strip the warmup/capture block out of switch_and_prepare(), rename rewarm_workspace → warm_and_capture, and dispatch it once post-reshuffle on every elastic transition (scale-up existing, scale-up new, scale-down remaining). switch_and_prepare retains _release_cuda_graphs and group/ MoE-config updates; warm_and_capture is now the sole capture site for elastic transitions, growing the MoE workspace via _dummy_run at max_num_tokens before cudagraph capture pins it. Net: one capture per transition (vs. two on scale-up); scale-down now gets the workspace fix it lacked. Signed-off-by: haosdent <haosdent@gmail.com>

haosdent · 2026-05-10T10:32:47Z

Have you confirmed that this problem doesn't exist for scale_down? I could see some buffers being sized based on the number of EP ranks.

@SageMoore The scale-down path looks may trigger the issue as well

Follow-up to vllm-project#41792. Two issues remained: - Scale-up captured cudagraphs twice per existing engine: once in switch_and_prepare() (pre-EPLB-reshuffle) and once in rewarm_workspace (post-reshuffle). Each capture is 5-10 s of redundant work. - Scale-down didn't get the workspace fix at all. _eplb_reshuffle() (which dispatched rewarm_workspace) is only called from scale-up paths; scale-down's _progress_remaining_engine uses a different reshuffle helper. After scale-down, per-rank M_full grows (each remaining rank serves more local experts), so the locked MoE workspace can hit the same too-small assertion that vllm-project#41792 fixed for scale-up. Strip the warmup/capture block out of switch_and_prepare(), rename rewarm_workspace → warm_and_capture, and dispatch it once post-reshuffle on every elastic transition (scale-up existing, scale-up new, scale-down remaining). switch_and_prepare retains _release_cuda_graphs and group/ MoE-config updates; warm_and_capture is now the sole capture site for elastic transitions, growing the MoE workspace via _dummy_run at max_num_tokens before cudagraph capture pins it. Net: one capture per transition (vs. two on scale-up); scale-down now gets the workspace fix it lacked. Signed-off-by: haosdent <haosdent@gmail.com>

haosdent · 2026-05-10T11:10:30Z

The follow-up PR is #42203 @SageMoore @LucasWilkinson could take a look when you are available

The block I stripped in commit 712ccab ("Unify warm+capture across scale-up and scale-down") wasn't a duplicate of warm_and_capture — it was load-bearing DP-coord synchronization. New-engine workers run compile_or_warm_up_model during their own __init__ and call coordinate_batch_across_dp on every warmup size; that DP all-reduce deadlocks unless existing workers participate in lockstep via their own compile_or_warm_up_model. py-spy of a hung scale-up confirmed: existing workers idle in worker_busy_loop, new workers blocked at dp_utils.all_reduce inside compile_or_warm_up_model. Restore the warmup block (save/clear block tables, unlock workspace, compile_or_warm_up_model, lock, restore tables). warm_and_capture in EPLB_RESHUFFLE remains — it grows the MoE workspace at max_num_tokens post-reshuffle, which compile_or_warm_up_model alone (only exercises cudagraph-capture sizes) doesn't cover. Scale-up cost regresses to two captures per existing rank (same as vllm-project#41792); the scale-down warmup this PR added is the net new fix. Signed-off-by: haosdent <haosdent@gmail.com>

Follow-up to vllm-project#41792 that landed `rewarm_workspace` (renamed here to `warm_and_capture`) post-EPLB-reshuffle. The lockstep PR vllm-project#41792 implicitly relied on — new-engine workers entering coordinate_batch_across_dp during their init `compile_or_warm_up_model`, existing workers matching via `switch_and_prepare`'s own compile_or_warm_up_model — was load-bearing for DP-coord but resulted in two cudagraph captures per existing rank per scale. Break the dependency at the source: add `skip_warmup` to `Executor.initialize_from_config` and pass it through from `_initialize_kv_caches` when VLLM_ELASTIC_EP_SCALE_UP_LAUNCH is set. New engines now finish __init__ with KV cache allocated but no cudagraphs and no DP-coord all_reduce. The deferred warm+capture runs later in `warm_and_capture` (already dispatched from `_eplb_reshuffle` and `_progress_remaining_engine`), in lockstep on the new DP group across all ranks. Net: one capture per rank per scale operation, no deadlock, workspace correctly grown to max_num_tokens before capture. EPLB rearrange uses `w[dst].copy_(b[dst], non_blocking=True)` (rebalance_execute.py:383), an in-place write that preserves .data_ptr(), so the single capture remains valid after reshuffle. py-spy of the previously-hung scale-up confirmed the deadlock chain; the test passes locally on h20-server-1 in ~7 min (vs. timeouts before). Signed-off-by: haosdent <haosdent@gmail.com>

Follow-up to vllm-project#41792 that landed `rewarm_workspace` (renamed in 712ccab to `warm_and_capture`) post-EPLB-reshuffle. The lockstep PR vllm-project#41792 implicitly relied on — new-engine workers entering coordinate_batch_across_dp during their init `compile_or_warm_up_model`, existing workers matching via `switch_and_prepare`'s own compile_or_warm_up_model — was load-bearing for DP-coord but cost two cudagraph captures per existing rank per scale. Split `Executor.initialize_from_config` into two methods: KV cache setup, and `compile_or_warm_up_model` for warmup + cudagraph capture + compilation-time propagation. EngineCore now calls the second explicitly, and skips it under VLLM_ELASTIC_EP_SCALE_UP_LAUNCH. New engines finish __init__ with KV cache allocated but no cudagraphs and no DP-coord all_reduce. The deferred warm+capture runs later in `warm_and_capture` (already dispatched from `_eplb_reshuffle` and `_progress_remaining_engine`), in lockstep on the new DP group across all ranks. Net: one capture per rank per scale, no deadlock, workspace correctly grown to max_num_tokens before capture. EPLB rearrange uses `w[dst].copy_(b[dst], non_blocking=True)` (rebalance_execute.py:383), an in-place write that preserves .data_ptr(), so the single capture remains valid after reshuffle. py-spy of the previously-hung scale-up confirmed the deadlock chain; the test passes locally on h20-server-1 in ~7 min (vs. timeouts before). Signed-off-by: haosdent <haosdent@gmail.com>

Follow-up to vllm-project#41792 (53dc9f3). Two issues remained: - Scale-up captured cudagraphs twice per existing rank: once in switch_and_prepare() (pre-EPLB-reshuffle), once in rewarm_workspace (post-reshuffle). The first capture was load-bearing — it kept existing workers in lockstep with new workers' init-time coordinate_batch_across_dp — but the duplicate cost 5-10s per existing rank. - Scale-down didn't get the workspace fix at all. rewarm_workspace was only dispatched from _eplb_reshuffle (scale-up). After scale-down, each remaining rank serves more local experts, so the per-rank workspace requirement grows and would trip the locked-too-small assertion under real traffic. Break the DP-coord dependency at the source: - Split Executor.initialize_from_config into two public methods: KV cache setup, and compile_or_warm_up_model for warmup + cudagraph capture + compilation-time propagation. EngineCore calls them separately and skips the warmup under VLLM_ELASTIC_EP_SCALE_UP_LAUNCH. New engines finish __init__ with KV cache allocated but no cudagraphs and no DP-coord all_reduce — no deadlock. - Strip the now-unnecessary warmup block from switch_and_prepare. - Rename rewarm_workspace -> warm_and_capture (it is now the sole capture site for elastic transitions, not a re-warm bolt-on). - Dispatch warm_and_capture post-reshuffle on every elastic transition: _eplb_reshuffle (scale-up, existing + new) and _progress_remaining_engine (scale-down remaining). Removing engines skip it — they are about to shut down. Net: one cudagraph capture per rank per scale, workspace correctly grown to max_num_tokens before capture, scale-down gets the workspace fix it lacked. EPLB rearrange uses w[dst].copy_(b[dst], non_blocking=True) (rebalance_execute.py:383), an in-place write that preserves .data_ptr(), so the single capture remains valid after reshuffle. Closes vllm-project#42107 Tested on h20-server-0 (4xH20, DP 2->4->2 and 2->3->2 with GSM8K eval at every stage): 2 passed in 6:28. GSM8K accuracy holds within 0.011 across all post-scale stages. The previously-hung deadlock signature (30 min of shm_broadcast timeouts during cudagraph capture) is gone. Signed-off-by: haosdent <haosdent@gmail.com>

) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

mergify Bot added v1 bug Something isn't working labels May 6, 2026

haosdent changed the title ~~[WIP][Bugfix][Elastic EP] Fix scale-up under RayExecutorV2~~ [CI][Elastic EP] Fix Elastic EP Scaling Test Failure May 6, 2026

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Comment thread vllm/v1/engine/utils.py Outdated

Comment thread vllm/v1/engine/utils.py Outdated

Comment thread vllm/v1/engine/utils.py Outdated

haosdent marked this pull request as ready for review May 6, 2026 07:26

haosdent requested a review from njhill as a code owner May 6, 2026 07:26

claude Bot reviewed May 6, 2026

View reviewed changes

NickLucche reviewed May 6, 2026

View reviewed changes

NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label May 6, 2026

haosdent force-pushed the ci-8b9b36c0 branch 3 times, most recently from 3bbf260 to a9d72aa Compare May 7, 2026 02:17

tlrmchlsmth reviewed May 7, 2026

View reviewed changes

tlrmchlsmth approved these changes May 7, 2026

View reviewed changes

haosdent force-pushed the ci-8b9b36c0 branch from 11b6e77 to 53dc9f3 Compare May 7, 2026 14:26

haosdent and others added 3 commits May 8, 2026 11:54

Merge branch 'main' into ci-8b9b36c0

a7a354a

Merge branch 'main' into ci-8b9b36c0

1b80767

Merge branch 'main' into ci-8b9b36c0

244a308

LucasWilkinson reviewed May 8, 2026

View reviewed changes

SageMoore reviewed May 8, 2026

View reviewed changes

LucasWilkinson mentioned this pull request May 8, 2026

[Performance]: Remove duplicate cudagraph capture in elastic EP #42107

Open

LucasWilkinson approved these changes May 8, 2026

View reviewed changes

robertgshaw2-redhat merged commit 44e6b44 into vllm-project:main May 8, 2026
61 of 67 checks passed

haosdent mentioned this pull request May 10, 2026

[Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down #42203

Open

weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026

[CI][Elastic EP] Fix Elastic EP Scaling Test Failure (vllm-project#41792

1d91d89

) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[CI][Elastic EP] Fix Elastic EP Scaling Test Failure (vllm-project#41792

1d282f3

) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[CI][Elastic EP] Fix Elastic EP Scaling Test Failure (vllm-project#41792

34f271e

) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026

[CI][Elastic EP] Fix Elastic EP Scaling Test Failure (vllm-project#41792

77dc1f3

) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

Uh oh!

Conversation

haosdent commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented May 6, 2026

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

haosdent commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickLucche commented May 6, 2026

Uh oh!

haosdent commented May 6, 2026

Uh oh!

NickLucche commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haosdent commented May 6, 2026

Uh oh!

haosdent commented May 7, 2026

Uh oh!

tlrmchlsmth May 7, 2026

Choose a reason for hiding this comment

Uh oh!

haosdent May 7, 2026

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

haosdent commented May 10, 2026

Uh oh!

haosdent commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

haosdent commented May 6, 2026 •

edited

Loading

haosdent commented May 6, 2026 •

edited

Loading

NickLucche commented May 6, 2026 •

edited

Loading