Skip to content

[CI][Elastic EP] Fix Elastic EP Scaling Test Failure #41792

Merged
robertgshaw2-redhat merged 4 commits into
vllm-project:mainfrom
haosdent:ci-8b9b36c0
May 8, 2026
Merged

[CI][Elastic EP] Fix Elastic EP Scaling Test Failure #41792
robertgshaw2-redhat merged 4 commits into
vllm-project:mainfrom
haosdent:ci-8b9b36c0

Conversation

@haosdent

@haosdent haosdent commented May 6, 2026

Copy link
Copy Markdown
Contributor

Purpose

#41421 made RayExecutorV2 the default Ray DP executor and broke tests/distributed/test_elastic_ep.py:

  • New engines collided on Ray actor names because scale_up_elastic_ep didn't append _dp{rank} to instance_id / kv_transfer_config.engine_id like _init_engines does.
  • After EPLB reshuffle, the locked MoE workspace was sized for cudagraph-capture batches (~14 MB) while real traffic needed hundreds of MB.

Fix: extract _apply_dp_identity_suffix and use it in both call sites; add ElasticEPScalingExecutor.rewarm_workspace (called from _eplb_reshuffle on every DP sibling) that runs profile_run() to grow the workspace at max_num_tokens before re-locking.

Test Plan

Test Result

@mergify mergify Bot added v1 bug Something isn't working labels May 6, 2026
@haosdent haosdent changed the title [WIP][Bugfix][Elastic EP] Fix scale-up under RayExecutorV2 [CI][Elastic EP] Fix Elastic EP Scaling Test Failure May 6, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workspace rewarming mechanism in the elastic EP to manage CUDA graph lifecycle during MoE workspace resizing and refactors the DP identity suffixing logic into a shared utility. The review feedback highlights a potential ID collision issue in multi-node environments, suggesting the use of global DP ranks instead of node-local ranks for engine identification, along with corresponding signature updates for the new helper function.

Comment thread vllm/v1/engine/utils.py Outdated
Comment thread vllm/v1/engine/utils.py Outdated
Comment thread vllm/v1/engine/utils.py Outdated
@haosdent haosdent marked this pull request as ready for review May 6, 2026 07:26
@haosdent haosdent requested a review from njhill as a code owner May 6, 2026 07:26

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify

mergify Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

Hi @haosdent, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@NickLucche NickLucche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into this @haosdent . I wonder if the engine_id change here will happen after the tp-sync I am adding here #39907 (tp ranks may get different engine_ids across numa nodes currently)

@haosdent

haosdent commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Hi, thanks @NickLucche I checked your changes in #39907 , it looks like able to work together

══ [API server process] ════════════════════════════════════════════

  KVTransferConfig.__post_init__()              # engine_id = uuid4()

  _init_engines / scale_up_elastic_ep:
    cfg = deepcopy(vllm_config)                 # per DP engine
    _apply_dp_identity_suffix(cfg, dp_rank)     # ← this PR
        cfg.engine_id = f"{UUID}_dp{dp_rank}"
    ray.remote(...).remote(vllm_config=cfg)     # ship to actor

══ [Ray actor: DP engine r]   (one per DP rank) ════════════════════

  inherits cfg.engine_id = f"{UUID}_dp{r}"
  # __post_init__ does NOT re-fire on unpickle

══ [Worker process: DP=r, TP=t] ════════════════════════════════════

  # multi-node TP setups re-construct config on the remote node
  if config reconstructed on this node:
      __post_init__() re-fires
      cfg.engine_id = uuid4()                   # drift!

  ensure_kv_transfer_initialized():
    _sync_engine_id_across_tp(cfg)              # ← your fix (#39907)
        cfg.engine_id = broadcast(rank0.engine_id, group=tp_group)

  # post-sync: every TP worker in DP r matches its rank 0

In short, it should be overridden by your changes if it belongs to the same DP

@NickLucche

Copy link
Copy Markdown
Member

just one thing

if config got reconstructed on this node: # rare: only on some multi-node TP setups

I wouldn't say it's rare with gb series and nvl72 anymore

@haosdent

haosdent commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

I wouldn't say it's rare with gb series and nvl72 anymore

Indeed, updated.

@NickLucche NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label May 6, 2026
@NickLucche

NickLucche commented May 6, 2026

Copy link
Copy Markdown
Member

@haosdent might be related

(EngineCore pid=5926)     assert self.local_rank < torch.accelerator.device_count(), (
--
(EngineCore pid=5926)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=5926) AssertionError: DP adjusted local rank 0 is out of bounds.

do you have permissions to re-run CI here?

@haosdent

haosdent commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Didn't notice the CI failure before, let me fix it

@haosdent haosdent force-pushed the ci-8b9b36c0 branch 3 times, most recently from 3bbf260 to a9d72aa Compare May 7, 2026 02:17
@haosdent

haosdent commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

Seems fail on unrelated test cases, let me try to re-trigger

Comment on lines +575 to +584
# Save and clear block tables so profile_run/compile_or_warm_up_model
# don't write dummy slot mappings into real KV-cache blocks (mirrors
# switch_and_prepare's pattern).
multi_block_table = self.worker.model_runner.input_batch.block_table
saved_block_tables: list[tuple[torch.Tensor, torch.Tensor]] = []
for bt in multi_block_table.block_tables:
saved_block_tables.append(
(bt.block_table.gpu.clone(), bt.block_table.cpu.clone())
)
multi_block_table.clear()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any model runner state that we're missing here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the code at here, I duplicated it from https://github.com/haosdent/vllm/blob/53dc9f3923a8438b7fd31f4c348a59e17894a453/vllm/distributed/elastic_ep/elastic_execute.py#L435-L451

Or we add something like model_runner.prepare_for_warmup() , restore_after_warmup() to avoid duplicate?

@tlrmchlsmth tlrmchlsmth left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but @LucasWilkinson should check as well

vllm-project#41421 made RayExecutorV2 the default Ray DP executor and broke
tests/distributed/test_elastic_ep.py:

- New engines collided on Ray actor names because scale_up_elastic_ep
  didn't append _dp{rank} to instance_id / kv_transfer_config.engine_id
  like _init_engines does.
- After EPLB reshuffle, the locked MoE workspace was sized for
  cudagraph-capture batches (~14 MB) while real traffic needed hundreds
  of MB.

Extract _apply_dp_identity_suffix and use it in both call sites
(global dp_rank for instance_id and engine_id so multi-node DP doesn't
collide). Add ElasticEPScalingExecutor.rewarm_workspace dispatched from
_eplb_reshuffle on every DP sibling: save/clear block tables, release
captured CUDA graphs, unlock workspace, run _dummy_run at max_num_tokens
with skip_eplb=True so the MoE workspace grows for full-shape post-
reshuffle traffic without polluting the just-rebalanced EPLB stats,
re-capture cudagraphs, relock, restore block tables.

Signed-off-by: haosdent <haosdent@gmail.com>

@LucasWilkinson LucasWilkinson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to mean but im not super familiar with elastic EP; this seems to add alot of overhead

i.e. are we now adding another round of recapturing cudagraphs when we weren't before?

cc @SageMoore

@SageMoore SageMoore left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not add a second capture to scale up. If this fix is an emergency to get CI green, I think it's fine to make the temporary sacrifice since scale up is not widely used. That being said if we go that route we need to get the second capture out of scale up ASAP.

Have you confirmed that this problem doesn't exist for scale_down? I could see some buffers being sized based on the number of EP ranks.

@LucasWilkinson LucasWilkinson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracking the followup from @SageMoore here: #42107

@robertgshaw2-redhat robertgshaw2-redhat merged commit 44e6b44 into vllm-project:main May 8, 2026
61 of 67 checks passed
haosdent added a commit to haosdent/vllm that referenced this pull request May 10, 2026
Follow-up to vllm-project#41792. Two issues remained:

- Scale-up captured cudagraphs twice per existing engine: once in
  switch_and_prepare() (pre-EPLB-reshuffle) and once in rewarm_workspace
  (post-reshuffle). Each capture is 5-10 s of redundant work.
- Scale-down didn't get the workspace fix at all. _eplb_reshuffle() (which
  dispatched rewarm_workspace) is only called from scale-up paths;
  scale-down's _progress_remaining_engine uses a different reshuffle
  helper. After scale-down, per-rank M_full grows (each remaining rank
  serves more local experts), so the locked MoE workspace can hit the
  same too-small assertion that vllm-project#41792 fixed for scale-up.

Strip the warmup/capture block out of switch_and_prepare(), rename
rewarm_workspace → warm_and_capture, and dispatch it once post-reshuffle
on every elastic transition (scale-up existing, scale-up new, scale-down
remaining). switch_and_prepare retains _release_cuda_graphs and group/
MoE-config updates; warm_and_capture is now the sole capture site for
elastic transitions, growing the MoE workspace via _dummy_run at
max_num_tokens before cudagraph capture pins it.

Net: one capture per transition (vs. two on scale-up); scale-down now
gets the workspace fix it lacked.

Signed-off-by: haosdent <haosdent@gmail.com>
@haosdent

Copy link
Copy Markdown
Contributor Author

Have you confirmed that this problem doesn't exist for scale_down? I could see some buffers being sized based on the number of EP ranks.

@SageMoore The scale-down path looks may trigger the issue as well

haosdent added a commit to haosdent/vllm that referenced this pull request May 10, 2026
Follow-up to vllm-project#41792. Two issues remained:

- Scale-up captured cudagraphs twice per existing engine: once in
  switch_and_prepare() (pre-EPLB-reshuffle) and once in rewarm_workspace
  (post-reshuffle). Each capture is 5-10 s of redundant work.
- Scale-down didn't get the workspace fix at all. _eplb_reshuffle() (which
  dispatched rewarm_workspace) is only called from scale-up paths;
  scale-down's _progress_remaining_engine uses a different reshuffle
  helper. After scale-down, per-rank M_full grows (each remaining rank
  serves more local experts), so the locked MoE workspace can hit the
  same too-small assertion that vllm-project#41792 fixed for scale-up.

Strip the warmup/capture block out of switch_and_prepare(), rename
rewarm_workspace → warm_and_capture, and dispatch it once post-reshuffle
on every elastic transition (scale-up existing, scale-up new, scale-down
remaining). switch_and_prepare retains _release_cuda_graphs and group/
MoE-config updates; warm_and_capture is now the sole capture site for
elastic transitions, growing the MoE workspace via _dummy_run at
max_num_tokens before cudagraph capture pins it.

Net: one capture per transition (vs. two on scale-up); scale-down now
gets the workspace fix it lacked.

Signed-off-by: haosdent <haosdent@gmail.com>
@haosdent

Copy link
Copy Markdown
Contributor Author

The follow-up PR is #42203 @SageMoore @LucasWilkinson could take a look when you are available

haosdent added a commit to haosdent/vllm that referenced this pull request May 11, 2026
The block I stripped in commit 712ccab ("Unify warm+capture across
scale-up and scale-down") wasn't a duplicate of warm_and_capture — it
was load-bearing DP-coord synchronization. New-engine workers run
compile_or_warm_up_model during their own __init__ and call
coordinate_batch_across_dp on every warmup size; that DP all-reduce
deadlocks unless existing workers participate in lockstep via their
own compile_or_warm_up_model. py-spy of a hung scale-up confirmed:
existing workers idle in worker_busy_loop, new workers blocked at
dp_utils.all_reduce inside compile_or_warm_up_model.

Restore the warmup block (save/clear block tables, unlock workspace,
compile_or_warm_up_model, lock, restore tables). warm_and_capture in
EPLB_RESHUFFLE remains — it grows the MoE workspace at max_num_tokens
post-reshuffle, which compile_or_warm_up_model alone (only exercises
cudagraph-capture sizes) doesn't cover. Scale-up cost regresses to two
captures per existing rank (same as vllm-project#41792); the scale-down warmup
this PR added is the net new fix.

Signed-off-by: haosdent <haosdent@gmail.com>
haosdent added a commit to haosdent/vllm that referenced this pull request May 11, 2026
The block I stripped in commit 712ccab ("Unify warm+capture across
scale-up and scale-down") wasn't a duplicate of warm_and_capture — it
was load-bearing DP-coord synchronization. New-engine workers run
compile_or_warm_up_model during their own __init__ and call
coordinate_batch_across_dp on every warmup size; that DP all-reduce
deadlocks unless existing workers participate in lockstep via their
own compile_or_warm_up_model. py-spy of a hung scale-up confirmed:
existing workers idle in worker_busy_loop, new workers blocked at
dp_utils.all_reduce inside compile_or_warm_up_model.

Restore the warmup block (save/clear block tables, unlock workspace,
compile_or_warm_up_model, lock, restore tables). warm_and_capture in
EPLB_RESHUFFLE remains — it grows the MoE workspace at max_num_tokens
post-reshuffle, which compile_or_warm_up_model alone (only exercises
cudagraph-capture sizes) doesn't cover. Scale-up cost regresses to two
captures per existing rank (same as vllm-project#41792); the scale-down warmup
this PR added is the net new fix.

Signed-off-by: haosdent <haosdent@gmail.com>
haosdent added a commit to haosdent/vllm that referenced this pull request May 11, 2026
Follow-up to vllm-project#41792 that landed `rewarm_workspace` (renamed here to
`warm_and_capture`) post-EPLB-reshuffle. The lockstep PR vllm-project#41792
implicitly relied on — new-engine workers entering
coordinate_batch_across_dp during their init `compile_or_warm_up_model`,
existing workers matching via `switch_and_prepare`'s own
compile_or_warm_up_model — was load-bearing for DP-coord but resulted
in two cudagraph captures per existing rank per scale.

Break the dependency at the source: add `skip_warmup` to
`Executor.initialize_from_config` and pass it through from
`_initialize_kv_caches` when VLLM_ELASTIC_EP_SCALE_UP_LAUNCH is set.
New engines now finish __init__ with KV cache allocated but no
cudagraphs and no DP-coord all_reduce. The deferred warm+capture runs
later in `warm_and_capture` (already dispatched from
`_eplb_reshuffle` and `_progress_remaining_engine`), in lockstep on
the new DP group across all ranks. Net: one capture per rank per
scale operation, no deadlock, workspace correctly grown to
max_num_tokens before capture.

EPLB rearrange uses `w[dst].copy_(b[dst], non_blocking=True)`
(rebalance_execute.py:383), an in-place write that preserves
.data_ptr(), so the single capture remains valid after reshuffle.

py-spy of the previously-hung scale-up confirmed the deadlock chain;
the test passes locally on h20-server-1 in ~7 min (vs. timeouts before).

Signed-off-by: haosdent <haosdent@gmail.com>
haosdent added a commit to haosdent/vllm that referenced this pull request May 11, 2026
Follow-up to vllm-project#41792 that landed `rewarm_workspace` (renamed in
712ccab to `warm_and_capture`) post-EPLB-reshuffle. The lockstep
PR vllm-project#41792 implicitly relied on — new-engine workers entering
coordinate_batch_across_dp during their init `compile_or_warm_up_model`,
existing workers matching via `switch_and_prepare`'s own
compile_or_warm_up_model — was load-bearing for DP-coord but cost two
cudagraph captures per existing rank per scale.

Split `Executor.initialize_from_config` into two methods: KV cache
setup, and `compile_or_warm_up_model` for warmup + cudagraph capture +
compilation-time propagation. EngineCore now calls the second
explicitly, and skips it under VLLM_ELASTIC_EP_SCALE_UP_LAUNCH. New
engines finish __init__ with KV cache allocated but no cudagraphs and
no DP-coord all_reduce. The deferred warm+capture runs later in
`warm_and_capture` (already dispatched from `_eplb_reshuffle` and
`_progress_remaining_engine`), in lockstep on the new DP group across
all ranks. Net: one capture per rank per scale, no deadlock, workspace
correctly grown to max_num_tokens before capture.

EPLB rearrange uses `w[dst].copy_(b[dst], non_blocking=True)`
(rebalance_execute.py:383), an in-place write that preserves
.data_ptr(), so the single capture remains valid after reshuffle.

py-spy of the previously-hung scale-up confirmed the deadlock chain;
the test passes locally on h20-server-1 in ~7 min (vs. timeouts before).

Signed-off-by: haosdent <haosdent@gmail.com>
haosdent added a commit to haosdent/vllm that referenced this pull request May 11, 2026
Follow-up to vllm-project#41792 (53dc9f3). Two issues remained:

- Scale-up captured cudagraphs twice per existing rank: once in
  switch_and_prepare() (pre-EPLB-reshuffle), once in rewarm_workspace
  (post-reshuffle). The first capture was load-bearing — it kept
  existing workers in lockstep with new workers' init-time
  coordinate_batch_across_dp — but the duplicate cost 5-10s per
  existing rank.
- Scale-down didn't get the workspace fix at all. rewarm_workspace was
  only dispatched from _eplb_reshuffle (scale-up). After scale-down,
  each remaining rank serves more local experts, so the per-rank
  workspace requirement grows and would trip the locked-too-small
  assertion under real traffic.

Break the DP-coord dependency at the source:

- Split Executor.initialize_from_config into two public methods: KV
  cache setup, and compile_or_warm_up_model for warmup + cudagraph
  capture + compilation-time propagation. EngineCore calls them
  separately and skips the warmup under VLLM_ELASTIC_EP_SCALE_UP_LAUNCH.
  New engines finish __init__ with KV cache allocated but no
  cudagraphs and no DP-coord all_reduce — no deadlock.
- Strip the now-unnecessary warmup block from switch_and_prepare.
- Rename rewarm_workspace -> warm_and_capture (it is now the sole
  capture site for elastic transitions, not a re-warm bolt-on).
- Dispatch warm_and_capture post-reshuffle on every elastic transition:
  _eplb_reshuffle (scale-up, existing + new) and
  _progress_remaining_engine (scale-down remaining). Removing engines
  skip it — they are about to shut down.

Net: one cudagraph capture per rank per scale, workspace correctly
grown to max_num_tokens before capture, scale-down gets the workspace
fix it lacked. EPLB rearrange uses
w[dst].copy_(b[dst], non_blocking=True) (rebalance_execute.py:383),
an in-place write that preserves .data_ptr(), so the single capture
remains valid after reshuffle.

Closes vllm-project#42107

Tested on h20-server-0 (4xH20, DP 2->4->2 and 2->3->2 with GSM8K eval
at every stage): 2 passed in 6:28. GSM8K accuracy holds within 0.011
across all post-scale stages. The previously-hung deadlock signature
(30 min of shm_broadcast timeouts during cudagraph capture) is gone.

Signed-off-by: haosdent <haosdent@gmail.com>
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
)

Signed-off-by: haosdent <haosdent@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
)

Signed-off-by: haosdent <haosdent@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
)

Signed-off-by: haosdent <haosdent@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
)

Signed-off-by: haosdent <haosdent@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
)

Signed-off-by: haosdent <haosdent@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants