Skip to content

elastic_ep: Fix issues with repeated scale up/down cycles#37131

Merged
tlrmchlsmth merged 4 commits into
vllm-project:mainfrom
itayalroy:elastic_ep_repeated_scaling
Mar 20, 2026
Merged

elastic_ep: Fix issues with repeated scale up/down cycles#37131
tlrmchlsmth merged 4 commits into
vllm-project:mainfrom
itayalroy:elastic_ep_repeated_scaling

Conversation

@itayalroy

@itayalroy itayalroy commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

This PR fixes two issues found when testing repeated scale up/down cycles:

  1. PyNCCL communicators were never explicitly destroyed during
    elastic EP reconfigurations. Over repeated scale up/down cycles,
    this causes CUDA OOM crashes. Fixed by cleaning up PyNCCL correctly.

  2. Elastic EP EPLB suppression was scattered across executor, worker, and
    state-machine code, which made its lifecycle hard to follow and caused
    bugs such as EPLB staying suppressed after scale down. Fixed by making
    elastic_execute the sole owner of EPLB suppression.

While removing EPLB suppression from load_model, also clean up the
entire Elastic EP new-worker load path which was extremely messy:
Move EEP-specific logic from GPUWorker.load_model() into
elastic_execute, go through the same executor load path for both
regular and EEP new workers, and move the remaining scale-up logic,
such as receive_weights, into the elastic state machine where it belongs
(symmetric to existing-worker send_weights)

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements and bug fixes for elastic expert parallelism. The primary changes include fixing a CUDA out-of-memory issue by ensuring PyNCCL communicators are properly destroyed, and refactoring the EPLB (Expert Parallelism Load Balancing) suppression logic to be centralized within elastic_execute. This centralization resolves a bug where EPLB would remain suppressed after a scale-down operation. Furthermore, the code for handling new worker loading has been substantially cleaned up, with EEP-specific logic being moved to more appropriate locations, which enhances maintainability and clarifies the execution flow. The changes are well-structured and directly address the issues described, leading to a more robust and reliable elastic EP implementation.

@itayalroy itayalroy force-pushed the elastic_ep_repeated_scaling branch from 28f7904 to 969b51a Compare March 16, 2026 01:06
@mergify

mergify Bot commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Hi @itayalroy, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@itayalroy itayalroy force-pushed the elastic_ep_repeated_scaling branch 2 times, most recently from f43dda8 to bd4cf16 Compare March 16, 2026 01:26
rtourgeman and others added 3 commits March 16, 2026 16:59
PyNCCL communicators were never explicitly destroyed during
elastic EP reconfigurations (CudaCommunicator.destroy() just
set pynccl_comm to None). Over repeated scale up/down cycles,
this causes CUDA OOM crashes.

Fix it by explicitly destroying PyNCCL comms on every elastic
ep reconfiguration. Note that PyNCCL destuction is a collective
operation so it must be called on exiting processes, and all
CUDA graphs referencing the comm must be destroyed beforehand
as well.

Co-authored-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Elastic EP EPLB suppression was scattered across executor, worker, and
state-machine code, which made its lifecycle hard to follow and caused
bugs such as EPLB staying suppressed after scale down. Fix by making
elastic_execute the sole owner of EPLB suppression.

While removing EPLB suppression from load_model, also clean up the
Elastic EP new-worker load path: keep GPUWorker.load_model() generic,
move EEP-specific loading into elastic_execute, use the same executor
load path for regular and EEP new workers, and move the remaining
scale-up work, such as receive_weights, into the state machine
(symmetric to the existing-worker's send_weights)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
During Elastic EP scale down, removing engines could destroy their active
DP/EP groups and still continue the current busy-loop iteration. If that
iteration happens to run  a dummy-batch, it fails with "data parallel group
is not initialized".

Fix this by exiting the busy loop as soon as a removing worker finishes its
Elastic EP state machine, and let SystemExit handler call
engine_core.shutdown() instead of calling it directly from the EEP state
machine.

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
@itayalroy itayalroy force-pushed the elastic_ep_repeated_scaling branch from 78bc5d9 to 9e6f377 Compare March 16, 2026 14:59
@tlrmchlsmth tlrmchlsmth self-assigned this Mar 17, 2026

@SageMoore SageMoore left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me @itayalroy. Thanks for the contribution!

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 20, 2026

@tlrmchlsmth tlrmchlsmth left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you!

Comment on lines +150 to +153
with torch.accelerator.device_index(self.device.index):
self.nccl.ncclCommDestroy(self.comm)
self.available = False
self.disabled = True

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paranoia: consider setting self.comm to None to prevent use after destroy but before garbage collection

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Mar 20, 2026
@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) March 20, 2026 21:29
@tlrmchlsmth tlrmchlsmth merged commit c57d38d into vllm-project:main Mar 20, 2026
77 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Mar 20, 2026
chooper26 pushed a commit to vLLM-HUST/vllm-hust that referenced this pull request Mar 21, 2026
…ct#37131)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026
…ct#37131)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
…ct#37131)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026
…ct#37131)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>

Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
…ct#37131)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
…ct#37131)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
…ct#37131)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…ct#37131)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…ct#37131)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…ct#37131)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants