elastic_ep: Fix issues with repeated scale up/down cycles by itayalroy · Pull Request #37131 · vllm-project/vllm

itayalroy · 2026-03-16T01:00:25Z

This PR fixes two issues found when testing repeated scale up/down cycles:

PyNCCL communicators were never explicitly destroyed during
elastic EP reconfigurations. Over repeated scale up/down cycles,
this causes CUDA OOM crashes. Fixed by cleaning up PyNCCL correctly.
Elastic EP EPLB suppression was scattered across executor, worker, and
state-machine code, which made its lifecycle hard to follow and caused
bugs such as EPLB staying suppressed after scale down. Fixed by making
elastic_execute the sole owner of EPLB suppression.

While removing EPLB suppression from load_model, also clean up the
entire Elastic EP new-worker load path which was extremely messy:
Move EEP-specific logic from GPUWorker.load_model() into
elastic_execute, go through the same executor load path for both
regular and EEP new workers, and move the remaining scale-up logic,
such as receive_weights, into the elastic state machine where it belongs
(symmetric to existing-worker send_weights)

gemini-code-assist

Code Review

This pull request introduces significant improvements and bug fixes for elastic expert parallelism. The primary changes include fixing a CUDA out-of-memory issue by ensuring PyNCCL communicators are properly destroyed, and refactoring the EPLB (Expert Parallelism Load Balancing) suppression logic to be centralized within elastic_execute. This centralization resolves a bug where EPLB would remain suppressed after a scale-down operation. Furthermore, the code for handling new worker loading has been substantially cleaned up, with EEP-specific logic being moved to more appropriate locations, which enhances maintainability and clarifies the execution flow. The changes are well-structured and directly address the issues described, leading to a more robust and reliable elastic EP implementation.

mergify · 2026-03-16T01:13:20Z

Hi @itayalroy, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

PyNCCL communicators were never explicitly destroyed during elastic EP reconfigurations (CudaCommunicator.destroy() just set pynccl_comm to None). Over repeated scale up/down cycles, this causes CUDA OOM crashes. Fix it by explicitly destroying PyNCCL comms on every elastic ep reconfiguration. Note that PyNCCL destuction is a collective operation so it must be called on exiting processes, and all CUDA graphs referencing the comm must be destroyed beforehand as well. Co-authored-by: Itay Alroy <ialroy@nvidia.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com>

Elastic EP EPLB suppression was scattered across executor, worker, and state-machine code, which made its lifecycle hard to follow and caused bugs such as EPLB staying suppressed after scale down. Fix by making elastic_execute the sole owner of EPLB suppression. While removing EPLB suppression from load_model, also clean up the Elastic EP new-worker load path: keep GPUWorker.load_model() generic, move EEP-specific loading into elastic_execute, use the same executor load path for regular and EEP new workers, and move the remaining scale-up work, such as receive_weights, into the state machine (symmetric to the existing-worker's send_weights) Signed-off-by: Itay Alroy <ialroy@nvidia.com>

During Elastic EP scale down, removing engines could destroy their active DP/EP groups and still continue the current busy-loop iteration. If that iteration happens to run a dummy-batch, it fails with "data parallel group is not initialized". Fix this by exiting the busy loop as soon as a removing worker finishes its Elastic EP state machine, and let SystemExit handler call engine_core.shutdown() instead of calling it directly from the EEP state machine. Signed-off-by: Itay Alroy <ialroy@nvidia.com>

SageMoore

This looks reasonable to me @itayalroy. Thanks for the contribution!

tlrmchlsmth

Looks good to me, thank you!

tlrmchlsmth · 2026-03-20T20:58:29Z

+            with torch.accelerator.device_index(self.device.index):
+                self.nccl.ncclCommDestroy(self.comm)
+            self.available = False
+            self.disabled = True


paranoia: consider setting self.comm to None to prevent use after destroy but before garbage collection

…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>

…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>

…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

itayalroy requested a review from njhill as a code owner March 16, 2026 01:00

mergify Bot added nvidia v1 labels Mar 16, 2026

github-project-automation Bot added this to NVIDIA Mar 16, 2026

gemini-code-assist Bot reviewed Mar 16, 2026

View reviewed changes

itayalroy force-pushed the elastic_ep_repeated_scaling branch from 28f7904 to 969b51a Compare March 16, 2026 01:06

itayalroy force-pushed the elastic_ep_repeated_scaling branch 2 times, most recently from f43dda8 to bd4cf16 Compare March 16, 2026 01:26

rtourgeman and others added 3 commits March 16, 2026 16:59

itayalroy force-pushed the elastic_ep_repeated_scaling branch from 78bc5d9 to 9e6f377 Compare March 16, 2026 14:59

tlrmchlsmth self-assigned this Mar 17, 2026

SageMoore approved these changes Mar 18, 2026

View reviewed changes

tlrmchlsmth mentioned this pull request Mar 20, 2026

[Bugfix] Fix elastic EP scale-up after scale-down #37357

Closed

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 20, 2026

tlrmchlsmth approved these changes Mar 20, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Mar 20, 2026

Merge branch 'main' into elastic_ep_repeated_scaling

04c2bd1

tlrmchlsmth enabled auto-merge (squash) March 20, 2026 21:29

tlrmchlsmth merged commit c57d38d into vllm-project:main Mar 20, 2026
77 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Mar 20, 2026

chooper26 pushed a commit to vLLM-HUST/vllm-hust that referenced this pull request Mar 21, 2026

elastic_ep: Fix issues with repeated scale up/down cycles (vllm-proje…

46f798a

…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>

SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026

elastic_ep: Fix issues with repeated scale up/down cycles (vllm-proje…

018abc5

…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

elastic_ep: Fix issues with repeated scale up/down cycles (vllm-proje…

f6db0d3

…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

elastic_ep: Fix issues with repeated scale up/down cycles (vllm-proje…

7e2bffa

…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>

jeffreywang-anyscale mentioned this pull request Apr 15, 2026

[BugFix] Prevent orphaned process on NCCL destroy #39846

Merged

5 tasks

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026

elastic_ep: Fix issues with repeated scale up/down cycles (vllm-proje…

024d264

…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

elastic_ep: Fix issues with repeated scale up/down cycles#37131

elastic_ep: Fix issues with repeated scale up/down cycles#37131
tlrmchlsmth merged 4 commits into
vllm-project:mainfrom
itayalroy:elastic_ep_repeated_scaling

itayalroy commented Mar 16, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mergify Bot commented Mar 16, 2026

Uh oh!

SageMoore left a comment

Uh oh!

tlrmchlsmth left a comment

Uh oh!

tlrmchlsmth Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

itayalroy commented Mar 16, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented Mar 16, 2026

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

itayalroy commented Mar 16, 2026 •

edited by github-actions Bot

Loading