elastic_ep: Fix issues with repeated scale up/down cycles#37131
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces significant improvements and bug fixes for elastic expert parallelism. The primary changes include fixing a CUDA out-of-memory issue by ensuring PyNCCL communicators are properly destroyed, and refactoring the EPLB (Expert Parallelism Load Balancing) suppression logic to be centralized within elastic_execute. This centralization resolves a bug where EPLB would remain suppressed after a scale-down operation. Furthermore, the code for handling new worker loading has been substantially cleaned up, with EEP-specific logic being moved to more appropriate locations, which enhances maintainability and clarifies the execution flow. The changes are well-structured and directly address the issues described, leading to a more robust and reliable elastic EP implementation.
28f7904 to
969b51a
Compare
|
Hi @itayalroy, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
f43dda8 to
bd4cf16
Compare
PyNCCL communicators were never explicitly destroyed during elastic EP reconfigurations (CudaCommunicator.destroy() just set pynccl_comm to None). Over repeated scale up/down cycles, this causes CUDA OOM crashes. Fix it by explicitly destroying PyNCCL comms on every elastic ep reconfiguration. Note that PyNCCL destuction is a collective operation so it must be called on exiting processes, and all CUDA graphs referencing the comm must be destroyed beforehand as well. Co-authored-by: Itay Alroy <ialroy@nvidia.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Elastic EP EPLB suppression was scattered across executor, worker, and state-machine code, which made its lifecycle hard to follow and caused bugs such as EPLB staying suppressed after scale down. Fix by making elastic_execute the sole owner of EPLB suppression. While removing EPLB suppression from load_model, also clean up the Elastic EP new-worker load path: keep GPUWorker.load_model() generic, move EEP-specific loading into elastic_execute, use the same executor load path for regular and EEP new workers, and move the remaining scale-up work, such as receive_weights, into the state machine (symmetric to the existing-worker's send_weights) Signed-off-by: Itay Alroy <ialroy@nvidia.com>
During Elastic EP scale down, removing engines could destroy their active DP/EP groups and still continue the current busy-loop iteration. If that iteration happens to run a dummy-batch, it fails with "data parallel group is not initialized". Fix this by exiting the busy loop as soon as a removing worker finishes its Elastic EP state machine, and let SystemExit handler call engine_core.shutdown() instead of calling it directly from the EEP state machine. Signed-off-by: Itay Alroy <ialroy@nvidia.com>
78bc5d9 to
9e6f377
Compare
SageMoore
left a comment
There was a problem hiding this comment.
This looks reasonable to me @itayalroy. Thanks for the contribution!
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Looks good to me, thank you!
| with torch.accelerator.device_index(self.device.index): | ||
| self.nccl.ncclCommDestroy(self.comm) | ||
| self.available = False | ||
| self.disabled = True |
There was a problem hiding this comment.
paranoia: consider setting self.comm to None to prevent use after destroy but before garbage collection
…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
…ct#37131) Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
This PR fixes two issues found when testing repeated scale up/down cycles:
PyNCCL communicators were never explicitly destroyed during
elastic EP reconfigurations. Over repeated scale up/down cycles,
this causes CUDA OOM crashes. Fixed by cleaning up PyNCCL correctly.
Elastic EP EPLB suppression was scattered across executor, worker, and
state-machine code, which made its lifecycle hard to follow and caused
bugs such as EPLB staying suppressed after scale down. Fixed by making
elastic_execute the sole owner of EPLB suppression.
While removing EPLB suppression from load_model, also clean up the
entire Elastic EP new-worker load path which was extremely messy:
Move EEP-specific logic from GPUWorker.load_model() into
elastic_execute, go through the same executor load path for both
regular and EEP new workers, and move the remaining scale-up logic,
such as receive_weights, into the elastic state machine where it belongs
(symmetric to existing-worker send_weights)