-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor subprocesses to avoid zombies #18218
Conversation
for more information, see https://pre-commit.ci
…feature/monitor-processes
for more information, see https://pre-commit.ci
…feature/monitor-processes
for more information, see https://pre-commit.ci
…feature/monitor-processes
⚡ Required checks status: All passing 🟢Groups summary🟢 pytorch_lightning: Tests workflow
These checks are required after the changes to 🟢 pytorch_lightning: Azure GPU
These checks are required after the changes to 🟢 pytorch_lightning: Benchmarks
These checks are required after the changes to 🟢 fabric: Docs
These checks are required after the changes to 🟢 pytorch_lightning: Docs
These checks are required after the changes to 🟢 lightning_fabric: CPU workflowThese checks are required after the changes to 🟢 lightning_fabric: Azure GPU
These checks are required after the changes to 🟢 mypy
These checks are required after the changes to 🟢 installThese checks are required after the changes to 🟢 link-check
These checks are required after the changes to Thank you for your contribution! 💜
|
…feature/monitor-processes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to add this to the spawn launchers too?
I would also update the existing PyTorch logic (added in #16525) to use the same code
What does this PR do?
This PR adds a process observer thread that monitors the child processes launched with the
SubprocessScriptLauncher
. The observer periodically checks whether the processes have terminated or not. If any processes have non-zero exit code, the thread will forcefully shutdown all other processes. Note that this only quality-of-life improvement applies only to our special process launcher from Lightning for single-node use. For multi-node training, we still rely on the external process launcher (SLURM, torchelastic, etc.) to handle this.Fixes #16410
Fixes #16518
Here is a simple example script if you want to test it yourself. It crashes on rank 1 and hangs on rank 0. On this branch, the processes will be cleaned up properly.
cc @Borda @carmocca @justusschock @awaelchli