-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the SLURM Preemption/Timeout Signal Configurable #14626
Conversation
@awaelchli This is ready for review, I have two failing tests but I'm not sure whats causing it Besides the test cases I added I also confirmed that it works on our SLURM cluster with a real job |
One thing to discuss: The way I implemented this if the user specifies an invalid signal, we emit a warning message and the SignalConnector falls back to using SIGUSR1 This could also be a fatal error of some sort (either exception and crash or error log and no signal listening) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall approach and implementation looks good to me. Minor comments. Thanks for this contribution, and the testing on the cluster!
src/pytorch_lightning/plugins/environments/slurm_environment.py
Outdated
Show resolved
Hide resolved
src/pytorch_lightning/plugins/environments/slurm_environment.py
Outdated
Show resolved
Hide resolved
I think we should immediately check this in the init of SLURMEnvironment. Did you choose a warning for a reason? I.e., long waiting job in the queue and erroring out can be annoying to the user? If there is a concern, I think that's fine. Otherwise I would error out, as warnings could be easily overlooked or lost in logs. |
Will look at it once we have resolved the discussions here. Probably a difference in the signal library between Python versions. |
I'm fine with erroring out, I think this is probably safer since a user passing an invalid signal would likely not want SIGUSR1 anyway so we probably shouldnt assume it. Although forcing them to pass an actual signal instead of a string sort of solves this problem for us |
tests/tests_pytorch/trainer/connectors/test_signal_connector.py
Outdated
Show resolved
Hide resolved
tests/tests_pytorch/trainer/connectors/test_signal_connector.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Max Ehrlich <[email protected]>
Signed-off-by: Max Ehrlich <[email protected]>
Signed-off-by: Max Ehrlich <[email protected]>
Signed-off-by: Max Ehrlich <[email protected]>
Signed-off-by: Max Ehrlich <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Signed-off-by: Max Ehrlich <[email protected]>
This is rebased on latest master and ready to go unless there is more feedback |
@awaelchli seems to be ready... :) |
Co-authored-by: Jirka Borovec <[email protected]>
Just for my own curiosity can you tell me what those changes are for? |
Earlier I promised I would take care of the failing Windows tests: https://github.com/Lightning-AI/lightning/actions/runs/3039218876/jobs/4893893890 Which is what I'm doing now. Otherwise we can't merge with failing tests. |
Ok I understand now, because python on windows doesn't even define those constants for the signals |
@Queuecumber Looks like we got the Windows cases figured out. |
Thanks for merging, do you guys know what the ETA is for it showing up in an installable package (pre-release or full release) I'm eager to use this in semi-production |
This feature will go in to 1.8, we are targeting end of September / beginning October. To use this immediately, you can install the build from the master branch: |
What does this PR do?
This PR adds a flag to the
SLURMEnvironment
class which allows users to change which signal will be watched for preemtion/timeout.Although SIGUSR1 is fairly standard, some libraries (e.g., submitit) which submit to SLURM use a different signal or users may have another compatibility issue with SIGUSR1 and therefore require the use of another signal.
Fixes #14610
Does your PR introduce any breaking changes? If yes, please list them.
None
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃