Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the SLURM Preemption/Timeout Signal Configurable #14626

Merged
merged 27 commits into from
Sep 12, 2022

Conversation

Queuecumber
Copy link
Contributor

@Queuecumber Queuecumber commented Sep 9, 2022

What does this PR do?

This PR adds a flag to the SLURMEnvironment class which allows users to change which signal will be watched for preemtion/timeout.

Although SIGUSR1 is fairly standard, some libraries (e.g., submitit) which submit to SLURM use a different signal or users may have another compatibility issue with SIGUSR1 and therefore require the use of another signal.

Fixes #14610

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Sep 9, 2022
@Queuecumber Queuecumber marked this pull request as draft September 9, 2022 16:46
@Queuecumber Queuecumber changed the title Draft: Make the SLURM Preemption/Timeout Signal Configurable Make the SLURM Preemption/Timeout Signal Configurable Sep 9, 2022
@Queuecumber Queuecumber marked this pull request as ready for review September 10, 2022 14:36
@Queuecumber
Copy link
Contributor Author

Queuecumber commented Sep 10, 2022

@awaelchli This is ready for review, I have two failing tests but I'm not sure whats causing it

Besides the test cases I added I also confirmed that it works on our SLURM cluster with a real job

@Queuecumber
Copy link
Contributor Author

Queuecumber commented Sep 10, 2022

One thing to discuss:

The way I implemented this if the user specifies an invalid signal, we emit a warning message and the SignalConnector falls back to using SIGUSR1

This could also be a fatal error of some sort (either exception and crash or error log and no signal listening)

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach and implementation looks good to me. Minor comments. Thanks for this contribution, and the testing on the cluster!

@awaelchli
Copy link
Contributor

The way I implemented this if the user specifies an invalid signal, we emit a warning message and the SignalConnector falls back to using SIGUSR1

I think we should immediately check this in the init of SLURMEnvironment. Did you choose a warning for a reason? I.e., long waiting job in the queue and erroring out can be annoying to the user? If there is a concern, I think that's fine. Otherwise I would error out, as warnings could be easily overlooked or lost in logs.

@awaelchli
Copy link
Contributor

I have two failing tests but I'm not sure whats causing it

Will look at it once we have resolved the discussions here. Probably a difference in the signal library between Python versions.

@Queuecumber
Copy link
Contributor Author

The way I implemented this if the user specifies an invalid signal, we emit a warning message and the SignalConnector falls back to using SIGUSR1

I think we should immediately check this in the init of SLURMEnvironment. Did you choose a warning for a reason? I.e., long waiting job in the queue and erroring out can be annoying to the user? If there is a concern, I think that's fine. Otherwise I would error out, as warnings could be easily overlooked or lost in logs.

I'm fine with erroring out, I think this is probably safer since a user passing an invalid signal would likely not want SIGUSR1 anyway so we probably shouldnt assume it.

Although forcing them to pass an actual signal instead of a string sort of solves this problem for us

@carmocca carmocca added feature Is an improvement or enhancement environment: slurm labels Sep 11, 2022
@carmocca carmocca added this to the pl:1.8 milestone Sep 11, 2022
@carmocca carmocca added the community This PR is from the community label Sep 11, 2022
@mergify mergify bot added the ready PRs ready to be merged label Sep 12, 2022
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Sep 12, 2022
@Queuecumber
Copy link
Contributor Author

This is rebased on latest master and ready to go unless there is more feedback

@Borda Borda enabled auto-merge (squash) September 12, 2022 17:15
@Borda
Copy link
Member

Borda commented Sep 12, 2022

@awaelchli seems to be ready... :)

@Queuecumber
Copy link
Contributor Author

Just for my own curiosity can you tell me what those changes are for?

@awaelchli
Copy link
Contributor

Earlier I promised I would take care of the failing Windows tests: https://github.com/Lightning-AI/lightning/actions/runs/3039218876/jobs/4893893890

Which is what I'm doing now. Otherwise we can't merge with failing tests.

@Queuecumber
Copy link
Contributor Author

Ok I understand now, because python on windows doesn't even define those constants for the signals

@Borda Borda merged commit e5998e6 into Lightning-AI:master Sep 12, 2022
@awaelchli
Copy link
Contributor

@Queuecumber Looks like we got the Windows cases figured out.
Great work, thanks for contributing this improvement!

@Queuecumber
Copy link
Contributor Author

Thanks for merging, do you guys know what the ETA is for it showing up in an installable package (pre-release or full release)

I'm eager to use this in semi-production

@awaelchli
Copy link
Contributor

This feature will go in to 1.8, we are targeting end of September / beginning October.

To use this immediately, you can install the build from the master branch:
pip install https://github.com/Lightning-AI/lightning/archive/refs/heads/master.zip -U

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This PR is from the community environment: slurm feature Is an improvement or enhancement pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Make the Slurm Signal Configurable
6 participants