-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Slurm heterogeneous jobs #346
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this in and for working through and around Slurm
smartsim/settings/slurmSettings.py
Outdated
@@ -270,6 +274,17 @@ def set_walltime(self, walltime: str) -> None: | |||
""" | |||
self.run_args["time"] = str(walltime) | |||
|
|||
def set_het_group(self, het_group: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thought that I just had. Thrown an error here and also in make_mpmd
in case someone tries to set a het_group
and also mpmd
at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## develop #346 +/- ##
========================================
Coverage 87.04% 87.04%
========================================
Files 59 59
Lines 3551 3551
========================================
Hits 3091 3091
Misses 460 460
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!! I have a couple of small knit picks, and some potential follow on work that may or may not be relevant. Lmk what you think!!
Co-authored-by: Matt Drozt <[email protected]>
Co-authored-by: Matt Drozt <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of those knit picks!! Everything here LGTM!!
This PR adds (limited) support for Slurm heterogeneous jobs. Basically, het jobs overload the syntax used by MPMD workloads, and this can be the cause of problems when running on Slurm. This PR adds several checks to see if the allocation is heterogeneous and prevents the user from running MPMD models (or
single_cmd
orchestrators).Still missing: