-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make some Slurm errors fatal for execution #392
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## develop #392 +/- ##
===========================================
+ Coverage 90.11% 90.24% +0.12%
===========================================
Files 60 60
Lines 3864 3864
===========================================
+ Hits 3482 3487 +5
+ Misses 382 377 -5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per offline convo, I will create a ticket to review adding checks to a higher level abstraction.
otherwise, looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two potential change requests that I wanted to get your opinion on, but otherwise this LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending a re-run of tests after the merge of the CrayLabs/SmartRedis#421 hotfix!
OK thanks, I'll let other - more urgent - PRs get in before this. |
When running on Slurm-based systems, it can happen that some commands are available, but return an error when the SlurmDB does not work properly. This error could silently lead to SmartSim's execution looping forever. It was decided that any error on
sacct
will be fatal for execution, as SmartSim makes use ofsacct
to acquire information about running , paused,and finished jobs.