Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make some Slurm errors fatal for execution #392

Merged
merged 14 commits into from
Dec 13, 2023

Conversation

al-rigazzi
Copy link
Collaborator

When running on Slurm-based systems, it can happen that some commands are available, but return an error when the SlurmDB does not work properly. This error could silently lead to SmartSim's execution looping forever. It was decided that any error on sacct will be fatal for execution, as SmartSim makes use of sacct to acquire information about running , paused,and finished jobs.

@codecov
Copy link

codecov bot commented Oct 11, 2023

Codecov Report

Merging #392 (0449d1e) into develop (3ee7794) will increase coverage by 0.12%.
The diff coverage is n/a.

❗ Current head 0449d1e differs from pull request most recent head 6bbb54a. Consider uploading reports for the commit 6bbb54a to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #392      +/-   ##
===========================================
+ Coverage    90.11%   90.24%   +0.12%     
===========================================
  Files           60       60              
  Lines         3864     3864              
===========================================
+ Hits          3482     3487       +5     
+ Misses         382      377       -5     

see 1 file with indirect coverage changes

@al-rigazzi al-rigazzi self-assigned this Oct 11, 2023
@al-rigazzi al-rigazzi added type: design Issues related to architecture and code design area: launcher Issues related to any of the launchers within SmartSim type: usability Issues related to ease of use labels Oct 11, 2023
@al-rigazzi al-rigazzi changed the title Make some Slurm error fatal for execution Make some Slurm errors fatal for execution Oct 18, 2023
@ankona ankona self-requested a review October 23, 2023 22:48
Copy link
Contributor

@ankona ankona left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per offline convo, I will create a ticket to review adding checks to a higher level abstraction.

otherwise, looks good!

Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two potential change requests that I wanted to get your opinion on, but otherwise this LGTM!

smartsim/_core/launcher/slurm/slurmCommands.py Outdated Show resolved Hide resolved
smartsim/_core/launcher/slurm/slurmCommands.py Outdated Show resolved Hide resolved
Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending a re-run of tests after the merge of the CrayLabs/SmartRedis#421 hotfix!

@al-rigazzi
Copy link
Collaborator Author

LGTM pending a re-run of tests after the merge of the CrayLabs/SmartRedis#421 hotfix!

OK thanks, I'll let other - more urgent - PRs get in before this.

@al-rigazzi al-rigazzi merged commit 67785e1 into CrayLabs:develop Dec 13, 2023
24 checks passed
@al-rigazzi al-rigazzi deleted the slurm-err branch December 13, 2023 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: launcher Issues related to any of the launchers within SmartSim type: design Issues related to architecture and code design type: usability Issues related to ease of use
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants