-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add base RunSettings support for slurm, pbs, and cobalt #90
Add base RunSettings support for slurm, pbs, and cobalt #90
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment, but in general it looks good.
settings = RunSettings(exe, args, run_command="srun", run_args=run_args) | ||
return settings | ||
if test_launcher == "pbs": | ||
run_args = {"--pes": ntasks} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we don't specify nodes and time for pbs
and cobalt
base settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aprun doesn't have the concept of nodes
. it's just processing elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to push a change to this branch with a new approach for creating a Step
@@ -77,6 +77,9 @@ def create_step(self, name, cwd, step_settings): | |||
if isinstance(step_settings, MpirunSettings): | |||
step = MpirunStep(name, cwd, step_settings) | |||
return step | |||
if isinstance(step_settings, RunSettings): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so lets say im on a slurm system and I put in the AprunSettings
as my model run settings.
and the launcher doesn’t have AprunStep
but it does have the base class… the launcher will still try to launch the AprunSettings
as a LocalStep
.
so then if I launch I would get this error
Traceback (most recent call last):
File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/control/controller.py", line 359, in _launch_step
job_id = self._launcher.run(job_step)
File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/launcher/slurm/slurmLauncher.py", line 159, in run
task_id = self.task_manager.start_task(
File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/launcher/taskManager.py", line 124, in start_task
proc = execute_async_cmd(cmd_list, cwd, env=env, out=out, err=err)
File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/launcher/util/shell.py", line 102, in execute_async_cmd
raise ShellError("Failed to run command", e, cmd_list) from None
smartsim.error.errors.ShellError: Failed to run command
Command: aprun /usr/bin/echo hello
Error from shell: [Errno 2] No such file or directory: 'aprun'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "hello.py", line 10, in <module>
exp.start(model)
File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/experiment.py", line 103, in start
self._control.start(manifest=start_manifest, block=block)
File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/control/controller.py", line 78, in start
self._launch(manifest)
File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/control/controller.py", line 296, in _launch
self._launch_step(*job_step)
File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/control/controller.py", line 365, in _launch_step
raise SmartSimError(f"Job step {entity.name} failed to launch") from e
smartsim.error.errors.SmartSimError: Job step hello failed to launch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't a bad error but then the AprunSettings
won't necessarily have all the arguments prepared correctly.
in order to correctly support custom run commands for each launcher type, the launcher.create_step function has been moved to the launcher parent class and a new class variable supported_rs now defines the preset RunSettings types supported by each launcher
6dc42a0
to
a22589c
Compare
… into runsettings-nonlocal Merge remote runsettings-nonlocal into local runsettings-nonlocal
… into runsettings-nonlocal Merge in remote
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is clean and extensible. I like it.
197 tests passed on horizon. 84% coverage. thanks for using make style
already.
LGTM.
This PR addresses issue #84. The SmartSim
SlurmLauncher
,PBSLauncher
, andCobaltLauncher
now support baseRunSettings
.Tested on slurm and pbs.