Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add base RunSettings support for slurm, pbs, and cobalt #90

Merged
merged 19 commits into from
Oct 20, 2021

Conversation

EricGustin
Copy link
Contributor

This PR addresses issue #84. The SmartSim SlurmLauncher, PBSLauncher, and CobaltLauncher now support base RunSettings.

Tested on slurm and pbs.

@EricGustin EricGustin added area: PBSpro area: settings Issues related to Batch or Run settings labels Sep 24, 2021
@EricGustin EricGustin self-assigned this Sep 24, 2021
@EricGustin EricGustin marked this pull request as draft September 25, 2021 17:59
@EricGustin EricGustin marked this pull request as ready for review September 25, 2021 18:19
Copy link
Collaborator

@al-rigazzi al-rigazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment, but in general it looks good.

settings = RunSettings(exe, args, run_command="srun", run_args=run_args)
return settings
if test_launcher == "pbs":
run_args = {"--pes": ntasks}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we don't specify nodes and time for pbs and cobalt base settings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aprun doesn't have the concept of nodes. it's just processing elements.

Copy link
Contributor

@Spartee Spartee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to push a change to this branch with a new approach for creating a Step

@@ -77,6 +77,9 @@ def create_step(self, name, cwd, step_settings):
if isinstance(step_settings, MpirunSettings):
step = MpirunStep(name, cwd, step_settings)
return step
if isinstance(step_settings, RunSettings):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so lets say im on a slurm system and I put in the AprunSettings as my model run settings.
and the launcher doesn’t have AprunStep but it does have the base class… the launcher will still try to launch the AprunSettings as a LocalStep.

so then if I launch I would get this error

Traceback (most recent call last):
  File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/control/controller.py", line 359, in _launch_step
    job_id = self._launcher.run(job_step)
  File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/launcher/slurm/slurmLauncher.py", line 159, in run
    task_id = self.task_manager.start_task(
  File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/launcher/taskManager.py", line 124, in start_task
    proc = execute_async_cmd(cmd_list, cwd, env=env, out=out, err=err)
  File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/launcher/util/shell.py", line 102, in execute_async_cmd
    raise ShellError("Failed to run command", e, cmd_list) from None
smartsim.error.errors.ShellError: Failed to run command

Command: aprun /usr/bin/echo hello
Error from shell: [Errno 2] No such file or directory: 'aprun'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "hello.py", line 10, in <module>
    exp.start(model)
  File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/experiment.py", line 103, in start
    self._control.start(manifest=start_manifest, block=block)
  File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/control/controller.py", line 78, in start
    self._launch(manifest)
  File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/control/controller.py", line 296, in _launch
    self._launch_step(*job_step)
  File "/lus/cls01029/spartee/poseidon/SmartSim/smartsim/control/controller.py", line 365, in _launch_step
    raise SmartSimError(f"Job step {entity.name} failed to launch") from e
smartsim.error.errors.SmartSimError: Job step hello failed to launch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't a bad error but then the AprunSettings won't necessarily have all the arguments prepared correctly.

Sam Partee and others added 2 commits October 15, 2021 15:12
in order to correctly support custom run commands
for each launcher type, the launcher.create_step
function has been moved to the launcher parent
class and a new class variable supported_rs now
defines the preset RunSettings types supported
by each launcher
@Spartee Spartee self-requested a review October 19, 2021 23:23
Copy link
Contributor

@Spartee Spartee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clean and extensible. I like it.

197 tests passed on horizon. 84% coverage. thanks for using make style already.

LGTM.

@EricGustin EricGustin merged commit 8be8ae4 into CrayLabs:develop Oct 20, 2021
@EricGustin EricGustin deleted the runsettings-nonlocal branch February 18, 2022 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: settings Issues related to Batch or Run settings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants