Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for NeMo SDK #131

Merged
merged 13 commits into from
Jul 9, 2024
Prev Previous commit
Next Next commit
Add slurm sdk example
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
  • Loading branch information
ryantwolf committed Jun 27, 2024
commit edf4bcb042ab26f9111ac2f5c2606bcb809edc8d
51 changes: 51 additions & 0 deletions examples/nemo_sdk/launch_slurm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import nemo_sdk as sdk
from nemo_sdk.core.execution import SlurmExecutor

from nemo_curator.nemo_sdk import SlurmJobConfig


@sdk.factory
def nemo_curator_slurm_executor():
"""
Configure the following function with the details of your SLURM cluster
"""
return SlurmExecutor(
job_name_prefix="nemo-curator",
nodes=2,
exclusive=True,
time="04:00:00",
container_image="nvcr.io/nvidia/nemo:dev",
container_mounts="/path/on/machine:/path/in/container",
)


if __name__ == "__main__":
# Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the SLURM cluster
container_entrypoint = "/cluster/path/slurm/container_entrypoint.sh"
# The NeMo Curator command to run
curator_command = "text_cleaning --input-data-dir=/path/to/data --output-clean-dir=/path/to/output"
ryantwolf marked this conversation as resolved.
Show resolved Hide resolved
curator_job = SlurmJobConfig(
job_dir="/home/user/jobs",
container_entrypoint=container_entrypoint,
script_command=curator_command,
)

with sdk.Experiment(
"example_nemo_curator_exp", executor="nemo_curator_slurm_executor"
) as exp:
exp.add(curator_job.to_script(), tail_logs=True)
exp.run(detach=False)
17 changes: 17 additions & 0 deletions nemo_curator/nemo_sdk/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from .slurm import SlurmJobConfig

__all__ = ["SlurmJobConfig"]
75 changes: 75 additions & 0 deletions nemo_curator/nemo_sdk/slurm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from dataclasses import dataclass
from typing import Dict

import nemo_sdk as sdk


@dataclass
class SlurmJobConfig:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also tagging @jacobtomlinson who's done a lot of work on the dask/dask-cuda clusters with Slurm (among other things).

For now this mimics the command line setup to start clusters, but feel free to share any opinions you might have since this overlaps a lot with the dask-runners/dask-jobqueue api.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ayushdg. This seems to follow the common pattern that a lot of Slurm implementations use so I don't have any particular comments. I'm always keen to see how we can reuse code though, so maybe we could work towards a common base in dask-jobqueue that projects like this can use instead of reinventing it each time.

job_dir: str
container_entrypoint: str
script_command: str
device: str = "cpu"
interface: str = "eth0"
protocol: str = "tcp"
cpu_worker_memory_limit: str = "0"
rapids_no_initialize: str = "1"
cudf_spill: str = "1"
rmm_scheduler_pool_size: str = "1GB"
rmm_worker_pool_size: str = "72GiB"
libcudf_cufile_policy: str = "OFF"
Comment on lines +61 to +68
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my other comment, I often have trouble knowing what to set for these types of parameters. Is there anywhere the user might be able to refer to for recommendations of how to set these parameters for their specific cluster?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should release a bigger guide on our recommendations for each parameter. For now I've included a docstring that should provide a bit more context. Let me know if you want me to change anything else to make it clearer.

ryantwolf marked this conversation as resolved.
Show resolved Hide resolved

def to_script(
self, add_scheduler_file: bool = True, add_device: bool = True
) -> sdk.Script:
"""
Converts to a script object executable by NeMo SDK
Args:
add_scheduler_file: Automatically appends a '--scheduler-file' argument to the
script_command where the value is job_dir/logs/scheduler.json. All
scripts included in NeMo Curator accept and require this argument to scale
properly on SLURM clusters.
add_device: Automatically appends a '--device' argument to the script_command
where the value is the member variable of device. All scripts included in
NeMo Curator accept and require this argument.
Returns:
A NeMo SDK Script that will intialize a Dask cluster, and run the specified command.
It is designed to be executed on a SLURM cluster
"""
env_vars = self._build_env_vars()

if add_scheduler_file:
env_vars[
"SCRIPT_COMMAND"
] += f" --scheduler-file={env_vars['SCHEDULER_FILE']}"
if add_device:
env_vars["SCRIPT_COMMAND"] += f" --device={env_vars['DEVICE']}"

return sdk.Script(path=self.container_entrypoint, env=env_vars)

def _build_env_vars(self) -> Dict[str, str]:
env_vars = vars(self)
# Convert to uppercase to match container_entrypoint.sh
env_vars = {key.upper(): val for key, val in env_vars.items()}

env_vars["LOGDIR"] = f"{self.job_dir}/logs"
env_vars["PROFILESDIR"] = f"{self.job_dir}/profiles"
env_vars["SCHEDULER_FILE"] = f"{env_vars['LOGDIR']}/scheduler.json"
env_vars["SCHEDULER_LOG"] = f"{env_vars['LOGDIR']}/scheduler.log"
env_vars["DONE_MARKER"] = f"{env_vars['LOGDIR']}/done.txt"

return env_vars