Skip to content

Commit 0cbe447

Browse files
ryantwolfsarahyurick
authored andcommitted
Add support for NeMo SDK (NVIDIA#131)
* Begin docs Signed-off-by: Ryan Wolf <[email protected]> * Add slurm sdk example Signed-off-by: Ryan Wolf <[email protected]> * Use safe import Signed-off-by: Ryan Wolf <[email protected]> * Fix bugs in sdk Signed-off-by: Ryan Wolf <[email protected]> * Update docs and tweak scripts Signed-off-by: Ryan Wolf <[email protected]> * Add interface helper function Signed-off-by: Ryan Wolf <[email protected]> * Update docs Signed-off-by: Ryan Wolf <[email protected]> * Fix formatting Signed-off-by: Ryan Wolf <[email protected]> * Add config docstring Signed-off-by: Ryan Wolf <[email protected]> * Address comments Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]>
1 parent 1e6acd8 commit 0cbe447

File tree

10 files changed

+338
-8
lines changed

10 files changed

+338
-8
lines changed

docs/user-guide/index.rst

+4
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,9 @@
3030
:ref:`NeMo Curator on Kubernetes <data-curator-kubernetes>`
3131
Demonstration of how to run the NeMo Curator on a Dask Cluster deployed on top of Kubernetes
3232

33+
:ref:`NeMo Curator with NeMo SDK <data-curator-nemo-sdk>`
34+
Example of how to use NeMo Curator with NeMo SDK to run on various platforms
35+
3336
`Tutorials <https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials>`__
3437
To get started, you can explore the NeMo Curator GitHub repository and follow the available tutorials and notebooks. These resources cover various aspects of data curation, including training from scratch and Parameter-Efficient Fine-Tuning (PEFT).
3538

@@ -49,3 +52,4 @@
4952
personalidentifiableinformationidentificationandremoval.rst
5053
distributeddataclassification.rst
5154
kubernetescurator.rst
55+
nemosdk.rst

docs/user-guide/nemosdk.rst

+127
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
.. _data-curator-nemo-sdk:
2+
3+
======================================
4+
NeMo Curator with NeMo SDK
5+
======================================
6+
-----------------------------------------
7+
NeMo SDK
8+
-----------------------------------------
9+
10+
The NeMo SDK is a general purpose tool for configuring and executing Python functions and scripts acrosss various computing environments.
11+
It is used across the NeMo Framework for managing machine learning experiments.
12+
One of the key features of the NeMo SDK is the ability to run code locally or on platforms like SLURM with minimal changes.
13+
14+
-----------------------------------------
15+
Usage
16+
-----------------------------------------
17+
18+
We recommend getting slightly familiar with NeMo SDK before jumping into this. The documentation can be found here.
19+
20+
Let's walk through the example usage for how you can launch a slurm job using `examples/launch_slurm.py <https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/nemo_sdk/launch_slurm.py>`_.
21+
22+
.. code-block:: python
23+
24+
25+
import nemo_sdk as sdk
26+
from nemo_sdk.core.execution import SlurmExecutor
27+
28+
from nemo_curator.nemo_sdk import SlurmJobConfig
29+
30+
@sdk.factory
31+
def nemo_curator_slurm_executor() -> SlurmExecutor:
32+
"""
33+
Configure the following function with the details of your SLURM cluster
34+
"""
35+
return SlurmExecutor(
36+
job_name_prefix="nemo-curator",
37+
account="my-account",
38+
nodes=2,
39+
exclusive=True,
40+
time="04:00:00",
41+
container_image="nvcr.io/nvidia/nemo:dev",
42+
container_mounts=["/path/on/machine:/path/in/container"],
43+
)
44+
45+
First, we need to define a factory that can produce a ``SlurmExecutor``.
46+
This exectuor is where you define all your cluster parameters. Note: NeMo SDK only supports running on SLURM clusters with `Pyxis <https://github.com/NVIDIA/pyxis>`_ right now.
47+
After this, there is the main function
48+
49+
.. code-block:: python
50+
51+
# Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the SLURM cluster
52+
container_entrypoint = "/cluster/path/slurm/container_entrypoint.sh"
53+
# The NeMo Curator command to run
54+
curator_command = "text_cleaning --input-data-dir=/path/to/data --output-clean-dir=/path/to/output"
55+
curator_job = SlurmJobConfig(
56+
job_dir="/home/user/jobs",
57+
container_entrypoint=container_entrypoint,
58+
script_command=curator_command,
59+
)
60+
61+
First, we need to specify the path to `examples/slurm/container-entrypoint.sh <https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/slurm/container-entrypoint.sh>`_ on the cluster.
62+
This shell script is responsible for setting up the Dask cluster on Slurm and will be the main script run.
63+
Therefore, we need to define the path to it.
64+
65+
Second, we need to establish the NeMo Curator script we want to run.
66+
This can be a command line utility like ``text_cleaning`` we have above, or it can be your own custom script ran with ``python path/to/script.py``
67+
68+
69+
Finally, we combine all of these into a ``SlurmJobConfig``. This config has many options for configuring the Dask cluster.
70+
We'll highlight a couple of important ones:
71+
72+
* ``device="cpu"`` determines the type of Dask cluster to initialize. If you are using GPU modules, please set this equal to ``"gpu"``.
73+
* ``interface="etho0"`` specifies the network interface to use for communication within the Dask cluster. It will likely be different for your Slurm cluster, so please modify as needed. You can determine what interfaces are available by running the following function on your cluster.
74+
75+
.. code-block:: python
76+
77+
from nemo_curator import get_network_interfaces
78+
79+
print(get_network_interfaces())
80+
81+
.. code-block:: python
82+
83+
executor = sdk.resolve(SlurmExecutor, "nemo_curator_slurm_executor")
84+
with sdk.Experiment("example_nemo_curator_exp", executor=executor) as exp:
85+
exp.add(curator_job.to_script(), tail_logs=True)
86+
exp.run(detach=False)
87+
88+
After configuring the job, we can finally run it.
89+
First, we use the sdk to resolve our custom factory.
90+
Next, we use it to begin an experiment named "example_nemo_curator_exp" running on our Slurm exectuor.
91+
92+
``exp.add(curator_job.to_script(), tail_logs=True)`` adds the NeMo Curator script to be part of the experiment.
93+
It converts the ``SlurmJobConfig`` to a ``sdk.Script``.
94+
This ``curator_job.to_script()`` has two important parameters.
95+
* ``add_scheduler_file=True``
96+
* ``add_device=True``
97+
98+
Both of these modify the command specified in ``curator_command``.
99+
Setting both to ``True`` (the default) transforms the original command from:
100+
101+
.. code-block:: bash
102+
103+
# Original command
104+
text_cleaning \
105+
--input-data-dir=/path/to/data \
106+
--output-clean-dir=/path/to/output
107+
108+
to:
109+
110+
.. code-block:: bash
111+
112+
# Modified commmand
113+
text_cleaning \
114+
--input-data-dir=/path/to/data \
115+
--output-clean-dir=/path/to/output \
116+
--scheduler-file=/path/to/scheduler/file \
117+
--device="cpu"
118+
119+
120+
As you can see, ``add_scheduler_file=True`` causes ``--scheduler-file=/path/to/scheduer/file`` to be appended to the command, and ``add_device=True`` causes ``--device="cpu"`` (or whatever the device is set to) to be appended.
121+
``/path/to/scheduer/file`` is determined by ``SlurmJobConfig``, and ``device`` is what the user specified in the ``device`` parameter previously.
122+
123+
The scheduler file argument is necessary to connect to the Dask cluster on Slurm.
124+
All NeMo Curator scripts accept both arguments, so the default is to automatically add them.
125+
If your script is configured differently, feel free to turn these off.
126+
127+
The final line ``exp.run(detach=False)`` starts the experiment on the Slurm cluster.

examples/nemo_sdk/launch_slurm.py

+56
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import nemo_sdk as sdk
16+
from nemo_sdk.core.execution import SlurmExecutor
17+
18+
from nemo_curator.nemo_sdk import SlurmJobConfig
19+
20+
21+
@sdk.factory
22+
def nemo_curator_slurm_executor() -> SlurmExecutor:
23+
"""
24+
Configure the following function with the details of your SLURM cluster
25+
"""
26+
return SlurmExecutor(
27+
job_name_prefix="nemo-curator",
28+
account="my-account",
29+
nodes=2,
30+
exclusive=True,
31+
time="04:00:00",
32+
container_image="nvcr.io/nvidia/nemo:dev",
33+
container_mounts=["/path/on/machine:/path/in/container"],
34+
)
35+
36+
37+
def main():
38+
# Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the SLURM cluster
39+
container_entrypoint = "/cluster/path/slurm/container_entrypoint.sh"
40+
# The NeMo Curator command to run
41+
# This command can be susbstituted with any NeMo Curator command
42+
curator_command = "text_cleaning --input-data-dir=/path/to/data --output-clean-dir=/path/to/output"
43+
curator_job = SlurmJobConfig(
44+
job_dir="/home/user/jobs",
45+
container_entrypoint=container_entrypoint,
46+
script_command=curator_command,
47+
)
48+
49+
executor = sdk.resolve(SlurmExecutor, "nemo_curator_slurm_executor")
50+
with sdk.Experiment("example_nemo_curator_exp", executor=executor) as exp:
51+
exp.add(curator_job.to_script(), tail_logs=True)
52+
exp.run(detach=False)
53+
54+
55+
if __name__ == "__main__":
56+
main()

examples/slurm/container-entrypoint.sh

+7-1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,12 @@
1616

1717
# Start the scheduler on the rank 0 node
1818
if [[ -z "$SLURM_NODEID" ]] || [[ $SLURM_NODEID == 0 ]]; then
19+
# Make the directories needed
20+
echo "Making log directory $LOGDIR"
21+
mkdir -p $LOGDIR
22+
echo "Making profile directory $PROFILESDIR"
23+
mkdir -p $PROFILESDIR
24+
1925
echo "Starting scheduler"
2026
if [[ $DEVICE == 'cpu' ]]; then
2127
dask scheduler \
@@ -58,7 +64,7 @@ fi
5864
sleep 60
5965

6066
if [[ -z "$SLURM_NODEID" ]] || [[ $SLURM_NODEID == 0 ]]; then
61-
echo "Starting $SCRIPT_PATH"
67+
echo "Starting $SCRIPT_COMMAND"
6268
bash -c "$SCRIPT_COMMAND"
6369
touch $DONE_MARKER
6470
fi

examples/slurm/start-slurm.sh

+2-4
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@
2828
export BASE_JOB_DIR=`pwd`/nemo-curator-jobs
2929
export JOB_DIR=$BASE_JOB_DIR/$SLURM_JOB_ID
3030

31-
# Logging information
31+
# Directory for Dask cluster communication and logging
32+
# Must be paths inside the container that are accessible across nodes
3233
export LOGDIR=$JOB_DIR/logs
3334
export PROFILESDIR=$JOB_DIR/profiles
3435
export SCHEDULER_FILE=$LOGDIR/scheduler.json
@@ -74,9 +75,6 @@ export DASK_DATAFRAME__QUERY_PLANNING=False
7475
# End easy customization
7576
# =================================================================
7677

77-
mkdir -p $LOGDIR
78-
mkdir -p $PROFILESDIR
79-
8078
# Start the container
8179
srun \
8280
--container-mounts=${MOUNTS} \

nemo_curator/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@
4141
NemoDeployClient,
4242
OpenAIClient,
4343
)
44-
from .utils.distributed_utils import get_client
44+
from .utils.distributed_utils import get_client, get_network_interfaces
4545

4646
# Dask will automatically convert the list score type
4747
# to a string without this option.

nemo_curator/nemo_sdk/__init__.py

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from .slurm import SlurmJobConfig
16+
17+
__all__ = ["SlurmJobConfig"]

nemo_curator/nemo_sdk/slurm.py

+110
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from dataclasses import dataclass
16+
from typing import Dict
17+
18+
from nemo_curator.utils.import_utils import safe_import
19+
20+
sdk = safe_import("nemo_sdk")
21+
22+
23+
@dataclass
24+
class SlurmJobConfig:
25+
"""
26+
Configuration for running a NeMo Curator script on a SLURM cluster using
27+
NeMo SDK
28+
29+
Args:
30+
job_dir: The base directory where all the files related to setting up
31+
the Dask cluster for NeMo Curator will be written
32+
container_entrypoint: A path to the container-entrypoint.sh script
33+
on the cluster. container-entrypoint.sh is found in the repo
34+
here: https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/slurm/container-entrypoint.sh
35+
script_command: The NeMo Curator CLI tool to run. Pass any additional arguments
36+
needed directly in this string.
37+
device: The type of script that will be running, and therefore the type
38+
of Dask cluster that will be created. Must be either "cpu" or "gpu".
39+
interface: The network interface the Dask cluster will communicate over.
40+
Use nemo_curator.get_network_interfaces() to get a list of available ones.
41+
protocol: The networking protocol to use. Can be either "tcp" or "ucx".
42+
Setting to "ucx" is recommended for GPU jobs if your cluster supports it.
43+
cpu_worker_memory_limit: The maximum memory per process that a Dask worker can use.
44+
"5GB" or "5000M" are examples. "0" means no limit.
45+
rapids_no_initialize: Will delay or disable the CUDA context creation of RAPIDS libraries,
46+
allowing for improved compatibility with UCX-enabled clusters and preventing runtime warnings.
47+
cudf_spill: Enables automatic spilling (and “unspilling”) of buffers from device to host to
48+
enable out-of-memory computation, i.e., computing on objects that occupy more memory
49+
than is available on the GPU.
50+
rmm_scheduler_pool_size: Sets a small pool of GPU memory for message transfers when
51+
the scheduler is using ucx
52+
rmm_worker_pool_size: The amount of GPU memory each GPU worker process may use.
53+
Recommended to set at 80-90% of available GPU memory. 72GiB is good for A100/H100
54+
libcudf_cufile_policy: Allows reading/writing directly from storage to GPU.
55+
"""
56+
57+
job_dir: str
58+
container_entrypoint: str
59+
script_command: str
60+
device: str = "cpu"
61+
interface: str = "eth0"
62+
protocol: str = "tcp"
63+
cpu_worker_memory_limit: str = "0"
64+
rapids_no_initialize: str = "1"
65+
cudf_spill: str = "1"
66+
rmm_scheduler_pool_size: str = "1GB"
67+
rmm_worker_pool_size: str = "72GiB"
68+
libcudf_cufile_policy: str = "OFF"
69+
70+
def to_script(self, add_scheduler_file: bool = True, add_device: bool = True):
71+
"""
72+
Converts to a script object executable by NeMo SDK
73+
Args:
74+
add_scheduler_file: Automatically appends a '--scheduler-file' argument to the
75+
script_command where the value is job_dir/logs/scheduler.json. All
76+
scripts included in NeMo Curator accept and require this argument to scale
77+
properly on SLURM clusters.
78+
add_device: Automatically appends a '--device' argument to the script_command
79+
where the value is the member variable of device. All scripts included in
80+
NeMo Curator accept and require this argument.
81+
Returns:
82+
A NeMo SDK Script that will intialize a Dask cluster, and run the specified command.
83+
It is designed to be executed on a SLURM cluster
84+
"""
85+
env_vars = self._build_env_vars()
86+
87+
if add_scheduler_file:
88+
env_vars[
89+
"SCRIPT_COMMAND"
90+
] += f" --scheduler-file={env_vars['SCHEDULER_FILE']}"
91+
if add_device:
92+
env_vars["SCRIPT_COMMAND"] += f" --device={env_vars['DEVICE']}"
93+
94+
# Surround the command in quotes so the variable gets set properly
95+
env_vars["SCRIPT_COMMAND"] = f"\"{env_vars['SCRIPT_COMMAND']}\""
96+
97+
return sdk.Script(path=self.container_entrypoint, env=env_vars)
98+
99+
def _build_env_vars(self) -> Dict[str, str]:
100+
env_vars = vars(self)
101+
# Convert to uppercase to match container_entrypoint.sh
102+
env_vars = {key.upper(): val for key, val in env_vars.items()}
103+
104+
env_vars["LOGDIR"] = f"{self.job_dir}/logs"
105+
env_vars["PROFILESDIR"] = f"{self.job_dir}/profiles"
106+
env_vars["SCHEDULER_FILE"] = f"{env_vars['LOGDIR']}/scheduler.json"
107+
env_vars["SCHEDULER_LOG"] = f"{env_vars['LOGDIR']}/scheduler.log"
108+
env_vars["DONE_MARKER"] = f"{env_vars['LOGDIR']}/done.txt"
109+
110+
return env_vars

0 commit comments

Comments
 (0)