|
| 1 | +.. _data-curator-nemo-sdk: |
| 2 | + |
| 3 | +====================================== |
| 4 | +NeMo Curator with NeMo SDK |
| 5 | +====================================== |
| 6 | +----------------------------------------- |
| 7 | +NeMo SDK |
| 8 | +----------------------------------------- |
| 9 | + |
| 10 | +The NeMo SDK is a general purpose tool for configuring and executing Python functions and scripts acrosss various computing environments. |
| 11 | +It is used across the NeMo Framework for managing machine learning experiments. |
| 12 | +One of the key features of the NeMo SDK is the ability to run code locally or on platforms like SLURM with minimal changes. |
| 13 | + |
| 14 | +----------------------------------------- |
| 15 | +Usage |
| 16 | +----------------------------------------- |
| 17 | + |
| 18 | +We recommend getting slightly familiar with NeMo SDK before jumping into this. The documentation can be found here. |
| 19 | + |
| 20 | +Let's walk through the example usage for how you can launch a slurm job using `examples/launch_slurm.py <https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/nemo_sdk/launch_slurm.py>`_. |
| 21 | + |
| 22 | +.. code-block:: python |
| 23 | +
|
| 24 | +
|
| 25 | + import nemo_sdk as sdk |
| 26 | + from nemo_sdk.core.execution import SlurmExecutor |
| 27 | +
|
| 28 | + from nemo_curator.nemo_sdk import SlurmJobConfig |
| 29 | +
|
| 30 | + @sdk.factory |
| 31 | + def nemo_curator_slurm_executor() -> SlurmExecutor: |
| 32 | + """ |
| 33 | + Configure the following function with the details of your SLURM cluster |
| 34 | + """ |
| 35 | + return SlurmExecutor( |
| 36 | + job_name_prefix="nemo-curator", |
| 37 | + account="my-account", |
| 38 | + nodes=2, |
| 39 | + exclusive=True, |
| 40 | + time="04:00:00", |
| 41 | + container_image="nvcr.io/nvidia/nemo:dev", |
| 42 | + container_mounts=["/path/on/machine:/path/in/container"], |
| 43 | + ) |
| 44 | +
|
| 45 | +First, we need to define a factory that can produce a ``SlurmExecutor``. |
| 46 | +This exectuor is where you define all your cluster parameters. Note: NeMo SDK only supports running on SLURM clusters with `Pyxis <https://github.com/NVIDIA/pyxis>`_ right now. |
| 47 | +After this, there is the main function |
| 48 | + |
| 49 | +.. code-block:: python |
| 50 | +
|
| 51 | + # Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the SLURM cluster |
| 52 | + container_entrypoint = "/cluster/path/slurm/container_entrypoint.sh" |
| 53 | + # The NeMo Curator command to run |
| 54 | + curator_command = "text_cleaning --input-data-dir=/path/to/data --output-clean-dir=/path/to/output" |
| 55 | + curator_job = SlurmJobConfig( |
| 56 | + job_dir="/home/user/jobs", |
| 57 | + container_entrypoint=container_entrypoint, |
| 58 | + script_command=curator_command, |
| 59 | + ) |
| 60 | +
|
| 61 | +First, we need to specify the path to `examples/slurm/container-entrypoint.sh <https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/slurm/container-entrypoint.sh>`_ on the cluster. |
| 62 | +This shell script is responsible for setting up the Dask cluster on Slurm and will be the main script run. |
| 63 | +Therefore, we need to define the path to it. |
| 64 | + |
| 65 | +Second, we need to establish the NeMo Curator script we want to run. |
| 66 | +This can be a command line utility like ``text_cleaning`` we have above, or it can be your own custom script ran with ``python path/to/script.py`` |
| 67 | + |
| 68 | + |
| 69 | +Finally, we combine all of these into a ``SlurmJobConfig``. This config has many options for configuring the Dask cluster. |
| 70 | +We'll highlight a couple of important ones: |
| 71 | + |
| 72 | +* ``device="cpu"`` determines the type of Dask cluster to initialize. If you are using GPU modules, please set this equal to ``"gpu"``. |
| 73 | +* ``interface="etho0"`` specifies the network interface to use for communication within the Dask cluster. It will likely be different for your Slurm cluster, so please modify as needed. You can determine what interfaces are available by running the following function on your cluster. |
| 74 | + |
| 75 | + .. code-block:: python |
| 76 | +
|
| 77 | + from nemo_curator import get_network_interfaces |
| 78 | +
|
| 79 | + print(get_network_interfaces()) |
| 80 | +
|
| 81 | +.. code-block:: python |
| 82 | +
|
| 83 | + executor = sdk.resolve(SlurmExecutor, "nemo_curator_slurm_executor") |
| 84 | + with sdk.Experiment("example_nemo_curator_exp", executor=executor) as exp: |
| 85 | + exp.add(curator_job.to_script(), tail_logs=True) |
| 86 | + exp.run(detach=False) |
| 87 | +
|
| 88 | +After configuring the job, we can finally run it. |
| 89 | +First, we use the sdk to resolve our custom factory. |
| 90 | +Next, we use it to begin an experiment named "example_nemo_curator_exp" running on our Slurm exectuor. |
| 91 | + |
| 92 | +``exp.add(curator_job.to_script(), tail_logs=True)`` adds the NeMo Curator script to be part of the experiment. |
| 93 | +It converts the ``SlurmJobConfig`` to a ``sdk.Script``. |
| 94 | +This ``curator_job.to_script()`` has two important parameters. |
| 95 | +* ``add_scheduler_file=True`` |
| 96 | +* ``add_device=True`` |
| 97 | + |
| 98 | +Both of these modify the command specified in ``curator_command``. |
| 99 | +Setting both to ``True`` (the default) transforms the original command from: |
| 100 | + |
| 101 | +.. code-block:: bash |
| 102 | +
|
| 103 | + # Original command |
| 104 | + text_cleaning \ |
| 105 | + --input-data-dir=/path/to/data \ |
| 106 | + --output-clean-dir=/path/to/output |
| 107 | +
|
| 108 | +to: |
| 109 | + |
| 110 | +.. code-block:: bash |
| 111 | +
|
| 112 | + # Modified commmand |
| 113 | + text_cleaning \ |
| 114 | + --input-data-dir=/path/to/data \ |
| 115 | + --output-clean-dir=/path/to/output \ |
| 116 | + --scheduler-file=/path/to/scheduler/file \ |
| 117 | + --device="cpu" |
| 118 | +
|
| 119 | +
|
| 120 | +As you can see, ``add_scheduler_file=True`` causes ``--scheduler-file=/path/to/scheduer/file`` to be appended to the command, and ``add_device=True`` causes ``--device="cpu"`` (or whatever the device is set to) to be appended. |
| 121 | +``/path/to/scheduer/file`` is determined by ``SlurmJobConfig``, and ``device`` is what the user specified in the ``device`` parameter previously. |
| 122 | + |
| 123 | +The scheduler file argument is necessary to connect to the Dask cluster on Slurm. |
| 124 | +All NeMo Curator scripts accept both arguments, so the default is to automatically add them. |
| 125 | +If your script is configured differently, feel free to turn these off. |
| 126 | + |
| 127 | +The final line ``exp.run(detach=False)`` starts the experiment on the Slurm cluster. |
0 commit comments