Add support for NeMo SDK #131

ryantwolf · 2024-06-27T19:56:58Z

Description

NeMo SDK is a library designed to make running different parts of the NeMo FW easier across computing platforms. It serves as an enhanced version of the NeMo Framework Launcher. This PR adds an example and simple config shortcut to run NeMo Curator scripts on Slurm clusters using NeMo SDK.

Usage

See examples/nemo_sdk/slurm.py for example usage.

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Ryan Wolf <[email protected]>

sarahyurick

LGTM! I really like the user guide and how straightforward it is. Just left a couple comments about helping the user know what parameters they should use.

docs/user-guide/nemosdk.rst

sarahyurick · 2024-07-05T23:06:14Z

nemo_curator/nemo_sdk/slurm.py

+    interface: str = "eth0"
+    protocol: str = "tcp"
+    cpu_worker_memory_limit: str = "0"
+    rapids_no_initialize: str = "1"
+    cudf_spill: str = "1"
+    rmm_scheduler_pool_size: str = "1GB"
+    rmm_worker_pool_size: str = "72GiB"
+    libcudf_cufile_policy: str = "OFF"


Similar to my other comment, I often have trouble knowing what to set for these types of parameters. Is there anywhere the user might be able to refer to for recommendations of how to set these parameters for their specific cluster?

I agree we should release a bigger guide on our recommendations for each parameter. For now I've included a docstring that should provide a bit more context. Let me know if you want me to change anything else to make it clearer.

Signed-off-by: Ryan Wolf <[email protected]>

VibhuJawa

Mostly looks good, will take another look once you link nemo_sdk

docs/user-guide/nemosdk.rst

examples/nemo_sdk/launch_slurm.py

VibhuJawa · 2024-07-05T23:05:12Z

examples/slurm/container-entrypoint.sh

+  mkdir -p $LOGDIR
+  mkdir -p $PROFILESDIR


This is run inside the container ?

Yes you're correct. It's a good call out, and it makes me think we maybe should've done this from the beginning since the contents of $LOGDIR and $PROFILESDIR get written inside the container, so they ought to be initialized in it too. Let me know if you disagree.

So, i think because logs are accessed from outside the container so we should make it clear where the path is from outside the container, I think to make it clear we should echo $LOGDIR and $PROFILESDIR to help someone debugging this.

Agreed, most of my setups typically mount these to be in a location that's accessible from outside the compute nodes so it would be good to keep in mind that the end goal for these logs/profiles is someplace that's within the mounted dirs

Added echo and comments.

examples/slurm/start-slurm.sh

nemo_curator/nemo_sdk/slurm.py

Signed-off-by: Ryan Wolf <[email protected]>

VibhuJawa

Mostly looks good to me, have added non blocking comments around LOGDIR (which is mostly unrelated to this PR)

sarahyurick

Thanks for the updates, LGTM!

ayushdg · 2024-07-08T15:27:28Z

examples/nemo_sdk/launch_slurm.py

+        exclusive=True,
+        time="04:00:00",
+        container_image="nvcr.io/nvidia/nemo:dev",
+        container_mounts=["/path/on/machine:/path/in/container"],


This is maybe a question for Nemo_sdk (apologies for my lack of familiarity). Can users pass in additional args here for other slurm options?

Haha no need to apologize for your lack of familiarity. Yes the user can pass in additional args.

ayushdg · 2024-07-08T15:57:11Z

examples/slurm/container-entrypoint.sh

+  mkdir -p $LOGDIR
+  mkdir -p $PROFILESDIR


Agreed, most of my setups typically mount these to be in a location that's accessible from outside the compute nodes so it would be good to keep in mind that the end goal for these logs/profiles is someplace that's within the mounted dirs

ayushdg · 2024-07-08T16:00:55Z

nemo_curator/nemo_sdk/slurm.py

+
+
+@dataclass
+class SlurmJobConfig:


Also tagging @jacobtomlinson who's done a lot of work on the dask/dask-cuda clusters with Slurm (among other things).

For now this mimics the command line setup to start clusters, but feel free to share any opinions you might have since this overlaps a lot with the dask-runners/dask-jobqueue api.

Thanks @ayushdg. This seems to follow the common pattern that a lot of Slurm implementations use so I don't have any particular comments. I'm always keen to see how we can reuse code though, so maybe we could work towards a common base in dask-jobqueue that projects like this can use instead of reinventing it each time.

Signed-off-by: Ryan Wolf <[email protected]>

* Begin docs Signed-off-by: Ryan Wolf <[email protected]> * Add slurm sdk example Signed-off-by: Ryan Wolf <[email protected]> * Use safe import Signed-off-by: Ryan Wolf <[email protected]> * Fix bugs in sdk Signed-off-by: Ryan Wolf <[email protected]> * Update docs and tweak scripts Signed-off-by: Ryan Wolf <[email protected]> * Add interface helper function Signed-off-by: Ryan Wolf <[email protected]> * Update docs Signed-off-by: Ryan Wolf <[email protected]> * Fix formatting Signed-off-by: Ryan Wolf <[email protected]> * Add config docstring Signed-off-by: Ryan Wolf <[email protected]> * Address comments Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]>

ryantwolf added 5 commits June 25, 2024 15:52

Begin docs

b46ab02

Signed-off-by: Ryan Wolf <[email protected]>

Add slurm sdk example

edf4bcb

Signed-off-by: Ryan Wolf <[email protected]>

Use safe import

5f5ee0b

Signed-off-by: Ryan Wolf <[email protected]>

Fix bugs in sdk

1a9e768

Signed-off-by: Ryan Wolf <[email protected]>

Update docs and tweak scripts

17c6d4e

Signed-off-by: Ryan Wolf <[email protected]>

ryantwolf marked this pull request as ready for review June 28, 2024 17:54

Merge branch 'main' into rywolf/nemo-sdk

8886254

ryantwolf requested a review from ayushdg July 1, 2024 22:31

sarahyurick reviewed Jul 5, 2024

View reviewed changes

Add interface helper function

781b881

Signed-off-by: Ryan Wolf <[email protected]>

VibhuJawa requested changes Jul 5, 2024

View reviewed changes

ryantwolf added 4 commits July 5, 2024 16:28

Update docs

bd3a343

Signed-off-by: Ryan Wolf <[email protected]>

Fix formatting

1a2ea96

Signed-off-by: Ryan Wolf <[email protected]>

Add config docstring

097302a

Signed-off-by: Ryan Wolf <[email protected]>

Merge branch 'main' into rywolf/nemo-sdk

da49b3e

Signed-off-by: Ryan Wolf <[email protected]>

VibhuJawa approved these changes Jul 6, 2024

View reviewed changes

sarahyurick approved these changes Jul 8, 2024

View reviewed changes

ayushdg reviewed Jul 8, 2024

View reviewed changes

ryantwolf added 2 commits July 9, 2024 08:34

Merge branch 'main' into rywolf/nemo-sdk

85dc664

Signed-off-by: Ryan Wolf <[email protected]>

Address comments

cd4e750

Signed-off-by: Ryan Wolf <[email protected]>

ryantwolf merged commit 9a3bbbd into main Jul 9, 2024
3 checks passed

ryantwolf deleted the rywolf/nemo-sdk branch July 9, 2024 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for NeMo SDK #131

Add support for NeMo SDK #131

ryantwolf commented Jun 27, 2024 •

edited

Loading

sarahyurick left a comment

sarahyurick Jul 5, 2024

ryantwolf Jul 5, 2024

VibhuJawa left a comment

VibhuJawa Jul 5, 2024

ryantwolf Jul 5, 2024

VibhuJawa Jul 6, 2024

ayushdg Jul 8, 2024

ryantwolf Jul 9, 2024

VibhuJawa left a comment

sarahyurick left a comment

ayushdg Jul 8, 2024

ryantwolf Jul 9, 2024

ayushdg Jul 8, 2024

ayushdg Jul 8, 2024

jacobtomlinson Jul 9, 2024



		@dataclass
		class SlurmJobConfig:

Add support for NeMo SDK #131

Add support for NeMo SDK #131

Conversation

ryantwolf commented Jun 27, 2024 • edited Loading

Description

Usage

Checklist

sarahyurick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

sarahyurick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryantwolf commented Jun 27, 2024 •

edited

Loading