Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for NeMo SDK #131

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Add support for NeMo SDK #131

wants to merge 11 commits into from

Conversation

ryantwolf
Copy link
Collaborator

@ryantwolf ryantwolf commented Jun 27, 2024

Description

NeMo SDK is a library designed to make running different parts of the NeMo FW easier across computing platforms. It serves as an enhanced version of the NeMo Framework Launcher. This PR adds an example and simple config shortcut to run NeMo Curator scripts on Slurm clusters using NeMo SDK.

Usage

See examples/nemo_sdk/slurm.py for example usage.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
@ryantwolf ryantwolf marked this pull request as ready for review June 28, 2024 17:54
@ryantwolf ryantwolf requested a review from ayushdg July 1, 2024 22:31
Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I really like the user guide and how straightforward it is. Just left a couple comments about helping the user know what parameters they should use.

docs/user-guide/nemosdk.rst Outdated Show resolved Hide resolved
Comment on lines +29 to +36
interface: str = "eth0"
protocol: str = "tcp"
cpu_worker_memory_limit: str = "0"
rapids_no_initialize: str = "1"
cudf_spill: str = "1"
rmm_scheduler_pool_size: str = "1GB"
rmm_worker_pool_size: str = "72GiB"
libcudf_cufile_policy: str = "OFF"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my other comment, I often have trouble knowing what to set for these types of parameters. Is there anywhere the user might be able to refer to for recommendations of how to set these parameters for their specific cluster?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should release a bigger guide on our recommendations for each parameter. For now I've included a docstring that should provide a bit more context. Let me know if you want me to change anything else to make it clearer.

Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, will take another look once you link nemo_sdk

docs/user-guide/nemosdk.rst Show resolved Hide resolved
# Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the SLURM cluster
container_entrypoint = "/cluster/path/slurm/container_entrypoint.sh"
# The NeMo Curator command to run
curator_command = "text_cleaning --input-data-dir=/path/to/data --output-clean-dir=/path/to/output"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this work for multiple commands ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be able to work for all dask-based curator scripts. It'll work for the rest too, though it'll be a bit overkill.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool , we can leave it as is. Maybe we add two.

Comment on lines +20 to +21
mkdir -p $LOGDIR
mkdir -p $PROFILESDIR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is run inside the container ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're correct. It's a good call out, and it makes me think we maybe should've done this from the beginning since the contents of $LOGDIR and $PROFILESDIR get written inside the container, so they ought to be initialized in it too. Let me know if you disagree.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, i think because logs are accessed from outside the container so we should make it clear where the path is from outside the container, I think to make it clear we should echo $LOGDIR and $PROFILESDIR to help someone debugging this.

examples/slurm/start-slurm.sh Show resolved Hide resolved
nemo_curator/nemo_sdk/slurm.py Show resolved Hide resolved
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, have added non blocking comments around LOGDIR (which is mostly unrelated to this PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants