-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for NeMo SDK #131
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I really like the user guide and how straightforward it is. Just left a couple comments about helping the user know what parameters they should use.
interface: str = "eth0" | ||
protocol: str = "tcp" | ||
cpu_worker_memory_limit: str = "0" | ||
rapids_no_initialize: str = "1" | ||
cudf_spill: str = "1" | ||
rmm_scheduler_pool_size: str = "1GB" | ||
rmm_worker_pool_size: str = "72GiB" | ||
libcudf_cufile_policy: str = "OFF" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my other comment, I often have trouble knowing what to set for these types of parameters. Is there anywhere the user might be able to refer to for recommendations of how to set these parameters for their specific cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should release a bigger guide on our recommendations for each parameter. For now I've included a docstring that should provide a bit more context. Let me know if you want me to change anything else to make it clearer.
Signed-off-by: Ryan Wolf <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, will take another look once you link nemo_sdk
# Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the SLURM cluster | ||
container_entrypoint = "/cluster/path/slurm/container_entrypoint.sh" | ||
# The NeMo Curator command to run | ||
curator_command = "text_cleaning --input-data-dir=/path/to/data --output-clean-dir=/path/to/output" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this work for multiple commands ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should be able to work for all dask-based curator scripts. It'll work for the rest too, though it'll be a bit overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool , we can leave it as is. Maybe we add two.
mkdir -p $LOGDIR | ||
mkdir -p $PROFILESDIR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is run inside the container ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're correct. It's a good call out, and it makes me think we maybe should've done this from the beginning since the contents of $LOGDIR
and $PROFILESDIR
get written inside the container, so they ought to be initialized in it too. Let me know if you disagree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, i think because logs are accessed from outside the container so we should make it clear where the path is from outside the container, I think to make it clear we should echo $LOGDIR
and $PROFILESDIR
to help someone debugging this.
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me, have added non blocking comments around LOGDIR (which is mostly unrelated to this PR)
Description
NeMo SDK is a library designed to make running different parts of the NeMo FW easier across computing platforms. It serves as an enhanced version of the NeMo Framework Launcher. This PR adds an example and simple config shortcut to run NeMo Curator scripts on Slurm clusters using NeMo SDK.
Usage
See
examples/nemo_sdk/slurm.py
for example usage.Checklist