-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for NeMo SDK #131
Conversation
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I really like the user guide and how straightforward it is. Just left a couple comments about helping the user know what parameters they should use.
interface: str = "eth0" | ||
protocol: str = "tcp" | ||
cpu_worker_memory_limit: str = "0" | ||
rapids_no_initialize: str = "1" | ||
cudf_spill: str = "1" | ||
rmm_scheduler_pool_size: str = "1GB" | ||
rmm_worker_pool_size: str = "72GiB" | ||
libcudf_cufile_policy: str = "OFF" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my other comment, I often have trouble knowing what to set for these types of parameters. Is there anywhere the user might be able to refer to for recommendations of how to set these parameters for their specific cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should release a bigger guide on our recommendations for each parameter. For now I've included a docstring that should provide a bit more context. Let me know if you want me to change anything else to make it clearer.
Signed-off-by: Ryan Wolf <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, will take another look once you link nemo_sdk
mkdir -p $LOGDIR | ||
mkdir -p $PROFILESDIR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is run inside the container ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're correct. It's a good call out, and it makes me think we maybe should've done this from the beginning since the contents of $LOGDIR
and $PROFILESDIR
get written inside the container, so they ought to be initialized in it too. Let me know if you disagree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, i think because logs are accessed from outside the container so we should make it clear where the path is from outside the container, I think to make it clear we should echo $LOGDIR
and $PROFILESDIR
to help someone debugging this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, most of my setups typically mount these to be in a location that's accessible from outside the compute nodes so it would be good to keep in mind that the end goal for these logs/profiles is someplace that's within the mounted dirs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added echo and comments.
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me, have added non blocking comments around LOGDIR (which is mostly unrelated to this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, LGTM!
exclusive=True, | ||
time="04:00:00", | ||
container_image="nvcr.io/nvidia/nemo:dev", | ||
container_mounts=["/path/on/machine:/path/in/container"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe a question for Nemo_sdk (apologies for my lack of familiarity). Can users pass in additional args here for other slurm options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha no need to apologize for your lack of familiarity. Yes the user can pass in additional args.
mkdir -p $LOGDIR | ||
mkdir -p $PROFILESDIR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, most of my setups typically mount these to be in a location that's accessible from outside the compute nodes so it would be good to keep in mind that the end goal for these logs/profiles is someplace that's within the mounted dirs
|
||
|
||
@dataclass | ||
class SlurmJobConfig: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also tagging @jacobtomlinson who's done a lot of work on the dask/dask-cuda clusters with Slurm (among other things).
For now this mimics the command line setup to start clusters, but feel free to share any opinions you might have since this overlaps a lot with the dask-runners/dask-jobqueue api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ayushdg. This seems to follow the common pattern that a lot of Slurm implementations use so I don't have any particular comments. I'm always keen to see how we can reuse code though, so maybe we could work towards a common base in dask-jobqueue
that projects like this can use instead of reinventing it each time.
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
* Begin docs Signed-off-by: Ryan Wolf <[email protected]> * Add slurm sdk example Signed-off-by: Ryan Wolf <[email protected]> * Use safe import Signed-off-by: Ryan Wolf <[email protected]> * Fix bugs in sdk Signed-off-by: Ryan Wolf <[email protected]> * Update docs and tweak scripts Signed-off-by: Ryan Wolf <[email protected]> * Add interface helper function Signed-off-by: Ryan Wolf <[email protected]> * Update docs Signed-off-by: Ryan Wolf <[email protected]> * Fix formatting Signed-off-by: Ryan Wolf <[email protected]> * Add config docstring Signed-off-by: Ryan Wolf <[email protected]> * Address comments Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]>
Description
NeMo SDK is a library designed to make running different parts of the NeMo FW easier across computing platforms. It serves as an enhanced version of the NeMo Framework Launcher. This PR adds an example and simple config shortcut to run NeMo Curator scripts on Slurm clusters using NeMo SDK.
Usage
See
examples/nemo_sdk/slurm.py
for example usage.Checklist