-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Switch Models to use Crossfit #58
Conversation
CC: @sarahyurick for an initial review. |
Side note: NeMo-Curator requires all PR's to include both signed commits as well as signed-off. This can be enabled with |
b41387e
to
54b0a91
Compare
Thanks, fixed it. |
nemo_curator/distributed_data_classification/domain_classifier_inference.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't been able to try running this yet (conda environments are acting up so I'm running an extensive cleanup right now) but wanted to leave at least an initial review for now. Changes look good and very straightforward to me! I don't have any major concerns or issues with this; I plan to approve/request further changes as needed later today when I actually run it.
Only question atm is whether you're planning to add the quality classifier in this PR or sometime later? Either way is ok with me.
examples/distributed_data_classification_examples/domain_api_example.py
Outdated
Show resolved
Hide resolved
nemo_curator/distributed_data_classification/domain_classifier_inference.py
Outdated
Show resolved
Hide resolved
nemo_curator/distributed_data_classification/domain_classifier_inference.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to run this as is! Thanks again.
examples/distributed_data_classification_examples/domain_api_example.py
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,373 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move this file if we don't like having notebooks but personally i think for examples notebooks are a better way to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep the notebook, but could you please move it to the tutorials folder? The name of the folder could be distributed_data_classification
and the name of the notebook could be distributed_data_classification.ipynb
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually while we're moving stuff around, do you mind also moving everything in distributed_data_classification_examples
to the root examples
folder? I think they should be with everything else. And rename the files too please.
domain_api_example.py
->domain_classifier_example.py
quality_api_example.py
->quality_classifier_example.py
368df8b
to
c8690d3
Compare
@sarahyurick , I have added the quality model and cleaned up scripts, Can i get a re-review please. |
c8690d3
to
bbee12f
Compare
6c9d5ea
to
6bff4fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm super glad to see a speedup like this, and thanks so much for removing a lot of code. I left a lot of nitpicks, and a couple of additional things I'd like to see cleaned up. Thanks for doing this!
@@ -0,0 +1,373 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep the notebook, but could you please move it to the tutorials folder? The name of the folder could be distributed_data_classification
and the name of the notebook could be distributed_data_classification.ipynb
.
nemo_curator/distributed_data_classification/domain_classifier_inference.py
Outdated
Show resolved
Hide resolved
nemo_curator/distributed_data_classification/quality_classifier_inference.py
Outdated
Show resolved
Hide resolved
examples/distributed_data_classification_examples/test_models.ipynb
Outdated
Show resolved
Hide resolved
examples/distributed_data_classification_examples/test_models.ipynb
Outdated
Show resolved
Hide resolved
examples/distributed_data_classification_examples/test_models.ipynb
Outdated
Show resolved
Hide resolved
examples/distributed_data_classification_examples/test_models.ipynb
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,373 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually while we're moving stuff around, do you mind also moving everything in distributed_data_classification_examples
to the root examples
folder? I think they should be with everything else. And rename the files too please.
domain_api_example.py
->domain_classifier_example.py
quality_api_example.py
->quality_classifier_example.py
e2811e7
to
cfd77f7
Compare
nemo_curator/distributed_data_classification/quality_classifier_multiple_models_inference.py
Outdated
Show resolved
Hide resolved
540f5d8
to
982d7b6
Compare
Really sorry for the commit noise guys (@ryantwolf , @sarahyurick ), i messed up one of the git rebase (forgot to sign off a commit). This should be ready for another review now. Have addressed all the reviews and added issues to follow up ones that i could not get to in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I missed one file in my initial review, but other than that mostly just nits. Thanks!
tutorials/distributed_data_classification/distributed_data_classification.ipynb
Outdated
Show resolved
Hide resolved
tutorials/distributed_data_classification/distributed_data_classification.ipynb
Outdated
Show resolved
Hide resolved
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
…ssification.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
…ssification.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
ca7b772
to
ca713c8
Compare
Signed-off-by: Vibhu Jawa <[email protected]>
@ryantwolf , Thanks again for all the careful reviews. Appreciate the help . I think i should have those resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Just a couple of nits and things I think you missed from my last review. I'll run some tests to do some final verifications too.
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good with me, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! Added a couple questions for general discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to make sure there's nothing here we want to keep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add it back as a followup if we think we need it .
"input_file_path=\"/input_data_dir/\"\n", | ||
"output_file_path = \"output_data_dir/\"\n", | ||
"domain_model_path = \"domain_model.pth\"\n", | ||
"quality_model_path = \"quality_model.pth\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you opened #72, I think once we add support for that we should use real links, paths, etc. so that the user can run it without changing anything.
Should we open an issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add to the issue for that to track we also change this file. Thanks for the suggestion. 👍🏼
This PR enables using Crossfit.
Todo:
Benchmarks:
A100 numbers: