Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Switch Models to use Crossfit #58

Merged
merged 43 commits into from
May 21, 2024

Conversation

VibhuJawa
Copy link
Collaborator

@VibhuJawa VibhuJawa commented May 9, 2024

This PR enables using Crossfit.

Todo:

  • Domain Model
  • Quality Model
  • Update installation instructions (Verified locally)

Benchmarks:

Subset Place Dataset Size Batch Size GPUs Time Taken Implementation Speedup Note
subset_CC-MAIN-2023-14_english 1.8 GB 1024** (dynamic) 16 V100 163 Crossfit 1 Batch size is dynamic
subset_CC-MAIN-2023-14_english 1.8 GB 256 (static) 16 V100 230 MainLine 1.41 Batch size is static, 512,1024 both OOM

image

A100 numbers:

image

Dataset Size (Mb) Resolution GPU Time (s) Model Speedup Batch Size
subset_CC-MAIN-2023-14_english 148 1024** 2 A100 50.558 Crossfit 1 Batch size is dynamic
subset_CC-MAIN-2023-14_english 148 1024** 2 A100 78.487 Mainline 1.55 Batch size is static

@VibhuJawa
Copy link
Collaborator Author

CC: @sarahyurick for an initial review.

@ayushdg
Copy link
Collaborator

ayushdg commented May 9, 2024

Side note: NeMo-Curator requires all PR's to include both signed commits as well as signed-off. This can be enabled with git commit -sS ....
There's some more info in https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md#pull-requests-pr-guidelines.

@VibhuJawa VibhuJawa force-pushed the switch_to_crossfit branch from b41387e to 54b0a91 Compare May 9, 2024 18:32
@VibhuJawa
Copy link
Collaborator Author

Side note: NeMo-Curator requires all PR's to include both signed commits as well as signed-off. This can be enabled with git commit -sS .... There's some more info in https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md#pull-requests-pr-guidelines.

Thanks, fixed it.

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to try running this yet (conda environments are acting up so I'm running an extensive cleanup right now) but wanted to leave at least an initial review for now. Changes look good and very straightforward to me! I don't have any major concerns or issues with this; I plan to approve/request further changes as needed later today when I actually run it.

Only question atm is whether you're planning to add the quality classifier in this PR or sometime later? Either way is ok with me.

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to run this as is! Thanks again.

@@ -0,0 +1,373 @@
{
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move this file if we don't like having notebooks but personally i think for examples notebooks are a better way to understand.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep the notebook, but could you please move it to the tutorials folder? The name of the folder could be distributed_data_classification and the name of the notebook could be distributed_data_classification.ipynb.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually while we're moving stuff around, do you mind also moving everything in distributed_data_classification_examples to the root examples folder? I think they should be with everything else. And rename the files too please.

  • domain_api_example.py -> domain_classifier_example.py
  • quality_api_example.py -> quality_classifier_example.py

@VibhuJawa VibhuJawa force-pushed the switch_to_crossfit branch from 368df8b to c8690d3 Compare May 16, 2024 23:23
@VibhuJawa
Copy link
Collaborator Author

@sarahyurick , I have added the quality model and cleaned up scripts, Can i get a re-review please.

@VibhuJawa VibhuJawa changed the base branch from main to pii-perf May 16, 2024 23:24
@VibhuJawa VibhuJawa changed the base branch from pii-perf to main May 16, 2024 23:24
@VibhuJawa VibhuJawa force-pushed the switch_to_crossfit branch from c8690d3 to bbee12f Compare May 16, 2024 23:28
@VibhuJawa VibhuJawa self-assigned this May 16, 2024
@VibhuJawa VibhuJawa force-pushed the switch_to_crossfit branch 2 times, most recently from 6c9d5ea to 6bff4fa Compare May 17, 2024 00:00
@VibhuJawa VibhuJawa changed the title [WIP] Switch Models to use Crossfit [REVIEW] Switch Models to use Crossfit May 17, 2024
@VibhuJawa VibhuJawa requested a review from ryantwolf May 17, 2024 00:02
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm super glad to see a speedup like this, and thanks so much for removing a lot of code. I left a lot of nitpicks, and a couple of additional things I'd like to see cleaned up. Thanks for doing this!

@@ -0,0 +1,373 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep the notebook, but could you please move it to the tutorials folder? The name of the folder could be distributed_data_classification and the name of the notebook could be distributed_data_classification.ipynb.

setup.py Show resolved Hide resolved
setup.py Show resolved Hide resolved
@@ -0,0 +1,373 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually while we're moving stuff around, do you mind also moving everything in distributed_data_classification_examples to the root examples folder? I think they should be with everything else. And rename the files too please.

  • domain_api_example.py -> domain_classifier_example.py
  • quality_api_example.py -> quality_classifier_example.py

@VibhuJawa VibhuJawa force-pushed the switch_to_crossfit branch from e2811e7 to cfd77f7 Compare May 20, 2024 19:01
@VibhuJawa
Copy link
Collaborator Author

Really sorry for the commit noise guys (@ryantwolf , @sarahyurick ), i messed up one of the git rebase (forgot to sign off a commit).

This should be ready for another review now. Have addressed all the reviews and added issues to follow up ones that i could not get to in this PR.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I missed one file in my initial review, but other than that mostly just nits. Thanks!

VibhuJawa and others added 20 commits May 20, 2024 17:32
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
…ssification.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
…ssification.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
@VibhuJawa VibhuJawa force-pushed the switch_to_crossfit branch from ca7b772 to ca713c8 Compare May 21, 2024 00:33
@VibhuJawa
Copy link
Collaborator Author

@ryantwolf , Thanks again for all the careful reviews. Appreciate the help . I think i should have those resolved.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just a couple of nits and things I think you missed from my last review. I'll run some tests to do some final verifications too.

nemo_curator/utils/script_utils.py Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@ryantwolf ryantwolf self-requested a review May 21, 2024 19:59
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good with me, thanks!

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Added a couple questions for general discussion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to make sure there's nothing here we want to keep?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add it back as a followup if we think we need it .

Comment on lines +70 to +73
"input_file_path=\"/input_data_dir/\"\n",
"output_file_path = \"output_data_dir/\"\n",
"domain_model_path = \"domain_model.pth\"\n",
"quality_model_path = \"quality_model.pth\""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you opened #72, I think once we add support for that we should use real links, paths, etc. so that the user can run it without changing anything.

Should we open an issue for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add to the issue for that to track we also change this file. Thanks for the suggestion. 👍🏼

@VibhuJawa VibhuJawa merged commit 9f8578b into NVIDIA:main May 21, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants