Fix 69 - Refactor how arguments are added to scripts #102

miguelusque · 2024-06-10T08:41:56Z

Description

I have noticed that the code used to parse the scripts is repeated in multiple places.

I have refactored the different parser.add_argument calls, grouping them in a single file that is consumed by the different scripts.

Signed-off-by: Miguel Martínez <[email protected]>

Signed-off-by: miguelusque <[email protected]>

ayushdg

Thanks for the changes. Most of my comments are around the same topic:

There are some parts of the codebase that repeat the same arguments across multiple files. It makes sense to group some of those into the common file for reuse. For other arguments, that are very specific to a use case or notebook, do you have a strong opinion on why they should reside in a common class rather than added directly in the example itself?

examples/domain_classifier_example.py

examples/k8s/create_dask_cluster.py

examples/quality_classifier_example.py

nemo_curator/scripts/add_id.py

nemo_curator/scripts/filter_documents.py

Signed-off-by: miguelusque <[email protected]>

ryantwolf

In general I like this idea going forward, but I think I want to see a couple of changes. Relating to Ayush's review summary, I think that common arguments (like those found in add_distributed_args) should be grouped together. I think this class also makes more sense as a home for add_distributed_args and similar functions rather than them being entirely independent.

However, I think if there isn't any reuse of a list of arguments in a script, we shouldn't make a method in ArgumentHelper for it. Arguments that are hyper-specific to each script should remain only in that script and not be abstracted away. I'm fine abstracting away collections of arguments that are used in 2+ scripts though (like input-data-dir, input-type, etc.). So I think I'd like to see ArgumentHelper shrink a bit for now and return a lot of the niche arguments to their original scripts.

nemo_curator/utils/script_utils.py

Signed-off-by: miguelusque <[email protected]>

miguelusque · 2024-06-18T09:21:47Z

In general I like this idea going forward, but I think I want to see a couple of changes. Relating to Ayush's review summary, I think that common arguments (like those found in add_distributed_args) should be grouped together. I think this class also makes more sense as a home for add_distributed_args and similar functions rather than them being entirely independent.

However, I think if there isn't any reuse of a list of arguments in a script, we shouldn't make a method in ArgumentHelper for it. Arguments that are hyper-specific to each script should remain only in that script and not be abstracted away. I'm fine abstracting away collections of arguments that are used in 2+ scripts though (like input-data-dir, input-type, etc.). So I think I'd like to see ArgumentHelper shrink a bit for now and return a lot of the niche arguments to their original scripts.

Hi Ryan,

Thank you! Just done!

ryantwolf

This version is great and much more in line with what I think is the most useful. I found a couple of outdated arguments (that were a lot easier to spot thanks to you) that would be good to fix before merging this in. Once all my changes are addressed and I verify via the CI that everything is ok, we should be good.

ryantwolf · 2024-06-24T22:30:51Z

examples/k8s/create_dask_cluster.py

Nit: Leave this file unchanged. ArgumentHelper isn't being used and the rest of the changes are inconsequential.

This still needs to be addressed.

nemo_curator/scripts/add_id.py

nemo_curator/scripts/blend_datasets.py

nemo_curator/scripts/text_cleaning.py

nemo_curator/scripts/train_fasttext.py

nemo_curator/utils/fuzzy_dedup_utils/io_utils.py

nemo_curator/scripts/make_data_shards.py

The former message was out of date. Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Miguel Martínez <[email protected]>

Improve help wording for output_data_dir argument Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Miguel Martínez <[email protected]>

Update help wording for ouptut-data-dir argument Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Miguel Martínez <[email protected]>

Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Miguel Martínez <[email protected]>

There is a typo in help for argument output-model Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Miguel Martínez <[email protected]>

Signed-off-by: miguelusque <[email protected]>

ryantwolf

Just one comment of mine is left to be addressed. Other than that, should be good! I'll run the CI now just to double check

ryantwolf · 2024-06-25T20:12:21Z

examples/k8s/create_dask_cluster.py

This still needs to be addressed.

Signed-off-by: Miguel Martínez <[email protected]>

ryantwolf

Great, thanks again for this!

ayushdg

Confirmed it works with the fuzzy dedup cli pipeline

miguelusque added 4 commits June 10, 2024 04:30

Refactor arguments usage

e0adc47

Signed-off-by: Miguel Martínez <[email protected]>

Refactor arguments usage

816e7fb

Signed-off-by: miguelusque <[email protected]>

Refactor arguments usage

052e25a

Signed-off-by: miguelusque <[email protected]>

Refactor arguments usage

21c71e1

Signed-off-by: miguelusque <[email protected]>

ayushdg requested changes Jun 10, 2024

View reviewed changes

miguelusque added 3 commits June 10, 2024 19:01

Refactor arguments usage

66be5f7

Signed-off-by: miguelusque <[email protected]>

Refactor arguments usage

38ca0ca

Signed-off-by: miguelusque <[email protected]>

Refactor arguments usage

e74aa04

Signed-off-by: miguelusque <[email protected]>

miguelusque requested a review from ayushdg June 10, 2024 23:28

ryantwolf requested changes Jun 13, 2024

View reviewed changes

nemo_curator/utils/script_utils.py Show resolved Hide resolved

nemo_curator/utils/script_utils.py Outdated Show resolved Hide resolved

miguelusque added 4 commits June 14, 2024 19:00

Add missing parser

812f777

Signed-off-by: miguelusque <[email protected]>

Add missing parser

742406d

Signed-off-by: miguelusque <[email protected]>

Fix missing default parameter

93c8a29

Signed-off-by: miguelusque <[email protected]>

Move unique arguments to their corresponding scripts

3fb851f

Signed-off-by: miguelusque <[email protected]>

Merge branch 'NVIDIA:main' into miguelusque-fix-69

843149f

miguelusque requested a review from ryantwolf June 18, 2024 23:38

ryantwolf requested changes Jun 24, 2024

View reviewed changes

miguelusque and others added 12 commits June 25, 2024 14:29

Update help message

990c1c0

The former message was out of date. Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Miguel Martínez <[email protected]>

Improve help wording

1bd0ef4

Improve help wording for output_data_dir argument Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Miguel Martínez <[email protected]>

Update help wording

ed542b8

Update help wording for ouptut-data-dir argument Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Miguel Martínez <[email protected]>

Update nemo_curator/scripts/text_cleaning.py

653db8e

Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Miguel Martínez <[email protected]>

Update nemo_curator/scripts/make_data_shards.py

7d59d65

Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Miguel Martínez <[email protected]>

Fix help wording typo

63f0744

There is a typo in help for argument output-model Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: Miguel Martínez <[email protected]>

Improve help wording for output-data-dir argument

4ba9280

Signed-off-by: miguelusque <[email protected]>

Remove unused arguments

d9407b2

Signed-off-by: miguelusque <[email protected]>

Fix help wording

3a1bd09

Signed-off-by: miguelusque <[email protected]>

Remove unneeded print

35e47a4

Signed-off-by: miguelusque <[email protected]>

Fix help string for output-data-dir argument

d86308c

Signed-off-by: miguelusque <[email protected]>

Improve argument passing

2980420

Signed-off-by: miguelusque <[email protected]>

miguelusque requested a review from ryantwolf June 25, 2024 16:04

ryantwolf requested changes Jun 25, 2024

View reviewed changes

examples/k8s/create_dask_cluster.py Outdated

Copy link

Collaborator

ryantwolf Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be addressed.

miguelusque reacted with thumbs up emoji

Revert changes

1a36735

Signed-off-by: Miguel Martínez <[email protected]>

ryantwolf self-requested a review June 27, 2024 16:44

ryantwolf approved these changes Jun 27, 2024

View reviewed changes

ayushdg approved these changes Jul 1, 2024

View reviewed changes

ayushdg merged commit 274d3a9 into NVIDIA:main Jul 1, 2024
3 checks passed

ayushdg mentioned this pull request Jul 1, 2024

[FEA] Refactor how arguments are added to scripts #69

Closed

miguelusque mentioned this pull request Jul 1, 2024

Broken Tutorial: single_gpu_tutorial.ipynb #105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 69 - Refactor how arguments are added to scripts #102

Fix 69 - Refactor how arguments are added to scripts #102

miguelusque commented Jun 10, 2024

ayushdg left a comment

ryantwolf left a comment

miguelusque commented Jun 18, 2024

ryantwolf left a comment

ryantwolf Jun 24, 2024

ryantwolf Jun 25, 2024

ryantwolf left a comment

ryantwolf Jun 25, 2024

ryantwolf left a comment

ayushdg left a comment

Fix 69 - Refactor how arguments are added to scripts #102

Fix 69 - Refactor how arguments are added to scripts #102

Conversation

miguelusque commented Jun 10, 2024

Description

ayushdg left a comment

Choose a reason for hiding this comment

ryantwolf left a comment

Choose a reason for hiding this comment

miguelusque commented Jun 18, 2024

ryantwolf left a comment

Choose a reason for hiding this comment

ryantwolf Jun 24, 2024

Choose a reason for hiding this comment

ryantwolf Jun 25, 2024

Choose a reason for hiding this comment

ryantwolf left a comment

Choose a reason for hiding this comment

ryantwolf Jun 25, 2024

Choose a reason for hiding this comment

ryantwolf left a comment

Choose a reason for hiding this comment

ayushdg left a comment

Choose a reason for hiding this comment