Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #63. Add --input-meta parameter to explicitly specify the jsonl field dtypes #75

Merged
merged 24 commits into from
May 30, 2024

Conversation

miguelusque
Copy link
Contributor

When reading jsonl files with Dask, the dataframe datatypes are inferred unless explicitly specified.

Inferring the data types can lead to several issues, such as incorrect type inference, degradation of performance and increased memory usage among others.

I think we could mitigate those issues if we would add a --input-meta parameter, which would receive a dictionary of datatypes.

That parameter would be optional, and be similar to the --meta parameter available here: https://docs.dask.org/en/latest/generated/dask.dataframe.read_json.html.

@miguelusque miguelusque force-pushed the miguelusque-fix-63 branch from 1aa9a72 to 18fce31 Compare May 22, 2024 11:56
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
@miguelusque miguelusque force-pushed the miguelusque-fix-63 branch from 18fce31 to 638f7ff Compare May 22, 2024 12:25
@ayushdg ayushdg requested review from ryantwolf and ayushdg May 23, 2024 21:19
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea! I left a few changes I'd like to see made, but nothing major. Also it looks like there are some merge conflicts so you should probably resolve those.

nemo_curator/download/doc_builder.py Outdated Show resolved Hide resolved
examples/blend_and_shuffle.py Outdated Show resolved Hide resolved
nemo_curator/scripts/download_and_extract.py Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
tutorials/peft-curation/main.py Outdated Show resolved Hide resolved
@ryantwolf
Copy link
Collaborator

ryantwolf commented May 24, 2024

After sleeping on it, I do have one more nit. I think the argument for all the read_* functions should be meta. The argparse arguments can still be --input-meta. I want to try and keep the arguments and behavior as close to Dask as possible.

@miguelusque
Copy link
Contributor Author

After sleeping on it, I do have one more nit. I think the argument for all the read_* functions should be meta. The argparse arguments can still be --input-meta. I want to try and keep the arguments and behavior as close to Dask as possible.

Hi @ryantwolf , thank you for all your feedback.

I am addressing all the comments you have made. My first thought about the name of the parameter was to name it meta, but then I thought that, in the near future, we may want to also specify the output-meta parameter, where we may want to specify specific formats for the output files.

Even if I think it is very likely that meta might work for both use-cases, I thought it would be clearer to differentiate them.

Looking forward to your thoughts. Thanks!!!

miguelusque and others added 17 commits May 25, 2024 04:13
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! Left a few suggestions, but generally looks good.

In addition to the changes, is it also possible to add a couple of tests, that attempt to use DocumentDataset.read_json with the input_meta param set both as a string and a dict, and verify that the result dtypes are what we'd expect.

nemo_curator/datasets/doc_dataset.py Show resolved Hide resolved
nemo_curator/scripts/verify_classification_results.py Outdated Show resolved Hide resolved
nemo_curator/scripts/verify_classification_results.py Outdated Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
nemo_curator/scripts/verify_classification_results.py Outdated Show resolved Hide resolved
nemo_curator/utils/fuzzy_dedup_utils/io_utils.py Outdated Show resolved Hide resolved
@miguelusque
Copy link
Contributor Author

miguelusque commented May 30, 2024

Thanks for the changes! Left a few suggestions, but generally looks good.

In addition to the changes, is it also possible to add a couple of tests, that attempt to use DocumentDataset.read_json with the input_meta param set both as a string and a dict, and verify that the result dtypes are what we'd expect.

Done! See test_io.py file, where I think we could start adding other IO-related tests.

Signed-off-by: Miguel Martínez <[email protected]>
@miguelusque miguelusque requested a review from ayushdg May 30, 2024 22:27
Signed-off-by: Miguel Martínez <[email protected]>
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for ddressing all my comments.

@ayushdg ayushdg merged commit 757b389 into NVIDIA:main May 30, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants