-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add Resiliparse option for text extraction #128
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ryantwolf @jojennin sorry for the confusion. I'm reopening #90 to fix the commit signoff issues I was dealing with; the PRs are identical otherwise. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! I just have some design comments on how I think we can make it easier to swap between algorithms and customize them.
Hi @ryantwolf this is ready for another review. The only question I have is what you think the best way to go about adding the unit tests is? Locally I'm testing it with download_common_crawl.py but that seems a bit heavy for CI? Even doing a single snapshot with Edit: Perhaps it may be sufficient to add examples of |
Good question. I wouldn't add a unit test for the
It isn't a bad idea to showcase how users can change the algorithm. Do you mind updating the from nemo_curator.download import (
download_common_crawl,
ResiliparseExtraction,
)
# Change the extraction algorithm
extraction_algorithm = ResiliparseExtraction()
common_crawl = download_common_crawl(
"/extracted/output/folder",
"2020-50",
"2021-04",
output_type="jsonl",
algorithm=extraction_algorithm,
) |
Thanks @ryantwolf ! Should be ready now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, glad to have this in. Thanks again!
Signed-off-by: Sarah Yurick <[email protected]>
86d6a03
to
4f90c28
Compare
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Duplicate of #90 with successful DCO check.
Right now, we only support Common Crawl text extraction with jusText. Resiliparse is known to be a faster text extraction algorithm which may also produce better tokens.
This PR adds optional support for the Resiliparse algorithm while still keeping jusText as the default.