Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983

Open
1 of 2 tasks
shahrokhDaijavad opened this issue Jan 27, 2025 · 2 comments
Labels

Comments

@shahrokhDaijavad
Copy link
Member

shahrokhDaijavad commented Jan 27, 2025

Search before asking

  • I searched the issues and found no similar issues.

Component

Other

Feature

We now have the 4 new transforms that were used in creating GneissWeb as 3 PRs in the repo (one PR implements 2 of these transforms) These PRs are being tested currently, but they are all in reasonable shape and all have notebook examples that will be helpful in creating ONE notebook that sequentially runs the transforms (plus existing transforms such as Filter)The sequence is:
Read Data->Repetition Removal->Readability,Fasttext,DCLM etc Annotation->Extreme token Annotation-> Filter
The 3 PRs are:
#953 (rep removal) #965 (extreme tokenization and readability) and #974 (fasttext classification).
The relevant notebooks are all in the main transform directory, for example: https://github.com/swith005/data-prep-kit-outer/blob/rep_removal/transforms/universal/rep_removal/rep_removal.ipynb for rep removal.
@touma-I @yousafshah @Swanand-Kadhe @Hajar-Emami

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@shahrokhDaijavad
Copy link
Member Author

@Swanand-Kadhe @Hajar-Emami #965 has now been merged into the dev branch.

@shahrokhDaijavad
Copy link
Member Author

@BishwaBhatta The GneissWeb fasttext classifier transform has been tested only with facebook/fasttext-language-identification model.bin
Whenever we have the models that were used in creating GneissWeb fasttext classification uploaded to HF or a similar public place, please let us know.
Also, I assume we need to download the small number of input parquet/arrow files that @Swanand-Kadhe and @Hajar-Emami need for their notebook from a public place, correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant