Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983

shahrokhDaijavad · 2025-01-27T20:03:38Z

Search before asking

I searched the issues and found no similar issues.

Component

Other

Feature

We now have the 4 new transforms that were used in creating GneissWeb as 3 PRs in the repo (one PR implements 2 of these transforms) These PRs are being tested currently, but they are all in reasonable shape and all have notebook examples that will be helpful in creating ONE notebook that sequentially runs the transforms (plus existing transforms such as Filter)The sequence is:
Read Data->Repetition Removal->Readability,Fasttext,DCLM etc Annotation->Extreme token Annotation-> Filter
The 3 PRs are:
#953 (rep removal) #965 (extreme tokenization and readability) and #974 (fasttext classification).
The relevant notebooks are all in the main transform directory, for example: https://github.com/swith005/data-prep-kit-outer/blob/rep_removal/transforms/universal/rep_removal/rep_removal.ipynb for rep removal.
@touma-I @yousafshah @Swanand-Kadhe @Hajar-Emami

Are you willing to submit a PR?

Yes I am willing to submit a PR!

shahrokhDaijavad · 2025-01-29T14:54:30Z

@Swanand-Kadhe @Hajar-Emami #965 has now been merged into the dev branch.

shahrokhDaijavad · 2025-01-29T21:49:46Z

@BishwaBhatta The GneissWeb fasttext classifier transform has been tested only with facebook/fasttext-language-identification model.bin
Whenever we have the models that were used in creating GneissWeb fasttext classification uploaded to HF or a similar public place, please let us know.
Also, I assume we need to download the small number of input parquet/arrow files that @Swanand-Kadhe and @Hajar-Emami need for their notebook from a public place, correct?

shahrokhDaijavad added enhancement New feature or request gneiss web sprint-Jan31 labels Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983

Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983

shahrokhDaijavad commented Jan 27, 2025 •

edited

Loading

shahrokhDaijavad commented Jan 29, 2025

shahrokhDaijavad commented Jan 29, 2025

Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983

Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983

Comments

shahrokhDaijavad commented Jan 27, 2025 • edited Loading

Search before asking

Component

Feature

Are you willing to submit a PR?

shahrokhDaijavad commented Jan 29, 2025

shahrokhDaijavad commented Jan 29, 2025

shahrokhDaijavad commented Jan 27, 2025 •

edited

Loading