Data Prep Kit

Data Prep Kit accelerates unstructured data preparation for LLM app developers. Developers can use Data Prep Kit to cleanse, transform, and enrich use case-specific unstructured data to pre-train LLMs, fine-tune LLMs, instruct-tune LLMs, or build retrieval augmented generation (RAG) applications for LLMs. Data Prep Kit can readily scale from a commodity laptop all the way to data center scale.

Features

The kit provides a growing set of modules/transforms targeting laptop-scale to datacenter-scale processing.
The data modalities supported today are: Natural Language and Code.
The modules are built on common frameworks for Python, Ray and Spark runtimes for scaling up data processing.
The kit provides a framework for developing custom transforms for processing parquet files.
The kit uses Kubeflow Pipelines-based workflow automation.

Installation

The latest version of the Data Prep Kit is available on PyPi for Python 3.10, 3.11 and 3.12. It can be installed using:

pip install  'data-prep-toolkit-transforms[all]'

This will install all available transforms.

For guidance on creating the virtual environment for installing the data prep kit, click here.

🚀 Getting Started

Fastest way to experience Data Prep Kit

With no setup necessary, let's use a Google Colab friendly notebook to try Data Prep Kit. This is a simple transform to extract content from PDF files: examples/notebooks/Run_your_first_transform_colab.ipynb | . (Here are some tips for running Data Prep Kit transforms on Google Colab. For this simple example, these tips are either already taken care of, or are not needed.) The same notebook can be downloaded and run on the local machine, without cloning the repo or any other setup.

Examples

Now that you have run a single transform, the next step is to explore how to put these transforms together to run a data prep pipeline for end to end real enterprise use cases like fine-tuning a model or building a RAG application.

We have a complete set of data processing recipes for such use cases.

We also have a developer tutorial for contributing a new transform to the kit.

For advanced users, here is more information for adding your own transform, running transforms from the command line, scaling and automation and more. Also, repository structure and use are discussed here.

Windows users

Please click here for guidance on how to run transforms in Windows.

Using HuggingFace data files

All the transforms in the kit include small sample data files for testing, but advanced users who want to download real data files from HuggingFace and use them in testing, can refer to this.

Supported data transforms

Click to expand for detailed list of transforms.

The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed here and can be combined to form data processing pipelines, as shown in the examples folder.

Modules	Python-only	Ray	Spark	KFP on Ray
Data Ingestion
Code (from zip) to Parquet	✅	✅		✅
PDF to Parquet	✅	✅		✅
HTML to Parquet	✅	✅		✅
Web to Parquet	✅
Universal (Code & Language)
Exact dedup filter	✅	✅		✅
Fuzzy dedup filter	✅	✅	✅	✅
Unique ID annotation	✅	✅	✅	✅
Filter on annotations	✅	✅	✅	✅
Profiler	✅	✅	✅	✅
Resize	✅	✅	✅	✅
Hate, Abuse, Profanity (HAP)	✅	✅		✅
Tokenizer	✅	✅		✅
Tokenization2Arrow	✅	✅
Repetition removal	✅	✅
Bloom filter	✅
Language-only
Language identification	✅	✅		✅
Document quality	✅	✅		✅
Document chunking for RAG	✅	✅		✅
Text encoder	✅	✅		✅
PII Annotator/Redactor	✅	✅		✅
Similarity	✅
GneissWeb classification	✅	✅
Readability scores	✅	✅
Extreme tokenized annotation	✅	✅
Code-only
Programming language annotation	✅	✅		✅
Code quality annotation	✅	✅		✅
Malware annotation	✅	✅		✅
Header cleanser	✅	✅		✅
Semantic file ordering		✅
License Select Annotation	✅	✅		✅
Code profiler	✅	✅

Contributing

Contributors are welcome to add new modules to expand to other data modalities as well as add runtime support for existing modules! Please read this for details.

Get help and support

Please feel free to connect with us using the discussion section.

MAINTAINERS

For a list of current maintainers, please see.

CHANGELOG

For the history of releases and changes, please see.

Resources

Papers, talks, presentations and tutorials

Granite open source LLM models

GneissWeb

Citation

If you use Data Prep Kit in your research, please cite our paper:

@misc{wood2024dataprepkitgettingdataready,
      title={Data-Prep-Kit: getting your data ready for LLM application development}, 
      author={David Wood and Boris Lublinsky and Alexy Roytman and Shivdeep Singh 
      and Constantin Adam and Abdulhamid Adebayo and Sungeun An and Yuan Chi Chang 
      and Xuan-Hong Dang and Nirmit Desai and Michele Dolfi and Hajar Emami-Gohari 
      and Revital Eres and Takuya Goto and Dhiraj Joshi and Yan Koyfman 
      and Mohammad Nassar and Hima Patel and Paramesvaran Selvam and Yousaf Shah  
      and Saptha Surendran and Daiki Tsuzuku and Petros Zerfos and Shahrokh Daijavad},
      year={2024},
      eprint={2409.18164},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2409.18164}, 
}

License

All source files must include a Copyright and License header. If you would like to see the detailed LICENSE click here.

LF AI & Data

Data Prep Kit is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the Data for AI Models team at IBM Research.

Name		Name	Last commit message	Last commit date
Latest commit History 5,155 Commits
.github		.github
data-connector-lib		data-connector-lib
data-processing-lib		data-processing-lib
doc		doc
examples		examples
kfp		kfp
scripts		scripts
tools		tools
transforms		transforms
.black.toml		.black.toml
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.make.defaults		.make.defaults
.make.subdirs		.make.subdirs
.make.versions		.make.versions
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yaml		.prettierrc.yaml
.secrets.baseline		.secrets.baseline
ADVANCED.md		ADVANCED.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
Makefile		Makefile
README.md		README.md
RELEASE.md		RELEASE.md
mkdocs.yml		mkdocs.yml
release-notes.md		release-notes.md
resources.md		resources.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Prep Kit

Features

Installation

🚀 Getting Started

Fastest way to experience Data Prep Kit

Examples

Windows users

Using HuggingFace data files

Supported data transforms

Contributing

Get help and support

MAINTAINERS

CHANGELOG

Resources

Citation

License

LF AI & Data

IBM ❤️ Open Source AI

About

Releases

Packages

Languages

License

revit13/data-prep-lab

Folders and files

Latest commit

History

Repository files navigation

Data Prep Kit

Features

Installation

🚀 Getting Started

Fastest way to experience Data Prep Kit

Examples

Windows users

Using HuggingFace data files

Supported data transforms

Contributing

Get help and support

MAINTAINERS

CHANGELOG

Resources

Citation

License

LF AI & Data

IBM ❤️ Open Source AI

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages