A toolkit for creating, and processing the pretokenized-dolma and pretokenized-paloma datasets available on our HuggingFace org.
This repository contains tools and scripts for creating datasets derived from the Dolma and Paloma corpora, suitable for training and evaluating large language models. The toolkit provides functionality for downloading source data, preprocessing, sharding, and preparing evaluation datasets.
⚠️ Note: To run these instructions you need A LOT of storage; roughly on the order of 20TB.
The goal of releasing these scripts is for transparency, we encourage you to just use our uploaded datasets directly rather than reproducing the work.
- Hugging Face account with API token
- Python environment with dependencies installed (see
pyproject.toml
)
-
Configure Environment
- Create a
.env
file in the root directory and add your Hugging Face token:HF_TOKEN=your_token_here
- Run
poetry install
to get dependencies installed - Run
poetry shell
to launch virtual env
- Create a
-
Download Data
Run the following to automatically download data from
https://huggingface.co/datasets/allenai/dolma
(downloads on the order of 10TB of data)../download_data.sh
-
Create Dolma Dataset
To create the
pretokenized-dolma
dataset run the following script:python create_dolma_dataset.py --idx ... --num_workers ...
Note that the argument following
--idx
is required, and specifies what shard of the dataset to process. It needs to be an integer between 0 and 99.The second argument
--num_workers
- the number of workers - is optional but ideally should be ~ the number of cpus available. -
Create Evaluation Batch
Open and run
create_paloma_dataset.ipynb
to generate thepretokenized-paloma
andpretokenized-paloma-tinsy
datasets. -
Optional: Further shard Dolma Dataset
The original
create_dolma_dataset.py
script shards the dataset into 100 shards. We realized in hindsight that it would be better to shard the dataset into smaller chunks for easier loading. You can open and runfinegrain_shard_dolma_dataset.ipynb
to further shard the preprocessed dolma dataset into 10,000 shardsNote that this will create a second version of the dataset, what we did is we ran this script to generate 10,000 shards and then deleted the original dataset