Pico Dataset

A toolkit for creating, and processing the pretokenized-dolma and pretokenized-paloma datasets available on our HuggingFace org.

This repository contains tools and scripts for creating datasets derived from the Dolma and Paloma corpora, suitable for training and evaluating large language models. The toolkit provides functionality for downloading source data, preprocessing, sharding, and preparing evaluation datasets.

⚠️ Note: To run these instructions you need A LOT of storage; roughly on the order of 20TB.

The goal of releasing these scripts is for transparency, we encourage you to just use our uploaded datasets directly rather than reproducing the work.

Prerequisites

Hugging Face account with API token
Python environment with dependencies installed (see pyproject.toml)

Steps

Configure Environment
- Create a .env file in the root directory and add your Hugging Face token: HF_TOKEN=your_token_here
- Run poetry install to get dependencies installed
- Run poetry shell to launch virtual env
Download Data

Run the following to automatically download data from https://huggingface.co/datasets/allenai/dolma (downloads on the order of 10TB of data).
```
./download_data.sh
```
Create Dolma Dataset

To create the pretokenized-dolma dataset run the following script:
```
python create_dolma_dataset.py --idx ... --num_workers ...
```
Note that the argument following --idx is required, and specifies what shard of the dataset to process. It needs to be an integer between 0 and 99.

The second argument --num_workers - the number of workers - is optional but ideally should be ~ the number of cpus available.
Create Evaluation Batch

Open and run create_paloma_dataset.ipynb to generate the pretokenized-paloma and pretokenized-paloma-tinsy datasets.
Optional: Further shard Dolma Dataset

The original create_dolma_dataset.py script shards the dataset into 100 shards. We realized in hindsight that it would be better to shard the dataset into smaller chunks for easier loading. You can open and run finegrain_shard_dolma_dataset.ipynb to further shard the preprocessed dolma dataset into 10,000 shards

Note that this will create a second version of the dataset, what we did is we ran this script to generate 10,000 shards and then deleted the original dataset

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_dolma_dataset.py		create_dolma_dataset.py
create_paloma_dataset.ipynb		create_paloma_dataset.ipynb
download_dolma.sh		download_dolma.sh
finegrain_shard_dolma_dataset.ipynb		finegrain_shard_dolma_dataset.ipynb
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pico Dataset

Prerequisites

Steps

About

Uh oh!

Releases 1

Packages

Languages

License

pico-lm/pico-dataset

Folders and files

Latest commit

History

Repository files navigation

Pico Dataset

Prerequisites

Steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages