Curates a mix of datasets for llm training. Based on dolma library (See https://arxiv.org/abs/2402.00159)
This script allows you to download a subset of a Hugging Face dataset based on a specified language (e.g., English). It utilizes the Dolma library for processing Wikipedia data and filtering the dataset during the loading process.
- wikiextractor
- requests
- smart_open
- tqdm
- dolma
- datasets
You can install these dependencies using pip:
pip install -r requirements.txt
First, you need to extract Wikipedia data and create a Wikipedia mix using Dolma:
python make_wikipedia.py \
--output data/wikipedia \
--date {latestDate in YYYYMMDD format} \
--lang simple \
--processes 16 \
--overwrite
Then, use the Dolma script to mix the Wikipedia data:
dolma -c config/wikipeida-mix.yaml mix --processes 16
You can subsample or oversample with the probability 0-1 and >1 respectively. For oversample, the oversampled_p= 1/p. Run the code as follows
python sampling.py -s 'data/dir/location1/*.gz' 'data/dir/location2/*.gz' -p 1.65 -d data/mixed -n 16
For private datasets, you may need to log in via the Hugging Face CLI before downloading.