Skip to content

Commit f9bcddb

Browse files
change package name to data_selection
1 parent bc38bbd commit f9bcddb

14 files changed

+164
-43
lines changed

README.md

+53-22
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,57 @@
11
# Data Selection for Language Models via Importance Resampling (DSIR)
2+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
3+
[![arXiv](https://img.shields.io/badge/arXiv-2305.10429-00ff00.svg)](https://arxiv.org/abs/2302.03169)
24

3-
This repository contains pre-filtered datasets and code for selecting relevant language model training data from The Pile.
5+
This repository contains the [DSIR](https://arxiv.org/abs/2302.03169) data selection tool for selecting relevant language model training data from any raw data source given a target dataset, as well as pre-filtered datasets and some pretrained models.
6+
7+
DSIR is built for:
8+
- fast, large-scale (trillion-token scale) data selection from large raw text datasets (Pile, RefinedWeb, RedPajama, ...)
9+
- selecting data that is distributed like a given target dataset (domain-specific data, Wikipedia, ...). Relevance and diversity are balanced automatically.
10+
11+
Compute needed:
12+
- 1 CPU node
13+
- a large amount of RAM (at least few hundred GB)
14+
- a high number of cores (parallelism on a file level. For best performance, use as many CPU cores as data files)
15+
16+
![DSIR figure](fig1.png)
17+
18+
Code related to the DSIR paper's experiments are in the `experimental/` directory.
19+
20+
## Quickstart
21+
22+
Install from pip:
23+
```
24+
pip install data-selection
25+
```
26+
27+
Install from source by cloning this repo and installing via pip:
28+
```
29+
git clone [email protected]:/p-lambda/dsir
30+
pip install ./dsir
31+
```
32+
33+
To select data, initialize a `HashedNgramDSIR` object and call the following functions:
34+
```
35+
from data_selection import HashedNgramDSIR
36+
37+
raw_datasets = [<list of paths>]
38+
target_datasets = [<list of paths>]
39+
40+
dsir = HashedNgramDSIR(raw_datasets, num_proc=30)
41+
dsir.fit_importance_estimator(target_datasets)
42+
dsir.compute_importance_weights()
43+
dsir.resample(out_dir='resampled', num_to_sample=1000000, cache_dir='/scr/resampled_cache')
44+
```
45+
This will save 1M examples in `jsonl` files inside an output directory named `resampled`. The files will first be written to `cache_dir` and moved to `out_dir` upon completion.
46+
47+
The `dsir` intermediate results (after `fit_importance_estimator` and `compute_importance_weights`) can be saved and loaded for later use, for example to resample a different number of examples:
48+
```
49+
dsir.save('dsir_params')
50+
51+
# later on
52+
dsir.load('dsir_params')
53+
dsir.resample(out_dir='resampled', num_to_sample=10000000, cache_dir='/scr/resampled_cache')
54+
```
455

556
## Pre-filtered datasets
657
Note: previous versions of the datasets had a small validation and test split (50000 examples each), but we concatenated these onto the end of the train set (in the order validation, then test) to better align with the paper. The datasets should be further shuffled during preprocessing before training.
@@ -52,32 +103,12 @@ In the table below, `{dataset}` can be replaced with one of `{ag, amazon, citati
52103
| heuristiccls-roberta-continuedpretrain-{dataset} | Link format: `https://huggingface.co/sangmichaelxie/dsir-roberta-continuedpretrain-{dataset}` | 6.4B tokens (25M examples) | 256 | 25000 | roberta-base | roberta-base | RoBERTa model with continued pretraining on data selected by heurstic classification with target={dataset} |
53104
| randomselect-roberta-continuedpretrain | [Link](https://huggingface.co/sangmichaelxie/randomselect-roberta-continuedpretrain) | 6.4B tokens (25M examples) | 256 | 25000 | roberta-base | roberta-base | RoBERTa model with continued pretraining on random subset of The Pile |
54105

55-
## Code for data selection
56-
57-
To select your own subset of The Pile, all you need is a small set of target examples representing the kind of data you want to select.
58-
This target dataset should be in jsonl format -- it can also be a dataset from HuggingFace Datasets. Note that our current workflow requires about 2TB of storage space --- we're working on reducing this! All the code should be run from the outer `dsir` directory.
59-
1. Create a virtualenv using `requirements.txt`: `virtualenv .venv; source .venv/bin/activate; pip install -r requirements.txt`
60-
2. Download The Pile to `PILE_PATH` and change the corresponding variables in `config.sh`.
61-
3. Run preprocessing on The Pile: Run `bash preprocessing/run_slurm.sh`. You can also run `bash preprocessing/run.sh` directly using the arguments in `preprocessing/run_slurm.sh`. This only needs to be run once.
62-
4. Precompute quality filter stats: Run `bash preprocessing/quality_scores/run_slurm_quality_stats.sh`. After this, run `bash preprocessing/quality_scores/run_merge_quality_scores.sh`. This only needs to be run once. (We're working on streamlining steps 3 and 4. Stay tuned!)
63-
5. Run DSIR: For an example, run `bash data_selection/run_cmds.sh`. For new target datasets, some information about which fields in the dataset to use should be placed in the `dsname_to_args` dictionary at the top of the `data_selection/dsir_pipeline.py` file. If you wish to retrieve from custom subsets of the Pile (for example, only select data from one chunk of the Pile), you will need to tweak one part of the code, in the main part of the script (an example is provided of how to do so as a comment). Many of the steps in DSIR can be cached and will only run the first time. For example, resampling a different number of examples with the same target dataset uses cached importance weights.
64-
65-
## Code for pretraining and GLUE evaluation
66-
67-
We provide scripts for training BERT-style masked language models on the selected data and evaluating it on GLUE in the `train` and `glue_eval` directories, respectively. All code should be run from the outer `dsir` directory.
68-
1. Install further dependencies using `train/requirements.txt`: `pip install -r train/requirements.txt`
69-
2. Change the `PRETRAIN_OUTPUT_DIR` variable in `config.sh`.
70-
3. Write a job command in `train/run_slurm.sh`. An example command in this file. You will need to change the path to the training data. If you want to skip preprocessing (if it's already done), set the first of two boolean variables to `false`. By setting both to `true`, there will be two jobs launched: one for preprocessing and one for pretraining. The pretraining job should take about 50 hours on 4 RTX 3090 GPUs. Kick off the jobs by running `bash train/run_slurm.sh`.
71-
4. Evaluate the trained model by editing the evaluation job command in `glue_eval/run_eval_exps.sh` with the path to the model checkpoint. This script runs 5 seeds for each GLUE dataset. The results and finetuned models will be saved a new `finetune_runs` directory inside the pretrained model checkpoint directory. Kick off the jobs by running `bash glue_exps/run_eval_exps.sh`.
72-
5. Read the GLUE results by running `python read_glue_results.py --results_dir </path/to/checkpoint>/finetune_runs` in the `glue_eval` directory.
73-
74-
75106
## Citation Information
76107
Paper: <https://arxiv.org/abs/2302.03169>
77108
```
78109
@article{xie2023data,
79110
author = {Sang Michael Xie and Shibani Santurkar and Tengyu Ma and Percy Liang},
80-
journal = {arXiv preprint arXiv:2302.03169},
111+
journal = {Advances in Neural Information Processing Systems (NeurIPS)},
81112
title = {Data Selection for Language Models via Importance Resampling},
82113
year = {2023},
83114
}

data_selection/__init__.py

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from .base import DSIR
2+
from .hashed_ngram_dsir import HashedNgramDSIR

dsir/base.py data_selection/base.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
from datasets import load_dataset
1010
from tqdm import tqdm
1111

12-
from dsir.utils import parallelize
12+
from data_selection.utils import parallelize
1313

1414

1515
def default_load_dataset_fn(path: str) -> Iterable[Dict]:

dsir/hashed_ngram_dsir.py data_selection/hashed_ngram_dsir.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@
1010
from nltk import ngrams as get_ngrams
1111
import numpy as np
1212

13-
from dsir.base import DSIR, default_load_dataset_fn, default_parse_example_fn
14-
from dsir.utils import parallelize
13+
from data_selection.base import DSIR, default_load_dataset_fn, default_parse_example_fn
14+
from data_selection.utils import parallelize
1515

1616

1717
wpt = WordPunctTokenizer()
File renamed without changes.

dsir/__init__.py

Whitespace-only changes.

experimental/README.md

+33
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Code for the DSIR paper
2+
This directory has the code for preprocessing, data selection, pretraining, and fine-tuning for the experiments in the DSIR paper. Pre-filtered datasets and pre-trained models from the paper are linked in the README at the outer directory.
3+
4+
## Code for data selection
5+
6+
To select your own subset of The Pile, all you need is a small set of target examples representing the kind of data you want to select.
7+
This target dataset should be in jsonl format -- it can also be a dataset from HuggingFace Datasets. Note that our current workflow requires about 2TB of storage space --- we're working on reducing this! All the code should be run from the `experimental/` directory.
8+
1. Create a virtualenv using `requirements.txt`: `virtualenv .venv; source .venv/bin/activate; pip install -r requirements.txt`
9+
2. Download The Pile to `PILE_PATH` and change the corresponding variables in `config.sh`.
10+
3. Run preprocessing on The Pile: Run `bash preprocessing/run_slurm.sh`. You can also run `bash preprocessing/run.sh` directly using the arguments in `preprocessing/run_slurm.sh`. This only needs to be run once.
11+
4. Precompute quality filter stats: Run `bash preprocessing/quality_scores/run_slurm_quality_stats.sh`. After this, run `bash preprocessing/quality_scores/run_merge_quality_scores.sh`. This only needs to be run once. (We're working on streamlining steps 3 and 4. Stay tuned!)
12+
5. Run DSIR: For an example, run `bash data_selection/run_cmds.sh`. For new target datasets, some information about which fields in the dataset to use should be placed in the `dsname_to_args` dictionary at the top of the `data_selection/dsir_pipeline.py` file. If you wish to retrieve from custom subsets of the Pile (for example, only select data from one chunk of the Pile), you will need to tweak one part of the code, in the main part of the script (an example is provided of how to do so as a comment). Many of the steps in DSIR can be cached and will only run the first time. For example, resampling a different number of examples with the same target dataset uses cached importance weights.
13+
14+
## Code for pretraining and GLUE evaluation
15+
16+
We provide scripts for training BERT-style masked language models on the selected data and evaluating it on GLUE in the `train` and `glue_eval` directories, respectively. All code should be run from the `experimental/` directory.
17+
1. Install further dependencies using `train/requirements.txt`: `pip install -r train/requirements.txt`
18+
2. Change the `PRETRAIN_OUTPUT_DIR` variable in `config.sh`.
19+
3. Write a job command in `train/run_slurm.sh`. An example command in this file. You will need to change the path to the training data. If you want to skip preprocessing (if it's already done), set the first of two boolean variables to `false`. By setting both to `true`, there will be two jobs launched: one for preprocessing and one for pretraining. The pretraining job should take about 50 hours on 4 RTX 3090 GPUs. Kick off the jobs by running `bash train/run_slurm.sh`.
20+
4. Evaluate the trained model by editing the evaluation job command in `glue_eval/run_eval_exps.sh` with the path to the model checkpoint. This script runs 5 seeds for each GLUE dataset. The results and finetuned models will be saved a new `finetune_runs` directory inside the pretrained model checkpoint directory. Kick off the jobs by running `bash glue_exps/run_eval_exps.sh`.
21+
5. Read the GLUE results by running `python read_glue_results.py --results_dir </path/to/checkpoint>/finetune_runs` in the `glue_eval` directory.
22+
23+
## Citation Information
24+
Paper: <https://arxiv.org/abs/2302.03169>
25+
```
26+
@article{xie2023data,
27+
author = {Sang Michael Xie and Shibani Santurkar and Tengyu Ma and Percy Liang},
28+
journal = {arXiv preprint arXiv:2302.03169},
29+
title = {Data Selection for Language Models via Importance Resampling},
30+
year = {2023},
31+
}
32+
```
33+

experimental/config.sh

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/bin/bash
2+
3+
CACHE='/path/to/cachedir'
4+
ROOT_DIR='/path/to/dsir/experimental'
5+
VIRTUAL_ENV='/path/to/.env'
6+
PILE_PATH='/path/to/pile'
7+
DSIR_OUTPUT_DIR='/path/to/outputdir'
8+
PRETRAIN_OUTPUT_DIR='/path/to/model_outputdir'
9+
WORD_VECTORS_PATH='/path/to/pretrained_fasttext_wordvecs.vec'
10+
# Slurm
11+
cluster_info='--partition <PARTITION_NAME>'
12+
13+
source ${VIRTUAL_ENV}/bin/activate
File renamed without changes.

fig1.png

300 KB
Loading

pyproject.toml

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
[build-system]
2+
requires = ["setuptools>=61.0.0", "wheel"]
3+
build-backend = "setuptools.build_meta"
4+
5+
[project]
6+
name = "data-selection"
7+
version = "0.0.1"
8+
authors = [
9+
{ name="Sang Michael Xie", email="[email protected]" },
10+
]
11+
description = "Data Selection with Importance Resampling"
12+
readme = "README.md"
13+
requires-python = ">=3.6"
14+
classifiers = [
15+
"Programming Language :: Python :: 3",
16+
"License :: OSI Approved :: MIT License",
17+
"Operating System :: OS Independent",
18+
]
19+
20+
license = { file = "LICENSE" }
21+
keywords = ["data selection", "importance resampling", "dsir", "nlp", "language models"]
22+
dependencies = [
23+
'numpy>=1.21.6',
24+
'datasets>=2.13.2',
25+
'tqdm>=4.62.3',
26+
'joblib>=1.1.0',
27+
'nltk>=3.7',
28+
]
29+
30+
[project.optional-dependencies]
31+
dev = ["pytest"]
32+
33+
[project.urls]
34+
"Homepage" = "https://github.com/p-lambda/dsir"

setup.py

+16-15
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,18 @@
11
from setuptools import setup, find_packages
22

3-
setup(name='dsir',
4-
version='0.0.1',
5-
description='Data Selection with Importance Resampling',
6-
url='https://github.com/p-lambda/dsir',
7-
author='Sang Michael Xie',
8-
author_email='[email protected]',
9-
packages=find_packages('.'),
10-
install_requires=[
11-
'numpy>=1.21.6',
12-
'datasets>=2.13.2',
13-
'tqdm>=4.62.3',
14-
'joblib>=1.1.0',
15-
'nltk>=3.7',
16-
]
17-
)
3+
if __name__ == "__main__":
4+
setup(name='data-selection',
5+
version='0.0.1',
6+
description='Data Selection with Importance Resampling',
7+
url='https://github.com/p-lambda/dsir',
8+
author='Sang Michael Xie',
9+
author_email='[email protected]',
10+
packages=find_packages('.'),
11+
install_requires=[
12+
'numpy>=1.21.6',
13+
'datasets>=2.13.2',
14+
'tqdm>=4.62.3',
15+
'joblib>=1.1.0',
16+
'nltk>=3.7',
17+
]
18+
)

tests/test_hashed_ngram.py

+9-2
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
import json
55
import shutil
66

7-
from dsir.hashed_ngram_dsir import HashedNgramDSIR, hash_buckets, get_ngram_counts
7+
from data_selection.hashed_ngram_dsir import HashedNgramDSIR, hash_buckets, get_ngram_counts
88

99

1010
toy_dataset = Path(__file__).parent / "toy_pile_data.jsonl"
@@ -147,4 +147,11 @@ def test_save_load(dsir_obj):
147147
assert dsir_obj_2.ngrams == dsir_obj.ngrams
148148

149149
if __name__ == "__main__":
150-
test_get_ngram_counts()
150+
dsir = HashedNgramDSIR(
151+
raw_datasets,
152+
parse_example_fn=parse_example_fn,
153+
num_proc=2,
154+
ngrams=2,
155+
num_buckets=10000)
156+
157+
test_resample(dsir)

tests/test_utils.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from dsir import utils
1+
from data_selection import utils
22

33

44
def job(arg):

0 commit comments

Comments
 (0)