Skip to content

Latest commit

 

History

History
 
 

pretrain_corpus

Pretraining Corpus

We provide a series of shared scripts for downloading/preparing the text corpus for pretraining NLP models. This helps create a unified text corpus for studying the performance of different pretraining algorithms. When picking the datasets to support, we follow the FAIR principle, i.e., the dataset needs to be findable, accessible, interoperable, and reusable.

For all scripts, we can either use nlp_data SCRIPT_NAME, or directly call the script.

Gutenberg BookCorpus

Unfortunately, we are unable to provide the Toronto BookCorpus dataset due to licensing issues.

There are some open source efforts for reproducing the dataset, e.g., using soskek/bookcorpus or directly downloading the preprocessed version.

Thus, we utilize the Project Gutenberg as an alternative to Toronto BookCorpus.

You can use the following command to download and prepare the Gutenberg corpus.

python3 prepare_gutenberg.py --save_dir gutenberg

Also, you should follow the license for using the data.

Wikipedia

We used the attardi/wikiextractor package for preparing the data.

# Download
python3 prepare_wikipedia.py --mode download --lang en --date latest -o ./

# Properly format the text files
python3 prepare_wikipedia.py --mode format -i [path-to-wiki.xml.bz2] -o ./

The process of downloading and formatting is time consuming, and we offer an alternative solution to download the prepared raw text file from S3 bucket. This raw text file is in English and was dumped at 2020-06-20 being formatted by the above process ( --lang en --date 20200620).

python3 prepare_wikipedia.py --mode download_prepared -o ./

References

OpenWebText

You can download the OpenWebText from link. After downloading and extracting the OpenWebText (i.e., tar xf openwebtext.tar.xz), you can use the following command to preprocess the dataset.

python3 prepare_openwebtext.py --input openwebtext/ --output prepared_owt --shuffle

In this step, the archived txt are directly read without decompressing. They are concatenated together in a single txt file with the same name as the archived file, using double empty lines as the document separation.