This page describes how to download and prepare the datasets used in GluonNLP.
Essentially, we provide scripts for downloading and preparing the datasets. The directory structure and the format of the processed datasets are well documented so that you are able to reuse the scripts with your own data (as long as the structure/format matches).
Thus, the typical workflow for running experiments:
- Download and prepare data with scripts in datasets.
- In case you will need to preprocess the dataset, there are toolkits in preprocess.
- Run the experiments in scripts
- Machine Translation
- Question Answering
- Language Modeling
- Music Generation
- Pretraining Corpus
- General NLP Benchmarks
We are very happy to receive and merge your contributions about new datasets 😃.
To add a new dataset, you may create a prepare_{DATASET_NAME}.py
file in the specific folder.
Also, remember to add the documentation in the README.md
about 1) the directory structure and 2) how to use the CLI tool for downloading + preprocessing.
In addition, add citations in the prepare_{DATASET_NAME}.py
to assign credit to the original author.
Refer to the existing scripts or ask questions in Github if you need help.
All URLs are bound with SHA1-hash keys to make sure that the downloaded files are not corrupted. You can refer to the files in url_checksums for examples.
In order to generate the hash values of the data files, you can revise update_download_stats.py and include the new URLS + create the stats file that will store the hash keys. Use the following command to update the hash key:
python3 update_download_stats.py
-
After installing GluonNLP, I cannot access the command line toolkits. It reports
nlp_data: command not found
.The reason is that you have installed glunonnlp to a folder that is not in
PATH
, e.g.,
~/.local/bin
. You can try to change thePATH
variable to also include '~/.local/bin' via the following command:export PATH=${PATH}:~/.local/bin