SQuAD datasets is distributed under the CC BY-SA 4.0 license.
Run the following command to download squad
python3 prepare_squad.py --version 1.1 # Squad 1.1
python3 prepare_squad.py --version 2.0 # Squad 2.0
For all datasets we support, we provide command-line-toolkits for downloading them as
nlp_data prepare_squad --version 1.1
nlp_data prepare_squad --version 2.0
Directory structure of the squad dataset will be as follows, where version
can be 1.1 or 2.0:
squad
├── train-v{version}.json
├── dev-v{version}.json
Following BSD-3-Clause License, we uploaded the SearchQA to our S3 bucket and provide the link to download the processed txt files. Please check out the Google drive link to download to raw and split files collected through web search using the scraper from GitHub repository.
Download SearchQA Dataset with python command or Command-line Toolkits
python3 prepare_searchqa.py
# Or download with command-line toolkits
nlp_data prepare_searchqa
Directory structure of the SearchQA dataset will be as follows
searchqa
├── train.txt
├── val.txt
├── test.txt
TriviaQA is an open domain QA dataset. See more useful scripts in Offical Github.
Run the following command to download TriviaQA
python3 prepare_triviaqa.py --version rc # Download TriviaQA version 1.0 for RC (2.5G)
python3 prepare_triviaqa.py --version unfiltered # Download unfiltered TriviaQA version 1.0 (604M)
# Or download with command-line toolkits
nlp_data prepare_triviaqa --version rc
nlp_data prepare_triviaqa --version unfiltered
Directory structure of the triviaqa (rc and unfiltered) dataset will be as follows:
triviaqa
├── triviaqa-rc
├── qa
├── verified-web-dev.json
├── web-dev.json
├── web-train.json
├── web-test-without-answers.json
├── verified-wikipedia-dev.json
├── wikipedia-test-without-answers.json
├── wikipedia-dev.json
├── wikipedia-train.json
├── evidence
├── web
├── wikipedia
├── triviaqa-unfiltered
├── unfiltered-web-train.json
├── unfiltered-web-dev.json
├── unfiltered-web-test-without-answers.json
HotpotQA is distributed under a CC BY-SA 4.0 License. We only provide download scripts (run by the following command), and please check out the GitHub repository for the details of preprocessing and evaluation.
python3 prepare_hotpotqa.py
# Or download with command-line toolkits
nlp_data prepare_hotpotqa
Directory structure of the hotpotqa dataset will be as follows
hotpotqa
├── hotpot_train_v1.1.json
├── hotpot_dev_fullwiki_v1.json
├── hotpot_dev_distractor_v1.json
├── hotpot_test_fullwiki_v1.json