Skip to content

Latest commit

 

History

History
 
 

question_answering

Question Answering

SQuAD

SQuAD datasets is distributed under the CC BY-SA 4.0 license.

Run the following command to download squad

python3 prepare_squad.py --version 1.1 # Squad 1.1
python3 prepare_squad.py --version 2.0 # Squad 2.0

For all datasets we support, we provide command-line-toolkits for downloading them as

nlp_data prepare_squad --version 1.1
nlp_data prepare_squad --version 2.0

Directory structure of the squad dataset will be as follows, where version can be 1.1 or 2.0:

squad
├── train-v{version}.json
├── dev-v{version}.json

SearchQA

Following BSD-3-Clause License, we uploaded the SearchQA to our S3 bucket and provide the link to download the processed txt files. Please check out the Google drive link to download to raw and split files collected through web search using the scraper from GitHub repository.

Download SearchQA Dataset with python command or Command-line Toolkits

python3 prepare_searchqa.py

# Or download with command-line toolkits
nlp_data prepare_searchqa

Directory structure of the SearchQA dataset will be as follows

searchqa
├── train.txt
├── val.txt
├── test.txt

TriviaQA

TriviaQA is an open domain QA dataset. See more useful scripts in Offical Github.

Run the following command to download TriviaQA

python3 prepare_triviaqa.py --version rc         # Download TriviaQA version 1.0 for RC (2.5G)
python3 prepare_triviaqa.py --version unfiltered # Download unfiltered TriviaQA version 1.0 (604M)

# Or download with command-line toolkits
nlp_data prepare_triviaqa --version rc
nlp_data prepare_triviaqa --version unfiltered

Directory structure of the triviaqa (rc and unfiltered) dataset will be as follows:

triviaqa
├── triviaqa-rc
    ├── qa
        ├── verified-web-dev.json        
        ├── web-dev.json                   
        ├── web-train.json     
        ├── web-test-without-answers.json
        ├── verified-wikipedia-dev.json
        ├── wikipedia-test-without-answers.json
        ├── wikipedia-dev.json  
        ├── wikipedia-train.json
    ├── evidence
        ├── web
        ├── wikipedia

├── triviaqa-unfiltered
    ├── unfiltered-web-train.json
    ├── unfiltered-web-dev.json
    ├── unfiltered-web-test-without-answers.json

HotpotQA

HotpotQA is distributed under a CC BY-SA 4.0 License. We only provide download scripts (run by the following command), and please check out the GitHub repository for the details of preprocessing and evaluation.

python3 prepare_hotpotqa.py

# Or download with command-line toolkits
nlp_data prepare_hotpotqa

Directory structure of the hotpotqa dataset will be as follows

hotpotqa
├── hotpot_train_v1.1.json
├── hotpot_dev_fullwiki_v1.json
├── hotpot_dev_distractor_v1.json
├── hotpot_test_fullwiki_v1.json