README for retrieval-based baselines

This repo provides guidelines for training and testing retrieval-based baselines for NeurIPS Competition on Efficient Open-domain Question Answering.

We provide two retrieval-based baselines:

TF-IDF: TF-IDF retrieval built on fixed-length passages, adapted from the DrQA system's implementation.
DPR: A learned dense passage retriever, detailed in Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al, 2020). Our baseline is adapted from the original implementation.

Note that for both baselines, we use text blocks of 100 words as passages and a BERT-base multi-passage reader. See more details in the DPR paper.

We provide two variants for each model, using (1) full Wikipedia (full) and (2) A subset of Wikipedia articles which are found relevant to the questions on the train data (subset). In particular, we think of subset as a naive way to reduce the disk memory usage for the retrieval-based baselines.

Note: If you want to try parameter-only baselines (T5-based) for the competition, please note that implementations on T5-based Closed-book QA model is available here.

Note: If you want simple guidelines on making end-to-end QA predictions using pretrained models, please refer to this tutorial.

Getting ready

Git clone

git clone https://github.com/facebookresearch/DPR.git # dependency
git clone https://github.com/efficientqa/retrieval-based-baselines.git # this repo

Download data

Follow DPR repo in order to download NQ data and Wikipedia DB. Specificially, after running cd DPR and let base_dir as your base directory to store data and pretrained models,

Download QA pairs by python3 data/download_data.py --resource data.retriever.qas --output_dir ${base_dir} and python3 data/download_data.py --resource data.retriever.nq --output_dir ${base_dir}.
Download wikipedia DB by python3 data/download_data.py --resource data.wikipedia_split --output_dir ${base_dir}.
Download gold question-passage pairs by python3 data/download_data.py --resource data.gold_passages_info --output_dir ${base_dir}.

Optionally, if you want to try subset variant, run cd ../retrieval-based-baselines; python3 filter_subset_wiki.py --db_path ${base_dir}/data/wikipedia_split/psgs_w100.tsv --data_path ${base_dir}/data/retriever/nq-train.json. This script will create a new passage DB containing passages which originated articles are those paired with question on the original NQ data (78,050 unique articles; 1,642,855 unique passages). This new DB will be stored at ${base_dir}/data/wikipedia_split/psgs_w100_subset.tsv.

From now on, we will denote Wikipedia DBs (either full or subset) as db_path.

TFIDF retrieval

Make sure to be in retrieval-based-baselines directory to run scripts for TFIDF (largely adapted from DrQA repo).

Step 1: Run pip install -r requirements.txt

Step 2: Build Sqlite DB via:

mkdir -p {base_dir}/tfidf
python3 build_db.py ${db_path} ${base_dir}/tfidf/db.db --num-workers 60`.

Step 3: Run the following command to build TFIDF index.

python3 build_tfidf.py ${base_dir}/tfidf/db.db ${base_dir}/tfidf

It will save TF-IDF index in ${base_dir}/tfidf

Step 4: Run inference code to save retrieval results.

python3 inference_tfidf.py --qa_file ${base_dir}/data/retriever/qas/nq-{train|dev|test}.csv --db_path ${db_path} --out_file ${base_dir}/tfidf/nq-{train|dev|test}.json --tfidf_path {path_to_tfidf_index}

The resulting files, ${base_dir}/tfidf/nq-{train|dev|test}-tfidf.json are ready to be fed into the DPR reader.

DPR retrieval

Follow DPR repo to train DPR retriever and make inference. You can follow steps until Retriever validation.

If you want to use retriever checkpoint provided by DPR, follow these three steps.

Step 1: Make sure to be in DPR directory, and download retriever checkpoint by python3 data/download_data.py --resource checkpoint.retriever.multiset.bert-base-encoder --output_dir ${base_dir}.

Step 2: Save passage vectors by following Generating representations. Note that you can replace ctx_file to your own db_path if you are trying "seen only" version. In particular, you can do

python3 generate_dense_embeddings.py --model_file ${base_dir}/checkpoint/retriever/multiset/bert-base-encoder.cp --ctx_file ${db_path} --shard_id {0-19} --num_shards 20 --out_file ${base_dir}/dpr_ctx

Step 3: Save retrieval results by following Retriever validation. In particular, you can do

mkdir -p ${base_dir}/dpr_retrieval
python3 dense_retriever.py \
  --model_file ${base_dir}/checkpoint/retriever/single/nq/bert-base-encoder.cp \
  --ctx_file  ${dp_path} \
  --qa_file ${base_dir}/data/retriever/qas/nq-{train|dev|test}.csv \
  --encoded_ctx_file ${base_dir}/'dpr_ctx*' \
  --out_file ${base_dir}/dpr_retrieval/nq-{train|dev|test}.json \
  --n-docs 200 \
  --save_or_load_index # this to save the dense index if it was built for the first time, and load it next times.

Now, ${base_dir}/dpr_retrieval/nq-{train|dev|test}.json is ready to be fed into DPR reader.

DPR reader

Note: The following instruction is identical to instructions from DPR README, but we rewrite it with hyperparamters specified for our baselines.

The following instruction is for training the reader using TFIDF results, saved in ${base_dir}/tfidf/nq-{train|dev|test}-tfidf.json. In order to use DPR retrieval results, simply replace paths to these files to ${base_dir}/dpr_retrieval/nq-{train|dev|test}.json

Step 1: Preprocess data.

python3 preprocess_reader_data.py \
  --retriever_results ${base_dir}/tfidf/nq-{train|dev|test}.json \
  --gold_passages ${base_dir}/data/gold_passages_info/nq_{train|dev|test}.json \
  --do_lower_case \
  --pretrained_model_cfg bert-base-uncased \
  --encoder_model_type hf_bert \
  --out_file ${base_dir}/tfidf/nq-{train|dev|test}-tfidf \
  --is_train_set # specify this only when it is train data

Step 2: Train the reader.

python3 train_reader.py \
        --encoder_model_type hf_bert \
        --pretrained_model_cfg bert-base-uncased \
        --train_file ${base_dir}/tfidf/'nq-train*.pkl' \
        --dev_file ${base_dir}/tfidf/'nq-dev*.pkl' \
        --output_dir ${base_dir}/checkpoints/reader_from_tfidf \
        --seed 42 \
        --learning_rate 1e-5 \
        --eval_step 2000 \
        --eval_top_docs 50 \
        --warmup_steps 0 \
        --sequence_length 350 \
        --batch_size 16 \
        --passages_per_question 24 \
        --num_train_epochs 100000 \
        --dev_batch_size 72 \
        --passages_per_question_predict 50

Step 3: Test the reader.

python train_reader.py \
  --prediction_results_file ${base_dir}/checkpoints/reader_from_tfidf/dev_predictions.json \
  --eval_top_docs 10 20 40 50 80 100 \
  --dev_file ${base_dir}/tfidf/`nq-dev*.pkl` \
  --model_file ${base_dir}/checkpoints/reader_from_tfidf/{checkpoint file} \
  --dev_batch_size 80 \
  --passages_per_question_predict 100 \
  --sequence_length 350

Result

Model	Exact Mach	Disk usage (gb)
TFIDF-full	32.0	20.1
TFIDF-subset	31.0	2.8
DPR-full	41.0	66.4
DPR-subset	34.8	5.9

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
drqa_retriever		drqa_retriever
drqa_tokenizers		drqa_tokenizers
LICENSE		LICENSE
README.md		README.md
build_db.py		build_db.py
build_tfidf.py		build_tfidf.py
filter_subset_wiki.py		filter_subset_wiki.py
inference_tfidf.py		inference_tfidf.py
requirements.txt		requirements.txt
run.sh		run.sh
run_inference.py		run_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README for retrieval-based baselines

Content

Getting ready

Git clone

Download data

TFIDF retrieval

DPR retrieval

DPR reader

Result

About

Releases

Packages

Languages

License

efficientqa/retrieval-based-baselines

Folders and files

Latest commit

History

Repository files navigation

README for retrieval-based baselines

Content

Getting ready

Git clone

Download data

TFIDF retrieval

DPR retrieval

DPR reader

Result

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages