Source Code for our EMNLP-22 findings paper Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in OpenQA. We open-source a two stage OpenQA system, where it first retrieves relevant table-text blocks and then extract answers from the retrieved evidences.
data_ottqa
: this folder contains the original dataset copied from OTT-QA.data_wikitable
: this folder contains the crawled tables and linked passages from Wikipedia.preprocessing
: this folder contains the data for training, validating and testing a code retriever. The code to obtain data for ablation study in the paper is also included.retrieval
: this folder contains the source code for table-text retrieval stage.qa
: this folder contains the source code for question answering stagescripts
: this folder contains the.py
and.sh
files to run experiments.preprocessed_data
: this folder contains the preprocessed data after preprocessing.BLINK
: this folder contains the source code adapted fromhttps://github.com/facebookresearch/BLINK
for entity linking.
pillow==5.4.1
torch==1.8.0
transformers==4.5.0
faiss-gpu
tensorboard==1.15.0
tqdm
torch-scatter
scikit-learn
scipy
bottle
nltk
sentencepiece
pexpect
prettytable
fuzzywuzzy
dateparser
pathos
We also use apex
to support mixed precision training. You can use the following command to install apex
.
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir ./
git clone https://github.com/wenhuchen/OTT-QA.git
cp OTT-QA/release_data/* ./data_ottqa
cd data_wikitable/
wget https://opendomainhybridqa.s3-us-west-2.amazonaws.com/all_plain_tables.json
wget https://opendomainhybridqa.s3-us-west-2.amazonaws.com/all_passages.json
cd ../
This script will download the crawled tables and linked passages from Wikiepdia in a cleaned format.
We strongly suggest you to download our processed linked passage from all_constructed_blink_tables.json and skipping this step 1-1, since it costs too much time. You can download the all_constructed_blink_tables.json.gz
file, then unzip it with gunzip
and move the json file to ./data_wikitable
. After that, go to step 2-2 to preprocess. (You can also use the linked passages all_constructed_tables.json following OTT-QA)
If you want to link by yourself, you can run the following script:
cd scripts/
for i in {0..7}
do
echo "Starting process", ${i}
CUDA_VISIBLE_DEVICES=$i python link_prediction_blink.py --shard ${i}@8 --do_all --dataset ../data_wikitable/all_plain_tables.json --data_output ../data_wikitable/all_constructed_blink_tables.json 2>&1 |tee ./logs/[email protected] &
done
Linking using the above script takes about 40-50 hours with 8 Tesla V100 32G GPUs. After linking, you can merge the 8 split files all_constructed_blink_tables_${i}@8.json
into one json file all_constructed_blink_tables.json
.
python retriever_preprocess.py --split train --nega intable_contra --aug_blink
python retriever_preprocess.py --split dev --nega intable_contra --aug_blink
These two scripts create data used for training.
python corpus_preprocess.py
This script encode the whole corpus table-text blocks used for retrieval.
Download the tfidf_augmentation_results.json.gz
file here, then use the following command to unzip and move the unzipped json file to ./data_wikitable
. This file will be used for preprocessing in step 4-3 and step 5-1.
gunzip tfidf_augmentation_results.json.gz
This step we pre-train the OTTeR with BART generated mixed-modality synthetic corpus. You have two choices here.
(1) Skip Step2 and jump to Step3. In this way, you just need to remove the argument --init_checkpoint ${PRETRAIN_MODEL_PATH}/checkpoint_best.pt
in the training script in Step3.
(2) Download the pre-trained checkpoint. The pre-trained checkpoint can be found here. You can download it and use to following command to unzip, and then move the repo to ${BASIC_PATH}/models
unzip -d ./checkpoint-pretrain checkpoint-pretrain.zip
If you don't want to use pretrained checkpoint with our proposed Mixed-Modality Synthetic Pretraining, you can remove the line --init_checkpoint ${PRETRAIN_MODEL_PATH}/checkpoint_best.pt \
in the command and run the remaining script.
export RUN_ID=0
export BASIC_PATH=.
export DATA_PATH=${BASIC_PATH}/preprocessed_data/retrieval
export TRAIN_DATA_PATH=${BASIC_PATH}/preprocessed_data/retrieval/train_intable_contra_blink_row.pkl
export DEV_DATA_PATH=${BASIC_PATH}/preprocessed_data/retrieval/dev_intable_contra_blink_row.pkl
export RT_MODEL_PATH=${BASIC_PATH}/models/otter
export PRETRAIN_MODEL_PATH=${BASIC_PATH}/models/checkpoint-pretrain/
export TABLE_CORPUS=table_corpus_blink
mkdir ${RT_MODEL_PATH}
cd ./scripts
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_1hop_tb_retrieval.py \
--do_train \
--prefix ${RUN_ID} \
--predict_batch_size 800 \
--model_name roberta-base \
--shared_encoder \
--train_batch_size 64 \
--fp16 \
--init_checkpoint ${PRETRAIN_MODEL_PATH}/checkpoint_best.pt \
--max_c_len 512 \
--max_q_len 70 \
--num_train_epochs 20 \
--warmup_ratio 0.1 \
--train_file ${TRAIN_DATA_PATH} \
--predict_file ${DEV_DATA_PATH} \
--output_dir ${RT_MODEL_PATH} \
2>&1 |tee ./retrieval_training.log
The training step takes about 10~12 hours with 8 Tesla V100 16G GPUs.
Encode dev questions.
cd ./scripts
CUDA_VISIBLE_DEVICES="0,1,2,3" python encode_corpus.py \
--do_predict \
--predict_batch_size 100 \
--model_name roberta-base \
--shared_encoder \
--predict_file ${BASIC_PATH}/data_ottqa/dev.json \
--init_checkpoint ${RT_MODEL_PATH}/checkpoint_best.pt \
--embed_save_path ${RT_MODEL_PATH}/indexed_embeddings/question_dev \
--fp16 \
--max_c_len 512 \
--num_workers 8 2>&1 |tee ./encode_corpus_dev.log
Encode table-text block corpus. It takes about 3 hours to encode.
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python encode_corpus.py \
--do_predict \
--encode_table \
--shared_encoder \
--predict_batch_size 1600 \
--model_name roberta-base \
--predict_file ${DATA_PATH}/${TABLE_CORPUS}.pkl \
--init_checkpoint ${RT_MODEL_PATH}/checkpoint_best.pt \
--embed_save_path ${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS} \
--fp16 \
--max_c_len 512 \
--num_workers 24 2>&1 |tee ./encode_corpus_table_blink.log
The reported results are table recalls.
python eval_ottqa_retrieval.py \
--raw_data_path ${BASIC_PATH}/data_ottqa/dev.json \
--eval_only_ans \
--query_embeddings_path ${RT_MODEL_PATH}/indexed_embeddings/question_dev.npy \
--corpus_embeddings_path ${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}.npy \
--id2doc_path ${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}/id2doc.json \
--output_save_path ${RT_MODEL_PATH}/indexed_embeddings/dev_output_k100_${TABLE_CORPUS}.json \
--beam_size 100 2>&1 |tee ./results_retrieval_dev.log
This step also evaluates the table block recall defined in our paper. We use the top 15 table-text blocks for QA, i.e.,CONCAT_TBS=15
.
export CONCAT_TBS=15
python ../preprocessing/qa_preprocess.py \
--split dev \
--topk_tbs ${CONCAT_TBS} \
--retrieval_results_file ${RT_MODEL_PATH}/indexed_embeddings/dev_output_k100_${TABLE_CORPUS}.json \
--qa_save_path ${RT_MODEL_PATH}/dev_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \
2>&1 |tee ./preprocess_qa_dev_k100cat${CONCAT_TBS}.log;
As we mainly focus on improving retrieval accuracy in this paper, we use the state-of-the-art reader model to evaluate downstream QA performance.
As we mentioned in our paper, to balance the distribution of training data and inference data, we also takes k table-text blocks for training, which contains several ground-truth blocks and the rest of retrieved blocks. We use the following scripts to obtain the training data.
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python encode_corpus.py \
--do_predict \
--predict_batch_size 200 \
--model_name roberta-base \
--shared_encoder \
--predict_file ${BASIC_PATH}/data_ottqa/train.json \
--init_checkpoint ${RT_MODEL_PATH}/checkpoint_best.pt \
--embed_save_path ${RT_MODEL_PATH}/indexed_embeddings/question_train \
--fp16 \
--max_c_len 512 \
--num_workers 16 2>&1 |tee ./encode_corpus_train.log
python eval_ottqa_retrieval.py \
--raw_data_path ${BASIC_PATH}/data_ottqa/train.json \
--eval_only_ans \
--query_embeddings_path ${RT_MODEL_PATH}/indexed_embeddings/question_train.npy \
--corpus_embeddings_path ${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}.npy \
--id2doc_path ${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}/id2doc.json \
--output_save_path ${RT_MODEL_PATH}/indexed_embeddings/train_output_k100_${TABLE_CORPUS}.json \
--beam_size 100 2>&1 |tee ./results_retrieval_train.log
python ../preprocessing/qa_preprocess.py \
--split train \
--topk_tbs 15 \
--retrieval_results_file ${RT_MODEL_PATH}/indexed_embeddings/train_output_k100_${TABLE_CORPUS}.json \
--qa_save_path ${RT_MODEL_PATH}/train_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \
2>&1 |tee ./preprocess_qa_train_k100.log
Note that we also requires the retrieval output for dev. set. You can refer to Step 4-3 to obtain the processed qa data.
export BASIC_PATH=.
export MODEL_NAME=mrm8488/longformer-base-4096-finetuned-squadv2
export TOPK=15
export QA_MODEL_PATH=${BASIC_PATH}/models/qa_longformer_${TOPK}_squadv2
mkdir ${QA_MODEL_PATH}
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_final_qa.py \
--do_train \
--do_eval \
--model_type longformer \
--dont_save_cache \
--overwrite_cache \
--model_name_or_path ${MODEL_NAME} \
--evaluate_during_training \
--data_dir ${RT_MODEL_PATH} \
--output_dir ${QA_MODEL_PATH} \
--train_file train_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \
--dev_file dev_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \
--per_gpu_train_batch_size 2 \
--per_gpu_eval_batch_size 8 \
--learning_rate 1e-5 \
--num_train_epochs 4 \
--max_seq_length 4096 \
--doc_stride 1024 \
--topk_tbs ${TOPK} \
2>&1 | tee ./train_qa_longformer-base-top${TOPK}.log
export BASIC_PATH=.
export TOPK=15
export QA_MODEL_PATH=${BASIC_PATH}/models/qa_longformer_${TOPK}_squadv2
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_final_qa.py \
--do_eval \
--model_type longformer \
--dont_save_cache \
--overwrite_cache \
--model_name_or_path ${MODEL_NAME} \
--data_dir ${RT_MODEL_PATH} \
--output_dir ${QA_MODEL_PATH} \
--dev_file dev_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json\
--per_gpu_eval_batch_size 16 \
--max_seq_length 4096 \
--doc_stride 1024 \
--topk_tbs ${TOPK} \
2>&1 | tee ./test_qa_longformer-base-top${TOPK}.log
If you find our code useful to you, please cite it using the following format:
@inproceedings{huang-etal-2022-mixed,
title = "Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in {O}pen{QA}",
author={Huang, Junjie and Zhong, Wanjun and Liu, Qian and Gong, Ming and Jiang, Daxin and Duan, Nan},
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.303",
pages = "4117--4129",
}
You can also check our another paper focusing on reasoning
@inproceedings{Zhong2022ReasoningOH,
title={Reasoning over Hybrid Chain for Table-and-Text Open Domain Question Answering},
author={Wanjun Zhong and Junjie Huang and Qian Liu and Ming Zhou and Jiahai Wang and Jian Yin and Nan Duan},
booktitle={IJCAI},
year={2022}
}