We support retrievers BM25 and ColBERT and retrieval corpuses kilt_wikipedia (KILT version) and the 2023 Annual Medline Corpus
There are 3 steps:
- Get top-k passages from the retrievers
- Evaluate each retriever separately
- Compare retrievers
This involves two steps: 1) using pyserini repo to get BM25 indices, 2) using KILT repo to get BM25 outputs.
-
Install Java 11.
-
Run BM25/get_indices.sh. This will output a folder
${index_dir}/${corpus}_jsonl
.
- Replace
${index_dir}/${corpus}_jsonl
inBM25/default_bm25.json
.
-
Customize
KILT/kilt/configs/${dataset}.json
for your select dataset. -
Run
BM25/bm25.sh
to output the top-k passages for each query in the file${prediction_dir}/bm25/${dataset}.jsonl
.
Each line corresponds to a query. This is an example of one line:
{"id": "-1027463814348738734",
"input": "pace maker is associated with which body organ",
"output": [{"provenance": [
# top 1 retrieved paragraph
{"page_id": "557054",
"start_par_id": "4",
"end_par_id": "4",
"text": "Peristalsis of the smooth muscle originating in pace-maker cells originating in the walls of the calyces propels urine through the renal pelvis and ureters to the bladder. The initiation is caused by the increase in volume that stretches the walls of the calyces. This causes them to fire impulses which stimulate rhythmical contraction and relaxation, called peristalsis. Parasympathetic innervation enhances the peristalsis while sympathetic innervation inhibits it.",
"score": 20.9375},
# top 2 retrieved paragraph ...
]}]}
-
Process the previously downloaded corpus file for ColBERT format by running
python retriever/data_processing/create_corpus_tsv.py --corpus $corpus --corpus_dir $corpus_dir
, which outputs$corpus_dir/${corpus}/${corpus}.json
. -
Clone our modified version of the original ColBERT repo.
-
Download the pre-trained ColBERTv2 checkpoint into your $model_dir. This checkpoint has been trained on the MS MARCO Passage Ranking task. You can also optionally train your own ColBERT model.
-
Run
ColBERT/colbert.sh
to output${prediction_dir}/colbert/${dataset}.jsonl
.
To evaluate the predictions, first compile all gold (evidence) information by running python retriever/data_processing/get_gold_compilation.py --data_dir $data_dir --dataset $dataset
which outputs $data_dir/gold_compilation_files/gold_${dataset}_compilation_file.json
.
Then, run evaluate_retriver.sh
, which outputs the following 3 files.
The first output file is ${evaluation_dir}/${retriever}/${dataset}.jsonl
.
Each line corresponds to a query. This is an example of one line:
{"id": "-1027463814348738734",
"gold provenance metadata": {"num_page_ids": 2, "num_page_par_ids": 2},
"passage-level results": [
{"page_id": "557054", "page_id_match": false, "answer_in_context": false, "page_par_id": "557054_4", "page_par_id_match": false},
...
{"page_id": "12887799", "page_id_match": false, "answer_in_context": true, "page_par_id": "12887799_2", "page_par_id_match": false}
]
}
The second output file is ${evaluation_dir}/${retriever}/${dataset}_results_by_k.json
.
This outputs retrieval performance for each k. We include an example below for k = 1.
"1": {
"top-k page_id accuracy": 0.3595347197744096,
"top-k page_par_id accuracy": 0.24814945364821994,
"precision@k page_id": 0.3595347197744096,
"precision@k page_par_id": 0.24814945364821994,
"recall@k page_id": 0.27195394195746725,
"recall@k page_par_id": 0.12315278912917237,
"answer_in_context@k": 0.44448360944659854
}
The third output file is ${evaluation_dir}/${retriever}/${dataset}_results_by_k.jpg
which plots retriever performance by k.
After evaluating each retriever separately as above, use evaluate_retriever.ipynb
to compare different retrievers across values of k.