This readme demonstrates how to reproduce ranking runs on the TREC_COVID relevance dataset. Go through experimenting first on how to build and deploy the app.
Export the documents using ir_datasets. Note that there are two trec-covid datasets:
There is differences in the total number of documents and the number of relevance judgments. In this work we use
the beir/trec-covid
version.
ir_datasets export beir/trec-covid docs --format jsonl --fields text title doc_id |python3 scripts/trec-covid-dataset.py > trec_covid_feed.jsonl
Index the dataset into Vespa using the Vespa CLI:
vespa feed -t http://localhost:8080 trec_covid_feed.jsonl
Dump query-document relevance judgements in trec_eval format using ir_datasets:
ir_datasets export beir/trec-covid qrels > beir-trec-covid-qrels.txt
python3 scripts/evaluate.py --endpoint http://localhost:8080/search/ --ranking bm25
python3 scripts/evaluate.py --endpoint http://localhost:8080/search/ --ranking colbert
python3 scripts/evaluate.py --endpoint http://localhost:8080/search/ --ranking hybrid-colbert
Install trec_eval
:
git clone https://github.com/usnistgov/trec_eval.git && cd trec_eval
make install
trec_eval -mndcg_cut.10 beir-trec-covid-qrels.txt bm25.run
ndcg_cut_10 all 0.6903
trec_eval -mndcg_cut.10 beir-trec-covid-qrels.txt colbert.run
ndcg_cut_10 all 0.6603
trec_eval -mndcg_cut.10 beir-trec-covid-qrels.txt hybrid-colbert.run
ndcg_cut_10 all 0.7501
The following table summarize results reported on BEIR trec-covid leaderboard
Method | nDCG@10 |
---|---|
Elasticsearch default (BM25) | 0.616 |
Anserini IR toolkit based on Lucene BM25, k=0.9, b=0.4 | 0.656 |
Vespa BM25 (k=0.9, b=0.4) | 0.690 |
Vespa BM25 + ColBERT (hybrid) | 0.750 |