This guide contains instructions for running a BM25 baseline on the MS MARCO passage ranking task, which is nearly identical to a similar guide in Anserini, except that everything is in Python here (no Java). Note that there is a separate guide for the MS MARCO document ranking task. This exercise will require a machine with >8 GB RAM and >15 GB free disk space.
If you're a Waterloo student traversing the onboarding path (which starts here), make sure you've already done the BM25 Baselines for MS MARCO Passage Ranking in Anserini. In general, don't try to rush through this guide by just blindly copying and pasting commands into a shell; that's what I call cargo culting. Instead, really try to understand what's going on.
Learning outcomes for this guide, building on previous steps in the onboarding path:
- Be able to use Pyserini to build a Lucene inverted index on the MS MARCO passage collection.
- Be able to use Pyserini to perform a batch retrieval run on the MS MARCO passage collection with the dev queries.
- Be able to evaluate the retrieved results above.
- Be able to generate the retrieved results above interactively by directly manipulating Pyserini Python classes.
In short, you'll do everything you did with Anserini (in Java) on the MS MARCO passage ranking test collection, but now with Pyserini (in Python).
What's Pyserini? Well, it's the repo that you're in right now. Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. The toolkit provides Python bindings for our group's Anserini IR toolkit, which is built on Lucene (in Java). Pyserini provides entrée into the broader deep learning ecosystem, which is heavily Python-centric.
The guide requires the development installation. So get your Python environment set up.
Once you've done that: congratulations, you've passed the most difficult part! Everything else below mirrors what you did in Anserini (in Java), so it should be easy.
We're going to use collections/msmarco-passage/
as the working directory.
First, we need to download and extract the MS MARCO passage dataset:
mkdir collections/msmarco-passage
wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage
# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage
tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage
To confirm, collectionandqueries.tar.gz
should have MD5 checksum of 31644046b18952c1386cd4564ba2ae69
.
Next, we need to convert the MS MARCO tsv collection into Pyserini's jsonl files (which have one json object per line):
python tools/scripts/msmarco/convert_collection_to_jsonl.py \
--collection-path collections/msmarco-passage/collection.tsv \
--output-folder collections/msmarco-passage/collection_jsonl
The above script should generate 9 jsonl files in collections/msmarco-passage/collection_jsonl
, each with 1M lines (except for the last one, which should have 841,823 lines).
We can now index these documents as a JsonCollection
using Pyserini:
python -m pyserini.index.lucene \
--collection JsonCollection \
--input collections/msmarco-passage/collection_jsonl \
--index indexes/lucene-index-msmarco-passage \
--generator DefaultLuceneDocumentGenerator \
--threads 9 \
--storePositions --storeDocvectors --storeRaw
The command-line invocation should look familiar: it essentially mirrors the command with Anserini (in Java). If you can't make sense of what's going on here, back up and make sure you've first done the BM25 Baselines for MS MARCO Passage Ranking in Anserini.
Upon completion, you should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD, indexing takes a couple of minutes.
The 6980 queries in the development set are already stored in the repo. Let's take a peek:
$ head tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
1048585 what is paula deen's brother
2 Androgen receptor define
524332 treating tension headaches without medication
1048642 what is paranoid sc
524447 treatment of varicose veins in legs
786674 what is prime rate in canada
1048876 who plays young dr mallard on ncis
1048917 what is operating system misconfiguration
786786 what is priority pass
524699 tricare service number
$ wc tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
6980 48335 290193 tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
Each line contains a tab-delimited (query id, query) pair. Conveniently, Pyserini already knows how to load and iterate through these pairs. We can now perform retrieval using these queries:
python -m pyserini.search.lucene \
--index indexes/lucene-index-msmarco-passage \
--topics msmarco-passage-dev-subset \
--output runs/run.msmarco-passage.bm25tuned.txt \
--output-format msmarco \
--hits 1000 \
--bm25 --k1 0.82 --b 0.68 \
--threads 4 --batch-size 16
Here, we set the BM25 parameters to k1=0.82
, b=0.68
(tuned by grid search).
The option --output-format msmarco
says to generate output in the MS MARCO output format.
The option --hits
specifies the number of documents to return per query.
Thus, the output file should have approximately 6980 × 1000 = 6.9M lines.
Once again, if you can't make sense of what's going on here, back up and make sure you've first done the BM25 Baselines for MS MARCO Passage Ranking in Anserini.
Retrieval speed will vary by hardware:
On a reasonably modern CPU with an SSD, we might get around 13 qps (queries per second), and so the entire run should finish in under ten minutes (using a single thread).
We can perform multi-threaded retrieval by using the --threads
and --batch-size
arguments.
For example, setting --threads 16 --batch-size 64
on a CPU with sufficient cores, the entire run will finish in a couple of minutes.
After the run finishes, we can evaluate the results using the official MS MARCO evaluation script, which has been incorporated into Pyserini:
$ python -m pyserini.eval.msmarco_passage_eval \
tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
runs/run.msmarco-passage.bm25tuned.txt
#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################
We can also use the official TREC evaluation tool, trec_eval
, to compute metrics other than MRR@10.
The tool needs a different run format, so it's easier to just run retrieval again:
python -m pyserini.search.lucene \
--index indexes/lucene-index-msmarco-passage \
--topics msmarco-passage-dev-subset \
--output runs/run.msmarco-passage.bm25tuned.trec \
--hits 1000 \
--bm25 --k1 0.82 --b 0.68 \
--threads 4 --batch-size 16
The only difference here is that we've removed --output-format msmarco
.
Then, convert qrels files to the TREC format:
python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
--input collections/msmarco-passage/qrels.dev.small.tsv \
--output collections/msmarco-passage/qrels.dev.small.trec
Finally, run the trec_eval
tool, which has been incorporated into Pyserini:
$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap \
collections/msmarco-passage/qrels.dev.small.trec \
runs/run.msmarco-passage.bm25tuned.trec
map all 0.1957
recall_1000 all 0.8573
If you want to examine the MRR@10 for qid
1048585:
$ python -m pyserini.eval.trec_eval -q -c -M 10 -m recip_rank \
collections/msmarco-passage/qrels.dev.small.trec \
runs/run.msmarco-passage.bm25tuned.trec | grep 1048585
recip_rank 1048585 1.0000
Once again, if you can't make sense of what's going on here, back up and make sure you've first done the BM25 Baselines for MS MARCO Passage Ranking in Anserini.
Otherwise, congratulations! You've done everything that you did in Anserini (in Java), but now in Pyserini (in Python).
There's one final thing we should go over. Because we're in Python now, we get the benefit of having an interactive shell. Thus, we can run Pyserini interactively.
Try the following:
from pyserini.search.lucene import LuceneSearcher
searcher = LuceneSearcher('indexes/lucene-index-msmarco-passage')
searcher.set_bm25(0.82, 0.68)
hits = searcher.search('what is paula deen\'s brother')
for i in range(0, 10):
print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.6f}')
The LuceneSearcher
class provides search capabilities for BM25.
In the code snippet above, we're issuing the query about Paula Deen's brother (from above).
Note that we're explicitly setting the BM25 parameters, which are not the default parameters.
We get back a list of results (hits
), which we then iterate through and print out:
1 7187158 18.811600
2 7187157 18.333401
3 7187163 17.878799
4 7546327 16.962099
5 7187160 16.564699
6 8227279 16.432501
7 7617404 16.239901
8 7187156 16.024900
9 2298838 15.701500
10 7187155 15.513300
You can confirm that the output is the same as pyserini.search.lucene
from above.
$ grep 1048585 runs/run.msmarco-passage.bm25tuned.trec | head -10
1048585 Q0 7187158 1 18.811600 Anserini
1048585 Q0 7187157 2 18.333401 Anserini
1048585 Q0 7187163 3 17.878799 Anserini
1048585 Q0 7546327 4 16.962099 Anserini
1048585 Q0 7187160 5 16.564699 Anserini
1048585 Q0 8227279 6 16.432501 Anserini
1048585 Q0 7617404 7 16.239901 Anserini
1048585 Q0 7187156 8 16.024900 Anserini
1048585 Q0 2298838 9 15.701500 Anserini
1048585 Q0 7187155 10 15.513300 Anserini
To pull up the actual contents of a hit:
hits[0].lucene_document.get('raw')
And you should get:
'{\n "id" : "7187158",\n "contents" : "Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba\'sâ\x80¦ Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba\'sâ\x80¦"\n}'
Everything make sense? If so, now you're truly done with this guide and are ready to move on and learn about the relationship between sparse and dense retrieval!
Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use yyyy-mm-dd
, make sure you're using a commit id that's on the main trunk of Pyserini, and use its 7-hexadecimal prefix for the link anchor text.
Reproduction Log*
- Results reproduced by @JeffreyCA on 2020-09-14 (commit
49fd7cb
) - Results reproduced by @jhuang265 on 2020-09-14 (commit
2ed2acc
) - Results reproduced by @Dahlia-Chehata on 2020-11-11 (commit
8172015
) - Results reproduced by @rakeeb123 on 2020-12-07 (commit
3bcd4e5
) - Results reproduced by @jrzhang12 on 2021-01-03 (commit
7caedfc
) - Results reproduced by @HEC2018 on 2021-01-04 (commit
46a6d47
) - Results reproduced by @KaiSun314 on 2021-01-08 (commit
aeec31f
) - Results reproduced by @yemiliey on 2021-01-18 (commit
98f3236
) - Results reproduced by @larryli1999 on 2021-01-22 (commit
74a87e4
) - Results reproduced by @ArthurChen189 on 2021-04-08 (commit
7261223
) - Results reproduced by @printfCalvin on 2021-04-12 (commit
0801f7f
) - Results reproduced by @saileshnankani on 2021-04-26 (commit
6d48609
) - Results reproduced by @andrewyguo on 2021-04-30 (commit
ecfed61
) - Results reproduced by @mayankanand007 on 2021-05-04 (commit
a9d6f66
) - Results reproduced by @rootofallevii on 2021-05-14 (commit
e764797
) - Results reproduced by @jpark621 on 2021-06-13 (commit
f614111
) - Results reproduced by @nimasadri11 on 2021-06-28 (commit
d31e2e6
) - Results reproduced by @mzzchy on 2021-07-05 (commit
45083f5
) - Results reproduced by @d1shs0ap on 2021-07-16 (commit
a6b6545
) - Results reproduced by @apokali on 2021-08-19 (commit
45a2fb4
) - Results reproduced by @leungjch on 2021-09-12 (commit
c71a69e
) - Results reproduced by @AlexWang000 on 2021-10-10 (commit
8599c81
) - Results reproduced by @manveertamber on 2021-12-05 (commit
c280dad
) - Results reproduced by @lingwei-gu on 2021-12-15 (commit
7249409
) - Results reproduced by @tyao-t on 2021-12-19 (commit
fc54ed6
) - Results reproduced by @kevin-wangg on 2022-01-05 (commit
b9fcae7
) - Results reproduced by @vivianliu0 on 2021-01-06 (commit
937ec63
) - Results reproduced by @mikhail-tsir on 2022-01-10 (commit
f1084a0
) - Results reproduced by @AceZhan on 2022-01-14 (commit
68be809
) - Results reproduced by @jh8liang on 2022-02-06 (commit
e03e068
) - Results reproduced by @HAKSOAT on 2022-03-10 (commit
7796685
) - Results reproduced by @jasper-xian on 2022-03-27 (commit
5668edd
) - Results reproduced by @jx3yang on 2022-04-25 (commit
53333e0
) - Results reproduced by @alvind1 on 2022-05-04 (commit
244828f
) - Results reproduced by @Pie31415 on 2022-06-20 (commit
52db3a7
) - Results reproduced by @aivan6842 on 2022-07-11 (commit
f553d43
) - Results reproduced by @Jasonwu-0803 on 2022-09-27 (commit
563e4e7
) - Results reproduced by @limelody on 2022-09-27 (commit
7b53918
) - Results reproduced by @minconszhang on 2022-11-25 (commit
a3b0631
) - Results reproduced by @jingliu on 2022-12-08 (commit
f5a73f0
) - Results reproduced by @farazkh80 on 2022-12-18 (commit
3d8c473
) - Results reproduced by @Cath on 2023-01-14 (commit
ec37c5e
) - Results reproduced by @dlrudwo1269 on 2023-03-08 (commit
dfae4bb5
) - Results reproduced by @aryamancodes on 2023-04-11 (commit
1aea2b0
) - Results reproduced by @Jocn2020 on 2023-05-01 (commit
ca5a2be
) - Results reproduced by @zoehahaha on 2023-05-12 (commit
68be809
) - Results reproduced by @Richard5678 on 2023-06-13 (commit
ccb6df5
) - Results reproduced by @pratyushpal on 2023-07-14 (commit
760c22a
) - Results reproduced by @sahel-sh on 2023-07-22 (commit
863ff361
) - Results reproduced by @yilinjz on 2023-08-25 (commit
b57b583
) - Results reproduced by @Andrwyl on 2023-08-26 (commit
0b3ec90
) - Results reproduced by @UShivani3 on 2023-08-29 (commit
d9da49e
) - Results reproduced by @Edward-J-Xu on 2023-09-04 (commit
8063322
) - Results reproduced by @mchlp on 2023-09-07 (commit
d8dc5b3
) - Results reproduced by @lucedes27 on 2023-09-10 (commit
54014af
) - Results reproduced by @MojTabaa4 on 2023-09-14 (commit
d4a829d
) - Results reproduced by @Kshama on 2023-09-24 (commit
7d18f4b
) - Results reproduced by @MelvinMo on 2023-09-24 (commit
7d18f4b
) - Results reproduced by @ksunisth on 2023-09-27 (commit
142c774
) - Results reproduced by @maizerrr on 2023-10-01 (commit
bdb9504
) - Results reproduced by @Stefan824 on 2023-10-04 (commit
4f3da10
) - Results reproduced by @shayanbali on 2023-10-13 (commit
f889bc4
) - Results reproduced by @gituserbs on 2023-10-18 (commit
f1d623c
) - Results reproduced by @shakibaam on 2023-11-04 (commit
01889cc
) - Results reproduced by @gitHubAndyLee2020 on 2023-11-05 (commit
01889cc
) - Results reproduced by @Melissa1412 on 2023-11-05 (commit
acd969f
) - Results reproduced by @aliranjbari on 2023-11-08 (commit
12cbb11
) - Results reproduced by @salinaria on 2023-11-11 (commit
086e16b
) - Results reproduced by @oscarbelda86 on 2023-11-13 (commit
086e16b
) - Results reproduced by @Seun-Ajayi on 2023-11-13 (commit
086e16b
) - Results reproduced by @AndreSlavescu on 2023-11-28 (commit
1219cdb
) - Results reproduced by @tudou0002 on 2023-11-28 (commit
723e06c
) - Results reproduced by @golnooshasefi on 2023-11-28 (commit
1219cdb
) - Results reproduced by @alimt1992 on 2023-11-29 (commit
e6700f6
) - Results reproduced by @sueszli on 2023-12-01 (commit
170e271
) - Results reproduced by @kdricci on 2023-12-01 (commit
a2049c4
) - Results reproduced by @ljk423 on 2023-12-04 (commit
35002ad
) - Results reproduced by @saharsamr on 2023-12-14 (commit
039c137
) - Results reproduced by @Panizghi on 2023-12-17 (commit
0f5db95
) - Results reproduced by @AreelKhan on 2023-12-22 (commit
f75adca
) - Results reproduced by @wu-ming233 on 2023-12-31 (commit
38a571f
) - Results reproduced by @Yuan-Hou on 2024-01-02 (commit
38a571f
) - Results reproduced by @himasheth on 2024-01-10 (commit
a6ed27e
) - Results reproduced by @Tanngent on 2024-01-13 (commit
57a00cf
) - Results reproduced by @BeginningGradeMaker on 2024-01-15 (commit
d4ea011
) - Results reproduced by @ia03 on 2024-01-18 (commit
05ee8ef
) - Results reproduced by @AlexStan0 on 2024-01-20 (commit
833ee19
) - Results reproduced by @charlie-liuu on 2024-01-23 (commit
87a120e
) - Results reproduced by @dannychn11 on 2024-01-28 (commit
2f7702f
) - Results reproduced by @ru5h16h on 2024-02-19 (commit
758eaaa
) - Results reproduced by @ASChampOmega on 2024-02-23 (commit
442e7e1
)