Update commands for MS MARCO V2 reproduction - uniCOIL noexp and uniC…

…OIL + TILDE (#891)
castorini · Nov 29, 2021 · d101ffa · d101ffa
1 parent 00a7d66
commit d101ffa
Show file tree

Hide file tree

Showing 2 changed files with 49 additions and 29 deletions.
diff --git a/docs/experiments-msmarco-v2-unicoil-tilde-expansion.md b/docs/experiments-msmarco-v2-unicoil-tilde-expansion.md
@@ -26,31 +26,35 @@ First, we need to download and extract the MS MARCO V2 passage dataset with uniC
 
 ```bash
 # Alternate mirrors of the same data, pick one:
-wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-v2-passage-unicoil-tilde-expansion-b8.tar -P collections/
-wget https://vault.cs.uwaterloo.ca/s/tb3m3J45HFJNAbq/download -O collections/msmarco-v2-passage-unicoil-tilde-expansion-b8.tar
+wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-v2-unicoil-tilde-expansion-b8.tar -P collections/
+wget https://vault.cs.uwaterloo.ca/s/tb3m3J45HFJNAbq/download -O collections/msmarco-passage-v2-unicoil-tilde-expansion-b8.tar
 
-tar -xvf collections/msmarco-v2-passage-unicoil-tilde-expansion-b8.tar -C collections/
+tar -xvf collections/msmarco-passage-v2-unicoil-tilde-expansion-b8.tar -C collections/
 ```
 
-To confirm, `msmarco-v2-passage-unicoil-tilde-expansion-b8.tar` is around 58 GB and should have an MD5 checksum of `acc4c9bc3506c3a496bf3e009fa6e50b`.
+To confirm, `msmarco-passage-v2-unicoil-tilde-expansion-b8.tar` is around 58 GB and should have an MD5 checksum of `acc4c9bc3506c3a496bf3e009fa6e50b`.
 
 ## Indexing
 
 We can now index these docs:
 
 ```
-python -m pyserini.index -collection JsonVectorCollection \
- -input collections/msmarco-v2-passage-unicoil-tilde-expansion-b8/ \
- -index indexes/lucene-index.msmarco-v2-passage-unicoil-tilde-expansion-b8 \
- -generator DefaultLuceneDocumentGenerator -impact -pretokenized \
- -threads 12
+python -m pyserini.index --collection JsonVectorCollection \
+                         --input collections/msmarco-passage-v2-unicoil-tilde-expansion-b8/ \
+                         --index indexes/lucene-index.msmarco-v2-passage-unicoil-tilde-expansion-b8 \
+                         --generator DefaultLuceneDocumentGenerator \
+                         --threads 12 \
+                         --impact \
+                         --pretokenized
 ```
 
 The important indexing options to note here are `-impact -pretokenized`: the first tells Pyserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.
 
 Upon completion, we should have an index with 138,364,198 documents.
 The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 5 hours.
 
+<!-- This is deprecated because we have pre-built indexes. Retaining for historic reasons.
+
 If you want to save time and skip the indexing step, download the prebuilt index directly:
 
 ```bash
@@ -64,6 +68,8 @@ tar -xzvf indexes/lucene-index.msmarco-v2-passage-unicoil-tilde-expansion-b8.tar
 To confirm, `lucene-index.msmarco-v2-passage-unicoil-tilde-expansion-b8.tar.gz` is around 30 GB and should have an MD5 checksum of `0f9b1f90751d49dd3a66be54dd0b4f82`.
 This pre-built index was created with the above command, but with the addition of the `-optimize` option to merge index segments.
 
+-->
+
 ## Retrieval
 
 > If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-passage-unicoil-tilde` in the command below.

diff --git a/docs/experiments-msmarco-v2-unicoil.md b/docs/experiments-msmarco-v2-unicoil.md
@@ -14,7 +14,7 @@ We are working on figuring out ways to distribute the indexes.
 ## Zero-Shot uniCOIL
 
 For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model and we did not have time to finish doc2query-T5 expansions.
-Thus, we applied uniCOIL without expansions in a zero-shot manner using the model trained on the MS MARCO (V1) passage corpus, described [here](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-unicoil.md).
+Thus, we applied uniCOIL without expansions in a zero-shot manner using the model trained on the MS MARCO (V1) passage corpus, described [here](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-unicoil.md).
 
 Specifically, we applied inference over the MS MARCO V2 [passage corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#passage-collection) and [segmented document corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#document-collection-segmented) to obtain the term weights.
 
@@ -28,18 +28,25 @@ As an alternative, we also make available pre-built indexes (in which case the i
 Download the sparse representation of the corpus generated by uniCOIL:
 
 ```bash
-wget https://vault.cs.uwaterloo.ca/s/a29gEzyXrK5NG4o/download -O collections/msmarco-v2-passage-unicoil-noexp-0shot-b8.tar
-tar -xvf collections/msmarco-v2-passage-unicoil-noexp-0shot-b8.tar -C collections/
+# Alternate mirrors of the same data, pick one:
+wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar -P collections/
+wget https://vault.cs.uwaterloo.ca/s/a29gEzyXrK5NG4o/download -O collections/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar
+
+tar -xvf collections/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar -C collections/
 ```
 
+To confirm, `msmarco-passage-v2-unicoil-noexp-0shot-b8.tar` is 24 GB and has an MD5 checksum of `fcf21991103197a7e8823b0e2045aca1`.
+
 Index the sparse vectors:
 
 ```bash
-python -m pyserini.index -collection JsonVectorCollection \
- -input collections/msmarco-v2-passage-unicoil-noexp-0shot-b8 \
- -index indexes/lucene.unicoil-noexp.0shot.msmarco-v2-passage \
- -generator DefaultLuceneDocumentGenerator -impact -pretokenized \
- -threads 32
+python -m pyserini.index --collection JsonVectorCollection \
+                         --input collections/msmarco-passage-v2-unicoil-noexp-0shot-b8 \
+                         --index indexes/lucene-index.msmarco-v2-passage.unicoil-noexp-0shot \
+                         --generator DefaultLuceneDocumentGenerator \
+                         --threads 32 \
+                         --impact \
+                         --pretokenized
 ```
 
 > If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-passage-unicoil-noexp-0shot` in the command below.
@@ -49,7 +56,7 @@ Sparse retrieval with uniCOIL:
 ```bash
 python -m pyserini.search --topics msmarco-v2-passage-dev \
                           --encoder castorini/unicoil-noexp-msmarco-passage \
-                          --index indexes/lucene.unicoil-noexp.0shot.msmarco-v2-passage  \
+                          --index indexes/lucene-index.msmarco-v2-passage.unicoil-noexp-0shot \
                           --output runs/run.msmarco-v2-passage.unicoil-noexp.0shot.txt \
                           --impact \
                           --hits 1000 \
@@ -85,18 +92,25 @@ As an alternative, we also make available pre-built indexes (in which case the i
 Download the sparse representation of the corpus generated by uniCOIL:
 
 ```bash
-wget https://vault.cs.uwaterloo.ca/s/x5cEaM3rXnTaE7j/download -O collections/msmarco-v2-doc-seg-unicoil-noexp-0shot-b8.tar
-tar -xvf collections/msmarco-v2-doc-seg-unicoil-noexp-0shot-b8.tar -C collections/
+# Alternate mirrors of the same data, pick one:
+wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar -P collections/
+wget https://vault.cs.uwaterloo.ca/s/x5cEaM3rXnTaE7j/download -O collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar
+
+tar -xvf collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar -C collections/
 ```
 
+To confirm, `msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar` is 54 GB and has an MD5 checksum of `af54061ab5c2ce6cf90a1e60fd92924c`.
+
 Index the sparse vectors:
 
 ```bash
-python -m pyserini.index -collection JsonVectorCollection \
- -input collections/msmarco-v2-doc-seg-unicoil-noexp-0shot-b8 \
- -index indexes/lucene.unicoil-noexp.0shot.msmarco-v2-doc-segmented \
- -generator DefaultLuceneDocumentGenerator -impact -pretokenized \
- -threads 32
+python -m pyserini.index --collection JsonVectorCollection \
+                         --input collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8 \
+                         --index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil-noexp.0shot \
+                         --generator DefaultLuceneDocumentGenerator \
+                         --threads 32 \
+                         --impact \
+                         --pretokenized
 ```
 
 > If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-doc-per-passage-unicoil-noexp-0shot` in the command below.
@@ -106,8 +120,8 @@ Sparse retrieval with uniCOIL:
 ```bash
 python -m pyserini.search --topics msmarco-v2-doc-dev \
                           --encoder castorini/unicoil-noexp-msmarco-passage \
-                          --index indexes/lucene.unicoil-noexp.0shot.msmarco-v2-doc-segmented  \
-                          --output runs/run.msmarco-document-v2-segmented.unicoil-noexp.0shot.txt \
+                          --index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil-noexp.0shot \
+                          --output runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt \
                           --impact \
                           --hits 10000 \
                           --batch 144 \
@@ -122,12 +136,12 @@ For the document corpus, since we are searching the segmented version, we retrie
 To evaluate, using `trec_eval`:
 
 ```bash
-$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev runs/run.msmarco-document-v2-segmented.unicoil-noexp.0shot.txt
+$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt
 Results:
 map                   	all	0.2012
 recip_rank            	all	0.2032
 
-$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev runs/run.msmarco-document-v2-segmented.unicoil-noexp.0shot.txt
+$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt
 Results:
 recall_100            	all	0.7190
 recall_1000           	all	0.8813