Skip to content

Commit

Permalink
Tweak uniCOIL + TILDE repro docs (castorini#754)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored and MXueguang committed Nov 5, 2021
1 parent c2762ca commit 601c66b
Show file tree
Hide file tree
Showing 3 changed files with 25 additions and 35 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -393,7 +393,7 @@ With Pyserini, it's easy to [reproduce](docs/reproducibility.md) runs on a numbe
+ Reproducing [BM25 baselines on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2.md)
+ Reproducing [DeepImpact experiments for MS MARCO (V1) Passage Ranking](docs/experiments-deepimpact.md)
+ Reproducing [uniCOIL experiments with doc2query-T5 expansions for MS MARCO (V1) Passage Ranking](docs/experiments-unicoil.md)
+ Reproducing [uniCOIL experiments with TILDE document expansion for MS MARCO (V1) Passage Ranking](docs/experiments-unicoil-tilde-expansion.md)
+ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO (V1) Passage Ranking](docs/experiments-unicoil-tilde-expansion.md)
+ Reproducing [uniCOIL experiments on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2-unicoil.md)

### Dense Retrieval
Expand Down
52 changes: 21 additions & 31 deletions docs/experiments-unicoil-tilde-expansion.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,19 @@
# Pyserini: uniCOIL for MS MARCO Passage Ranking with TILDE Passage Expansion
# Pyserini: uniCOIL (w/ TILDE) for MS MARCO Passage Ranking

This page describes how to reproduce the uniCOIL experiments in the following papers:

> Jimmy Lin and Xueguang Ma. [A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.](https://arxiv.org/abs/2106.14807) _arXiv:2106.14807_.
This page describes how to reproduce experiments using uniCOIL with TILDE document expansion, as described in the following paper:

> Shengyao Zhuang and Guido Zuccon. [Fast Passage Re-ranking with Contextualized Exact Term
Matching and Efficient Passage Expansion.](https://arxiv.org/pdf/2108.08513) _arXiv:2108.08513_.

In this guide, we start with a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).
The original uniCOIL model is described here:

> Jimmy Lin and Xueguang Ma. [A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.](https://arxiv.org/abs/2106.14807) _arXiv:2106.14807_.
Instead of using docTquery-T5 to perform document expansion which is slow and expensive, in this guide, the TILDE model is used to expand the corpus, resulting in a faster and cheaper document expansion process. For details of how to use TILDE to expand documents, please see [this guide](https://github.com/ielab/TILDE).

Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-unicoil-tilde-expansion.md) based on Java.
Here, we can get _exactly_ the same results from Python.
In this guide, we start with a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

## Data Prep

Expand All @@ -25,12 +24,12 @@ First, we need to download and extract the MS MARCO passage dataset with uniCOIL
wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/msmarco-passage-unicoil-tilde-expansion-b8.tar -P collections/

# Alternate mirror
wget https://vault.cs.uwaterloo.ca/s/Rm6fknT432YdBts/download -O collections/msmarco-passage-unicoil-tilde-expansion-b8.tar
wget https://vault.cs.uwaterloo.ca/s/6LECmLdiaBoPwrL/download -O collections/msmarco-passage-unicoil-tilde-expansion-b8.tar

tar -xvf collections/msmarco-passage-unicoil-tilde-expansion-b8.tar -C collections/
```

To confirm, `msmarco-passage-unicoil-tilde-expansion-b8.tar` should have MD5 checksum of `a506ef9315c933f9d2040ce3e7385cff`.
To confirm, `msmarco-passage-unicoil-tilde-expansion-b8.tar` should have MD5 checksum of `be0a786033140ebb7a984a3e155c19ae`.


## Indexing
Expand All @@ -48,40 +47,29 @@ python -m pyserini.index -collection JsonVectorCollection \
The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.

Upon completion, we should have an index with 8,841,823 documents.
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around ten minutes.

If you want to save time and skip the indexing step, download the prebuilt index directly:

```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8.tar.gz -P indexes/

# Alternate mirror
# wget https://vault.cs.uwaterloo.ca/s/bKbHmN6CjRtmoJq/download -O indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8.tar.gz

tar -xvf indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8.tar.gz -C indexes/
```
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around half an hour.

## Retrieval

We can now run retrieval:

```bash
python -m pyserini.search --topics msmarco-passage-dev-subset \
--index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \
--encoder ielab/unicoil-tilde200-msmarco-passage \
--output runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv \
--impact \
--hits 1000 --batch 32 --threads 12 \
--output-format msmarco
--index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \
--encoder ielab/unicoil-tilde200-msmarco-passage \
--output runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv \
--impact \
--hits 1000 --batch 32 --threads 12 \
--output-format msmarco
```

Query evaluation is much slower than with bag-of-words BM25; a complete run can take around 15 min.
Query evaluation is much slower than with bag-of-words BM25; a complete run can take around 20 minutes.
Note that the important option here is `-impact`, where we specify impact scoring.

The output is in MS MARCO output format, so we can directly evaluate:

```bash
$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset run.msmarco-passage-unicoil-tilde-expansion-b8.tsv
$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv
```

The results should be as follows:
Expand All @@ -94,3 +82,5 @@ QueriesRanked: 6980
```

## Reproduction Log[*](reproducibility.md)

+ Results reproduced by [@lintool](https://github.com/lintool) on 2021-09-08 (commit [`f026b87`](https://github.com/castorini/pyserini/commit/f026b871e0e581743fcb09d1eb309e9698767a8d))
6 changes: 3 additions & 3 deletions docs/experiments-unicoil.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Pyserini: uniCOIL for MS MARCO Passage Ranking
# Pyserini: uniCOIL (w/ doc2query-T5) for MS MARCO Passage Ranking

This page describes how to reproduce the uniCOIL experiments in the following paper:

Expand Down Expand Up @@ -43,7 +43,7 @@ python -m pyserini.index -collection JsonVectorCollection \
The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.

Upon completion, we should have an index with 8,841,823 documents.
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around ten minutes.
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 20 minutes.


## Retrieval
Expand Down Expand Up @@ -71,7 +71,7 @@ $ python -m pyserini.search --topics collections/topics.msmarco-passage.dev-subs
--output-format msmarco
```

Query evaluation is much slower than with bag-of-words BM25; a complete run can take around 15 min.
Query evaluation is much slower than with bag-of-words BM25; a complete run can take around 15 minutes.
Note that the important option here is `-impact`, where we specify impact scoring.

The output is in MS MARCO output format, so we can directly evaluate:
Expand Down

0 comments on commit 601c66b

Please sign in to comment.