diff --git a/docs/regressions-dl19-doc-segmented-unicoil-noexp.md b/docs/regressions-dl19-doc-segmented-unicoil-noexp.md index b7ef094a19..687e492d69 100644 --- a/docs/regressions-dl19-doc-segmented-unicoil-noexp.md +++ b/docs/regressions-dl19-doc-segmented-unicoil-noexp.md @@ -1,6 +1,6 @@ # Anserini Regressions: TREC 2019 Deep Learning Track (Document) -**Model**: uniCOIL (without any expansions) on segmented documents +**Model**: uniCOIL (without any expansions) on segmented documents (title/segment encoding) This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (without any expansions) on the [TREC 2019 Deep Learning Track document ranking task](https://trec.nist.gov/data/deep2019.html). The uniCOIL model is described in the following paper: @@ -22,19 +22,19 @@ python src/main/python/run_regression.py --index --verify --search --regression ## Corpus -We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. +We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). Download the corpus and unpack into `collections/`: ``` -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar -P collections/ +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar -P collections/ -tar xvf collections/msmarco-doc-segmented-unicoil.tar -C collections/ +tar xvf collections/msmarco-doc-segmented-unicoil-noexp.tar -C collections/ ``` -To confirm, `msmarco-doc-segmented-unicoil.tar` is 18 GB and has MD5 checksum `6a00e2c0c375cb1e52c83ae5ac377ebb`. +To confirm, `msmarco-doc-segmented-unicoil-noexp.tar` is 11 GB and has MD5 checksum `11b226e1cacd9c8ae0a660fd14cdd710`. With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: @@ -59,7 +59,7 @@ target/appassembler/bin/IndexCollection \ >& logs/log.msmarco-doc-segmented-unicoil-noexp & ``` -The directory `/path/to/msmarco-doc-segmented-unicoil/` should point to the corpus downloaded above. +The directory `/path/to/msmarco-doc-segmented-unicoil-noexp/` should point to the corpus downloaded above. The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens. Upon completion, we should have an index with 20,545,677 documents. @@ -98,22 +98,22 @@ With the above commands, you should be able to reproduce the following results: | AP@100 | uniCOIL (no expansions)| |:-------------------------------------------------------------------------------------------------------------|-----------| -| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.2621 | +| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.2665 | | nDCG@10 | uniCOIL (no expansions)| |:-------------------------------------------------------------------------------------------------------------|-----------| -| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.6118 | +| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.6349 | | R@100 | uniCOIL (no expansions)| |:-------------------------------------------------------------------------------------------------------------|-----------| -| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.3956 | +| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.3943 | | R@1000 | uniCOIL (no expansions)| |:-------------------------------------------------------------------------------------------------------------|-----------| -| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.6382 | +| [DL19 (Doc)](https://trec.nist.gov/data/deep2019.html) | 0.6391 | Note that in the official evaluation for document ranking, all runs were truncated to top-100 hits per query (whereas all top-1000 hits per query were retained for passage ranking). Thus, average precision is computed to depth 100 (i.e., AP@100); nDCG@10 remains unaffected. diff --git a/docs/regressions-dl19-doc-segmented-unicoil.md b/docs/regressions-dl19-doc-segmented-unicoil.md index 6625ba2718..331ef9ba68 100644 --- a/docs/regressions-dl19-doc-segmented-unicoil.md +++ b/docs/regressions-dl19-doc-segmented-unicoil.md @@ -1,6 +1,6 @@ # Anserini Regressions: TREC 2019 Deep Learning Track (Document) -**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents +**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents (title/segment encoding) This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (with doc2query-T5 expansions) on the [TREC 2019 Deep Learning Track document ranking task](https://trec.nist.gov/data/deep2019.html). The uniCOIL model is described in the following paper: @@ -22,7 +22,7 @@ python src/main/python/run_regression.py --index --verify --search --regression ## Corpus -We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. +We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). diff --git a/docs/regressions-dl20-doc-segmented-unicoil-noexp.md b/docs/regressions-dl20-doc-segmented-unicoil-noexp.md index 0ba14b7a4e..ca1e9480cf 100644 --- a/docs/regressions-dl20-doc-segmented-unicoil-noexp.md +++ b/docs/regressions-dl20-doc-segmented-unicoil-noexp.md @@ -1,6 +1,6 @@ # Anserini Regressions: TREC 2020 Deep Learning Track (Document) -**Model**: uniCOIL (without any expansions) on segmented documents +**Model**: uniCOIL (without any expansions) on segmented documents (title/segment encoding) This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (without any expansions) on the [TREC 2020 Deep Learning Track document ranking task](https://trec.nist.gov/data/deep2020.html). The uniCOIL model is described in the following paper: @@ -22,19 +22,19 @@ python src/main/python/run_regression.py --index --verify --search --regression ## Corpus -We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. +We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). Download the corpus and unpack into `collections/`: ``` -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar -P collections/ +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar -P collections/ -tar xvf collections/msmarco-doc-segmented-unicoil.tar -C collections/ +tar xvf collections/msmarco-doc-segmented-unicoil-noexp.tar -C collections/ ``` -To confirm, `msmarco-doc-segmented-unicoil.tar` is 18 GB and has MD5 checksum `6a00e2c0c375cb1e52c83ae5ac377ebb`. +To confirm, `msmarco-doc-segmented-unicoil-noexp.tar` is 11 GB and has MD5 checksum `11b226e1cacd9c8ae0a660fd14cdd710`. With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: @@ -59,7 +59,7 @@ target/appassembler/bin/IndexCollection \ >& logs/log.msmarco-doc-segmented-unicoil-noexp & ``` -The directory `/path/to/msmarco-doc-segmented-unicoil/` should point to the corpus downloaded above. +The directory `/path/to/msmarco-doc-segmented-unicoil-noexp/` should point to the corpus downloaded above. The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens. Upon completion, we should have an index with 20,545,677 documents. @@ -98,22 +98,22 @@ With the above commands, you should be able to reproduce the following results: | AP@100 | uniCOIL w/ doc2query-T5 expansion| |:-------------------------------------------------------------------------------------------------------------|-----------| -| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.3586 | +| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.3698 | | nDCG@10 | uniCOIL w/ doc2query-T5 expansion| |:-------------------------------------------------------------------------------------------------------------|-----------| -| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.5632 | +| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.5893 | | R@100 | uniCOIL w/ doc2query-T5 expansion| |:-------------------------------------------------------------------------------------------------------------|-----------| -| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.5932 | +| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.5872 | | R@1000 | uniCOIL w/ doc2query-T5 expansion| |:-------------------------------------------------------------------------------------------------------------|-----------| -| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.7562 | +| [DL20 (Doc)](https://trec.nist.gov/data/deep2020.html) | 0.7623 | Note that in the official evaluation for document ranking, all runs were truncated to top-100 hits per query (whereas all top-1000 hits per query were retained for passage ranking). Thus, average precision is computed to depth 100 (i.e., AP@100); nDCG@10 remains unaffected. diff --git a/docs/regressions-dl20-doc-segmented-unicoil.md b/docs/regressions-dl20-doc-segmented-unicoil.md index b8df31d72a..8193f80d2a 100644 --- a/docs/regressions-dl20-doc-segmented-unicoil.md +++ b/docs/regressions-dl20-doc-segmented-unicoil.md @@ -1,6 +1,6 @@ # Anserini Regressions: TREC 2020 Deep Learning Track (Document) -**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents +**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents (title/segment encoding) This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (with doc2query-T5 expansions) on the [TREC 2020 Deep Learning Track document ranking task](https://trec.nist.gov/data/deep2020.html). The uniCOIL model is described in the following paper: @@ -22,7 +22,7 @@ python src/main/python/run_regression.py --index --verify --search --regression ## Corpus -We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. +We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). diff --git a/docs/regressions-msmarco-doc-segmented-unicoil-noexp.md b/docs/regressions-msmarco-doc-segmented-unicoil-noexp.md index 29cb6270be..456020e8d2 100644 --- a/docs/regressions-msmarco-doc-segmented-unicoil-noexp.md +++ b/docs/regressions-msmarco-doc-segmented-unicoil-noexp.md @@ -1,6 +1,6 @@ # Anserini Regressions: MS MARCO Document Ranking -**Model**: uniCOIL (without any expansions) on segmented documents +**Model**: uniCOIL (without any expansions) on segmented documents (title/segment encoding) This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (without any expansions) on the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking). The uniCOIL model is described in the following paper: @@ -22,19 +22,19 @@ python src/main/python/run_regression.py --index --verify --search --regression ## Corpus -We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. +We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). Download the corpus and unpack into `collections/`: ``` -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar -P collections/ +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar -P collections/ -tar xvf collections/msmarco-doc-segmented-unicoil.tar -C collections/ +tar xvf collections/msmarco-doc-segmented-unicoil-noexp.tar -C collections/ ``` -To confirm, `msmarco-doc-segmented-unicoil.tar` is 18 GB and has MD5 checksum `6a00e2c0c375cb1e52c83ae5ac377ebb`. +To confirm, `msmarco-doc-segmented-unicoil-noexp.tar` is 11 GB and has MD5 checksum `11b226e1cacd9c8ae0a660fd14cdd710`. With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: @@ -59,7 +59,7 @@ target/appassembler/bin/IndexCollection \ >& logs/log.msmarco-doc-segmented-unicoil-noexp & ``` -The directory `/path/to/msmarco-doc-segmented-unicoil/` should point to the corpus downloaded above. +The directory `/path/to/msmarco-doc-segmented-unicoil-noexp/` should point to the corpus downloaded above. The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens. Upon completion, we should have an index with 20,545,677 documents. @@ -97,22 +97,22 @@ With the above commands, you should be able to reproduce the following results: | AP@1000 | uniCOIL (no expansions)| |:-------------------------------------------------------------------------------------------------------------|-----------| -| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.3200 | +| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.3413 | | RR@100 | uniCOIL (no expansions)| |:-------------------------------------------------------------------------------------------------------------|-----------| -| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.3195 | +| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.3409 | | R@100 | uniCOIL (no expansions)| |:-------------------------------------------------------------------------------------------------------------|-----------| -| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.8398 | +| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.8639 | | R@1000 | uniCOIL (no expansions)| |:-------------------------------------------------------------------------------------------------------------|-----------| -| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.9286 | +| [MS MARCO Doc: Dev](https://github.com/microsoft/MSMARCO-Document-Ranking) | 0.9420 | This model corresponds to the run named "uniCOIL-d2q" on the official MS MARCO Document Ranking Leaderboard, submitted 2021/09/16. The following command generates a comparable run: diff --git a/docs/regressions-msmarco-doc-segmented-unicoil.md b/docs/regressions-msmarco-doc-segmented-unicoil.md index ab3e7e92fc..5a864cd7ea 100644 --- a/docs/regressions-msmarco-doc-segmented-unicoil.md +++ b/docs/regressions-msmarco-doc-segmented-unicoil.md @@ -1,6 +1,6 @@ # Anserini Regressions: MS MARCO Document Ranking -**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents +**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents (title/segment encoding) This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (with doc2query-T5 expansions) on the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking). The uniCOIL model is described in the following paper: @@ -22,7 +22,7 @@ python src/main/python/run_regression.py --index --verify --search --regression ## Corpus -We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. +We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). diff --git a/src/main/resources/docgen/templates/dl19-doc-segmented-unicoil-noexp.template b/src/main/resources/docgen/templates/dl19-doc-segmented-unicoil-noexp.template index ebb595979f..9e5404e8e1 100644 --- a/src/main/resources/docgen/templates/dl19-doc-segmented-unicoil-noexp.template +++ b/src/main/resources/docgen/templates/dl19-doc-segmented-unicoil-noexp.template @@ -1,6 +1,6 @@ # Anserini Regressions: TREC 2019 Deep Learning Track (Document) -**Model**: uniCOIL (without any expansions) on segmented documents +**Model**: uniCOIL (without any expansions) on segmented documents (title/segment encoding) This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (without any expansions) on the [TREC 2019 Deep Learning Track document ranking task](https://trec.nist.gov/data/deep2019.html). The uniCOIL model is described in the following paper: @@ -22,19 +22,19 @@ python src/main/python/run_regression.py --index --verify --search --regression ## Corpus -We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. +We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). Download the corpus and unpack into `collections/`: ``` -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar -P collections/ +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar -P collections/ -tar xvf collections/msmarco-doc-segmented-unicoil.tar -C collections/ +tar xvf collections/msmarco-doc-segmented-unicoil-noexp.tar -C collections/ ``` -To confirm, `msmarco-doc-segmented-unicoil.tar` is 18 GB and has MD5 checksum `6a00e2c0c375cb1e52c83ae5ac377ebb`. +To confirm, `msmarco-doc-segmented-unicoil-noexp.tar` is 11 GB and has MD5 checksum `11b226e1cacd9c8ae0a660fd14cdd710`. With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: @@ -53,7 +53,7 @@ Sample indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc-segmented-unicoil/` should point to the corpus downloaded above. +The directory `/path/to/msmarco-doc-segmented-unicoil-noexp/` should point to the corpus downloaded above. The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens. Upon completion, we should have an index with 20,545,677 documents. diff --git a/src/main/resources/docgen/templates/dl19-doc-segmented-unicoil.template b/src/main/resources/docgen/templates/dl19-doc-segmented-unicoil.template index b69925e016..470d0b30d1 100644 --- a/src/main/resources/docgen/templates/dl19-doc-segmented-unicoil.template +++ b/src/main/resources/docgen/templates/dl19-doc-segmented-unicoil.template @@ -1,6 +1,6 @@ # Anserini Regressions: TREC 2019 Deep Learning Track (Document) -**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents +**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents (title/segment encoding) This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (with doc2query-T5 expansions) on the [TREC 2019 Deep Learning Track document ranking task](https://trec.nist.gov/data/deep2019.html). The uniCOIL model is described in the following paper: @@ -22,7 +22,7 @@ python src/main/python/run_regression.py --index --verify --search --regression ## Corpus -We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. +We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). diff --git a/src/main/resources/docgen/templates/dl20-doc-segmented-unicoil-noexp.template b/src/main/resources/docgen/templates/dl20-doc-segmented-unicoil-noexp.template index 47b35e8322..601190703c 100644 --- a/src/main/resources/docgen/templates/dl20-doc-segmented-unicoil-noexp.template +++ b/src/main/resources/docgen/templates/dl20-doc-segmented-unicoil-noexp.template @@ -1,6 +1,6 @@ # Anserini Regressions: TREC 2020 Deep Learning Track (Document) -**Model**: uniCOIL (without any expansions) on segmented documents +**Model**: uniCOIL (without any expansions) on segmented documents (title/segment encoding) This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (without any expansions) on the [TREC 2020 Deep Learning Track document ranking task](https://trec.nist.gov/data/deep2020.html). The uniCOIL model is described in the following paper: @@ -22,19 +22,19 @@ python src/main/python/run_regression.py --index --verify --search --regression ## Corpus -We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. +We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). Download the corpus and unpack into `collections/`: ``` -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar -P collections/ +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar -P collections/ -tar xvf collections/msmarco-doc-segmented-unicoil.tar -C collections/ +tar xvf collections/msmarco-doc-segmented-unicoil-noexp.tar -C collections/ ``` -To confirm, `msmarco-doc-segmented-unicoil.tar` is 18 GB and has MD5 checksum `6a00e2c0c375cb1e52c83ae5ac377ebb`. +To confirm, `msmarco-doc-segmented-unicoil-noexp.tar` is 11 GB and has MD5 checksum `11b226e1cacd9c8ae0a660fd14cdd710`. With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: @@ -53,7 +53,7 @@ Sample indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc-segmented-unicoil/` should point to the corpus downloaded above. +The directory `/path/to/msmarco-doc-segmented-unicoil-noexp/` should point to the corpus downloaded above. The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens. Upon completion, we should have an index with 20,545,677 documents. diff --git a/src/main/resources/docgen/templates/dl20-doc-segmented-unicoil.template b/src/main/resources/docgen/templates/dl20-doc-segmented-unicoil.template index 9596f73c4c..d7b051ab2f 100644 --- a/src/main/resources/docgen/templates/dl20-doc-segmented-unicoil.template +++ b/src/main/resources/docgen/templates/dl20-doc-segmented-unicoil.template @@ -1,6 +1,6 @@ # Anserini Regressions: TREC 2020 Deep Learning Track (Document) -**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents +**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents (title/segment encoding) This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (with doc2query-T5 expansions) on the [TREC 2020 Deep Learning Track document ranking task](https://trec.nist.gov/data/deep2020.html). The uniCOIL model is described in the following paper: @@ -22,7 +22,7 @@ python src/main/python/run_regression.py --index --verify --search --regression ## Corpus -We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. +We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). diff --git a/src/main/resources/docgen/templates/msmarco-doc-segmented-unicoil-noexp.template b/src/main/resources/docgen/templates/msmarco-doc-segmented-unicoil-noexp.template index 7ed65065cd..656489c9cc 100644 --- a/src/main/resources/docgen/templates/msmarco-doc-segmented-unicoil-noexp.template +++ b/src/main/resources/docgen/templates/msmarco-doc-segmented-unicoil-noexp.template @@ -1,6 +1,6 @@ # Anserini Regressions: MS MARCO Document Ranking -**Model**: uniCOIL (without any expansions) on segmented documents +**Model**: uniCOIL (without any expansions) on segmented documents (title/segment encoding) This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (without any expansions) on the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking). The uniCOIL model is described in the following paper: @@ -22,19 +22,19 @@ python src/main/python/run_regression.py --index --verify --search --regression ## Corpus -We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. +We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). Download the corpus and unpack into `collections/`: ``` -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar -P collections/ +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar -P collections/ -tar xvf collections/msmarco-doc-segmented-unicoil.tar -C collections/ +tar xvf collections/msmarco-doc-segmented-unicoil-noexp.tar -C collections/ ``` -To confirm, `msmarco-doc-segmented-unicoil.tar` is 18 GB and has MD5 checksum `6a00e2c0c375cb1e52c83ae5ac377ebb`. +To confirm, `msmarco-doc-segmented-unicoil-noexp.tar` is 11 GB and has MD5 checksum `11b226e1cacd9c8ae0a660fd14cdd710`. With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: @@ -53,7 +53,7 @@ Sample indexing command: ${index_cmds} ``` -The directory `/path/to/msmarco-doc-segmented-unicoil/` should point to the corpus downloaded above. +The directory `/path/to/msmarco-doc-segmented-unicoil-noexp/` should point to the corpus downloaded above. The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens. Upon completion, we should have an index with 20,545,677 documents. diff --git a/src/main/resources/docgen/templates/msmarco-doc-segmented-unicoil.template b/src/main/resources/docgen/templates/msmarco-doc-segmented-unicoil.template index d61ec9622e..2129a448cb 100644 --- a/src/main/resources/docgen/templates/msmarco-doc-segmented-unicoil.template +++ b/src/main/resources/docgen/templates/msmarco-doc-segmented-unicoil.template @@ -1,6 +1,6 @@ # Anserini Regressions: MS MARCO Document Ranking -**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents +**Model**: uniCOIL (with doc2query-T5 expansions) on segmented documents (title/segment encoding) This page describes regression experiments, integrated into Anserini's regression testing framework, using uniCOIL (with doc2query-T5 expansions) on the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking). The uniCOIL model is described in the following paper: @@ -22,7 +22,7 @@ python src/main/python/run_regression.py --index --verify --search --regression ## Corpus -We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. +We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). diff --git a/src/main/resources/regression/dl19-doc-segmented-unicoil-noexp.yaml b/src/main/resources/regression/dl19-doc-segmented-unicoil-noexp.yaml index 82bc67a827..017ce91548 100644 --- a/src/main/resources/regression/dl19-doc-segmented-unicoil-noexp.yaml +++ b/src/main/resources/regression/dl19-doc-segmented-unicoil-noexp.yaml @@ -10,7 +10,7 @@ index_options: -impact -pretokenized index_stats: documents: 20545677 documents (non-empty): 20545677 - total terms: 152325913715 + total terms: 152323732876 metrics: - metric: AP@100 @@ -57,10 +57,10 @@ models: params: -impact -pretokenized -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 results: AP@100: - - 0.2621 + - 0.2665 nDCG@10: - - 0.6118 + - 0.6349 R@100: - - 0.3956 + - 0.3943 R@1000: - - 0.6382 + - 0.6391 diff --git a/src/main/resources/regression/dl20-doc-segmented-unicoil-noexp.yaml b/src/main/resources/regression/dl20-doc-segmented-unicoil-noexp.yaml index bcf8babdf7..adf0aeacf1 100644 --- a/src/main/resources/regression/dl20-doc-segmented-unicoil-noexp.yaml +++ b/src/main/resources/regression/dl20-doc-segmented-unicoil-noexp.yaml @@ -10,7 +10,7 @@ index_options: -impact -pretokenized index_stats: documents: 20545677 documents (non-empty): 20545677 - total terms: 152325913715 + total terms: 152323732876 metrics: - metric: AP@100 @@ -57,10 +57,10 @@ models: params: -impact -pretokenized -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 results: AP@100: - - 0.3586 + - 0.3698 nDCG@10: - - 0.5632 + - 0.5893 R@100: - - 0.5932 + - 0.5872 R@1000: - - 0.7562 + - 0.7623 diff --git a/src/main/resources/regression/msmarco-doc-segmented-unicoil-noexp.yaml b/src/main/resources/regression/msmarco-doc-segmented-unicoil-noexp.yaml index ca83ed9e6d..47f8989d78 100644 --- a/src/main/resources/regression/msmarco-doc-segmented-unicoil-noexp.yaml +++ b/src/main/resources/regression/msmarco-doc-segmented-unicoil-noexp.yaml @@ -10,7 +10,7 @@ index_options: -impact -pretokenized index_stats: documents: 20545677 documents (non-empty): 20545677 - total terms: 152325913715 + total terms: 152323732876 metrics: - metric: AP@1000 @@ -57,11 +57,10 @@ models: params: -impact -pretokenized -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 results: AP@1000: - - 0.3200 + - 0.3413 RR@100: - - 0.3195 + - 0.3409 R@100: - - 0.8398 + - 0.8639 R@1000: - - 0.9286 - + - 0.9420 \ No newline at end of file diff --git a/src/main/resources/regression/msmarco-doc-segmented-unicoil.yaml b/src/main/resources/regression/msmarco-doc-segmented-unicoil.yaml index 83c0e2d10b..a09adedab5 100644 --- a/src/main/resources/regression/msmarco-doc-segmented-unicoil.yaml +++ b/src/main/resources/regression/msmarco-doc-segmented-unicoil.yaml @@ -64,4 +64,3 @@ models: - 0.8858 R@1000: - 0.9546 -