Cohere MS MARCO v1 passage 2CR (#2357)

Adds 2CR regressions doc for MS MARCO embedded with cohere embed-english-v3
castorini · Feb 9, 2024 · f86a65f · f86a65f
1 parent fd8655f
commit f86a65f
Show file tree

Hide file tree

Showing 4 changed files with 212 additions and 1 deletion.
diff --git a/docs/regressions/regressions-msmarco-passage-cohere-embed-english-v3.md b/docs/regressions/regressions-msmarco-passage-cohere-embed-english-v3.md
@@ -0,0 +1,84 @@
+# Anserini Regressions: MS MARCO Passage Ranking
+
+**Model**: [Cohere embed-english-v3.0](https://docs.cohere.com/reference/embed) with HNSW indexes (using pre-encoded queries)
+
+This page describes regression experiments, integrated into Anserini's regression testing framework, using the [Cohere embed-english-v3.0](https://docs.cohere.com/reference/embed) model on the [MS MARCO passage ranking task](https://github.com/microsoft/MSMARCO-Passage-Ranking).
+
+In these experiments, we are using pre-encoded queries (i.e., cached results of query encoding).
+
+The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/msmarco-passage-cohere-embed-english-v3.yaml).
+Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/msmarco-passage-cohere-embed-english-v3.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead and then run `bin/build.sh` to rebuild the documentation.
+
+## Corpus Download
+
+Download the corpus and unpack into `collections/`:
+
+```bash
+wget https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-cohere-embed-english-v3.tar -P collections/
+tar xvf collections/msmarco-passage-cohere-embed-english-v3.tar -C collections/
+```
+
+To confirm, `msmarco-passage-cohere-embed-english-v3.tar` is 38 GB and has MD5 checksum `6b7d9795806891b227378f6c290464a9`.
+
+## Indexing
+
+Sample indexing command, building HNSW indexes:
+
+```bash
+target/appassembler/bin/IndexHnswDenseVectors \
+  -collection JsonDenseVectorCollection \
+  -input /path/to/msmarco-passage-cohere-embed-english-v3 \
+  -generator HnswDenseVectorDocumentGenerator \
+  -index indexes/lucene-hnsw.msmarco-passage-cohere-embed-english-v3/ \
+  -threads 16 -M 16 -efC 100 \
+  >& logs/log.msmarco-passage-cohere-embed-english-v3 &
+```
+
+The path `/path/to/msmarco-passage-cohere-embed-english-v3/` should point to the corpus downloaded above.
+Upon completion, we should have an index with 8,841,823 documents.
+
+## Retrieval
+
+Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.
+The regression experiments here evaluate on the 6980 dev set questions; see [this page](../../docs/experiments-msmarco-passage.md) for more details.
+
+After indexing has completed, you should be able to perform retrieval as follows using HNSW indexes:
+
+```bash
+target/appassembler/bin/SearchHnswDenseVectors \
+  -index indexes/lucene-hnsw.msmarco-passage-cohere-embed-english-v3/ \
+  -topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.gz \
+  -topicReader JsonIntVector \
+  -output runs/run.msmarco-passage-cohere-embed-english-v3.cohere-embed-english-v3.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.txt \
+  -generator VectorQueryGenerator -topicField vector -threads 16 -hits 1000 -efSearch 1000 &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```bash
+target/appassembler/bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.cohere-embed-english-v3.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.txt
+target/appassembler/bin/trec_eval -c -m map tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.cohere-embed-english-v3.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.txt
+target/appassembler/bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.cohere-embed-english-v3.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.txt
+target/appassembler/bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.cohere-embed-english-v3.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| **nDCG@10**                                                                                                  | **cohere-embed-english-v3**|
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| [MS MARCO Passage: Dev](https://github.com/microsoft/MSMARCO-Passage-Ranking)                                | 0.428     |
+| **AP@1000**                                                                                                  | **cohere-embed-english-v3**|
+| [MS MARCO Passage: Dev](https://github.com/microsoft/MSMARCO-Passage-Ranking)                                | 0.371     |
+| **RR@10**                                                                                                    | **cohere-embed-english-v3**|
+| [MS MARCO Passage: Dev](https://github.com/microsoft/MSMARCO-Passage-Ranking)                                | 0.365     |
+| **R@1000**                                                                                                   | **cohere-embed-english-v3**|
+| [MS MARCO Passage: Dev](https://github.com/microsoft/MSMARCO-Passage-Ranking)                                | 0.974     |
+
+Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
+Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/msmarco-passage-cohere-embed-english-v3.yaml).
+
+## Reproduction Log[*](../../docs/reproducibility.md)
+
+To add to this reproduction log, modify [this template](../../src/main/resources/docgen/templates/msmarco-passage-cohere-embed-english-v3.template) and run `bin/build.sh` to rebuild the documentation.
diff --git a/src/main/resources/docgen/templates/msmarco-passage-cohere-embed-english-v3.template b/src/main/resources/docgen/templates/msmarco-passage-cohere-embed-english-v3.template
@@ -0,0 +1,62 @@
+# Anserini Regressions: MS MARCO Passage Ranking
+
+**Model**: [Cohere embed-english-v3.0](https://docs.cohere.com/reference/embed) with HNSW indexes (using pre-encoded queries)
+
+This page describes regression experiments, integrated into Anserini's regression testing framework, using the [Cohere embed-english-v3.0](https://docs.cohere.com/reference/embed) model on the [MS MARCO passage ranking task](https://github.com/microsoft/MSMARCO-Passage-Ranking).
+
+In these experiments, we are using pre-encoded queries (i.e., cached results of query encoding).
+
+The exact configurations for these regressions are stored in [this YAML file](${yaml}).
+Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead and then run `bin/build.sh` to rebuild the documentation.
+
+## Corpus Download
+
+Download the corpus and unpack into `collections/`:
+
+```bash
+wget ${download_url} -P collections/
+tar xvf collections/${corpus}.tar -C collections/
+```
+
+To confirm, `${corpus}.tar` is 38 GB and has MD5 checksum `${download_checksum}`.
+
+## Indexing
+
+Sample indexing command, building HNSW indexes:
+
+```bash
+${index_cmds}
+```
+
+The path `/path/to/${corpus}/` should point to the corpus downloaded above.
+Upon completion, we should have an index with 8,841,823 documents.
+
+## Retrieval
+
+Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.
+The regression experiments here evaluate on the 6980 dev set questions; see [this page](${root_path}/docs/experiments-msmarco-passage.md) for more details.
+
+After indexing has completed, you should be able to perform retrieval as follows using HNSW indexes:
+
+```bash
+${ranking_cmds}
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```bash
+${eval_cmds}
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+${effectiveness}
+
+Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
+Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](${yaml}).
+
+## Reproduction Log[*](${root_path}/docs/reproducibility.md)
+
+To add to this reproduction log, modify [this template](${template}) and run `bin/build.sh` to rebuild the documentation.
diff --git a/src/main/resources/regression/msmarco-passage-cohere-embed-english-v3.yaml b/src/main/resources/regression/msmarco-passage-cohere-embed-english-v3.yaml
@@ -0,0 +1,65 @@
+---
+corpus: msmarco-passage-cohere-embed-english-v3
+corpus_path: collections/msmarco/msmarco-passage-cohere-embed-english-v3/
+
+download_url: https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-cohere-embed-english-v3.tar
+download_checksum: 6b7d9795806891b227378f6c290464a9
+
+index_path: indexes/lucene-hnsw.msmarco-passage-cohere-embed-english-v3/
+index_type: hnsw
+collection_class: JsonDenseVectorCollection
+generator_class: HnswDenseVectorDocumentGenerator 
+index_threads: 16
+index_options: -M 16 -efC 100
+
+metrics:
+  - metric: nDCG@10
+    command: target/appassembler/bin/trec_eval
+    params: -c -m ndcg_cut.10
+    separator: "\t"
+    parse_index: 2
+    metric_precision: 4
+    can_combine: false
+  - metric: AP@1000
+    command: target/appassembler/bin/trec_eval
+    params: -c -m map
+    separator: "\t"
+    parse_index: 2
+    metric_precision: 4
+    can_combine: false
+  - metric: RR@10
+    command: target/appassembler/bin/trec_eval
+    params: -c -M 10 -m recip_rank
+    separator: "\t"
+    parse_index: 2
+    metric_precision: 4
+    can_combine: false
+  - metric: R@1000
+    command: target/appassembler/bin/trec_eval
+    params: -c -m recall.1000
+    separator: "\t"
+    parse_index: 2
+    metric_precision: 4
+    can_combine: false
+
+topic_reader: JsonIntVector
+topics:
+  - name: "[MS MARCO Passage: Dev](https://github.com/microsoft/MSMARCO-Passage-Ranking)"
+    id: dev
+    path: topics.msmarco-passage.dev-subset.cohere-embed-english-v3.jsonl.gz
+    qrel: qrels.msmarco-passage.dev-subset.txt
+
+models:
+  - name: cohere-embed-english-v3
+    display: cohere-embed-english-v3
+    type: hnsw
+    params: -generator VectorQueryGenerator -topicField vector -threads 16 -hits 1000 -efSearch 1000
+    results:
+      nDCG@10:
+        - 0.4275
+      AP@1000:
+        - 0.3706
+      RR@10:
+        - 0.3648
+      R@1000:
+        - 0.9735
diff --git a/tools b/tools