diff --git a/docs/elastirini.md b/docs/elastirini.md index 272f459d0d..9d47bcf7f0 100644 --- a/docs/elastirini.md +++ b/docs/elastirini.md @@ -4,17 +4,17 @@ Anserini provides code for indexing into an ELK stack, thus providing interopera ## Deploying Elasticsearch Locally -From the [Elasticsearch](http://elastic.co/start), download the correct distribution for you platform to the `anserini/` directory. +From [here](http://elastic.co/start), download the latest Elasticsearch distribution for you platform to the `anserini/` directory (currently, v8.1.0). Unpacking: -``` +```bash mkdir elastirini && tar -zxvf elasticsearch*.tar.gz -C elastirini --strip-components=1 ``` Start running: -``` +```bash elastirini/bin/elasticsearch ``` @@ -39,23 +39,33 @@ Now, we can start indexing through Elastirini. Here, instead of passing in `-index` (to index with Lucene directly), we use `-es` for Elasticsearch: ```bash -sh target/appassembler/bin/IndexCollection -collection TrecCollection -generator DefaultLuceneDocumentGenerator \ - -es -es.index robust04 -threads 16 -input /path/to/disk45 -storePositions -storeDocvectors -storeRaw +sh target/appassembler/bin/IndexCollection \ + -collection TrecCollection \ + -input /path/to/disk45 \ + -generator DefaultLuceneDocumentGenerator \ + -es \ + -es.index robust04 \ + -threads 8 \ + -storePositions -storeDocvectors -storeRaw ``` We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off. Run the following command to reproduce Anserini BM25 retrieval: ```bash -sh target/appassembler/bin/SearchElastic -topicreader Trec -es.index robust04 \ +sh target/appassembler/bin/SearchElastic \ -topics src/main/resources/topics-and-qrels/topics.robust04.txt \ + -topicreader Trec -es.index robust04 \ -output runs/run.es.robust04.bm25.topics.robust04.txt ``` To evaluate effectiveness: ```bash -$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.robust04.txt runs/run.es.robust04.bm25.topics.robust04.txt +$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 \ + src/main/resources/topics-and-qrels/qrels.robust04.txt \ + runs/run.es.robust04.bm25.topics.robust04.txt + map all 0.2531 P_30 all 0.3102 ``` @@ -73,8 +83,14 @@ cat src/main/resources/elasticsearch/index-config.core18.json \ Indexing: ```bash -sh target/appassembler/bin/IndexCollection -collection WashingtonPostCollection -generator WashingtonPostGenerator \ - -es -es.index core18 -threads 8 -input /path/to/WashingtonPost -storePositions -storeDocvectors -storeContents +sh target/appassembler/bin/IndexCollection \ + -collection WashingtonPostCollection \ + -input /path/to/WashingtonPost \ + -generator WashingtonPostGenerator \ + -es \ + -es.index core18 \ + -threads 8 \ + -storePositions -storeDocvectors -storeContents ``` We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off. @@ -82,17 +98,22 @@ We may need to wait a few minutes after indexing for the index to "catch up" bef Retrieval: ```bash -sh target/appassembler/bin/SearchElastic -topicreader Trec -es.index core18 \ +sh target/appassembler/bin/SearchElastic \ -topics src/main/resources/topics-and-qrels/topics.core18.txt \ + -topicreader Trec \ + -es.index core18 \ -output runs/run.es.core18.bm25.topics.core18.txt ``` Evaluation: ```bash -$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.core18.txt runs/run.es.core18.bm25.topics.core18.txt -map all 0.2495 -P_30 all 0.3567 +$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 \ + src/main/resources/topics-and-qrels/qrels.core18.txt \ + runs/run.es.core18.bm25.topics.core18.txt + +map all 0.2496 +P_30 all 0.3573 ``` ## Indexing and Retrieval: MS MARCO Passage @@ -108,8 +129,14 @@ cat src/main/resources/elasticsearch/index-config.msmarco-passage.json \ Indexing: ```bash -sh target/appassembler/bin/IndexCollection -collection JsonCollection -generator DefaultLuceneDocumentGenerator \ - -es -es.index msmarco-passage -threads 9 -input /path/to/msmarco-passage -storePositions -storeDocvectors -storeRaw +sh target/appassembler/bin/IndexCollection \ + -collection JsonCollection \ + -input /path/to/msmarco-passage \ + -generator DefaultLuceneDocumentGenerator \ + -es \ + -es.index msmarco-passage \ + -threads 8 \ + -storePositions -storeDocvectors -storeRaw ``` We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off. @@ -117,14 +144,20 @@ We may need to wait a few minutes after indexing for the index to "catch up" bef Retrieval: ```bash -sh target/appassembler/bin/SearchElastic -topicreader TsvString -es.index msmarco-passage \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output runs/run.es.msmacro-passage.txt +sh target/appassembler/bin/SearchElastic \ + -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \ + -topicreader TsvString \ + -es.index msmarco-passage \ + -output runs/run.es.msmacro-passage.txt ``` Evaluation: ```bash -$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.es.msmacro-passage.txt +$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map \ + src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \ + runs/run.es.msmacro-passage.txt + map all 0.1956 recall_1000 all 0.8573 ``` @@ -142,8 +175,14 @@ cat src/main/resources/elasticsearch/index-config.msmarco-doc.json \ Indexing: ```bash -sh target/appassembler/bin/IndexCollection -collection CleanTrecCollection -generator DefaultLuceneDocumentGenerator \ - -es -es.index msmarco-doc -threads 1 -input /path/to/msmarco-doc -storePositions -storeDocvectors -storeRaw +sh target/appassembler/bin/IndexCollection \ + -collection JsonCollection \ + -input /path/to/msmarco-doc \ + -generator DefaultLuceneDocumentGenerator \ + -es \ + -es.index msmarco-doc \ + -threads 8 \ + -storePositions -storeDocvectors -storeRaw ``` We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off. @@ -151,8 +190,11 @@ We may need to wait a few minutes after indexing for the index to "catch up" bef Retrieval: ```bash -sh target/appassembler/bin/SearchElastic -topicreader TsvInt -es.index msmarco-doc \ - -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output runs/run.es.msmacro-doc.txt +sh target/appassembler/bin/SearchElastic \ + -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \ + -topicreader TsvInt \ + -es.index msmarco-doc \ + -output runs/run.es.msmarco-doc.txt ``` This can take potentially longer than `SearchCollection` with Lucene indexes. @@ -160,26 +202,34 @@ This can take potentially longer than `SearchCollection` with Lucene indexes. Evaluation: ```bash -$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.es.msmacro-doc.txt -map all 0.2308 +$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map \ + src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \ + runs/run.es.msmarco-doc.txt + +map all 0.2307 recall_1000 all 0.8856 ``` ## Elasticsearch Integration Test -We have an end-to-end integration testing script `run_es_regression.py` for [Robust04](regressions-robust04.md), [Core18](regressions-core18.md), [MS MARCO passage](regressions-msmarco-passage.md) and [MS MARCO document](regressions-msmarco-doc.md): +We have an end-to-end integration testing script `run_es_regression.py` for [Robust04](regressions-disk45.md), [Core18](regressions-core18.md), [MS MARCO passage](regressions-msmarco-passage.md) and [MS MARCO document](regressions-msmarco-doc.md): -``` +```bash # Check if Elasticsearch server is on python src/main/python/run_es_regression.py --ping + # Check if collection exists python src/main/python/run_es_regression.py --check-index-exists [collection] + # Create collection if it does not exist python src/main/python/run_es_regression.py --create-index [collection] + # Delete collection if it exists python src/main/python/run_es_regression.py --delete-index [collection] + # Insert documents from input directory into collection python src/main/python/run_es_regression.py --insert-docs [collection] --input [directory] + # Search and evaluate on collection python src/main/python/run_es_regression.py --evaluate [collection] @@ -191,14 +241,14 @@ For the `collection` meta-parameter, use `robust04`, `core18`, `msmarco-passage` ## Reproduction Log[*](reproducibility.md) -+ Results reproduced by [@nikhilro](https://github.com/nikhilro) on 2020-01-26 (commit [`d5ee069`](https://github.com/castorini/anserini/commit/d5ee069399e6a306d7685bda756c1f19db721156)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-robust04.md) -+ Results reproduced by [@edwinzhng](https://github.com/edwinzhng) on 2020-01-26 (commit [`7b76dfb`](https://github.com/castorini/anserini/commit/7b76dfbea7e0c01a3a5dc13e74f54852c780ec9b)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-robust04.md) -+ Results reproduced by [@HangCui0510](https://github.com/HangCui0510) on 2020-04-29 (commit [`07a9b05`](https://github.com/castorini/anserini/commit/07a9b053173637e15be79de4e7fce4d5a93d04fe)) for [MS Marco Passage](regressions-msmarco-passage.md), [Robust04](regressions-robust04.md) and [Core18](regressions-core18.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py) ++ Results reproduced by [@nikhilro](https://github.com/nikhilro) on 2020-01-26 (commit [`d5ee069`](https://github.com/castorini/anserini/commit/d5ee069399e6a306d7685bda756c1f19db721156)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-disk45.md) ++ Results reproduced by [@edwinzhng](https://github.com/edwinzhng) on 2020-01-26 (commit [`7b76dfb`](https://github.com/castorini/anserini/commit/7b76dfbea7e0c01a3a5dc13e74f54852c780ec9b)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-disk45.md) ++ Results reproduced by [@HangCui0510](https://github.com/HangCui0510) on 2020-04-29 (commit [`07a9b05`](https://github.com/castorini/anserini/commit/07a9b053173637e15be79de4e7fce4d5a93d04fe)) for [MS Marco Passage](regressions-msmarco-passage.md), [Robust04](regressions-disk45.md) and [Core18](regressions-core18.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py) + Results reproduced by [@shaneding](https://github.com/shaneding) on 2020-05-25 (commit [`1de3274`](https://github.com/castorini/anserini/commit/1de3274b057a63382534c5277ffcd772c3fc0d43)) for [MS Marco Passage](regressions-msmarco-passage.md) + Results reproduced by [@adamyy](https://github.com/adamyy) on 2020-05-29 (commit [`94893f1`](https://github.com/castorini/anserini/commit/94893f170e047d77c3ef5b8b995d7fbdd13f4298)) for [MS MARCO Passage](regressions-msmarco-passage.md), [MS MARCO Document](experiments-msmarco-doc.md) + Results reproduced by [@YimingDou](https://github.com/YimingDou) on 2020-05-29 (commit [`2947a16`](https://github.com/castorini/anserini/commit/2947a1622efae35637b83e321aba8e6fccd43489)) for [MS MARCO Passage](regressions-msmarco-passage.md) -+ Results reproduced by [@yxzhu16](https://github.com/yxzhu16) on 2020-07-17 (commit [`fad12be`](https://github.com/castorini/anserini/commit/fad12be2e37a075100707c3a674eb67bc0aa57ef)) for [Robust04](regressions-robust04.md), [Core18](regressions-core18.md), and [MS MARCO Passage](regressions-msmarco-passage.md) -+ Results reproduced by [@lintool](https://github.com/lintool) on 2020-11-10 (commit [`e19755`](https://github.com/castorini/anserini/commit/e19755b5fa976127830597bc9fbca203b9f5ad24)), all commands and end-to-end regression script for all four collections ++ Results reproduced by [@yxzhu16](https://github.com/yxzhu16) on 2020-07-17 (commit [`fad12be`](https://github.com/castorini/anserini/commit/fad12be2e37a075100707c3a674eb67bc0aa57ef)) for [Robust04](regressions-disk45.md), [Core18](regressions-core18.md), and [MS MARCO Passage](regressions-msmarco-passage.md) ++ Results reproduced by [@lintool](https://github.com/lintool) on 2020-11-10 (commit [`e19755b`](https://github.com/castorini/anserini/commit/e19755b5fa976127830597bc9fbca203b9f5ad24)), all commands and end-to-end regression script for all four collections + Results reproduced by [@jrzhang12](https://github.com/jrzhang12) on 2021-01-02 (commit [`be4e44d`](https://github.com/castorini/anserini/commit/02c52ee606ba0ebe32c130af1e26d24d8f10566a)) for [MS MARCO Passage](regressions-msmarco-passage.md) + Results reproduced by [@tyao-t](https://github.com/tyao-t) on 2022-01-13 (commit [`06fb4f9`](https://github.com/castorini/anserini/commit/06fb4f9947ff2167c276d8893287453af7680786)) for [MS MARCO Passage](regressions-msmarco-passage.md) and [MS MARCO Document](regressions-msmarco-doc.md) -+ Results reproduced by [@d1shs0ap](https://github.com/d1shs0ap) on 2022-01-21 (commit [`a81299e5`](https://github.com/castorini/anserini/commit/a81299e59eff24512d635e0d49fba6e373286469)) for [MS MARCO Document](regressions-msmarco-doc.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py) \ No newline at end of file ++ Results reproduced by [@d1shs0ap](https://github.com/d1shs0ap) on 2022-01-21 (commit [`a81299e`](https://github.com/castorini/anserini/commit/a81299e59eff24512d635e0d49fba6e373286469)) for [MS MARCO Document](regressions-msmarco-doc.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py) \ No newline at end of file diff --git a/docs/experiments-cord19-extras.md b/docs/experiments-cord19-extras.md index fa39f5fb9d..1a6aa11571 100644 --- a/docs/experiments-cord19-extras.md +++ b/docs/experiments-cord19-extras.md @@ -16,7 +16,7 @@ python src/main/python/trec-covid/index_cord19.py --date 2020-07-16 --download ## Solr -From the Solr [archives](https://archive.apache.org/dist/lucene/solr/), download the Solr (non `-src`) version that matches Anserini's [Lucene version](https://github.com/castorini/anserini/blob/master/pom.xml#L36) to the `anserini/` directory. +Download the latest Solr version (binary release) from [here](https://solr.apache.org/downloads.html) and extract the archive (currently, v8.11.1): Extract the archive: @@ -62,10 +62,15 @@ solrini/bin/solr create -n anserini -c cord19 We can now index into Solr: ```bash -sh target/appassembler/bin/IndexCollection -collection Cord19AbstractCollection -generator Cord19Generator \ - -threads 8 -input collections/cord19-2020-07-16 \ - -solr -solr.index cord19 -solr.zkUrl localhost:9983 \ - -storePositions -storeDocvectors -storeContents -storeRaw +sh target/appassembler/bin/IndexCollection \ + -collection Cord19AbstractCollection \ + -input collections/cord19-2020-07-16 \ + -generator Cord19Generator \ + -solr \ + -solr.index cord19 \ + -solr.zkUrl localhost:9983 \ + -threads 8 \ + -storePositions -storeDocvectors -storeContents -storeRaw ``` Once indexing is complete, you can query in Solr at [`http://localhost:8983/solr/#/cord19/query`](http://localhost:8983/solr/#/cord19/query). @@ -74,8 +79,7 @@ You'll need to make sure your query is searching the `contents` field, so the qu ## Elasticsearch + Kibana -From the [Elasticsearch](http://elastic.co/start), download the correct distribution for your platform to the `anserini/` directory. -These instructions below work with version 7.10.0. +From [here](http://elastic.co/start), download the latest Elasticsearch and Kibanna distributions for you platform to the `anserini/` directory (currently, v8.1.0). First, unpack and deploy Elasticsearch: @@ -95,8 +99,9 @@ Elasticsearch has a built-in safeguard to disable indexing if you're running low The error is something like "flood stage disk watermark [95%] exceeded on ..." with indexes placed into readonly mode. Obviously, be careful, but if you're sure things are going to be okay and you won't run out of disk space, disable the safeguard as follows: -``` -curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }' +```bash +curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings \ + -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }' ``` Set up the proper schema using [this config](../src/main/resources/elasticsearch/index-config.cord19.json): @@ -109,16 +114,22 @@ cat src/main/resources/elasticsearch/index-config.cord19.json \ Indexing abstracts: ```bash -sh target/appassembler/bin/IndexCollection -collection Cord19AbstractCollection -generator Cord19Generator \ - -es -es.index cord19 -threads 8 -input collections/cord19-2020-07-16 -storePositions -storeDocvectors -storeContents -storeRaw +sh target/appassembler/bin/IndexCollection \ + -collection Cord19AbstractCollection \ + -input collections/cord19-2020-07-16 \ + -generator Cord19Generator \ + -es \ + -es.index cord19 \ + -threads 8 \ + -storePositions -storeDocvectors -storeContents -storeRaw ``` We are now able to access interactive search and visualization capabilities from Kibana at [`http://localhost:5601/`](http://localhost:5601). Here's an example: -1. Click on the hamburger icon, then click "Discover" under "Analytics". -2. Create "Index Pattern": set the index pattern to `cord19`, and use `publish_time` as the timestamp field. +1. Click on the hamburger icon, then click "Dashboard" under "Analytics". +2. Create "Data View": set the index pattern to `cord19`, and use `publish_time` as the timestamp field. 3. Go back to "Discover" under "Analytics"; now run a search, e.g., "incubation period". Be sure to expand the date, which is a dropdown box to the right of the search box; something like "Last 10 years" works well. 4. You should be able to see search results as well as a histogram of the dates in which those articles ar published! diff --git a/docs/solrini.md b/docs/solrini.md index 63d283684e..e50377f8ce 100644 --- a/docs/solrini.md +++ b/docs/solrini.md @@ -2,45 +2,49 @@ This page documents code for reproducing results from the following paper: -+ Ryan Clancy, Toke Eskildsen, Nick Ruest, and Jimmy Lin. [Solr Integration in the Anserini Information Retrieval Toolkit.](https://cs.uwaterloo.ca/~jimmylin/publications/Clancy_etal_SIGIR2019a.pdf) _Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019)_, July 2019, Paris, France. +> Ryan Clancy, Toke Eskildsen, Nick Ruest, and Jimmy Lin. [Solr Integration in the Anserini Information Retrieval Toolkit.](https://cs.uwaterloo.ca/~jimmylin/publications/Clancy_etal_SIGIR2019a.pdf) _Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019)_, July 2019, Paris, France. We provide instructions for setting up a single-node SolrCloud instance running locally and indexing into it from Anserini. Instructions for setting up SolrCloud clusters can be found by searching the web. ## Setting up a Single-Node SolrCloud Instance -From the Solr [archives](https://archive.apache.org/dist/lucene/solr/), download the Solr (non `-src`) version that matches Anserini's [Lucene version](https://github.com/castorini/anserini/blob/master/pom.xml#L36) to the `anserini/` directory. +Download the latest Solr version (binary release) from [here](https://solr.apache.org/downloads.html) and extract the archive (currently, v8.11.1): -Extract the archive: - -``` +```bash mkdir solrini && tar -zxvf solr*.tgz -C solrini --strip-components=1 ``` Start Solr: +```bash +solrini/bin/solr start -c -m 16G ``` -solrini/bin/solr start -c -m 8G + +When you're done, remember to stop Solr: + +```bash +solrini/bin/solr stop ``` -Adjust memory usage (i.e., `-m 8G` as appropriate). +Adjust memory usage (i.e., `-m 16G` as appropriate). Run the Solr bootstrap script to copy the Anserini JAR into Solr's classpath and upload the configsets to Solr's internal ZooKeeper: -``` +```bash pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd ``` Solr should now be available at [http://localhost:8983/](http://localhost:8983/) for browsing. The Solr index schema can also be modified using the [Schema API](https://lucene.apache.org/solr/guide/8_3/schema-api.html). This is useful for specifying field types and other properties including multiValued fields. - Schemas for setting up specific Solr index schemas can be found in the [src/main/resources/solr/schemas/](../src/main/resources/solr/schemas/) folder. - To set the schema, we can make a request to the Schema API: -``` -curl -X POST -H 'Content-type:application/json' --data-binary @src/main/resources/solr/schemas/SCHEMA_NAME.json http://localhost:8983/solr/COLLECTION_NAME/schema +```bash +curl -X POST -H 'Content-type:application/json' \ + --data-binary @src/main/resources/solr/schemas/SCHEMA_NAME.json \ + http://localhost:8983/solr/COLLECTION_NAME/schema ``` For Robust04 example below, this isn't necessary. @@ -50,41 +54,50 @@ For Robust04 example below, this isn't necessary. We can use Anserini as a common "front-end" for indexing into SolrCloud, thus supporting the same range of test collections that's already included in Anserini (when directly building local Lucene indexes). Indexing into Solr is similar indexing to disk with Lucene, with a few added parameters. Most notably, we replace the `-index` parameter (which specifies the Lucene index path on disk) with Solr parameters. -Alternatively, Solr can also be configured to [read prebuilt Lucene index](#solr-with-prebuilt-lucene-index), since Solr uses Lucene indexes under the hood. +Alternatively, Solr can also be configured to read pre-built Lucene indexes, since Solr uses Lucene indexes under the hood (more details below). We'll index [Robust04](regressions-disk45.md) as an example. First, create the `robust04` collection in Solr: -``` +```bash solrini/bin/solr create -n anserini -c robust04 ``` Run the Solr indexing command for `robust04`: -``` -sh target/appassembler/bin/IndexCollection -collection TrecCollection -generator DefaultLuceneDocumentGenerator \ - -threads 8 -input /path/to/disk45 \ - -solr -solr.index robust04 -solr.zkUrl localhost:9983 \ +```bash +sh target/appassembler/bin/IndexCollection \ + -collection TrecCollection \ + -input /path/to/disk45 \ + -generator DefaultLuceneDocumentGenerator \ + -solr \ + -solr.index robust04 \ + -solr.zkUrl localhost:9983 \ + -threads 8 \ -storePositions -storeDocvectors -storeRaw ``` Make sure `/path/to/disk45` is updated with the appropriate path for the Robust04 collection. Once indexing has completed, you should be able to query `robust04` from the Solr [query interface](http://localhost:8983/solr/#/robust04/query). - You can also run the following command to reproduce Anserini BM25 retrieval: -``` -sh target/appassembler/bin/SearchSolr -topicreader Trec \ - -solr.index robust04 -solr.zkUrl localhost:9983 \ +```bash +sh target/appassembler/bin/SearchSolr \ -topics src/main/resources/topics-and-qrels/topics.robust04.txt \ + -topicreader Trec \ + -solr.index robust04 \ + -solr.zkUrl localhost:9983 \ -output runs/run.solr.robust04.bm25.topics.robust04.txt ``` Evaluation can be performed using `trec_eval`: ```bash -$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.robust04.txt runs/run.solr.robust04.bm25.topics.robust04.txt +$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 \ + src/main/resources/topics-and-qrels/qrels.robust04.txt \ + runs/run.solr.robust04.bm25.topics.robust04.txt + map all 0.2531 P_30 all 0.3102 ``` @@ -92,8 +105,8 @@ P_30 all 0.3102 Solrini has also been verified to work with following collections as well: + [TREC Washington Post Corpus](regressions-core18.md) -+ [MS MARCO Passage Retrieval Corpus](experiments-msmarco-passage.md) -+ [MS MARCO document](regressions-msmarco-doc.md) ++ [MS MARCO passage ranking task](experiments-msmarco-passage.md) ++ [MS MARCO document ranking task](regressions-msmarco-doc.md) See `run_solr_regression.py` regression script for more details. @@ -101,13 +114,13 @@ See `run_solr_regression.py` regression script for more details. It is possible for Solr to read pre-built Lucene indexes. To achieve this, some housekeeping is required to "install" the pre-built indexes. -The following uses [Robust04](regressions-robust04.md) as an example. -Let's assume the pre-built index is stored at `indexes/lucene-index.robust04.pos+docvectors+raw/`. +The following uses [Robust04](regressions-disk45.md) as an example. +Let's assume the pre-built index is stored at `indexes/lucene-index.disk45/`. First, a Solr collection must be created to house the index. Here, we create a collection `robust04` with configset `anserini`. -``` +```bash solrini/bin/solr create -n anserini -c robust04 ``` @@ -119,28 +132,30 @@ Second, make proper Solr schema adjustments if necessary. Here, `robust04` is a TREC collection whose schema is already handled by [managed-schema](https://github.com/castorini/anserini/blob/master/src/main/resources/solr/anserini/conf/managed-schema) in the Solr configset. However, for a collection such as `cord19`, remember to make proper adjustments to the Solr schema (also see above): -``` -curl -X POST -H 'Content-type:application/json' --data-binary @src/main/resources/solr/schemas/SCHEMA_NAME.json http://localhost:8983/solr/COLLECTION_NAME/schema +```bash +curl -X POST -H 'Content-type:application/json' \ + --data-binary @src/main/resources/solr/schemas/SCHEMA_NAME.json \ + http://localhost:8983/solr/COLLECTION_NAME/schema ``` Finally, we can copy the pre-built index to the local where Solr expects them. Start by removing data that's there: -``` +```bash rm solrini/server/solr/robust04_shard1_replica_n1/data/index/* ``` Then, simply copy the pre-built Lucene indexes into that location: -``` -cp indexes/lucene-index.robust04.pos+docvectors+raw/* solrini/server/solr/robust04_shard1_replica_n1/data/index +```bash +cp indexes/lucene-index.disk45/* solrini/server/solr/robust04_shard1_replica_n1/data/index ``` Restart Solr to make sure changes take effect: -``` +```bash solrini/bin/solr stop -solrini/bin/solr start -c -m 8G +solrini/bin/solr start -c -m 16G ``` You can confirm that everything works by performing a retrieval run and checking the results (see above). @@ -148,7 +163,7 @@ You can confirm that everything works by performing a retrieval run and checking ## Solr integration test We have an end-to-end integration testing script `run_solr_regression.py`. -See example usage for [Robust04](regressions-robust04.md) below: +See example usage for [Robust04](regressions-disk45.md) below: ```bash # Check if Solr server is on @@ -164,7 +179,7 @@ python src/main/python/run_solr_regression.py --create-index robust04 python src/main/python/run_solr_regression.py --delete-index robust04 # Insert documents from /path/to/disk45 into robust04 -python src/main/python/run_solr_regression.py --insert-docs core18 --input /path/to/disk45 +python src/main/python/run_solr_regression.py --insert-docs robust04 --input /path/to/disk45 # Search and evaluate on robust04 python src/main/python/run_solr_regression.py --evaluate robust04 @@ -176,20 +191,20 @@ To run end-to-end, issue the following command: python src/main/python/run_solr_regression.py --regression robust04 --input /path/to/disk45 ``` -The regression script has been verified to work for [`robust04`](regressions-robust04.md), [`core18`](regressions-core18.md), [`msmarco-passage`](experiments-msmarco-passage.md), [`msmarco-doc`](regressions-msmarco-doc.md). +The regression script has been verified to work for [`robust04`](regressions-disk45.md), [`core18`](regressions-core18.md), [`msmarco-passage`](experiments-msmarco-passage.md), [`msmarco-doc`](regressions-msmarco-doc.md). ## Reproduction Log[*](reproducibility.md) -+ Results reproduced by [@nikhilro](https://github.com/nikhilro) on 2020-01-26 (commit [`1882d84`](https://github.com/castorini/anserini/commit/1882d84236b13cd4673d2d8fa91003438eea2d82)) for both [Washington Post](regressions-core18.md) and [Robust04](regressions-robust04.md) -+ Results reproduced by [@edwinzhng](https://github.com/edwinzhng) on 2020-01-28 (commit [`a79cb62`](https://github.com/castorini/anserini/commit/a79cb62a57a059113a6c3b1523b582b89dccf0a1)) for both [Washington Post](regressions-core18.md) and [Robust04](regressions-robust04.md) -+ Results reproduced by [@nikhilro](https://github.com/nikhilro) on 2020-02-12 (commit [`eff7755`](https://github.com/castorini/anserini/commit/eff7755a611bd20ee1d63ac0167f5c8f38cd3074)) for [Washington Post `core18`](regressions-core18.md), [Robust04 `robust04`](regressions-robust04.md), and [MS Marco Passage `msmarco-passage`](regressions-msmarco-passage.md) using end-to-end [`run_solr_regression`](../src/main/python/run_solr_regression.py) ++ Results reproduced by [@nikhilro](https://github.com/nikhilro) on 2020-01-26 (commit [`1882d84`](https://github.com/castorini/anserini/commit/1882d84236b13cd4673d2d8fa91003438eea2d82)) for both [Washington Post](regressions-core18.md) and [Robust04](regressions-disk45.md) ++ Results reproduced by [@edwinzhng](https://github.com/edwinzhng) on 2020-01-28 (commit [`a79cb62`](https://github.com/castorini/anserini/commit/a79cb62a57a059113a6c3b1523b582b89dccf0a1)) for both [Washington Post](regressions-core18.md) and [Robust04](regressions-disk45.md) ++ Results reproduced by [@nikhilro](https://github.com/nikhilro) on 2020-02-12 (commit [`eff7755`](https://github.com/castorini/anserini/commit/eff7755a611bd20ee1d63ac0167f5c8f38cd3074)) for [Washington Post `core18`](regressions-core18.md), [Robust04 `robust04`](regressions-disk45.md), and [MS Marco Passage `msmarco-passage`](regressions-msmarco-passage.md) using end-to-end [`run_solr_regression`](../src/main/python/run_solr_regression.py) + Results reproduced by [@yuki617](https://github.com/yuki617) on 2020-03-30 (commit [`ec8ee41`](https://github.com/castorini/anserini/commit/ec8ee4145edf6db767cb86fa0d244d17e652eb2e)) for [MS Marco Passage `msmarco-passage`](regressions-msmarco-passage.md) using end-to-end [`run_solr_regression`](../src/main/python/run_solr_regression.py) + Results reproduced by [@HangCui0510](https://github.com/HangCui0510) on 2020-04-29 (commit [`31d843a`](https://github.com/castorini/anserini/commit/31d843a6073bfd7eff7e326f543e3f11845df7fa)) for [MS Marco Passage `msmarco-passage`](regressions-msmarco-passage.md) using end-to-end [`run_solr_regression`](../src/main/python/run_solr_regression.py) + Results reproduced by [@shaneding](https://github.com/shaneding) on 2020-05-26 (commit [`bed8ead`](https://github.com/castorini/anserini/commit/bed8eadad5f2ba859a2ddd2801db4aaeb3c81485)) for [MS Marco Passage `msmarco-passage`](regressions-msmarco-passage.md) using end-to-end [`run_solr_regression`](../src/main/python/run_solr_regression.py) + Results reproduced by [@YimingDou](https://github.com/YimingDou) on 2020-05-29 (commit [`2947a16`](https://github.com/castorini/anserini/commit/2947a1622efae35637b83e321aba8e6fccd43489)) for [MS MARCO Passage `msmarco-passage`](regressions-msmarco-passage.md) + Results reproduced by [@adamyy](https://github.com/adamyy) on 2020-05-29 (commit [`2947a16`](https://github.com/castorini/anserini/commit/2947a1622efae35637b83e321aba8e6fccd43489)) for [MS Marco Passage `msmarco-passage`](regressions-msmarco-passage.md) and [MS Marco Document `msmarco-doc`](regressions-msmarco-doc.md) using end-to-end [`run_solr_regression`](../src/main/python/run_solr_regression.py) -+ Results reproduced by [@yxzhu16](https://github.com/yxzhu16) on 2020-07-17 (commit [`fad12be`](https://github.com/castorini/anserini/commit/fad12be2e37a075100707c3a674eb67bc0aa57ef)) for [Robust04 `robust04`](regressions-robust04.md), [Washington Post `core18`](regressions-core18.md), and [MS Marco Passage `msmarco-passage`](regressions-msmarco-passage.md) using end-to-end [`run_solr_regression`](../src/main/python/run_solr_regression.py) -+ Results reproduced by [@lintool](https://github.com/lintool) on 2020-11-10 (commit [`e19755`](https://github.com/castorini/anserini/commit/e19755b5fa976127830597bc9fbca203b9f5ad24)), all commands and end-to-end regression script for all four collections ++ Results reproduced by [@yxzhu16](https://github.com/yxzhu16) on 2020-07-17 (commit [`fad12be`](https://github.com/castorini/anserini/commit/fad12be2e37a075100707c3a674eb67bc0aa57ef)) for [Robust04 `robust04`](regressions-disk45.md), [Washington Post `core18`](regressions-core18.md), and [MS Marco Passage `msmarco-passage`](regressions-msmarco-passage.md) using end-to-end [`run_solr_regression`](../src/main/python/run_solr_regression.py) ++ Results reproduced by [@lintool](https://github.com/lintool) on 2020-11-10 (commit [`e19755b`](https://github.com/castorini/anserini/commit/e19755b5fa976127830597bc9fbca203b9f5ad24)), all commands and end-to-end regression script for all four collections + Results reproduced by [@jrzhang12](https://github.com/jrzhang12) on 2021-01-10 (commit [`be4e44d`](https://github.com/castorini/anserini/commit/02c52ee606ba0ebe32c130af1e26d24d8f10566a)) for [MS MARCO Passage](regressions-msmarco-passage.md) + Results reproduced by [@tyao-t](https://github.com/tyao-t) on 2021-01-13 (commit [`a62aca0`](https://github.com/castorini/anserini/commit/a62aca06c1603617207c1c148133de0f90f24738)) for [MS MARCO Passage](regressions-msmarco-passage.md) and [MS MARCO Document](regressions-msmarco-doc.md) -+ Results reproduced by [@d1shs0ap](https://github.com/d1shs0ap) on 2022-01-21 (commit [`a81299e5`](https://github.com/castorini/anserini/commit/a81299e59eff24512d635e0d49fba6e373286469)) for [MS MARCO Document](regressions-msmarco-doc.md) using end-to-end [`run_solr_regression`](../src/main/python/run_solr_regression.py) ++ Results reproduced by [@d1shs0ap](https://github.com/d1shs0ap) on 2022-01-21 (commit [`a81299e`](https://github.com/castorini/anserini/commit/a81299e59eff24512d635e0d49fba6e373286469)) for [MS MARCO Document](regressions-msmarco-doc.md) using end-to-end [`run_solr_regression`](../src/main/python/run_solr_regression.py) diff --git a/src/main/python/run_es_regression.py b/src/main/python/run_es_regression.py index 1209480fa3..b2084de21d 100644 --- a/src/main/python/run_es_regression.py +++ b/src/main/python/run_es_regression.py @@ -106,19 +106,19 @@ def insert_docs(self, collection, path): # TODO: abstract this into an external config instead of hard-coded. if collection == 'robust04': command = 'sh target/appassembler/bin/IndexCollection -collection TrecCollection ' + \ - '-generator DefaultLuceneDocumentGenerator -es -es.index robust04 -threads 16 -input ' + \ + '-generator DefaultLuceneDocumentGenerator -es -es.index robust04 -threads 8 -input ' + \ path + ' -storePositions -storeDocvectors -storeRaw' elif collection == 'msmarco-passage': command = 'sh target/appassembler/bin/IndexCollection -collection JsonCollection ' + \ - '-generator DefaultLuceneDocumentGenerator -es -es.index msmarco-passage -threads 9 -input ' + \ + '-generator DefaultLuceneDocumentGenerator -es -es.index msmarco-passage -threads 8 -input ' + \ path + ' -storePositions -storeDocvectors -storeRaw' elif collection == 'core18': command = 'sh target/appassembler/bin/IndexCollection -collection WashingtonPostCollection ' + \ '-generator WashingtonPostGenerator -es -es.index core18 -threads 8 -input ' + \ path + ' -storePositions -storeDocvectors -storeContents' elif collection == 'msmarco-doc': - command = 'sh target/appassembler/bin/IndexCollection -collection CleanTrecCollection ' + \ - '-generator DefaultLuceneDocumentGenerator -es -es.index msmarco-doc -threads 1 -input ' + \ + command = 'sh target/appassembler/bin/IndexCollection -collection JsonCollection ' + \ + '-generator DefaultLuceneDocumentGenerator -es -es.index msmarco-doc -threads 8 -input ' + \ path + ' -storePositions -storeDocvectors -storeRaw' else: raise Exception('Unknown collection: {}'.format(collection)) @@ -180,7 +180,7 @@ def evaluate(self, collection): elif collection == 'core18': expected = 0.2496 elif collection == 'msmarco-doc': - expected = 0.2308 + expected = 0.2307 else: raise Exception('Unknown collection: {}'.format(collection)) diff --git a/src/main/python/run_solr_regression.py b/src/main/python/run_solr_regression.py index ab42193179..3fa8486a4b 100644 --- a/src/main/python/run_solr_regression.py +++ b/src/main/python/run_solr_regression.py @@ -99,7 +99,7 @@ def insert_docs(self, collection, path): '-solr -solr.index msmarco-passage -solr.zkUrl localhost:9983 ' + \ '-threads 8 -input ' + path + ' -storePositions -storeDocvectors -storeRaw' elif collection == 'msmarco-doc': - command = 'sh target/appassembler/bin/IndexCollection -collection CleanTrecCollection ' + \ + command = 'sh target/appassembler/bin/IndexCollection -collection JsonCollection ' + \ '-generator DefaultLuceneDocumentGenerator ' + \ '-solr -solr.index msmarco-doc -solr.zkUrl localhost:9983 ' + \ '-threads 8 -input ' + path + ' -storePositions -storeDocvectors -storeRaw' @@ -176,7 +176,7 @@ def evaluate(self, collection): elif collection == 'msmarco-passage': expected = 0.1926 elif collection == 'msmarco-doc': - expected = 0.2310 + expected = 0.2305 else: raise Exception('Unknown collection: {}'.format(collection))