Refactor Solr and Elasticsearch integration and tests (#1799)

Fix minor issues that broke since last update.
castorini · Mar 21, 2022 · 3d1fc34 · 3d1fc34
1 parent f42bbbe
commit 3d1fc34
Show file tree

Hide file tree

Showing 5 changed files with 172 additions and 96 deletions.
diff --git a/docs/elastirini.md b/docs/elastirini.md
@@ -4,17 +4,17 @@ Anserini provides code for indexing into an ELK stack, thus providing interopera
 
 ## Deploying Elasticsearch Locally
 
-From the [Elasticsearch](http://elastic.co/start), download the correct distribution for you platform to the `anserini/` directory. 
+From [here](http://elastic.co/start), download the latest Elasticsearch distribution for you platform to the `anserini/` directory (currently, v8.1.0). 
 
 Unpacking:
 
-```
+```bash
 mkdir elastirini && tar -zxvf elasticsearch*.tar.gz -C elastirini --strip-components=1
 ```
 
 Start running:
 
-```
+```bash
 elastirini/bin/elasticsearch
 ```
 
@@ -39,23 +39,33 @@ Now, we can start indexing through Elastirini.
 Here, instead of passing in `-index` (to index with Lucene directly), we use `-es` for Elasticsearch:
 
 ```bash
-sh target/appassembler/bin/IndexCollection -collection TrecCollection -generator DefaultLuceneDocumentGenerator \
- -es -es.index robust04 -threads 16 -input /path/to/disk45 -storePositions -storeDocvectors -storeRaw
+sh target/appassembler/bin/IndexCollection \
+  -collection TrecCollection \
+  -input /path/to/disk45 \
+  -generator DefaultLuceneDocumentGenerator \
+  -es \
+  -es.index robust04 \
+  -threads 8 \
+  -storePositions -storeDocvectors -storeRaw
 ```
 
 We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off.
 Run the following command to reproduce Anserini BM25 retrieval:
 
 ```bash
-sh target/appassembler/bin/SearchElastic -topicreader Trec -es.index robust04 \
+sh target/appassembler/bin/SearchElastic \
   -topics src/main/resources/topics-and-qrels/topics.robust04.txt \
+  -topicreader Trec -es.index robust04 \
   -output runs/run.es.robust04.bm25.topics.robust04.txt
 ```
 
 To evaluate effectiveness:
 
 ```bash
-$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.robust04.txt runs/run.es.robust04.bm25.topics.robust04.txt
+$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 \
+    src/main/resources/topics-and-qrels/qrels.robust04.txt \
+    runs/run.es.robust04.bm25.topics.robust04.txt
+
 map                   	all	0.2531
 P_30                  	all	0.3102
 ```
@@ -73,26 +83,37 @@ cat src/main/resources/elasticsearch/index-config.core18.json \
 Indexing:
 
 ```bash
-sh target/appassembler/bin/IndexCollection -collection WashingtonPostCollection -generator WashingtonPostGenerator \
- -es -es.index core18 -threads 8 -input /path/to/WashingtonPost -storePositions -storeDocvectors -storeContents
+sh target/appassembler/bin/IndexCollection \
+  -collection WashingtonPostCollection \
+  -input /path/to/WashingtonPost \
+  -generator WashingtonPostGenerator \
+  -es \
+  -es.index core18 \
+  -threads 8 \
+  -storePositions -storeDocvectors -storeContents
 ```
 
 We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off.
 
 Retrieval:
 
 ```bash
-sh target/appassembler/bin/SearchElastic -topicreader Trec -es.index core18 \
+sh target/appassembler/bin/SearchElastic \
   -topics src/main/resources/topics-and-qrels/topics.core18.txt \
+  -topicreader Trec \
+  -es.index core18 \
   -output runs/run.es.core18.bm25.topics.core18.txt
 ```
 
 Evaluation:
 
 ```bash
-$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.core18.txt runs/run.es.core18.bm25.topics.core18.txt
-map                   	all	0.2495
-P_30                  	all	0.3567
+$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 \
+    src/main/resources/topics-and-qrels/qrels.core18.txt \
+    runs/run.es.core18.bm25.topics.core18.txt
+
+map                   	all	0.2496
+P_30                  	all	0.3573
 ```
 
 ## Indexing and Retrieval: MS MARCO Passage
@@ -108,23 +129,35 @@ cat src/main/resources/elasticsearch/index-config.msmarco-passage.json \
 Indexing:
 
 ```bash
-sh target/appassembler/bin/IndexCollection -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
- -es -es.index msmarco-passage -threads 9 -input /path/to/msmarco-passage -storePositions -storeDocvectors -storeRaw
+sh target/appassembler/bin/IndexCollection \
+  -collection JsonCollection \
+  -input /path/to/msmarco-passage \
+  -generator DefaultLuceneDocumentGenerator \
+  -es \
+  -es.index msmarco-passage \
+  -threads 8 \
+  -storePositions -storeDocvectors -storeRaw
 ```
 
 We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off.
 
 Retrieval:
 
 ```bash
-sh target/appassembler/bin/SearchElastic -topicreader TsvString -es.index msmarco-passage \
- -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output runs/run.es.msmacro-passage.txt
+sh target/appassembler/bin/SearchElastic \
+  -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
+  -topicreader TsvString \
+  -es.index msmarco-passage \
+  -output runs/run.es.msmacro-passage.txt
 ```
 
 Evaluation:
 
 ```bash
-$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.es.msmacro-passage.txt
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map \
+    src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
+    runs/run.es.msmacro-passage.txt
+
 map                   	all	0.1956
 recall_1000           	all	0.8573
 ```
@@ -142,44 +175,61 @@ cat src/main/resources/elasticsearch/index-config.msmarco-doc.json \
 Indexing:
 
 ```bash
-sh target/appassembler/bin/IndexCollection -collection CleanTrecCollection -generator DefaultLuceneDocumentGenerator \
- -es -es.index msmarco-doc -threads 1 -input /path/to/msmarco-doc -storePositions -storeDocvectors -storeRaw
+sh target/appassembler/bin/IndexCollection \
+  -collection JsonCollection \
+  -input /path/to/msmarco-doc \
+  -generator DefaultLuceneDocumentGenerator \
+  -es \
+  -es.index msmarco-doc \
+  -threads 8 \
+  -storePositions -storeDocvectors -storeRaw
 ```
 
 We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off.
 
 Retrieval:
 
 ```bash
-sh target/appassembler/bin/SearchElastic -topicreader TsvInt -es.index msmarco-doc \
- -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output runs/run.es.msmacro-doc.txt
+sh target/appassembler/bin/SearchElastic \
+ -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
+ -topicreader TsvInt \
+ -es.index msmarco-doc \
+ -output runs/run.es.msmarco-doc.txt
 ```
 
 This can take potentially longer than `SearchCollection` with Lucene indexes.
 
 Evaluation:
 
 ```bash
-$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.es.msmacro-doc.txt
-map                   	all	0.2308
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map \
+    src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \
+    runs/run.es.msmarco-doc.txt
+
+map                   	all	0.2307
 recall_1000           	all	0.8856
 ```
 
 ## Elasticsearch Integration Test
 
-We have an end-to-end integration testing script `run_es_regression.py` for [Robust04](regressions-robust04.md), [Core18](regressions-core18.md), [MS MARCO passage](regressions-msmarco-passage.md) and [MS MARCO document](regressions-msmarco-doc.md):
+We have an end-to-end integration testing script `run_es_regression.py` for [Robust04](regressions-disk45.md), [Core18](regressions-core18.md), [MS MARCO passage](regressions-msmarco-passage.md) and [MS MARCO document](regressions-msmarco-doc.md):
 
-```
+```bash
 # Check if Elasticsearch server is on
 python src/main/python/run_es_regression.py --ping
+
 # Check if collection exists
 python src/main/python/run_es_regression.py --check-index-exists [collection]
+
 # Create collection if it does not exist
 python src/main/python/run_es_regression.py --create-index [collection]
+
 # Delete collection if it exists
 python src/main/python/run_es_regression.py --delete-index [collection]
+
 # Insert documents from input directory into collection
 python src/main/python/run_es_regression.py --insert-docs [collection] --input [directory]
+
 # Search and evaluate on collection
 python src/main/python/run_es_regression.py --evaluate [collection]
 
@@ -191,14 +241,14 @@ For the `collection` meta-parameter, use `robust04`, `core18`, `msmarco-passage`
 
 ## Reproduction Log[*](reproducibility.md)
 
-+ Results reproduced by [@nikhilro](https://github.com/nikhilro) on 2020-01-26 (commit [`d5ee069`](https://github.com/castorini/anserini/commit/d5ee069399e6a306d7685bda756c1f19db721156)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-robust04.md)
-+ Results reproduced by [@edwinzhng](https://github.com/edwinzhng) on 2020-01-26 (commit [`7b76dfb`](https://github.com/castorini/anserini/commit/7b76dfbea7e0c01a3a5dc13e74f54852c780ec9b)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-robust04.md)
-+ Results reproduced by [@HangCui0510](https://github.com/HangCui0510) on 2020-04-29 (commit [`07a9b05`](https://github.com/castorini/anserini/commit/07a9b053173637e15be79de4e7fce4d5a93d04fe)) for [MS Marco Passage](regressions-msmarco-passage.md), [Robust04](regressions-robust04.md) and [Core18](regressions-core18.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py)
++ Results reproduced by [@nikhilro](https://github.com/nikhilro) on 2020-01-26 (commit [`d5ee069`](https://github.com/castorini/anserini/commit/d5ee069399e6a306d7685bda756c1f19db721156)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-disk45.md)
++ Results reproduced by [@edwinzhng](https://github.com/edwinzhng) on 2020-01-26 (commit [`7b76dfb`](https://github.com/castorini/anserini/commit/7b76dfbea7e0c01a3a5dc13e74f54852c780ec9b)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-disk45.md)
++ Results reproduced by [@HangCui0510](https://github.com/HangCui0510) on 2020-04-29 (commit [`07a9b05`](https://github.com/castorini/anserini/commit/07a9b053173637e15be79de4e7fce4d5a93d04fe)) for [MS Marco Passage](regressions-msmarco-passage.md), [Robust04](regressions-disk45.md) and [Core18](regressions-core18.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py)
 + Results reproduced by [@shaneding](https://github.com/shaneding) on 2020-05-25 (commit [`1de3274`](https://github.com/castorini/anserini/commit/1de3274b057a63382534c5277ffcd772c3fc0d43)) for [MS Marco Passage](regressions-msmarco-passage.md)
 + Results reproduced by [@adamyy](https://github.com/adamyy) on 2020-05-29 (commit [`94893f1`](https://github.com/castorini/anserini/commit/94893f170e047d77c3ef5b8b995d7fbdd13f4298)) for [MS MARCO Passage](regressions-msmarco-passage.md), [MS MARCO Document](experiments-msmarco-doc.md)
 + Results reproduced by [@YimingDou](https://github.com/YimingDou) on 2020-05-29 (commit [`2947a16`](https://github.com/castorini/anserini/commit/2947a1622efae35637b83e321aba8e6fccd43489)) for [MS MARCO Passage](regressions-msmarco-passage.md)
-+ Results reproduced by [@yxzhu16](https://github.com/yxzhu16) on 2020-07-17 (commit [`fad12be`](https://github.com/castorini/anserini/commit/fad12be2e37a075100707c3a674eb67bc0aa57ef)) for [Robust04](regressions-robust04.md), [Core18](regressions-core18.md), and [MS MARCO Passage](regressions-msmarco-passage.md)
-+ Results reproduced by [@lintool](https://github.com/lintool) on 2020-11-10 (commit [`e19755`](https://github.com/castorini/anserini/commit/e19755b5fa976127830597bc9fbca203b9f5ad24)), all commands and end-to-end regression script for all four collections
++ Results reproduced by [@yxzhu16](https://github.com/yxzhu16) on 2020-07-17 (commit [`fad12be`](https://github.com/castorini/anserini/commit/fad12be2e37a075100707c3a674eb67bc0aa57ef)) for [Robust04](regressions-disk45.md), [Core18](regressions-core18.md), and [MS MARCO Passage](regressions-msmarco-passage.md)
++ Results reproduced by [@lintool](https://github.com/lintool) on 2020-11-10 (commit [`e19755b`](https://github.com/castorini/anserini/commit/e19755b5fa976127830597bc9fbca203b9f5ad24)), all commands and end-to-end regression script for all four collections
 + Results reproduced by [@jrzhang12](https://github.com/jrzhang12) on 2021-01-02 (commit [`be4e44d`](https://github.com/castorini/anserini/commit/02c52ee606ba0ebe32c130af1e26d24d8f10566a)) for [MS MARCO Passage](regressions-msmarco-passage.md)
 + Results reproduced by [@tyao-t](https://github.com/tyao-t) on 2022-01-13 (commit [`06fb4f9`](https://github.com/castorini/anserini/commit/06fb4f9947ff2167c276d8893287453af7680786)) for [MS MARCO Passage](regressions-msmarco-passage.md) and [MS MARCO Document](regressions-msmarco-doc.md)
-+ Results reproduced by [@d1shs0ap](https://github.com/d1shs0ap) on 2022-01-21 (commit [`a81299e5`](https://github.com/castorini/anserini/commit/a81299e59eff24512d635e0d49fba6e373286469)) for [MS MARCO Document](regressions-msmarco-doc.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py)
++ Results reproduced by [@d1shs0ap](https://github.com/d1shs0ap) on 2022-01-21 (commit [`a81299e`](https://github.com/castorini/anserini/commit/a81299e59eff24512d635e0d49fba6e373286469)) for [MS MARCO Document](regressions-msmarco-doc.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py)
diff --git a/docs/experiments-cord19-extras.md b/docs/experiments-cord19-extras.md
@@ -16,7 +16,7 @@ python src/main/python/trec-covid/index_cord19.py --date 2020-07-16 --download
 
 ## Solr
 
-From the Solr [archives](https://archive.apache.org/dist/lucene/solr/), download the Solr (non `-src`) version that matches Anserini's [Lucene version](https://github.com/castorini/anserini/blob/master/pom.xml#L36) to the `anserini/` directory.
+Download the latest Solr version (binary release) from [here](https://solr.apache.org/downloads.html) and extract the archive (currently, v8.11.1):
 
 Extract the archive:
 
@@ -62,10 +62,15 @@ solrini/bin/solr create -n anserini -c cord19
 We can now index into Solr:
 
 ```bash
-sh target/appassembler/bin/IndexCollection -collection Cord19AbstractCollection -generator Cord19Generator \
- -threads 8 -input collections/cord19-2020-07-16 \
- -solr -solr.index cord19 -solr.zkUrl localhost:9983 \
- -storePositions -storeDocvectors -storeContents -storeRaw
+sh target/appassembler/bin/IndexCollection \
+  -collection Cord19AbstractCollection \
+  -input collections/cord19-2020-07-16 \
+  -generator Cord19Generator \
+  -solr \
+  -solr.index cord19 \
+  -solr.zkUrl localhost:9983 \
+  -threads 8  \
+  -storePositions -storeDocvectors -storeContents -storeRaw
 ```
 
 Once indexing is complete, you can query in Solr at [`http://localhost:8983/solr/#/cord19/query`](http://localhost:8983/solr/#/cord19/query).
@@ -74,8 +79,7 @@ You'll need to make sure your query is searching the `contents` field, so the qu
 
 ## Elasticsearch + Kibana
 
-From the [Elasticsearch](http://elastic.co/start), download the correct distribution for your platform to the `anserini/` directory.
-These instructions below work with version 7.10.0.
+From [here](http://elastic.co/start), download the latest Elasticsearch and Kibanna distributions for you platform to the `anserini/` directory (currently, v8.1.0).
 
 First, unpack and deploy Elasticsearch:
 
@@ -95,8 +99,9 @@ Elasticsearch has a built-in safeguard to disable indexing if you're running low
 The error is something like "flood stage disk watermark [95%] exceeded on ..." with indexes placed into readonly mode.
 Obviously, be careful, but if you're sure things are going to be okay and you won't run out of disk space, disable the safeguard as follows:
 
-```
-curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'
+```bash
+curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings \
+  -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'
 ```
 
 Set up the proper schema using [this config](../src/main/resources/elasticsearch/index-config.cord19.json):
@@ -109,16 +114,22 @@ cat src/main/resources/elasticsearch/index-config.cord19.json \
 Indexing abstracts:
 
 ```bash
-sh target/appassembler/bin/IndexCollection -collection Cord19AbstractCollection -generator Cord19Generator \
- -es -es.index cord19 -threads 8 -input collections/cord19-2020-07-16 -storePositions -storeDocvectors -storeContents -storeRaw
+sh target/appassembler/bin/IndexCollection \
+  -collection Cord19AbstractCollection \
+  -input collections/cord19-2020-07-16 \
+  -generator Cord19Generator \
+  -es \
+  -es.index cord19 \
+  -threads 8 \
+  -storePositions -storeDocvectors -storeContents -storeRaw
 ```
 
 We are now able to access interactive search and visualization capabilities from Kibana at [`http://localhost:5601/`](http://localhost:5601).
 
 Here's an example:
 
-1. Click on the hamburger icon, then click "Discover" under "Analytics".
-2. Create "Index Pattern": set the index pattern to `cord19`, and use `publish_time` as the timestamp field.
+1. Click on the hamburger icon, then click "Dashboard" under "Analytics".
+2. Create "Data View": set the index pattern to `cord19`, and use `publish_time` as the timestamp field.
 3. Go back to "Discover" under "Analytics"; now run a search, e.g., "incubation period". Be sure to expand the date, which is a dropdown box to the right of the search box; something like "Last 10 years" works well.
 4. You should be able to see search results as well as a histogram of the dates in which those articles ar published!