Skip to content

Commit

Permalink
Refactor Solr and Elasticsearch integration and tests (#1799)
Browse files Browse the repository at this point in the history
Fix minor issues that broke since last update.
  • Loading branch information
lintool authored Mar 21, 2022
1 parent f42bbbe commit 3d1fc34
Show file tree
Hide file tree
Showing 5 changed files with 172 additions and 96 deletions.
114 changes: 82 additions & 32 deletions docs/elastirini.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,17 @@ Anserini provides code for indexing into an ELK stack, thus providing interopera

## Deploying Elasticsearch Locally

From the [Elasticsearch](http://elastic.co/start), download the correct distribution for you platform to the `anserini/` directory.
From [here](http://elastic.co/start), download the latest Elasticsearch distribution for you platform to the `anserini/` directory (currently, v8.1.0).

Unpacking:

```
```bash
mkdir elastirini && tar -zxvf elasticsearch*.tar.gz -C elastirini --strip-components=1
```

Start running:

```
```bash
elastirini/bin/elasticsearch
```

Expand All @@ -39,23 +39,33 @@ Now, we can start indexing through Elastirini.
Here, instead of passing in `-index` (to index with Lucene directly), we use `-es` for Elasticsearch:

```bash
sh target/appassembler/bin/IndexCollection -collection TrecCollection -generator DefaultLuceneDocumentGenerator \
-es -es.index robust04 -threads 16 -input /path/to/disk45 -storePositions -storeDocvectors -storeRaw
sh target/appassembler/bin/IndexCollection \
-collection TrecCollection \
-input /path/to/disk45 \
-generator DefaultLuceneDocumentGenerator \
-es \
-es.index robust04 \
-threads 8 \
-storePositions -storeDocvectors -storeRaw
```

We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off.
Run the following command to reproduce Anserini BM25 retrieval:

```bash
sh target/appassembler/bin/SearchElastic -topicreader Trec -es.index robust04 \
sh target/appassembler/bin/SearchElastic \
-topics src/main/resources/topics-and-qrels/topics.robust04.txt \
-topicreader Trec -es.index robust04 \
-output runs/run.es.robust04.bm25.topics.robust04.txt
```

To evaluate effectiveness:

```bash
$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.robust04.txt runs/run.es.robust04.bm25.topics.robust04.txt
$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 \
src/main/resources/topics-and-qrels/qrels.robust04.txt \
runs/run.es.robust04.bm25.topics.robust04.txt

map all 0.2531
P_30 all 0.3102
```
Expand All @@ -73,26 +83,37 @@ cat src/main/resources/elasticsearch/index-config.core18.json \
Indexing:

```bash
sh target/appassembler/bin/IndexCollection -collection WashingtonPostCollection -generator WashingtonPostGenerator \
-es -es.index core18 -threads 8 -input /path/to/WashingtonPost -storePositions -storeDocvectors -storeContents
sh target/appassembler/bin/IndexCollection \
-collection WashingtonPostCollection \
-input /path/to/WashingtonPost \
-generator WashingtonPostGenerator \
-es \
-es.index core18 \
-threads 8 \
-storePositions -storeDocvectors -storeContents
```

We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off.

Retrieval:

```bash
sh target/appassembler/bin/SearchElastic -topicreader Trec -es.index core18 \
sh target/appassembler/bin/SearchElastic \
-topics src/main/resources/topics-and-qrels/topics.core18.txt \
-topicreader Trec \
-es.index core18 \
-output runs/run.es.core18.bm25.topics.core18.txt
```

Evaluation:

```bash
$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.core18.txt runs/run.es.core18.bm25.topics.core18.txt
map all 0.2495
P_30 all 0.3567
$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 \
src/main/resources/topics-and-qrels/qrels.core18.txt \
runs/run.es.core18.bm25.topics.core18.txt

map all 0.2496
P_30 all 0.3573
```

## Indexing and Retrieval: MS MARCO Passage
Expand All @@ -108,23 +129,35 @@ cat src/main/resources/elasticsearch/index-config.msmarco-passage.json \
Indexing:

```bash
sh target/appassembler/bin/IndexCollection -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
-es -es.index msmarco-passage -threads 9 -input /path/to/msmarco-passage -storePositions -storeDocvectors -storeRaw
sh target/appassembler/bin/IndexCollection \
-collection JsonCollection \
-input /path/to/msmarco-passage \
-generator DefaultLuceneDocumentGenerator \
-es \
-es.index msmarco-passage \
-threads 8 \
-storePositions -storeDocvectors -storeRaw
```

We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off.

Retrieval:

```bash
sh target/appassembler/bin/SearchElastic -topicreader TsvString -es.index msmarco-passage \
-topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output runs/run.es.msmacro-passage.txt
sh target/appassembler/bin/SearchElastic \
-topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
-topicreader TsvString \
-es.index msmarco-passage \
-output runs/run.es.msmacro-passage.txt
```

Evaluation:

```bash
$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.es.msmacro-passage.txt
$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map \
src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
runs/run.es.msmacro-passage.txt

map all 0.1956
recall_1000 all 0.8573
```
Expand All @@ -142,44 +175,61 @@ cat src/main/resources/elasticsearch/index-config.msmarco-doc.json \
Indexing:

```bash
sh target/appassembler/bin/IndexCollection -collection CleanTrecCollection -generator DefaultLuceneDocumentGenerator \
-es -es.index msmarco-doc -threads 1 -input /path/to/msmarco-doc -storePositions -storeDocvectors -storeRaw
sh target/appassembler/bin/IndexCollection \
-collection JsonCollection \
-input /path/to/msmarco-doc \
-generator DefaultLuceneDocumentGenerator \
-es \
-es.index msmarco-doc \
-threads 8 \
-storePositions -storeDocvectors -storeRaw
```

We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off.

Retrieval:

```bash
sh target/appassembler/bin/SearchElastic -topicreader TsvInt -es.index msmarco-doc \
-topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt -output runs/run.es.msmacro-doc.txt
sh target/appassembler/bin/SearchElastic \
-topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-topicreader TsvInt \
-es.index msmarco-doc \
-output runs/run.es.msmarco-doc.txt
```

This can take potentially longer than `SearchCollection` with Lucene indexes.

Evaluation:

```bash
$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.es.msmacro-doc.txt
map all 0.2308
$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map \
src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \
runs/run.es.msmarco-doc.txt

map all 0.2307
recall_1000 all 0.8856
```

## Elasticsearch Integration Test

We have an end-to-end integration testing script `run_es_regression.py` for [Robust04](regressions-robust04.md), [Core18](regressions-core18.md), [MS MARCO passage](regressions-msmarco-passage.md) and [MS MARCO document](regressions-msmarco-doc.md):
We have an end-to-end integration testing script `run_es_regression.py` for [Robust04](regressions-disk45.md), [Core18](regressions-core18.md), [MS MARCO passage](regressions-msmarco-passage.md) and [MS MARCO document](regressions-msmarco-doc.md):

```
```bash
# Check if Elasticsearch server is on
python src/main/python/run_es_regression.py --ping

# Check if collection exists
python src/main/python/run_es_regression.py --check-index-exists [collection]

# Create collection if it does not exist
python src/main/python/run_es_regression.py --create-index [collection]

# Delete collection if it exists
python src/main/python/run_es_regression.py --delete-index [collection]

# Insert documents from input directory into collection
python src/main/python/run_es_regression.py --insert-docs [collection] --input [directory]

# Search and evaluate on collection
python src/main/python/run_es_regression.py --evaluate [collection]

Expand All @@ -191,14 +241,14 @@ For the `collection` meta-parameter, use `robust04`, `core18`, `msmarco-passage`

## Reproduction Log[*](reproducibility.md)

+ Results reproduced by [@nikhilro](https://github.com/nikhilro) on 2020-01-26 (commit [`d5ee069`](https://github.com/castorini/anserini/commit/d5ee069399e6a306d7685bda756c1f19db721156)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-robust04.md)
+ Results reproduced by [@edwinzhng](https://github.com/edwinzhng) on 2020-01-26 (commit [`7b76dfb`](https://github.com/castorini/anserini/commit/7b76dfbea7e0c01a3a5dc13e74f54852c780ec9b)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-robust04.md)
+ Results reproduced by [@HangCui0510](https://github.com/HangCui0510) on 2020-04-29 (commit [`07a9b05`](https://github.com/castorini/anserini/commit/07a9b053173637e15be79de4e7fce4d5a93d04fe)) for [MS Marco Passage](regressions-msmarco-passage.md), [Robust04](regressions-robust04.md) and [Core18](regressions-core18.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py)
+ Results reproduced by [@nikhilro](https://github.com/nikhilro) on 2020-01-26 (commit [`d5ee069`](https://github.com/castorini/anserini/commit/d5ee069399e6a306d7685bda756c1f19db721156)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-disk45.md)
+ Results reproduced by [@edwinzhng](https://github.com/edwinzhng) on 2020-01-26 (commit [`7b76dfb`](https://github.com/castorini/anserini/commit/7b76dfbea7e0c01a3a5dc13e74f54852c780ec9b)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-disk45.md)
+ Results reproduced by [@HangCui0510](https://github.com/HangCui0510) on 2020-04-29 (commit [`07a9b05`](https://github.com/castorini/anserini/commit/07a9b053173637e15be79de4e7fce4d5a93d04fe)) for [MS Marco Passage](regressions-msmarco-passage.md), [Robust04](regressions-disk45.md) and [Core18](regressions-core18.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py)
+ Results reproduced by [@shaneding](https://github.com/shaneding) on 2020-05-25 (commit [`1de3274`](https://github.com/castorini/anserini/commit/1de3274b057a63382534c5277ffcd772c3fc0d43)) for [MS Marco Passage](regressions-msmarco-passage.md)
+ Results reproduced by [@adamyy](https://github.com/adamyy) on 2020-05-29 (commit [`94893f1`](https://github.com/castorini/anserini/commit/94893f170e047d77c3ef5b8b995d7fbdd13f4298)) for [MS MARCO Passage](regressions-msmarco-passage.md), [MS MARCO Document](experiments-msmarco-doc.md)
+ Results reproduced by [@YimingDou](https://github.com/YimingDou) on 2020-05-29 (commit [`2947a16`](https://github.com/castorini/anserini/commit/2947a1622efae35637b83e321aba8e6fccd43489)) for [MS MARCO Passage](regressions-msmarco-passage.md)
+ Results reproduced by [@yxzhu16](https://github.com/yxzhu16) on 2020-07-17 (commit [`fad12be`](https://github.com/castorini/anserini/commit/fad12be2e37a075100707c3a674eb67bc0aa57ef)) for [Robust04](regressions-robust04.md), [Core18](regressions-core18.md), and [MS MARCO Passage](regressions-msmarco-passage.md)
+ Results reproduced by [@lintool](https://github.com/lintool) on 2020-11-10 (commit [`e19755`](https://github.com/castorini/anserini/commit/e19755b5fa976127830597bc9fbca203b9f5ad24)), all commands and end-to-end regression script for all four collections
+ Results reproduced by [@yxzhu16](https://github.com/yxzhu16) on 2020-07-17 (commit [`fad12be`](https://github.com/castorini/anserini/commit/fad12be2e37a075100707c3a674eb67bc0aa57ef)) for [Robust04](regressions-disk45.md), [Core18](regressions-core18.md), and [MS MARCO Passage](regressions-msmarco-passage.md)
+ Results reproduced by [@lintool](https://github.com/lintool) on 2020-11-10 (commit [`e19755b`](https://github.com/castorini/anserini/commit/e19755b5fa976127830597bc9fbca203b9f5ad24)), all commands and end-to-end regression script for all four collections
+ Results reproduced by [@jrzhang12](https://github.com/jrzhang12) on 2021-01-02 (commit [`be4e44d`](https://github.com/castorini/anserini/commit/02c52ee606ba0ebe32c130af1e26d24d8f10566a)) for [MS MARCO Passage](regressions-msmarco-passage.md)
+ Results reproduced by [@tyao-t](https://github.com/tyao-t) on 2022-01-13 (commit [`06fb4f9`](https://github.com/castorini/anserini/commit/06fb4f9947ff2167c276d8893287453af7680786)) for [MS MARCO Passage](regressions-msmarco-passage.md) and [MS MARCO Document](regressions-msmarco-doc.md)
+ Results reproduced by [@d1shs0ap](https://github.com/d1shs0ap) on 2022-01-21 (commit [`a81299e5`](https://github.com/castorini/anserini/commit/a81299e59eff24512d635e0d49fba6e373286469)) for [MS MARCO Document](regressions-msmarco-doc.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py)
+ Results reproduced by [@d1shs0ap](https://github.com/d1shs0ap) on 2022-01-21 (commit [`a81299e`](https://github.com/castorini/anserini/commit/a81299e59eff24512d635e0d49fba6e373286469)) for [MS MARCO Document](regressions-msmarco-doc.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py)
37 changes: 24 additions & 13 deletions docs/experiments-cord19-extras.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ python src/main/python/trec-covid/index_cord19.py --date 2020-07-16 --download

## Solr

From the Solr [archives](https://archive.apache.org/dist/lucene/solr/), download the Solr (non `-src`) version that matches Anserini's [Lucene version](https://github.com/castorini/anserini/blob/master/pom.xml#L36) to the `anserini/` directory.
Download the latest Solr version (binary release) from [here](https://solr.apache.org/downloads.html) and extract the archive (currently, v8.11.1):

Extract the archive:

Expand Down Expand Up @@ -62,10 +62,15 @@ solrini/bin/solr create -n anserini -c cord19
We can now index into Solr:

```bash
sh target/appassembler/bin/IndexCollection -collection Cord19AbstractCollection -generator Cord19Generator \
-threads 8 -input collections/cord19-2020-07-16 \
-solr -solr.index cord19 -solr.zkUrl localhost:9983 \
-storePositions -storeDocvectors -storeContents -storeRaw
sh target/appassembler/bin/IndexCollection \
-collection Cord19AbstractCollection \
-input collections/cord19-2020-07-16 \
-generator Cord19Generator \
-solr \
-solr.index cord19 \
-solr.zkUrl localhost:9983 \
-threads 8 \
-storePositions -storeDocvectors -storeContents -storeRaw
```

Once indexing is complete, you can query in Solr at [`http://localhost:8983/solr/#/cord19/query`](http://localhost:8983/solr/#/cord19/query).
Expand All @@ -74,8 +79,7 @@ You'll need to make sure your query is searching the `contents` field, so the qu

## Elasticsearch + Kibana

From the [Elasticsearch](http://elastic.co/start), download the correct distribution for your platform to the `anserini/` directory.
These instructions below work with version 7.10.0.
From [here](http://elastic.co/start), download the latest Elasticsearch and Kibanna distributions for you platform to the `anserini/` directory (currently, v8.1.0).

First, unpack and deploy Elasticsearch:

Expand All @@ -95,8 +99,9 @@ Elasticsearch has a built-in safeguard to disable indexing if you're running low
The error is something like "flood stage disk watermark [95%] exceeded on ..." with indexes placed into readonly mode.
Obviously, be careful, but if you're sure things are going to be okay and you won't run out of disk space, disable the safeguard as follows:

```
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'
```bash
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings \
-d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'
```

Set up the proper schema using [this config](../src/main/resources/elasticsearch/index-config.cord19.json):
Expand All @@ -109,16 +114,22 @@ cat src/main/resources/elasticsearch/index-config.cord19.json \
Indexing abstracts:

```bash
sh target/appassembler/bin/IndexCollection -collection Cord19AbstractCollection -generator Cord19Generator \
-es -es.index cord19 -threads 8 -input collections/cord19-2020-07-16 -storePositions -storeDocvectors -storeContents -storeRaw
sh target/appassembler/bin/IndexCollection \
-collection Cord19AbstractCollection \
-input collections/cord19-2020-07-16 \
-generator Cord19Generator \
-es \
-es.index cord19 \
-threads 8 \
-storePositions -storeDocvectors -storeContents -storeRaw
```

We are now able to access interactive search and visualization capabilities from Kibana at [`http://localhost:5601/`](http://localhost:5601).

Here's an example:

1. Click on the hamburger icon, then click "Discover" under "Analytics".
2. Create "Index Pattern": set the index pattern to `cord19`, and use `publish_time` as the timestamp field.
1. Click on the hamburger icon, then click "Dashboard" under "Analytics".
2. Create "Data View": set the index pattern to `cord19`, and use `publish_time` as the timestamp field.
3. Go back to "Discover" under "Analytics"; now run a search, e.g., "incubation period". Be sure to expand the date, which is a dropdown box to the right of the search box; something like "Last 10 years" works well.
4. You should be able to see search results as well as a histogram of the dates in which those articles ar published!

Expand Down
Loading

0 comments on commit 3d1fc34

Please sign in to comment.