Skip to content

Latest commit

 

History

History
143 lines (102 loc) · 6.33 KB

experiments-cord19-extras.md

File metadata and controls

143 lines (102 loc) · 6.33 KB

Ingesting CORD-19 into Solr and Elasticsearch

This document describes how to ingest the COVID-19 Open Research Dataset (CORD-19) from the Allen Institute for AI into Solr and Elasticsearch. If you want to build or download Lucene indexes for CORD-19, see this guide.

Getting the Data

Follow the instructions here to get access to the data. This version of the guide has been verified to work with the version of 2020/07/16, which is the corpus used in round 5 of the TREC-COVID challenge.

Download the corpus using our script:

python src/main/python/trec-covid/index_cord19.py --date 2020-07-16 --download

Solr

Download the latest Solr version (binary release) from here and extract the archive (currently, v8.11.1):

Extract the archive:

mkdir solrini && tar -zxvf solr*.tgz -C solrini --strip-components=1

Start Solr (adjust memory usage with -m as appropriate):

solrini/bin/solr start -c -m 8G

Run the Solr bootstrap script to copy the Anserini JAR into Solr's classpath and upload the configsets to Solr's internal ZooKeeper:

pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd

Solr should now be available at http://localhost:8983/ for browsing.

Next, create the collection:

solrini/bin/solr create -n anserini -c cord19

Adjust the schema (if there are errors, follow the instructions below):

curl -X POST -H 'Content-type:application/json' --data-binary @src/main/resources/solr/schemas/cord19.json \
 http://localhost:8983/solr/cord19/schema

Note: If there are errors from field conflicts, you'll need to reset the configset and recreate the collection:

solrini/bin/solr delete -c cord19
pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd
solrini/bin/solr create -n anserini -c cord19

We can now index into Solr:

sh target/appassembler/bin/IndexCollection \
  -collection Cord19AbstractCollection \
  -input collections/cord19-2020-07-16 \
  -generator Cord19Generator \
  -solr \
  -solr.index cord19 \
  -solr.zkUrl localhost:9983 \
  -threads 8  \
  -storePositions -storeDocvectors -storeContents -storeRaw

Once indexing is complete, you can query in Solr at http://localhost:8983/solr/#/cord19/query.

You'll need to make sure your query is searching the contents field, so the query should look something like contents:"incubation period".

Elasticsearch + Kibana

From here, download the latest Elasticsearch and Kibanna distributions for you platform to the anserini/ directory (currently, v8.1.0).

First, unpack and deploy Elasticsearch:

mkdir elastirini && tar -zxvf elasticsearch*.tar.gz -C elastirini --strip-components=1
elastirini/bin/elasticsearch

Upack and deploy Kibana:

tar -zxvf kibana*.tar.gz -C elastirini --strip-components=1
elastirini/bin/kibana

Elasticsearch has a built-in safeguard to disable indexing if you're running low on disk space. The error is something like "flood stage disk watermark [95%] exceeded on ..." with indexes placed into readonly mode. Obviously, be careful, but if you're sure things are going to be okay and you won't run out of disk space, disable the safeguard as follows:

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings \
  -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'

Set up the proper schema using this config:

cat src/main/resources/elasticsearch/index-config.cord19.json \
 | curl --user elastic:changeme -XPUT -H 'Content-Type: application/json' 'localhost:9200/cord19' -d @-

Indexing abstracts:

sh target/appassembler/bin/IndexCollection \
  -collection Cord19AbstractCollection \
  -input collections/cord19-2020-07-16 \
  -generator Cord19Generator \
  -es \
  -es.index cord19 \
  -threads 8 \
  -storePositions -storeDocvectors -storeContents -storeRaw

We are now able to access interactive search and visualization capabilities from Kibana at http://localhost:5601/.

Here's an example:

  1. Click on the hamburger icon, then click "Dashboard" under "Analytics".
  2. Create "Data View": set the name to cord19, and use publish_time as the timestamp field. (Note, "Data Views" used to be called "Index Patterns".)
  3. Go back to "Discover" under "Analytics"; now run a search, e.g., "incubation period". Be sure to expand the date, which is a dropdown box to the right of the search box; something like "Last 10 years" works well.
  4. You should be able to see search results as well as a histogram of the dates in which those articles are published!

Reproduction Log*

  • Reproduced by @adamyy on 2020-05-29 (commit 2947a16) on CORD-19 release of 2020/05/26.
  • Reproduced by @yxzhu16 on 2020-07-17 (commit fad12be) on CORD-19 release of 2020/06/19.
  • Reproduced by @LizzyZhang-tutu on 2020-07-26 (commit fad12be) on CORD-19 release of 2020/07/25.
  • Reproduced by @lintool on 2020-11-23 (commit 746447a) on CORD-19 release of 2020/07/16 with Solr v8.3.0 and ES/Kibana v7.10.0.
  • Reproduced by @lintool on 2021-11-02 (commit cb0c44c) on CORD-19 release of 2020/07/16 with Solr v8.10.1 and ES/Kibana v7.15.1.
  • Reproduced by @lintool on 2022-03-21 (commit 3d1fc34 on CORD-19 release of 2020/07/16 with Solr v8.11.1 and ES/Kibana v8.1.0.