This project evaluates Tantivy, Elastic Search and Apache Lucene retrieval quality using NDCG.
This project is based on Beir. Index and retrieval task is performed by Tantivy and Lucene in their respective Tust and Java environment.
Name | Engine | Tokeniser | BM25 settings | Query style |
---|---|---|---|---|
Tantivy default | Tantivy | Default | Default (K1=1.2, B=0.75) | Multifiled |
Tantivy english | Tantivy | English Stem and stopword | Default (K1=1.2, B=0.75) | Multifiled |
Tantivy disjunction max | Tantivy | English Stem and stopword | Default (K1=1.2, B=0.75) | Disjunction max (tie_breaker=0.5) |
Apache Lucene default | Apache lucene | Default | Default(K1=1.2, B=0.75) | Multifield |
Elastic Search default | Elastic search | Default | Default(K1=1.2, B=0.75) | Disjunction max(tie_breaker=0.5) |
Retrieval results are exported as tsv file which are then scored with pytrec_eval. This approach allows us to manually examine search output and ensure each engine's performance is scored by the same code base.
Evaluation datasets are available on Beir github.
Dataset | Tantivy default | Tantivy English | Tantivy disjunction max | Apache Lucene default | Beir BM25 Multifield | Elastic Search 8.12.0 default |
---|---|---|---|---|---|---|
Scifact | 0.6110550406527024 | 0.6466632511040359 | 0.6905108331112109 | 0.6105774540257333 | 0.665 | 0.690638173453613 |
NFCorpus | 0.31783463374400994 | 0.32944073593202805 | 0.3429044899000987 | 0.31788159965582696 | 0.325 | 0.34281013102961966 |
TREC-COVID | 0.42327118043942563 | 0.41898217050240794 | 0.6796090083796931 | 0.42438665909618467 | 0.656 | 0.6880298232606303 |
NQ | 0.30181710921729077 | 0.3132771880207455 | 0.32453876742895865 | 0.30170174644291564 | 0.329 | 0.3260731485135678 |
- This project is built in a linux container as pytrec_eval is not playing nicely with pip on windows. If you prefer to run it on your local environment, make sure you have:
- Python 3.9
- cargo lastest
- Java latest with OpenJDK and gradle
- Download and unzip a dataset into
.\data
folder. For instance, if ydou choose the Scifact dataset your folder should look like
data
scifact
corpus.jsonl
queries.jsonl
qrels
test.tsv
dev.tsv
- Run the following step to generate result for tantivy retrieval task. For instance, we are running retrieval for scifact corpus
cd tantivy-retrieval
cargo update
cargo run -- scifact
For retrieval task using disjunction max query
cd tantivy-retrieval
cargo update
cargo run -- scifact dismax
- If ran successfully, a new file called
result_tantivy.tsv
will be created in the dataset folder
- Run the following step to geenrate result for lucene retrieval task. For instance, we are running retrieval for scifact corpus
cd lucene-retrieval
./gralew run --args="scifact"
- Result of the run will be added to the dataset folder with the name
result_lucene.tsv
- Download and install self-managed version of elastic search for your platform
- Update elastic search connection details in main.py
- Setup environment
cd elasticsearch-retrieval
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirement.txt
- Run the evaluation task. For instance, we are evaluate tantivy performance on the scifact corpus
python3 main.py scifact
- Run the following step to create virtualenv for python and install the necessary packages
cd evaluation
python3 -m .venv
source .venv/bin/activate
pip install -r requirement.txt
- Run the evaluation script. For instance, we are evaluate tantivy performance on the scifact corpus
python main.py scifact tantivy