Introduction

This project evaluates Tantivy, Elastic Search and Apache Lucene retrieval quality using NDCG.

This project is based on Beir. Index and retrieval task is performed by Tantivy and Lucene in their respective Tust and Java environment.

Retrieval task configuration

Name	Engine	Tokeniser	BM25 settings	Query style
Tantivy default	Tantivy	Default	Default (K1=1.2, B=0.75)	Multifiled
Tantivy english	Tantivy	English Stem and stopword	Default (K1=1.2, B=0.75)	Multifiled
Tantivy disjunction max	Tantivy	English Stem and stopword	Default (K1=1.2, B=0.75)	Disjunction max (tie_breaker=0.5)
Apache Lucene default	Apache lucene	Default	Default(K1=1.2, B=0.75)	Multifield
Elastic Search default	Elastic search	Default	Default(K1=1.2, B=0.75)	Disjunction max(tie_breaker=0.5)

Retrieval results are exported as tsv file which are then scored with pytrec_eval. This approach allows us to manually examine search output and ensure each engine's performance is scored by the same code base.

Evaluation datasets are available on Beir github.

NDCG@10 results

Dataset	Tantivy default	Tantivy English	Tantivy disjunction max	Apache Lucene default	Beir BM25 Multifield	Elastic Search 8.12.0 default
Scifact	0.6110550406527024	0.6466632511040359	0.6905108331112109	0.6105774540257333	0.665	0.690638173453613
NFCorpus	0.31783463374400994	0.32944073593202805	0.3429044899000987	0.31788159965582696	0.325	0.34281013102961966
TREC-COVID	0.42327118043942563	0.41898217050240794	0.6796090083796931	0.42438665909618467	0.656	0.6880298232606303
NQ	0.30181710921729077	0.3132771880207455	0.32453876742895865	0.30170174644291564	0.329	0.3260731485135678

Running evaluation

1. Prerequiste

This project is built in a linux container as pytrec_eval is not playing nicely with pip on windows. If you prefer to run it on your local environment, make sure you have:
- Python 3.9
- cargo lastest
- Java latest with OpenJDK and gradle
Download and unzip a dataset into .\data folder. For instance, if ydou choose the Scifact dataset your folder should look like

data
    scifact
        corpus.jsonl
        queries.jsonl
        qrels
            test.tsv
            dev.tsv

2. Running tantivy retrieval task

Run the following step to generate result for tantivy retrieval task. For instance, we are running retrieval for scifact corpus

cd tantivy-retrieval
cargo update
cargo run -- scifact

For retrieval task using disjunction max query

cd tantivy-retrieval
cargo update
cargo run -- scifact dismax

If ran successfully, a new file called result_tantivy.tsv will be created in the dataset folder

3. Runing lucene retrieval task

Run the following step to geenrate result for lucene retrieval task. For instance, we are running retrieval for scifact corpus

cd lucene-retrieval
./gralew run --args="scifact"

Result of the run will be added to the dataset folder with the name result_lucene.tsv

4. Running elastic search retrieval task

Download and install self-managed version of elastic search for your platform
Update elastic search connection details in main.py
Setup environment

cd elasticsearch-retrieval
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirement.txt

Run the evaluation task. For instance, we are evaluate tantivy performance on the scifact corpus

python3 main.py scifact

5. Running evaluation

Run the following step to create virtualenv for python and install the necessary packages

cd evaluation
python3 -m .venv
source .venv/bin/activate
pip install -r requirement.txt

Run the evaluation script. For instance, we are evaluate tantivy performance on the scifact corpus

python main.py scifact tantivy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Retrieval task configuration

NDCG@10 results

Running evaluation

1. Prerequiste

2. Running tantivy retrieval task

3. Runing lucene retrieval task

4. Running elastic search retrieval task

5. Running evaluation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.devcontainer		.devcontainer
.vscode		.vscode
data		data
elasticsearch-retrieval		elasticsearch-retrieval
evaluation		evaluation
lucene-retrieval		lucene-retrieval
tantivy-retrieval		tantivy-retrieval
.gitignore		.gitignore
README.md		README.md

triandco/tantivy-pytrec-eval

Folders and files

Latest commit

History

Repository files navigation

Introduction

Retrieval task configuration

NDCG@10 results

Running evaluation

1. Prerequiste

2. Running tantivy retrieval task

3. Runing lucene retrieval task

4. Running elastic search retrieval task

5. Running evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages