Skip to content

A simple Lucene framework to get started with Information Retrieval experiments on TREC documents

Notifications You must be signed in to change notification settings

gdebasis/luc4ir

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Luc4IR

Luc4IR (pronounced Lucifer) is a Java implementation of sparse indexing and retrieval. The code is distributed in the hope that it'll be useful for IR practitioners and students who want to get started with retrieving documents from a collection and measure effectiveness with standard evaluation metrics.

To index TREC document disks 4/5

Due to the lack of file size restrictions, the index could not be made available on this repository. To recreate the index, download the TREC disks 4/5 collection from here.

After downloading the collection and unzipping it, build the index by executing the following script

./index_trecd45 <path to the collection>

You may even download the index from this shared OneDrive folder.

For retrieval, simply run the script

./retrieve_trecd45.sh <INDEX-PATH> <QUERY FILE> <QRELS FILE>

which executes a series of queries from a TREC formatted topic file (using the LM-Dir retrieval model) and reports MAP.

Another small test collection that is included in the repository is the ToucheV2 dataset. To run BM25 just execute the following commands, which will prepare the index and execure retrieval on 49 test queries. The result file, named touche.res is saved in the project base folder, which can then be evaluated with trec_eval.

./index_touche.sh 
./retrieve_touche.sh 

About

A simple Lucene framework to get started with Information Retrieval experiments on TREC documents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published