The Information Retrieval Experiment Platform integrates ir_datasets, ir_measures, and PyTerrier with TIRA to promote more standardized, reproducible, and scalable retrieval experiments---and ultimately blinded experiments in IR. Standardization is achieved when the input and output of an experiment are compatible with ir_datasets and ir_measures, and the retrieval approach implements PyTerrier’s interfaces. However, none of this is a must for reproducibility and scalability, as TIRA can run any dockerized software locally or remotely in a cloud-native execution environment. Version control and caching ensure efficient (re)execution. TIRA allows for blind evaluation when an experiment runs on a remote server/cloud not under the control of the experimenter. The test data and ground truth are then hidden from view, and the retrieval software has to process them in a sandbox that prevents data leaks.
The platform currently includes 15 corpora (1.9 billion documents) on which 32 well-known shared tasks are based, as well as Docker images of 50 standard retrieval approaches. Within this setup, we were able to automatically run and evaluate the 50 approaches on the 32 tasks (1600 runs) in less than a week.
The hosted version of the IR Experiment Platform is open for submissions at https://www.tira.io/task-overview/ir-benchmarks.
All evaluations and analysis (including those reported in the paper) are located in analysis-of-submissions.
Comparing the leaderboards accross different tasks is quite interesting (we have a large scale evaluation on that in the paper), e.g., compare MS MARCO DL 2019 with Antique or Args.me: On MS MARCO, all kinds of deep learning models are at the top, which totally reverses for other corpora, e.g., Args.me or Antique.
The current leaderboards can be viewed in tira.io:
- Antique
- Args.me 2020 Task 1
- Args.me 2021 Task 1
- Cranfield
- TREC COVID
- TREC Deep Learning 2019 (passage)
- TREC Deep Learning 2020 (passage)
- TREC Genomics 2004
- TREC Genomics 2005
- TREC 7
- TREC 8
- Robust04
- TREC Web Track 2002 (gov)
- TREC Web Track 2003 (gov)
- TREC Web Track 2004 (gov)
- TREC Web Track 2009 (ClueWeb09)
- TREC Web Track 2010 (ClueWeb09)
- TREC Web Track 2011 (ClueWeb09)
- TREC Web Track 2012 (ClueWeb09)
- TREC Web Track 2013 (ClueWeb12)
- TREC Web Track 2014 (ClueWeb12)
- Touché 2020 Task 2 (ClueWeb12)
- Touché 2021 Task 2 (ClueWeb12)
- Touché 2023 Task 2 (ClueWeb22) (Task is still ongoing, so the leaderboard is not yet public)
- TREC Terabyte 2004 (gov2)
- TREC Terabyte 2005 (gov2)
- TREC Terabyte 2006 (gov2)
- NFCorpus
- Vaswani
- TREC Core 2018 (wapo)
- TREC Precision Medicine 2017
- TREC Precision Medicine 2018
All datasets from the main branch of ir_datasets
are supported by default.
We have a tutorial showing how new, potentially work-in-progress data can be imported at ir-datasets/tutorial
Submission is available at: The hosted version of the IR Experiment Platform is open for submissions at https://www.tira.io/task-overview/ir-benchmarks. To simplify submissions, we provide several starters (that yield 50 different retrieval models) that you can use as starting point.
After the run was unblinded and published by an organizer, it becomes visible on the leaderboard (here, as example, the top entries by nDCG@10 for the ClueWeb09):
Examples of reproducibility experiments are available in the directory reproducibility-experiments. The main advantage of the IR Experiment Platform is that after the shared tasks, the complete shared task repository can be archived in a fully self contained archive (including all software, runs, etc.). This repository https://github.com/tira-io/ir-experiment-platform-benchmarks contains an archived shared task repository covering over 50~retrieval softwares on more than 32 benchmarks with overall over 2000 executed softwares.
We provide starters for 4 frequently used IR research frameworks that can be used as basis for software submissions to the Information Retrieval Experiment Platform. Retrieval Systems submitted to the IR Experiment Platform has to be implemented in fully self-contained Docker images, i.e., the software must be able to run without internet connection to improve reproducibility (e.g., preventing cases where an external dependency or API is not available anymore in a few years). Our existing starters can be directly submitted to TIRA, as all of them have been extensively tested on 32 benchmarks in TIRA, and they also might serve as starting point for custom development.
The starters are available and documented in the directory tira-ir-starters.
The simplest starter implements BM25 retrieval using a few lines of declarative PyTerrier code in a Jupyter notebook.
If you use TIRA/Tirex in your research (or cached run files or cached indices or other outputs), please cite the TIRA and TIREx paper:
@InProceedings{froebe:2023e,
author = {Maik Fr{\"o}be and Jan Heinrich Reimer and Sean MacAvaney and Niklas Deckers and Simon Reich and Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast},
booktitle = {46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023)},
doi = {10.1145/3539618.3591888},
editor = {Hsin-Hsi Chen and {Wei-Jou (Edward)} Duh and Hen-Hsen Huang and {Makoto P.} Kato and Josiane Mothe and Barbara Poblete},
isbn = {9781450394086},
month = jul,
numpages = 11,
pages = {2826--2836},
publisher = {ACM},
site = {Taipei, Taiwan},
title = {{The Information Retrieval Experiment Platform}},
year = 2023
}
@InProceedings{froebe:2023b,
address = {Berlin Heidelberg New York},
author = {Maik Fr{\"o}be and Matti Wiegmann and Nikolay Kolyada and Bastian Grahm and Theresa Elstner and Frank Loebe and Matthias Hagen and Benno Stein and Martin Potthast},
booktitle = {Advances in Information Retrieval. 45th European Conference on {IR} Research ({ECIR} 2023)},
month = apr,
publisher = {Springer},
series = {Lecture Notes in Computer Science},
site = {Dublin, Irland},
title = {{Continuous Integration for Reproducible Shared Tasks with TIRA.io}},
year = 2023
}