Skip to content

Commit

Permalink
Jupyter notebook for MonoT5 on PyTerrier
Browse files Browse the repository at this point in the history
  • Loading branch information
mam10eks committed May 3, 2023
1 parent 4b23eba commit 7aa8d00
Show file tree
Hide file tree
Showing 7 changed files with 338 additions and 0 deletions.
17 changes: 17 additions & 0 deletions tira-ir-starters/pyterrier-t5/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
FROM webis/tira-ir-starter-pyterrier:0.0.1-base

RUN pip3 install --upgrade git+https://github.com/terrierteam/pyterrier_t5.git \
&& pip install tira==0.0.29

ARG MODEL_NAME=local
ENV MODEL_NAME ${MODEL_NAME}

ARG TOKENIZER_NAME=local
ENV TOKENIZER_NAME ${TOKENIZER_NAME}

RUN python3 -c "from tira.third_party_integrations import ensure_pyterrier_is_loaded; ensure_pyterrier_is_loaded(); from pyterrier_t5 import MonoT5ReRanker; mono_t5 = MonoT5ReRanker(model='${MODEL_NAME}', tok_model='${TOKENIZER_NAME}');"

COPY pyterrier-t5/bm25-monot5.ipynb /workspace

RUN jupyter trust /workspace/*.ipynb

68 changes: 68 additions & 0 deletions tira-ir-starters/pyterrier-t5/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# TIRA IR-Starter for MonoT5 in PyTerrier with Jupyter Notebooks

This directory contains a retrieval system that uses a Jupyter notebook with PyTerrier to rerank the top-1000 results of BM25 with MonoT5.
m.

## Local Development

Please use the `tira-run` command (can be installed via `pip3 install tira`) to test that your retrieval approach is correctly installed inside the Docker image.
For example, you can run the following command inside this directory to re-rank with an PyTerrier re-ranker from our tira-ir-starter with BM25 on a small example (2 queries from the passage retrieval task of TREC DL 2019):

```
tira-run \
--input-directory ${PWD}/sample-input \
--image webis/tira-ir-starter-pyterrier:0.0.1-base \
--command '/workspace/pyterrier_cli.py --input $inputDataset --output $outputDir --params wmodel=BM25 --rerank True --retrieval_pipeline default_pipelines.wmodel_text_scorer'
```

In this example above, the command `/workspace/pyterrier_cli.py --input $inputDataset --output $outputDir --params wmodel=BM25 --rerank True --retrieval_pipeline default_pipelines.wmodel_text_scorer` is the command that you would enter in TIRA, and the `--input-directory` flag points to the inputs.

This creates a run file `tira-output/run.txt`, with content like (`cat sample-output/run.txt |head -3`):

```
19335 Q0 8412684 1 2.0044117909904275 pyterrier.default_pipelines.wmodel_text_scorer
19335 Q0 8412687 2 1.6165480088144524 pyterrier.default_pipelines.wmodel_text_scorer
19335 Q0 527689 3 0.7777388572417481 pyterrier.default_pipelines.wmodel_text_scorer
```

Testing full-rank retrievers works analougously.

## Developing Retrieval Approaches in Declarative PyTerrier-Pipelines

The notebook [full-rank-pipeline.ipynb](full-rank-pipeline.ipynb) exemplifies how to directly run Jupyter Notebooks in TIRA.

You can run it locally via:

```
tira-run \
--input-directory ${PWD}/sample-input-full-rank \
--image webis/tira-ir-starter-pyterrier:0.0.1-base \
--command '/workspace/run-pyterrier-notebook.py --input $inputDataset --output $outputDir --notebook /workspace/full-rank-pipeline.ipynb'
```

This creates a run file `tira-output/run.txt`, with content like (`cat sample-output/run.txt |head -3`):

```
1 0 pangram-03 1 -0.4919184192126373 BM25
1 0 pangram-01 2 -0.5271673505256447 BM25
1 0 pangram-04 3 -0.9838368384252746 BM25
```

## Submit the Image to TIRA

You need a team for your submission, in the following, we use `tira-ir-starter` as team name, to resubmit the image, please just replace `tira-ir-starter` with your team name.

First, you have to upload the image:

```
docker pull webis/tira-ir-starter-pyterrier-monot5:0.0.1-monot5-base-msmarco-10k
docker tag webis/tira-ir-starter-pyterrier-monot5:0.0.1-monot5-base-msmarco-10k registry.webis.de/code-research/tira/tira-user-tira-ir-starter/pyterrier-monot5:0.0.1
docker push registry.webis.de/code-research/tira/tira-user-tira-ir-starter/pyterrier-monot5:0.0.1
```

# Build the image

```
docker build --build-arg MODEL_NAME=castorini/monot5-base-msmarco-10k --build-arg TOKENIZER_NAME=t5-base -t webis/tira-ir-starter-pyterrier-monot5:0.0.1-monot5-base-msmarco-10k -f pyterrier-t5/Dockerfile .
```
205 changes: 205 additions & 0 deletions tira-ir-starters/pyterrier-t5/bm25-monot5.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "8c3da078-f7fc-4d37-904c-532bb26d4321",
"metadata": {},
"source": [
"# BM25 >> MonoT5 Pipeline"
]
},
{
"cell_type": "markdown",
"id": "66fd2911-c97a-4f91-af28-8c7e381573b6",
"metadata": {},
"source": [
"### Step 1: Import everything and load variables"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "7ae3c54f-aba1-45bf-b074-e78a99f6405f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"I will use a small hardcoded example located in ./sample-input-full-rank.\n",
"The output directory is /tmp/\n"
]
}
],
"source": [
"import pyterrier as pt\n",
"import pandas as pd\n",
"from tira.third_party_integrations import ensure_pyterrier_is_loaded, get_input_directory_and_output_directory, persist_and_normalize_run\n",
"import json\n",
"from tqdm import tqdm\n",
"\n",
"ensure_pyterrier_is_loaded()\n",
"input_directory, output_directory = get_input_directory_and_output_directory('./sample-input-full-rank')\n"
]
},
{
"cell_type": "markdown",
"id": "8c563b0e-97ac-44a2-ba2f-18858f1506bb",
"metadata": {},
"source": [
"### Step 2: Load the Data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "e35230af-66ec-4607-a97b-127bd890fa59",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Step 2: Load the data.\n"
]
}
],
"source": [
"print('Step 2: Load the data.')\n",
"\n",
"queries = pt.io.read_topics(input_directory + '/queries.xml', format='trecxml')\n",
"\n",
"documents = (json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r'))\n"
]
},
{
"cell_type": "markdown",
"id": "72655916-07fe-4c58-82c1-2f9f93381e7f",
"metadata": {},
"source": [
"### Step 3: Create the Index"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "05ce062d-25e4-4c61-b6ce-9431b9f2bbd4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Step 3: Create the Index.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"5it [00:00, 48.24it/s]\n"
]
}
],
"source": [
"print('Step 3: Create the Index.')\n",
"\n",
"!rm -Rf ./index\n",
"iter_indexer = pt.IterDictIndexer(\"./index\", meta={'docno' : 100, 'text': 10240})\n",
"index_ref = iter_indexer.index(tqdm(documents))"
]
},
{
"cell_type": "markdown",
"id": "806c4638-ccee-4470-a74c-2a85d9ee2cfc",
"metadata": {},
"source": [
"### Step 4: Create Run"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "a191f396-e896-4792-afaf-574e452640f5",
"metadata": {},
"outputs": [],
"source": [
"from pyterrier_t5 import MonoT5ReRanker\n",
"import os\n",
"\n",
"bm25 = pt.BatchRetrieve(index_ref, wmodel=\"BM25\", metadata=['docno', 'text'])\n",
"\n",
"mono_t5 = MonoT5ReRanker(model=os.environ['MODEL_NAME'], tok_model=os.environ['TOKENIZER_NAME'])\n",
"\n",
"pipeline = bm25 % 1000 >> mono_t5"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c0e07fca-de98-4de2-b6a7-abfd516c652c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"monoT5: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4.26batches/s]\n"
]
}
],
"source": [
"run = pipeline(queries)"
]
},
{
"cell_type": "markdown",
"id": "28c40a2e-0f96-4ae8-aa5e-55a5e7ef9dee",
"metadata": {},
"source": [
"### Step 5: Persist Run"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "12e5bb42-ed1f-41ba-b7a5-cb43ebca96f6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Step 5: Persist Run.\n"
]
}
],
"source": [
"print('Step 5: Persist Run.')\n",
"\n",
"persist_and_normalize_run(run, output_file=output_directory, system_name='MonoT5', depth=1000)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{"docno": "pangram-01", "text": "How quickly daft jumping zebras vex.", "original_document": {"doc_id": "pangram-01", "text": "How quickly daft jumping zebras vex.", "letters": 30}}
{"docno": "pangram-02", "text": "Quick fox jumps nightly above wizard.", "original_document": {"doc_id": "pangram-02", "text": "Quick fox jumps nightly above wizard.", "letters": 31}}
{"docno": "pangram-03", "text": "The jay, pig, fox, zebra and my wolves quack!", "original_document": {"doc_id": "pangram-03", "text": "The jay, pig, fox, zebra and my wolves quack!", "letters": 33}}
{"docno": "pangram-04", "text": "The quick brown fox jumps over the lazy dog.", "original_document": {"doc_id": "pangram-04", "text": "The quick brown fox jumps over the lazy dog.", "letters": 35}}
{"docno": "pangram-05", "text": "As quirky joke, chefs won\u2019t pay devil magic zebra tax.", "original_document": {"doc_id": "pangram-05", "text": "As quirky joke, chefs won\u2019t pay devil magic zebra tax.", "letters": 42}}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"ir_datasets_id": "pangrams"}
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
{"qid": "1", "query": "fox jumps above animal", "original_query": {"query_id": "1", "title": "fox jumps above animal", "description": "What pangrams have a fox jumping above some animal?", "narrative": "Relevant pangrams have a fox jumping over an animal (e.g., an dog). Pangrams containing a fox that is not jumping or jumps over something that is not an animal are not relevant."}}
{"qid": "2", "query": "multiple animals including a zebra", "original_query": {"query_id": "2", "title": "multiple animals including a zebra", "description": "Which pangrams have multiple animals where one of the animals is a zebra?", "narrative": "Relevant pangrams have at least two animals, one of the animals must be a Zebra. Pangrams containing only a Zebra are not relevant."}}
40 changes: 40 additions & 0 deletions tira-ir-starters/pyterrier-t5/sample-input-full-rank/queries.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
<topics ir-datasets-id="pangrams">
<topic number="1">
<query>
fox jumps above animal
</query>
<original_query>
<query_id>
1
</query_id>
<title>
fox jumps above animal
</title>
<description>
What pangrams have a fox jumping above some animal?
</description>
<narrative>
Relevant pangrams have a fox jumping over an animal (e.g., an dog). Pangrams containing a fox that is not jumping or jumps over something that is not an animal are not relevant.
</narrative>
</original_query>
</topic>
<topic number="2">
<query>
multiple animals including a zebra
</query>
<original_query>
<query_id>
2
</query_id>
<title>
multiple animals including a zebra
</title>
<description>
Which pangrams have multiple animals where one of the animals is a zebra?
</description>
<narrative>
Relevant pangrams have at least two animals, one of the animals must be a Zebra. Pangrams containing only a Zebra are not relevant.
</narrative>
</original_query>
</topic>
</topics>

0 comments on commit 7aa8d00

Please sign in to comment.