Jupyter notebook for MonoT5 on PyTerrier

tira-io · May 3, 2023 · 7aa8d00 · 7aa8d00
1 parent 4b23eba
commit 7aa8d00
Show file tree

Hide file tree

Showing 7 changed files with 338 additions and 0 deletions.
diff --git a/tira-ir-starters/pyterrier-t5/Dockerfile b/tira-ir-starters/pyterrier-t5/Dockerfile
@@ -0,0 +1,17 @@
+FROM webis/tira-ir-starter-pyterrier:0.0.1-base
+
+RUN pip3 install --upgrade git+https://github.com/terrierteam/pyterrier_t5.git \
+	&& pip install tira==0.0.29
+
+ARG MODEL_NAME=local
+ENV MODEL_NAME ${MODEL_NAME}
+
+ARG TOKENIZER_NAME=local
+ENV TOKENIZER_NAME ${TOKENIZER_NAME}
+
+RUN python3 -c "from tira.third_party_integrations import ensure_pyterrier_is_loaded; ensure_pyterrier_is_loaded(); from pyterrier_t5 import MonoT5ReRanker; mono_t5 = MonoT5ReRanker(model='${MODEL_NAME}', tok_model='${TOKENIZER_NAME}');"
+
+COPY pyterrier-t5/bm25-monot5.ipynb /workspace
+
+RUN jupyter trust /workspace/*.ipynb
+
diff --git a/tira-ir-starters/pyterrier-t5/README.md b/tira-ir-starters/pyterrier-t5/README.md
@@ -0,0 +1,68 @@
+# TIRA IR-Starter for MonoT5 in PyTerrier with Jupyter Notebooks
+
+This directory contains a retrieval system that uses a Jupyter notebook with PyTerrier to rerank the top-1000 results of BM25 with MonoT5.
+m.
+
+## Local Development
+
+Please use the `tira-run` command (can be installed via `pip3 install tira`) to test that your retrieval approach is correctly installed inside the Docker image.
+For example, you can run the following command inside this directory to re-rank with an PyTerrier re-ranker from our tira-ir-starter with BM25 on a small example (2 queries from the passage retrieval task of TREC DL 2019):
+
+```
+tira-run \
+    --input-directory ${PWD}/sample-input \
+    --image webis/tira-ir-starter-pyterrier:0.0.1-base \
+    --command '/workspace/pyterrier_cli.py --input $inputDataset --output $outputDir --params wmodel=BM25 --rerank True --retrieval_pipeline default_pipelines.wmodel_text_scorer'
+```
+
+In this example above, the command `/workspace/pyterrier_cli.py --input $inputDataset --output $outputDir --params wmodel=BM25 --rerank True --retrieval_pipeline default_pipelines.wmodel_text_scorer` is the command that you would enter in TIRA, and the `--input-directory` flag points to the inputs.
+
+This creates a run file `tira-output/run.txt`, with content like (`cat sample-output/run.txt |head -3`):
+
+```
+19335 Q0 8412684 1 2.0044117909904275 pyterrier.default_pipelines.wmodel_text_scorer
+19335 Q0 8412687 2 1.6165480088144524 pyterrier.default_pipelines.wmodel_text_scorer
+19335 Q0 527689 3 0.7777388572417481 pyterrier.default_pipelines.wmodel_text_scorer
+```
+
+Testing full-rank retrievers works analougously.
+
+## Developing Retrieval Approaches in Declarative PyTerrier-Pipelines
+
+The notebook [full-rank-pipeline.ipynb](full-rank-pipeline.ipynb) exemplifies how to directly run Jupyter Notebooks in TIRA.
+
+You can run it locally via:
+
+```
+tira-run \
+    --input-directory ${PWD}/sample-input-full-rank \
+    --image webis/tira-ir-starter-pyterrier:0.0.1-base \
+    --command '/workspace/run-pyterrier-notebook.py --input $inputDataset --output $outputDir --notebook /workspace/full-rank-pipeline.ipynb'
+```
+
+This creates a run file `tira-output/run.txt`, with content like (`cat sample-output/run.txt |head -3`):
+
+```
+1 0 pangram-03 1 -0.4919184192126373 BM25
+1 0 pangram-01 2 -0.5271673505256447 BM25
+1 0 pangram-04 3 -0.9838368384252746 BM25
+```
+
+## Submit the Image to TIRA
+
+You need a team for your submission, in the following, we use `tira-ir-starter` as team name, to resubmit the image, please just replace `tira-ir-starter` with your team name.
+
+First, you have to upload the image:
+
+```
+docker pull webis/tira-ir-starter-pyterrier-monot5:0.0.1-monot5-base-msmarco-10k
+
+docker tag webis/tira-ir-starter-pyterrier-monot5:0.0.1-monot5-base-msmarco-10k registry.webis.de/code-research/tira/tira-user-tira-ir-starter/pyterrier-monot5:0.0.1
+docker push registry.webis.de/code-research/tira/tira-user-tira-ir-starter/pyterrier-monot5:0.0.1
+```
+
+# Build the image
+
+```
+docker build --build-arg MODEL_NAME=castorini/monot5-base-msmarco-10k --build-arg TOKENIZER_NAME=t5-base -t webis/tira-ir-starter-pyterrier-monot5:0.0.1-monot5-base-msmarco-10k -f pyterrier-t5/Dockerfile .
+```
diff --git a/tira-ir-starters/pyterrier-t5/bm25-monot5.ipynb b/tira-ir-starters/pyterrier-t5/bm25-monot5.ipynb
@@ -0,0 +1,205 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "8c3da078-f7fc-4d37-904c-532bb26d4321",
+   "metadata": {},
+   "source": [
+    "# BM25 >> MonoT5 Pipeline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "66fd2911-c97a-4f91-af28-8c7e381573b6",
+   "metadata": {},
+   "source": [
+    "### Step 1: Import everything and load variables"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "7ae3c54f-aba1-45bf-b074-e78a99f6405f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "I will use a small hardcoded example located in ./sample-input-full-rank.\n",
+      "The output directory is /tmp/\n"
+     ]
+    }
+   ],
+   "source": [
+    "import pyterrier as pt\n",
+    "import pandas as pd\n",
+    "from tira.third_party_integrations import ensure_pyterrier_is_loaded, get_input_directory_and_output_directory, persist_and_normalize_run\n",
+    "import json\n",
+    "from tqdm import tqdm\n",
+    "\n",
+    "ensure_pyterrier_is_loaded()\n",
+    "input_directory, output_directory = get_input_directory_and_output_directory('./sample-input-full-rank')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8c563b0e-97ac-44a2-ba2f-18858f1506bb",
+   "metadata": {},
+   "source": [
+    "### Step 2: Load the Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "e35230af-66ec-4607-a97b-127bd890fa59",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Step 2: Load the data.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('Step 2: Load the data.')\n",
+    "\n",
+    "queries = pt.io.read_topics(input_directory + '/queries.xml', format='trecxml')\n",
+    "\n",
+    "documents = (json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r'))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "72655916-07fe-4c58-82c1-2f9f93381e7f",
+   "metadata": {},
+   "source": [
+    "### Step 3: Create the Index"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "05ce062d-25e4-4c61-b6ce-9431b9f2bbd4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Step 3: Create the Index.\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "5it [00:00, 48.24it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('Step 3: Create the Index.')\n",
+    "\n",
+    "!rm -Rf ./index\n",
+    "iter_indexer = pt.IterDictIndexer(\"./index\", meta={'docno' : 100, 'text': 10240})\n",
+    "index_ref = iter_indexer.index(tqdm(documents))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "806c4638-ccee-4470-a74c-2a85d9ee2cfc",
+   "metadata": {},
+   "source": [
+    "### Step 4: Create Run"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "a191f396-e896-4792-afaf-574e452640f5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyterrier_t5 import MonoT5ReRanker\n",
+    "import os\n",
+    "\n",
+    "bm25 = pt.BatchRetrieve(index_ref, wmodel=\"BM25\", metadata=['docno', 'text'])\n",
+    "\n",
+    "mono_t5 = MonoT5ReRanker(model=os.environ['MODEL_NAME'], tok_model=os.environ['TOKENIZER_NAME'])\n",
+    "\n",
+    "pipeline = bm25 % 1000 >> mono_t5"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "c0e07fca-de98-4de2-b6a7-abfd516c652c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "monoT5: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.26batches/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "run = pipeline(queries)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "28c40a2e-0f96-4ae8-aa5e-55a5e7ef9dee",
+   "metadata": {},
+   "source": [
+    "### Step 5: Persist Run"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "12e5bb42-ed1f-41ba-b7a5-cb43ebca96f6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Step 5: Persist Run.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('Step 5: Persist Run.')\n",
+    "\n",
+    "persist_and_normalize_run(run, output_file=output_directory, system_name='MonoT5', depth=1000)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/tira-ir-starters/pyterrier-t5/sample-input-full-rank/documents.jsonl b/tira-ir-starters/pyterrier-t5/sample-input-full-rank/documents.jsonl
@@ -0,0 +1,5 @@
+{"docno": "pangram-01", "text": "How quickly daft jumping zebras vex.", "original_document": {"doc_id": "pangram-01", "text": "How quickly daft jumping zebras vex.", "letters": 30}}
+{"docno": "pangram-02", "text": "Quick fox jumps nightly above wizard.", "original_document": {"doc_id": "pangram-02", "text": "Quick fox jumps nightly above wizard.", "letters": 31}}
+{"docno": "pangram-03", "text": "The jay, pig, fox, zebra and my wolves quack!", "original_document": {"doc_id": "pangram-03", "text": "The jay, pig, fox, zebra and my wolves quack!", "letters": 33}}
+{"docno": "pangram-04", "text": "The quick brown fox jumps over the lazy dog.", "original_document": {"doc_id": "pangram-04", "text": "The quick brown fox jumps over the lazy dog.", "letters": 35}}
+{"docno": "pangram-05", "text": "As quirky joke, chefs won\u2019t pay devil magic zebra tax.", "original_document": {"doc_id": "pangram-05", "text": "As quirky joke, chefs won\u2019t pay devil magic zebra tax.", "letters": 42}}
diff --git a/tira-ir-starters/pyterrier-t5/sample-input-full-rank/metadata.json b/tira-ir-starters/pyterrier-t5/sample-input-full-rank/metadata.json
@@ -0,0 +1 @@
+{"ir_datasets_id": "pangrams"}
diff --git a/tira-ir-starters/pyterrier-t5/sample-input-full-rank/queries.jsonl b/tira-ir-starters/pyterrier-t5/sample-input-full-rank/queries.jsonl
@@ -0,0 +1,2 @@
+{"qid": "1", "query": "fox jumps above animal", "original_query": {"query_id": "1", "title": "fox jumps above animal", "description": "What pangrams have a fox jumping above some animal?", "narrative": "Relevant pangrams have a fox jumping over an animal (e.g., an dog). Pangrams containing a fox that is not jumping or jumps over something that is not an animal are not relevant."}}
+{"qid": "2", "query": "multiple animals including a zebra", "original_query": {"query_id": "2", "title": "multiple animals including a zebra", "description": "Which pangrams have multiple animals where one of the animals is a zebra?", "narrative": "Relevant pangrams have at least two animals, one of the animals must be a Zebra. Pangrams containing only a Zebra are not relevant."}}
diff --git a/tira-ir-starters/pyterrier-t5/sample-input-full-rank/queries.xml b/tira-ir-starters/pyterrier-t5/sample-input-full-rank/queries.xml
@@ -0,0 +1,40 @@
+<topics ir-datasets-id="pangrams">
+ <topic number="1">
+  <query>
+   fox jumps above animal
+  </query>
+  <original_query>
+   <query_id>
+    1
+   </query_id>
+   <title>
+    fox jumps above animal
+   </title>
+   <description>
+    What pangrams have a fox jumping above some animal?
+   </description>
+   <narrative>
+    Relevant pangrams have a fox jumping over an animal (e.g., an dog). Pangrams containing a fox that is not jumping or jumps over something that is not an animal are not relevant.
+   </narrative>
+  </original_query>
+ </topic>
+ <topic number="2">
+  <query>
+   multiple animals including a zebra
+  </query>
+  <original_query>
+   <query_id>
+    2
+   </query_id>
+   <title>
+    multiple animals including a zebra
+   </title>
+   <description>
+    Which pangrams have multiple animals where one of the animals is a zebra?
+   </description>
+   <narrative>
+    Relevant pangrams have at least two animals, one of the animals must be a Zebra. Pangrams containing only a Zebra are not relevant.
+   </narrative>
+  </original_query>
+ </topic>
+</topics>
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		{"qid": "1", "query": "fox jumps above animal", "original_query": {"query_id": "1", "title": "fox jumps above animal", "description": "What pangrams have a fox jumping above some animal?", "narrative": "Relevant pangrams have a fox jumping over an animal (e.g., an dog). Pangrams containing a fox that is not jumping or jumps over something that is not an animal are not relevant."}}
		{"qid": "2", "query": "multiple animals including a zebra", "original_query": {"query_id": "2", "title": "multiple animals including a zebra", "description": "Which pangrams have multiple animals where one of the animals is a zebra?", "narrative": "Relevant pangrams have at least two animals, one of the animals must be a Zebra. Pangrams containing only a Zebra are not relevant."}}