-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Jupyter notebook for MonoT5 on PyTerrier
- Loading branch information
Showing
7 changed files
with
338 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
FROM webis/tira-ir-starter-pyterrier:0.0.1-base | ||
|
||
RUN pip3 install --upgrade git+https://github.com/terrierteam/pyterrier_t5.git \ | ||
&& pip install tira==0.0.29 | ||
|
||
ARG MODEL_NAME=local | ||
ENV MODEL_NAME ${MODEL_NAME} | ||
|
||
ARG TOKENIZER_NAME=local | ||
ENV TOKENIZER_NAME ${TOKENIZER_NAME} | ||
|
||
RUN python3 -c "from tira.third_party_integrations import ensure_pyterrier_is_loaded; ensure_pyterrier_is_loaded(); from pyterrier_t5 import MonoT5ReRanker; mono_t5 = MonoT5ReRanker(model='${MODEL_NAME}', tok_model='${TOKENIZER_NAME}');" | ||
|
||
COPY pyterrier-t5/bm25-monot5.ipynb /workspace | ||
|
||
RUN jupyter trust /workspace/*.ipynb | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# TIRA IR-Starter for MonoT5 in PyTerrier with Jupyter Notebooks | ||
|
||
This directory contains a retrieval system that uses a Jupyter notebook with PyTerrier to rerank the top-1000 results of BM25 with MonoT5. | ||
m. | ||
|
||
## Local Development | ||
|
||
Please use the `tira-run` command (can be installed via `pip3 install tira`) to test that your retrieval approach is correctly installed inside the Docker image. | ||
For example, you can run the following command inside this directory to re-rank with an PyTerrier re-ranker from our tira-ir-starter with BM25 on a small example (2 queries from the passage retrieval task of TREC DL 2019): | ||
|
||
``` | ||
tira-run \ | ||
--input-directory ${PWD}/sample-input \ | ||
--image webis/tira-ir-starter-pyterrier:0.0.1-base \ | ||
--command '/workspace/pyterrier_cli.py --input $inputDataset --output $outputDir --params wmodel=BM25 --rerank True --retrieval_pipeline default_pipelines.wmodel_text_scorer' | ||
``` | ||
|
||
In this example above, the command `/workspace/pyterrier_cli.py --input $inputDataset --output $outputDir --params wmodel=BM25 --rerank True --retrieval_pipeline default_pipelines.wmodel_text_scorer` is the command that you would enter in TIRA, and the `--input-directory` flag points to the inputs. | ||
|
||
This creates a run file `tira-output/run.txt`, with content like (`cat sample-output/run.txt |head -3`): | ||
|
||
``` | ||
19335 Q0 8412684 1 2.0044117909904275 pyterrier.default_pipelines.wmodel_text_scorer | ||
19335 Q0 8412687 2 1.6165480088144524 pyterrier.default_pipelines.wmodel_text_scorer | ||
19335 Q0 527689 3 0.7777388572417481 pyterrier.default_pipelines.wmodel_text_scorer | ||
``` | ||
|
||
Testing full-rank retrievers works analougously. | ||
|
||
## Developing Retrieval Approaches in Declarative PyTerrier-Pipelines | ||
|
||
The notebook [full-rank-pipeline.ipynb](full-rank-pipeline.ipynb) exemplifies how to directly run Jupyter Notebooks in TIRA. | ||
|
||
You can run it locally via: | ||
|
||
``` | ||
tira-run \ | ||
--input-directory ${PWD}/sample-input-full-rank \ | ||
--image webis/tira-ir-starter-pyterrier:0.0.1-base \ | ||
--command '/workspace/run-pyterrier-notebook.py --input $inputDataset --output $outputDir --notebook /workspace/full-rank-pipeline.ipynb' | ||
``` | ||
|
||
This creates a run file `tira-output/run.txt`, with content like (`cat sample-output/run.txt |head -3`): | ||
|
||
``` | ||
1 0 pangram-03 1 -0.4919184192126373 BM25 | ||
1 0 pangram-01 2 -0.5271673505256447 BM25 | ||
1 0 pangram-04 3 -0.9838368384252746 BM25 | ||
``` | ||
|
||
## Submit the Image to TIRA | ||
|
||
You need a team for your submission, in the following, we use `tira-ir-starter` as team name, to resubmit the image, please just replace `tira-ir-starter` with your team name. | ||
|
||
First, you have to upload the image: | ||
|
||
``` | ||
docker pull webis/tira-ir-starter-pyterrier-monot5:0.0.1-monot5-base-msmarco-10k | ||
docker tag webis/tira-ir-starter-pyterrier-monot5:0.0.1-monot5-base-msmarco-10k registry.webis.de/code-research/tira/tira-user-tira-ir-starter/pyterrier-monot5:0.0.1 | ||
docker push registry.webis.de/code-research/tira/tira-user-tira-ir-starter/pyterrier-monot5:0.0.1 | ||
``` | ||
|
||
# Build the image | ||
|
||
``` | ||
docker build --build-arg MODEL_NAME=castorini/monot5-base-msmarco-10k --build-arg TOKENIZER_NAME=t5-base -t webis/tira-ir-starter-pyterrier-monot5:0.0.1-monot5-base-msmarco-10k -f pyterrier-t5/Dockerfile . | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,205 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8c3da078-f7fc-4d37-904c-532bb26d4321", | ||
"metadata": {}, | ||
"source": [ | ||
"# BM25 >> MonoT5 Pipeline" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "66fd2911-c97a-4f91-af28-8c7e381573b6", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 1: Import everything and load variables" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "7ae3c54f-aba1-45bf-b074-e78a99f6405f", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"I will use a small hardcoded example located in ./sample-input-full-rank.\n", | ||
"The output directory is /tmp/\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import pyterrier as pt\n", | ||
"import pandas as pd\n", | ||
"from tira.third_party_integrations import ensure_pyterrier_is_loaded, get_input_directory_and_output_directory, persist_and_normalize_run\n", | ||
"import json\n", | ||
"from tqdm import tqdm\n", | ||
"\n", | ||
"ensure_pyterrier_is_loaded()\n", | ||
"input_directory, output_directory = get_input_directory_and_output_directory('./sample-input-full-rank')\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8c563b0e-97ac-44a2-ba2f-18858f1506bb", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 2: Load the Data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "e35230af-66ec-4607-a97b-127bd890fa59", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Step 2: Load the data.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print('Step 2: Load the data.')\n", | ||
"\n", | ||
"queries = pt.io.read_topics(input_directory + '/queries.xml', format='trecxml')\n", | ||
"\n", | ||
"documents = (json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r'))\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "72655916-07fe-4c58-82c1-2f9f93381e7f", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 3: Create the Index" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "05ce062d-25e4-4c61-b6ce-9431b9f2bbd4", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Step 3: Create the Index.\n" | ||
] | ||
}, | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"5it [00:00, 48.24it/s]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print('Step 3: Create the Index.')\n", | ||
"\n", | ||
"!rm -Rf ./index\n", | ||
"iter_indexer = pt.IterDictIndexer(\"./index\", meta={'docno' : 100, 'text': 10240})\n", | ||
"index_ref = iter_indexer.index(tqdm(documents))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "806c4638-ccee-4470-a74c-2a85d9ee2cfc", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 4: Create Run" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"id": "a191f396-e896-4792-afaf-574e452640f5", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from pyterrier_t5 import MonoT5ReRanker\n", | ||
"import os\n", | ||
"\n", | ||
"bm25 = pt.BatchRetrieve(index_ref, wmodel=\"BM25\", metadata=['docno', 'text'])\n", | ||
"\n", | ||
"mono_t5 = MonoT5ReRanker(model=os.environ['MODEL_NAME'], tok_model=os.environ['TOKENIZER_NAME'])\n", | ||
"\n", | ||
"pipeline = bm25 % 1000 >> mono_t5" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"id": "c0e07fca-de98-4de2-b6a7-abfd516c652c", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"monoT5: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4.26batches/s]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"run = pipeline(queries)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "28c40a2e-0f96-4ae8-aa5e-55a5e7ef9dee", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 5: Persist Run" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"id": "12e5bb42-ed1f-41ba-b7a5-cb43ebca96f6", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Step 5: Persist Run.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print('Step 5: Persist Run.')\n", | ||
"\n", | ||
"persist_and_normalize_run(run, output_file=output_directory, system_name='MonoT5', depth=1000)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.13" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
5 changes: 5 additions & 0 deletions
5
tira-ir-starters/pyterrier-t5/sample-input-full-rank/documents.jsonl
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
{"docno": "pangram-01", "text": "How quickly daft jumping zebras vex.", "original_document": {"doc_id": "pangram-01", "text": "How quickly daft jumping zebras vex.", "letters": 30}} | ||
{"docno": "pangram-02", "text": "Quick fox jumps nightly above wizard.", "original_document": {"doc_id": "pangram-02", "text": "Quick fox jumps nightly above wizard.", "letters": 31}} | ||
{"docno": "pangram-03", "text": "The jay, pig, fox, zebra and my wolves quack!", "original_document": {"doc_id": "pangram-03", "text": "The jay, pig, fox, zebra and my wolves quack!", "letters": 33}} | ||
{"docno": "pangram-04", "text": "The quick brown fox jumps over the lazy dog.", "original_document": {"doc_id": "pangram-04", "text": "The quick brown fox jumps over the lazy dog.", "letters": 35}} | ||
{"docno": "pangram-05", "text": "As quirky joke, chefs won\u2019t pay devil magic zebra tax.", "original_document": {"doc_id": "pangram-05", "text": "As quirky joke, chefs won\u2019t pay devil magic zebra tax.", "letters": 42}} |
1 change: 1 addition & 0 deletions
1
tira-ir-starters/pyterrier-t5/sample-input-full-rank/metadata.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"ir_datasets_id": "pangrams"} |
2 changes: 2 additions & 0 deletions
2
tira-ir-starters/pyterrier-t5/sample-input-full-rank/queries.jsonl
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
{"qid": "1", "query": "fox jumps above animal", "original_query": {"query_id": "1", "title": "fox jumps above animal", "description": "What pangrams have a fox jumping above some animal?", "narrative": "Relevant pangrams have a fox jumping over an animal (e.g., an dog). Pangrams containing a fox that is not jumping or jumps over something that is not an animal are not relevant."}} | ||
{"qid": "2", "query": "multiple animals including a zebra", "original_query": {"query_id": "2", "title": "multiple animals including a zebra", "description": "Which pangrams have multiple animals where one of the animals is a zebra?", "narrative": "Relevant pangrams have at least two animals, one of the animals must be a Zebra. Pangrams containing only a Zebra are not relevant."}} |
40 changes: 40 additions & 0 deletions
40
tira-ir-starters/pyterrier-t5/sample-input-full-rank/queries.xml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
<topics ir-datasets-id="pangrams"> | ||
<topic number="1"> | ||
<query> | ||
fox jumps above animal | ||
</query> | ||
<original_query> | ||
<query_id> | ||
1 | ||
</query_id> | ||
<title> | ||
fox jumps above animal | ||
</title> | ||
<description> | ||
What pangrams have a fox jumping above some animal? | ||
</description> | ||
<narrative> | ||
Relevant pangrams have a fox jumping over an animal (e.g., an dog). Pangrams containing a fox that is not jumping or jumps over something that is not an animal are not relevant. | ||
</narrative> | ||
</original_query> | ||
</topic> | ||
<topic number="2"> | ||
<query> | ||
multiple animals including a zebra | ||
</query> | ||
<original_query> | ||
<query_id> | ||
2 | ||
</query_id> | ||
<title> | ||
multiple animals including a zebra | ||
</title> | ||
<description> | ||
Which pangrams have multiple animals where one of the animals is a zebra? | ||
</description> | ||
<narrative> | ||
Relevant pangrams have at least two animals, one of the animals must be a Zebra. Pangrams containing only a Zebra are not relevant. | ||
</narrative> | ||
</original_query> | ||
</topic> | ||
</topics> |