Skip to content

Commit

Permalink
Project Modernization (#27)
Browse files Browse the repository at this point in the history
* DS_Store in gitignore

* license

* pyproject.toml instead of setup.py

* pt_docs

* test and deploy actions

* coverage target

* faiss as dev dependency

* style check

* ruff fixes

* fix ruff errors in indexes.py

* dev dependencies
  • Loading branch information
seanmacavaney authored Nov 24, 2024
1 parent ac4ed9d commit 6909587
Show file tree
Hide file tree
Showing 15 changed files with 238 additions and 125 deletions.
File renamed without changes.
51 changes: 0 additions & 51 deletions .github/workflows/push.yml

This file was deleted.

36 changes: 36 additions & 0 deletions .github/workflows/style.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: style

on:
push: {branches: [master]} # pushes to master
pull_request: {} # all PRs

jobs:
ruff:
strategy:
matrix:
python-version: ['3.10']
os: ['ubuntu-latest']

runs-on: ${{ matrix.os }}
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Install Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Cache Dependencies
uses: actions/cache@v4
with:
path: ${{ env.pythonLocation }}
key: ${{ matrix.os }}-${{ matrix.python-version }}-${{ hashFiles('requirements.txt', 'requirements-dev.txt') }}

- name: Install Dependencies
run: |
pip install --upgrade -r requirements-dev.txt
pip install -e .
- name: Ruff
run: 'ruff check --output-format=github pyterrier_dr'
62 changes: 62 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
name: test

on:
push: {branches: [master]} # pushes to master
pull_request: {} # all PRs
schedule: [cron: '0 12 * * 3'] # every Wednesday at noon

jobs:
pytest:
strategy:
matrix:
os: ['ubuntu-latest']
python-version: ['3.8', '3.12']

runs-on: ${{ matrix.os }}
env:
runtag: ${{ matrix.os }}-${{ matrix.python-version }}

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Install Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Cache Dependencies
uses: actions/cache@v4
with:
path: ${{ env.pythonLocation }}
key: ${{ env.runtag }}-${{ hashFiles('requirements.txt', 'requirements-dev.txt') }}

- name: Loading Torch models from cache
uses: actions/cache@v3
with:
path: /home/runner/.cache/
key: model-cache

- name: Install Dependencies
run: |
pip install --upgrade -r requirements.txt -r requirements-dev.txt
pip install -e .
- name: Unit Test
run: |
pytest --durations=20 -p no:faulthandler --json-report --json-report-file ${{ env.runtag }}.results.json --cov pyterrier_dr --cov-report json:${{ env.runtag }}.coverage.json tests/
- name: Upload Test Results
if: always()
uses: actions/upload-artifact@v4
with:
path: ${{ env.runtag }}.*.json
overwrite: true

- name: Report Test Results
if: always()
run: |
printf "**Test Results**\n\n" >> $GITHUB_STEP_SUMMARY
jq '.summary' ${{ env.runtag }}.results.json >> $GITHUB_STEP_SUMMARY
printf "\n\n**Test Coverage**\n\n" >> $GITHUB_STEP_SUMMARY
jq '.files | to_entries[] | " - `" + .key + "`: **" + .value.summary.percent_covered_display + "%**"' -r ${{ env.runtag }}.coverage.json >> $GITHUB_STEP_SUMMARY
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -129,3 +129,5 @@ dmypy.json

# Pyre type checker
.pyre/

.DS_Store
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024, Sean MacAvaney

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
2 changes: 1 addition & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1 +1 @@
include requirements.txt
recursive-include pyterrier_dr *.rst
43 changes: 43 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
[build-system]
requires = ["setuptools >= 61.0"]
build-backend = "setuptools.build_meta"

[project]
name = "pyterrier-dr"
description = "Dense Retrieval for PyTerrier"
requires-python = ">=3.8"
authors = [
{name = "Sean MacAvaney", email = "[email protected]"},
]
maintainers = [
{name = "Sean MacAvaney", email = "[email protected]"},
]
readme = "README.rst"
classifiers = [
"Programming Language :: Python",
"Operating System :: OS Independent",
"Topic :: Text Processing",
"Topic :: Text Processing :: Indexing",
"License :: OSI Approved :: MIT License",
]
dynamic = ["version", "dependencies"]

[tool.setuptools.dynamic]
version = {attr = "pyterrier_dr.__version__"}
dependencies = {file = ["requirements.txt"]}

[project.optional-dependencies]
bgem3 = [
"FlagEmbedding",
]

[tool.setuptools.packages.find]
exclude = ["tests"]

[project.urls]
Repository = "https://github.com/terrierteam/pyterrier_dr"
"Bug Tracker" = "https://github.com/terrierteam/pyterrier_dr/issues"

[project.entry-points."pyterrier.artifact"]
"dense_index.flex" = "pyterrier_dr:FlexIndex"
"cde_cache.np_pickle" = "pyterrier_dr:CDECache"
25 changes: 15 additions & 10 deletions pyterrier_dr/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
__version__ = '0.2.0'

from .util import SimFn, infer_device
from .indexes import DocnoFile, NilIndex, NumpyIndex, RankedLists, FaissFlat, FaissHnsw, MemIndex, TorchIndex
from .flex import FlexIndex
from .biencoder import BiEncoder, BiQueryEncoder, BiDocEncoder, BiScorer
from .hgf_models import HgfBiEncoder, TasB, RetroMAE
from .sbert_models import SBertBiEncoder, Ance, Query2Query, GTR
from .tctcolbert_model import TctColBert
from .electra import ElectraScorer
from .bge_m3 import BGEM3, BGEM3QueryEncoder, BGEM3DocEncoder
from .cde import CDE, CDECache
from pyterrier_dr.util import SimFn, infer_device
from pyterrier_dr.indexes import DocnoFile, NilIndex, NumpyIndex, RankedLists, FaissFlat, FaissHnsw, MemIndex, TorchIndex
from pyterrier_dr.flex import FlexIndex
from pyterrier_dr.biencoder import BiEncoder, BiQueryEncoder, BiDocEncoder, BiScorer
from pyterrier_dr.hgf_models import HgfBiEncoder, TasB, RetroMAE
from pyterrier_dr.sbert_models import SBertBiEncoder, Ance, Query2Query, GTR
from pyterrier_dr.tctcolbert_model import TctColBert
from pyterrier_dr.electra import ElectraScorer
from pyterrier_dr.bge_m3 import BGEM3, BGEM3QueryEncoder, BGEM3DocEncoder
from pyterrier_dr.cde import CDE, CDECache

__all__ = ["FlexIndex", "DocnoFile", "NilIndex", "NumpyIndex", "RankedLists", "FaissFlat", "FaissHnsw", "MemIndex", "TorchIndex",
"BiEncoder", "BiQueryEncoder", "BiDocEncoder", "BiScorer", "HgfBiEncoder", "TasB", "RetroMAE", "SBertBiEncoder", "Ance",
"Query2Query", "GTR", "TctColBert", "ElectraScorer", "BGEM3", "BGEM3QueryEncoder", "BGEM3DocEncoder", "CDE", "CDECache",
"SimFn", "infer_device"]
2 changes: 1 addition & 1 deletion pyterrier_dr/bge_m3.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ def __init__(self, model_name='BAAI/bge-m3', batch_size=32, max_length=8192, tex
self.device = torch.device(device)
try:
from FlagEmbedding import BGEM3FlagModel
except ImportError as e:
except ImportError:
raise ImportError("BGE-M3 requires the FlagEmbedding package. You can install it using 'pip install pyterrier-dr[bgem3]'")

self.model = BGEM3FlagModel(self.model_name, use_fp16=self.use_fp16, device=self.device)
Expand Down
20 changes: 11 additions & 9 deletions pyterrier_dr/flex/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
from .core import FlexIndex, IndexingMode
from .np_retr import *
from .torch_retr import *
from .corpus_graph import *
from .faiss_retr import *
from .scann_retr import *
from .ladr import *
from .gar import *
from .voyager_retr import *
from pyterrier_dr.flex.core import FlexIndex, IndexingMode
from pyterrier_dr.flex import np_retr
from pyterrier_dr.flex import torch_retr
from pyterrier_dr.flex import corpus_graph
from pyterrier_dr.flex import faiss_retr
from pyterrier_dr.flex import scann_retr
from pyterrier_dr.flex import ladr
from pyterrier_dr.flex import gar
from pyterrier_dr.flex import voyager_retr

__all__ = ["FlexIndex", "IndexingMode", "np_retr", "torch_retr", "corpus_graph", "faiss_retr", "scann_retr", "ladr", "gar", "voyager_retr"]
13 changes: 6 additions & 7 deletions pyterrier_dr/indexes.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# Deprecated module
import torch
import itertools
import math
Expand Down Expand Up @@ -261,7 +262,7 @@ def index(self, inp):
fout.write(doc_vecs.tobytes())
docnos.extend([d['docno'] for d in docs])
count += len(docs)
DocnoFile.build(docnos, path/f'docnos.npy')
DocnoFile.build(docnos, path/'docnos.npy')
with open(path/'meta.json', 'wt') as f_meta:
json.dump({'dtype': self.dtype, 'vec_size': vec_size, 'count': count}, f_meta)

Expand Down Expand Up @@ -468,8 +469,7 @@ def transform(self, inp):
query_vecs = query_vecs / np.linalg.norm(query_vecs, axis=1, keepdims=True)
query_vecs = query_vecs.copy()
res = []
query_heaps = [[] for _ in range(query_vecs.shape[0])]
docnos = DocnoFile(self.index_path/f'docnos.npy')
docnos = DocnoFile(self.index_path/'docnos.npy')
num_q = query_vecs.shape[0]
ranked_lists = RankedLists(self.num_results, num_q)
dids_offset = 0
Expand Down Expand Up @@ -526,7 +526,7 @@ def index(self, inp):
index.add(doc_vecs)
docnos.extend(d['docno'] for d in batch)
faiss.write_index(index, str(path/f'{shardid}.faiss'))
DocnoFile.build(docnos, path/f'docnos.npy')
DocnoFile.build(docnos, path/'docnos.npy')


class FaissHnsw(pt.Indexer):
Expand Down Expand Up @@ -568,8 +568,7 @@ def transform(self, inp):
query_vecs = query_vecs / np.linalg.norm(query_vecs, axis=1, keepdims=True)
query_vecs = query_vecs.copy()
res = []
query_heaps = [[] for _ in range(query_vecs.shape[0])]
docnos = DocnoFile(self.index_path/f'docnos.npy')
docnos = DocnoFile(self.index_path/'docnos.npy')
num_q = query_vecs.shape[0]
ranked_lists = RankedLists(self.num_results, num_q)
dids_offset = 0
Expand Down Expand Up @@ -629,7 +628,7 @@ def index(self, inp):
index.add(doc_vecs)
docnos.extend(d['docno'] for d in batch)
faiss.write_index(index, str(path/f'{shardid}.faiss'))
DocnoFile.build(docnos, path/f'docnos.npy')
DocnoFile.build(docnos, path/'docnos.npy')



Expand Down
36 changes: 36 additions & 0 deletions pyterrier_dr/pt_docs/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
Dense Retrieval for PyTerrier
=======================================================

Features to support Dense Retrieval in `PyTerrier <https://github.com/terrier-org/pyterrier>`__.

.. rubric:: Getting Started

.. code-block:: console
:caption: Install ``pyterrier-dr`` with ``pip``
$ pip install pyterrier-dr
Import ``pyterrier_dr``, load a pre-built index and model, and retrieve:

.. code-block:: python
:caption: Basic example of using ``pyterrier_dr``
>>> from pyterrier_dr import FlexIndex, TasB
>>> index = FlexIndex.from_hf('macavaney/vaswani.tasb.flex')
>>> model = TasB('sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco')
>>> pipeline = model.query_encoder() >> index.np_retriever()
>>> pipeline.search('chemical reactions')
score docno docid rank qid query
0 95.841721 7049 7048 0 1 chemical reactions
1 94.669395 9374 9373 1 1 chemical reactions
2 93.520027 3101 3100 2 1 chemical reactions
3 92.809227 6480 6479 3 1 chemical reactions
4 92.376190 3452 3451 4 1 chemical reactions
.. ... ... ... ... .. ...
995 82.554390 7701 7700 995 1 chemical reactions
996 82.552139 1553 1552 996 1 chemical reactions
997 82.551933 10064 10063 997 1 chemical reactions
998 82.546890 4417 4416 998 1 chemical reactions
999 82.545776 7120 7119 999 1 chemical reactions
4 changes: 4 additions & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
pytest
pytest-subtests
pytest-cov
pytest-json-report
git+https://github.com/terrierteam/pyterrier_adaptive
voyager
FlagEmbedding
faiss-cpu
ruff
Loading

0 comments on commit 6909587

Please sign in to comment.