Open Retrieval

Retrieve semantically close text embeddings using a prebuilt FAISS index and retrieval model from HF transformers.

Query the faiss index and optionally retrieve metadata from a parquet via a pandas.

Requirements

faiss transformers pandas numpy

Get Started

%cd open_retrieval

Method 1. Use the command line (Requires loading data and models every time): !python main.py -h

Method 2. Load each component, see tests/test.py

Parameters:

index_path (Required) : A string that represents the path to a faiss index containing the items for retrieval. This argument is required and is used to specify the location of the index.

retriever (Required) : A string that specifies the Transformers automodel to be used for embedding for retrieval. The default value is "facebook/contriever-msmarco".

query (Optional) : A string that represents the query for retrieval. The default value is "Who was the last man on the moon?".

device (Optional): A string that specifies the device to run processing on, either "gpu" or "cpu". The default value is "cpu".

dataset_path (Optional): A string that represents the path to the parquet dataset containing the items for retrieval as strings. The default value is None.

column (Required when using a dataset): A string that specifies the name of the column in the dataset containing the sentences/paragraphs for retrieval. The default value is None.

extra_columns (Optional): A string that specifies the columns to return alongside the result indices when using a dataset. The default value is None.

n (Optional): An integer that specifies the number of results to retrieve. The default value is 5.

use_nn (Optional): A Boolean value that determines whether to use 3 nearest neighbours. The default value is False. This argument is optional and is used to specify whether to use nearest neighbours for retrieval.

nn_threshold (Optional): A float value that represents the minimum similarity for 3 nearest neighbours when using use_nn. The default value is 0.7. This argument is optional and is used to specify the minimum similarity for nearest neighbours.

json_path: A string that represents the path to a json file to output the results. The default value is an empty string. This argument is optional and is used to specify the location to output the results to a json file.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
__pycache__		__pycache__
dist		dist
src		src
tests		tests
LICENSE.txt		LICENSE.txt
README.txt		README.txt
__init__.py		__init__.py
main.py		main.py
pyproject.toml		pyproject.toml
readme.md		readme.md
requirements.txt		requirements.txt
results.json		results.json
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Retrieval

Requirements

Get Started

Parameters:

About

Releases

Packages

Languages

License

corranmac/open_retrieval

Folders and files

Latest commit

History

Repository files navigation

Open Retrieval

Requirements

Get Started

Parameters:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages