GitHub - SergioArnaud/information_retrieval: Train and prepare information retrieval systems in non-english languages

Main Idea

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Setup

Download trained models

There are two models trained for spanish, a bi-encoder and a cross-encoder. These serve to make the retrieval system using the retrieve and rerank idea:

make setup
pip install -r requirements.txt

Basic usage

Setup Elasticsearch index with semantic vectors. For this step we supose that a set of json files is folder. Each json can contain several optional fields but need to contain id and text fiedlds.

from information_retrieval import SemanticEmbedder, CrossEncoder, Prepare, Search

data_folder = 'data/'
text_field = "texto_parrafo"
id_field = "id_parrafo"
elastic_index_name = "sentencias_2.0"

# Read the files, compute embeddings and upload them to elasticsearch
P = Prepare(data_folder, text_field, id_field, elastic_index_name)
P.prepare()

Make queries to retrieve documents:

from information_retrieval import SearchEngine

query = "la vida es bella"
S = SearchEngine(elastic_index_name)
S.retrieve(query) # Only semantic search

S.rerank(query) # Retrieve and rerank

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
information_retrieval		information_retrieval
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Main Idea

Setup

Download trained models

Basic usage

Model architecture

Training

Finetuning

About

Releases

Packages

Languages

SergioArnaud/information_retrieval

Folders and files

Latest commit

History

Repository files navigation

Main Idea

Setup

Download trained models

Basic usage

Model architecture

Training

Finetuning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages