The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank
There are two models trained for spanish, a bi-encoder and a cross-encoder. These serve to make the retrieval system using the retrieve and rerank idea:
make setup
pip install -r requirements.txt
- Setup Elasticsearch index with semantic vectors. For this step we supose that a set of json files is folder. Each json can contain several optional fields but need to contain id and text fiedlds.
from information_retrieval import SemanticEmbedder, CrossEncoder, Prepare, Search
data_folder = 'data/'
text_field = "texto_parrafo"
id_field = "id_parrafo"
elastic_index_name = "sentencias_2.0"
# Read the files, compute embeddings and upload them to elasticsearch
P = Prepare(data_folder, text_field, id_field, elastic_index_name)
P.prepare()
- Make queries to retrieve documents:
from information_retrieval import SearchEngine
query = "la vida es bella"
S = SearchEngine(elastic_index_name)
S.retrieve(query) # Only semantic search
S.rerank(query) # Retrieve and rerank