Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,8 @@ pr:
make lint
make test

build-docs:
build-docs: build-docs-overview
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oO does this work?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, everything after : will be triggered before running a function

targets: prerequisites
	command

@echo "--- 📚 Building documentation ---"
make build-docs-overview
python -m mkdocs build


Expand Down
23 changes: 23 additions & 0 deletions docs/advanced_usage/retrieval_backend.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we also add documentation on the search backend?

Maybe we should also add something here so people can discover what has been added:

Screenshot 2025-11-25 at 16 23 28

A kind of user-friendly changelog

Copy link
Copy Markdown
Member Author

@Samoed Samoed Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be shown in advanced usage

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but people will not know what has happened since 2.0.0

I would probably change New in v2.0 to

- What is new
  - v2.3
  - v2.2
  - v2.1
  - v2.0

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more about changelog #3401

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair we still need the API docs though

Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Retrieval Search backend

!!! note "Available since 2.3.0"
This feature was introduced in version **2.3.0**.

For some large dataset search can take a lot of time and memory. To reduce this you can use `FaissSearchIndex`. To work with it install `pip install mteb[faiss]`.

Usage example:
```python
import mteb
from mteb.models import SearchEncoderWrapper
from mteb.models.search_encoder_index import FaissSearchIndex

model = mteb.get_model(...)
index_backend = FaissSearchIndex(model)
model = SearchEncoderWrapper(
model,
index_backend=index_backend
)
...
```

For example running `minishlab/potion-base-2M` on `SWEbenchVerifiedRR` took 694 seconds instead of 769.
4 changes: 4 additions & 0 deletions docs/api/model.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,7 @@ length, valid frameworks, license, and degree of openness.
:::mteb.models.CrossEncoderProtocol

:::mteb.models.MTEBModels

:::mteb.models.IndexEncoderSearchProtocol

:::mteb.models.CacheBackendProtocol
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ nav:
- Advanced Usage:
- Two stage reranking: advanced_usage/two_stage_reranking.md
- Cache embeddings: advanced_usage/cache_embeddings.md
- Retrieval backend: advanced_usage/retrieval_backend.md
- Contributing:
- Adding a Model: contributing/adding_a_model.md
- Adding a Task: contributing/adding_a_dataset.md
Expand Down
4 changes: 4 additions & 0 deletions mteb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,10 @@
from mteb.get_tasks import get_task, get_tasks
from mteb.load_results import load_results
from mteb.models import (
CacheBackendProtocol,
CrossEncoderProtocol,
EncoderProtocol,
IndexEncoderSearchProtocol,
SearchProtocol,
SentenceTransformerEncoderWrapper,
)
Expand All @@ -27,8 +29,10 @@
"AbsTask",
"Benchmark",
"BenchmarkResults",
"CacheBackendProtocol",
"CrossEncoderProtocol",
"EncoderProtocol",
"IndexEncoderSearchProtocol",
"SearchProtocol",
"SentenceTransformerEncoderWrapper",
"TaskMetadata",
Expand Down
5 changes: 4 additions & 1 deletion mteb/models/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
from .cache_wrappers import CachedEmbeddingWrapper
from .cache_wrappers import CacheBackendProtocol, CachedEmbeddingWrapper
from .model_meta import ModelMeta
from .models_protocols import (
CrossEncoderProtocol,
EncoderProtocol,
MTEBModels,
SearchProtocol,
)
from .search_encoder_index.search_backend_protocol import IndexEncoderSearchProtocol
from .search_wrappers import SearchCrossEncoderWrapper, SearchEncoderWrapper
from .sentence_transformer_wrapper import (
CrossEncoderWrapper,
Expand All @@ -14,10 +15,12 @@
)

__all__ = [
"CacheBackendProtocol",
"CachedEmbeddingWrapper",
"CrossEncoderProtocol",
"CrossEncoderWrapper",
"EncoderProtocol",
"IndexEncoderSearchProtocol",
"MTEBModels",
"ModelMeta",
"SearchCrossEncoderWrapper",
Expand Down
3 changes: 2 additions & 1 deletion mteb/models/cache_wrappers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from .cache_backend_protocol import CacheBackendProtocol
from .cache_wrapper import CachedEmbeddingWrapper

__all__ = ["CachedEmbeddingWrapper"]
__all__ = ["CacheBackendProtocol", "CachedEmbeddingWrapper"]
26 changes: 8 additions & 18 deletions mteb/models/model_implementations/random_baseline.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@

from mteb.abstasks.task_metadata import TaskMetadata
from mteb.models.model_meta import ModelMeta
from mteb.similarity_functions import (
select_pairwise_similarity,
select_similarity,
)
from mteb.types._encoder_io import Array, BatchedInput, PromptType


Expand Down Expand Up @@ -155,15 +159,9 @@ def similarity(
Returns:
Cosine similarity matrix between the two sets of embeddings
"""
norm1 = np.linalg.norm(
embeddings1.reshape(-1, self.embedding_dim), axis=1, keepdims=True
)
norm2 = np.linalg.norm(
embeddings2.reshape(-1, self.embedding_dim), axis=1, keepdims=True
return select_similarity(
embeddings1, embeddings2, self.mteb_model_meta.similarity_fn_name
)
normalized1 = embeddings1 / (norm1 + 1e-10)
normalized2 = embeddings2 / (norm2 + 1e-10)
return np.dot(normalized1, normalized2.T)

def similarity_pairwise(
self,
Expand All @@ -179,17 +177,9 @@ def similarity_pairwise(
Returns:
Cosine similarity for each pair of embeddings
"""
norm1 = np.linalg.norm(
embeddings1.reshape(-1, self.embedding_dim), axis=1, keepdims=True
)
norm2 = np.linalg.norm(
embeddings2.reshape(-1, self.embedding_dim), axis=1, keepdims=True
return select_pairwise_similarity(
embeddings1, embeddings2, self.mteb_model_meta.similarity_fn_name
)
normalized1 = embeddings1 / (norm1 + 1e-10)
normalized2 = embeddings2 / (norm2 + 1e-10)
normalized1 = np.asarray(normalized1)
normalized2 = np.asarray(normalized2)
return np.sum(normalized1 * normalized2, axis=1)


random_encoder_baseline = ModelMeta(
Expand Down
7 changes: 7 additions & 0 deletions mteb/models/search_encoder_index/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from .search_backend_protocol import IndexEncoderSearchProtocol
from .search_indexes import FaissSearchIndex

__all__ = [
"FaissSearchIndex",
"IndexEncoderSearchProtocol",
]
50 changes: 50 additions & 0 deletions mteb/models/search_encoder_index/search_backend_protocol.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
from collections.abc import Callable
from typing import Protocol

from mteb.types import Array, TopRankedDocumentsType


class IndexEncoderSearchProtocol(Protocol):
"""Protocol for search backends used in encoder-based retrieval."""

def add_documents(
self,
embeddings: Array,
idxs: list[str],
) -> None:
"""Add documents to the search backend.

Args:
embeddings: Embeddings of the documents to add.
idxs: IDs of the documents to add.
"""

def search(
self,
embeddings: Array,
top_k: int,
similarity_fn: Callable[[Array, Array], Array],
top_ranked: TopRankedDocumentsType | None = None,
query_idx_to_id: dict[int, str] | None = None,
) -> tuple[list[list[float]], list[list[int]]]:
"""Search through added corpus embeddings or rerank top-ranked documents.

Supports both full-corpus and reranking search modes:
- Full-corpus mode: `top_ranked=None`, uses added corpus embeddings.
- Reranking mode: `top_ranked` contains mapping {query_id: [doc_ids]}.

Args:
embeddings: Query embeddings, shape (num_queries, dim).
top_k: Number of top results to return.
similarity_fn: Function to compute similarity between query and corpus.
top_ranked: Mapping of query_id -> list of candidate doc_ids. Used for reranking.
query_idx_to_id: Mapping of query index -> query_id. Used for reranking.

Returns:
A tuple (top_k_values, top_k_indices), for each query:
- top_k_values: List of top-k similarity scores.
- top_k_indices: List of indices of the top-k documents in the added corpus.
"""

def clear(self) -> None:
"""Clear all stored documents and embeddings from the backend."""
5 changes: 5 additions & 0 deletions mteb/models/search_encoder_index/search_indexes/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from .faiss_search_index import FaissSearchIndex

__all__ = [
"FaissSearchIndex",
]
157 changes: 157 additions & 0 deletions mteb/models/search_encoder_index/search_indexes/faiss_search_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
import logging
from collections.abc import Callable

import numpy as np
import torch

from mteb._requires_package import requires_package
from mteb.models.model_meta import ScoringFunction
from mteb.models.models_protocols import EncoderProtocol
from mteb.types import Array, TopRankedDocumentsType

logger = logging.getLogger(__name__)


class FaissSearchIndex:
"""FAISS-based backend for encoder-based search.

Supports both full-corpus retrieval and reranking (via `top_ranked`).

Notes:
- Stores *all* embeddings in memory (IndexFlatIP or IndexFlatL2).
- Expects embeddings to be normalized if cosine similarity is desired.
"""

_normalize: bool = False

def __init__(self, model: EncoderProtocol) -> None:
requires_package(
self,
"faiss",
"FAISS-based search",
install_instruction="pip install mteb[faiss-cpu]",
)

import faiss
from faiss import IndexFlatIP, IndexFlatL2

# https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
if model.mteb_model_meta.similarity_fn_name is ScoringFunction.DOT_PRODUCT:
self.index_type = IndexFlatIP
elif model.mteb_model_meta.similarity_fn_name is ScoringFunction.COSINE:
self.index_type = IndexFlatIP
self._normalize = True
elif model.mteb_model_meta.similarity_fn_name is ScoringFunction.EUCLIDEAN:
self.index_type = IndexFlatL2
else:
raise ValueError(
f"FAISS backend does not support similarity function {model.mteb_model_meta.similarity_fn_name}. "
f"Available: {ScoringFunction.DOT_PRODUCT}, {ScoringFunction.COSINE}."
)

self.idxs: list[str] = []
self.index: faiss.Index | None = None

def add_documents(self, embeddings: Array, idxs: list[str]) -> None:
"""Add all document embeddings and their IDs to FAISS index."""
import faiss

if isinstance(embeddings, torch.Tensor):
embeddings = embeddings.detach().cpu().numpy()

embeddings = embeddings.astype(np.float32)
self.idxs.extend(idxs)

if self._normalize:
faiss.normalize_L2(embeddings)

dim = embeddings.shape[1]
if self.index is None:
self.index = self.index_type(dim)

self.index.add(embeddings)
logger.info(f"FAISS index built with {len(idxs)} vectors of dim {dim}.")

def search(
self,
embeddings: Array,
top_k: int,
similarity_fn: Callable[[Array, Array], Array],
top_ranked: TopRankedDocumentsType | None = None,
query_idx_to_id: dict[int, str] | None = None,
) -> tuple[list[list[float]], list[list[int]]]:
"""Search using FAISS."""
import faiss

if self.index is None:
raise ValueError("No index built. Call add_document() first.")

if isinstance(embeddings, torch.Tensor):
embeddings = embeddings.detach().cpu().numpy()

if self._normalize:
faiss.normalize_L2(embeddings)

if top_ranked is not None:
if query_idx_to_id is None:
raise ValueError("query_idx_to_id must be provided when reranking.")

similarities, ids = self._reranking(
embeddings,
top_k,
top_ranked=top_ranked,
query_idx_to_id=query_idx_to_id,
)
else:
similarities, ids = self.index.search(embeddings.astype(np.float32), top_k)
similarities = similarities.tolist()
ids = ids.tolist()

if issubclass(self.index_type, faiss.IndexFlatL2):
similarities = -np.sqrt(np.maximum(similarities, 0))

return similarities, ids

def _reranking(
self,
embeddings: Array,
top_k: int,
top_ranked: TopRankedDocumentsType | None = None,
query_idx_to_id: dict[int, str] | None = None,
) -> tuple[list[list[float]], list[list[int]]]:
doc_id_to_idx = {doc_id: i for i, doc_id in enumerate(self.idxs)}
scores_all: list[list[float]] = []
idxs_all: list[list[int]] = []

for query_idx, query_emb in enumerate(embeddings):
query_id = query_idx_to_id[query_idx]
ranked_ids = top_ranked.get(query_id)
if not ranked_ids:
logger.warning(f"No top-ranked documents for query {query_id}")
scores_all.append([])
idxs_all.append([])
continue

candidate_indices = [doc_id_to_idx[doc_id] for doc_id in ranked_ids]
d = self.index.d
candidate_embs = np.vstack(
[self.index.reconstruct(idx) for idx in candidate_indices]
)
sub_reranking_index = self.index_type(d)
sub_reranking_index.add(candidate_embs)

# Search returns scores and indices in one call
scores, local_indices = sub_reranking_index.search(
query_emb.reshape(1, -1).astype(np.float32),
min(top_k, len(candidate_indices)),
)
# faiss will output 2d arrays even for single query
scores_all.append(scores[0].tolist())
idxs_all.append(local_indices[0].tolist())

return scores_all, idxs_all

def clear(self) -> None:
"""Clear all stored documents and embeddings from the backend."""
self.index = None
self.idxs = []
Loading