diff --git a/Makefile b/Makefile
index 5e6d700407..0f0937f08d 100644
--- a/Makefile
+++ b/Makefile
@@ -1,12 +1,12 @@
install:
@echo "--- 🚀 Installing project dependencies ---"
- pip install -e ".[dev]"
+ pip install -e ".[dev,image]"
pre-commit install
install-for-tests:
@echo "--- 🚀 Installing project dependencies for test ---"
@echo "This ensures that the project is not installed in editable mode"
- pip install ".[dev,speedtask]"
+ pip install ".[dev,speedtask,image]"
lint:
@echo "--- 🧹 Running linters ---"
diff --git a/README.md b/README.md
index b3240ab305..ebcc3e1a26 100644
--- a/README.md
+++ b/README.md
@@ -36,9 +36,11 @@
pip install mteb
```
+
## Example Usage
-* Using a Python script:
+
+### Using a script
```python
import mteb
@@ -53,41 +55,10 @@ evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder=f"results/{model_name}")
```
-
- Running SentenceTransformer model with prompts
-
-Prompts can be passed to the SentenceTransformer model using the `prompts` parameter. The following code shows how to use prompts with SentenceTransformer:
-
-```python
-from sentence_transformers import SentenceTransformer
-
-
-model = SentenceTransformer("average_word_embeddings_komninos", prompts={"query": "Query:", "passage": "Passage:"})
-evaluation = mteb.MTEB(tasks=tasks)
-```
-
-In prompts the key can be:
-1. Prompt types (`passage`, `query`) - they will be used in reranking and retrieval tasks
-2. Task type - these prompts will be used in all tasks of the given type
- 1. `BitextMining`
- 2. `Classification`
- 3. `MultilabelClassification`
- 4. `Clustering`
- 5. `PairClassification`
- 6. `Reranking`
- 7. `Retrieval`
- 8. `STS`
- 9. `Summarization`
- 10. `InstructionRetrieval`
-3. Pair of task type and prompt type like `Retrival-query` - these prompts will be used in all classification tasks
-4. Task name - these prompts will be used in the specific task
-5. Pair of task name and prompt type like `NFCorpus-query` - these prompts will be used in the specific task
-
-
-* Using CLI
+### Using the CLI
```bash
-mteb available_tasks
+mteb available_tasks # list _all_ available tasks
mteb run -m sentence-transformers/all-MiniLM-L6-v2 \
-t Banking77Classification \
@@ -96,427 +67,52 @@ mteb run -m sentence-transformers/all-MiniLM-L6-v2 \
# if nothing is specified default to saving the results in the results/{model_name} folder
```
-* Using multiple GPUs in parallel can be done by just having a custom encode function that distributes the inputs to multiple GPUs like e.g. [here](https://github.com/microsoft/unilm/blob/b60c741f746877293bb85eed6806736fc8fa0ffd/e5/mteb_eval.py#L60) or [here](https://github.com/ContextualAI/gritlm/blob/09d8630f0c95ac6a456354bcb6f964d7b9b6a609/gritlm/gritlm.py#L75).
+Note that using multiple GPUs in parallel can be done by just having a custom encode function that distributes the inputs to multiple GPUs like e.g. [here](https://github.com/microsoft/unilm/blob/b60c741f746877293bb85eed6806736fc8fa0ffd/e5/mteb_eval.py#L60) or [here](https://github.com/ContextualAI/gritlm/blob/09d8630f0c95ac6a456354bcb6f964d7b9b6a609/gritlm/gritlm.py#L75). See [custom models](docs/usage/usage.md#using-a-custom-model) for more information.
## Usage Documentation
-Click on each section below to see the details.
-
-
-
-
- Task selection
-
-### Task selection
-
-Tasks can be selected by providing the list of datasets, but also
-
-* by their task (e.g. "Clustering" or "Classification")
-
-```python
-tasks = mteb.get_tasks(task_types=["Clustering", "Retrieval"]) # Only select clustering and retrieval tasks
-```
-
-* by their categories e.g. "s2s" (sentence to sentence) or "p2p" (paragraph to paragraph)
-
-```python
-tasks = mteb.get_tasks(categories=["s2s", "p2p"]) # Only select sentence2sentence and paragraph2paragraph datasets
-```
-
-* by their languages
-
-```python
-tasks = mteb.get_tasks(languages=["eng", "deu"]) # Only select datasets which contain "eng" or "deu" (iso 639-3 codes)
-```
-
-You can also specify which languages to load for multilingual/cross-lingual tasks like below:
-
-```python
-import mteb
-
-tasks = [
- mteb.get_task("AmazonReviewsClassification", languages = ["eng", "fra"]),
- mteb.get_task("BUCCBitextMining", languages = ["deu"]), # all subsets containing "deu"
-]
-
-# or you can select specific huggingface subsets like this:
-from mteb.tasks import AmazonReviewsClassification, BUCCBitextMining
-
-evaluation = mteb.MTEB(tasks=[
- AmazonReviewsClassification(hf_subsets=["en", "fr"]) # Only load "en" and "fr" subsets of Amazon Reviews
- BUCCBitextMining(hf_subsets=["de-en"]), # Only load "de-en" subset of BUCC
-])
-# for an example of a HF subset see "Subset" in the dataset viewer at: https://huggingface.co/datasets/mteb/bucc-bitext-mining
-```
-
-* by their modalities
-
-```python
-tasks = mteb.get_tasks(modalities=["text", "image"]) # Only select tasks with text or image modalities
-```
-
- You can also specify exclusive modality filtering to only get tasks with exactly the requested modalities (default behavior with exclusive_modality_filter=False):
-```python
-# Get tasks with text modality, this will also include tasks having both text and image modalities
-tasks = mteb.get_tasks(modalities=["text"], exclusive_modality_filter=False)
-
-# Get tasks that have ONLY text modality (no image or other modalities)
-tasks = mteb.get_tasks(modalities=["text"], exclusive_modality_filter=True)
-```
-
-
-
-
- Running a benchmark
-
-### Running a Benchmark
-
-`mteb` comes with a set of predefined benchmarks. These can be fetched using `get_benchmark` and run in a similar fashion to other sets of tasks.
-For instance to select the 56 English datasets that form the "Overall MTEB English leaderboard":
-
-```python
-import mteb
-benchmark = mteb.get_benchmark("MTEB(eng, v1)")
-evaluation = mteb.MTEB(tasks=benchmark)
-```
-
-The benchmark specified not only a list of tasks, but also what splits and language to run on. To get an overview of all available benchmarks simply run:
-
-```python
-import mteb
-benchmarks = mteb.get_benchmarks()
-```
-
-Generally we use the naming scheme for benchmarks `MTEB(*)`, where the "*" denotes the target of the benchmark. In the case of a language, we use the three-letter language code. For large groups of languages, we use the group notation, e.g., `MTEB(Scandinavian, v1)` for Scandinavian languages. External benchmarks implemented in MTEB like `CoIR` use their original name. When using a benchmark from MTEB please cite `mteb` along with the citations of the benchmark which you can access using:
-
-```python
-benchmark.citation
-```
-
-
-
-
- Passing in `encode` arguments
-
-
-### Passing in `encode` arguments
-
-To pass in arguments to the model's `encode` function, you can use the encode keyword arguments (`encode_kwargs`):
-
-```python
-evaluation.run(model, encode_kwargs={"batch_size": 32})
-```
-
-
-
-
- Selecting evaluation split
-
-### Selecting evaluation split
-You can evaluate only on `test` splits of all tasks by doing the following:
-
-```python
-evaluation.run(model, eval_splits=["test"])
-```
-
-Note that the public leaderboard uses the test splits for all datasets except MSMARCO, where the "dev" split is used.
-
-
-
-
-
- Selecting evaluation subset
-
-### Selecting evaluation subset
-You can evaluate only on selected subsets. For example, if you want to evaluate only the `subset_name_to_run` subset of all tasks, do the following:
-
-```python
-evaluation.run(model, eval_subsets=["subset_name_to_run"])
-```
-
-Monolingual tasks have `default` subset, other tasks have subsets that are specific to the dataset.
-
-
-
-
- Using a custom model
-
-
-### Using a custom model
-
-Models should implement the following interface, implementing an `encode` function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.). For inspiration, you can look at the [mteb/mtebscripts repo](https://github.com/embeddings-benchmark/mtebscripts) used for running diverse models via SLURM scripts for the paper.
-
-```python
-import mteb
-from mteb.encoder_interface import PromptType
-import numpy as np
-
-
-class CustomModel:
- def encode(
- self,
- sentences: list[str],
- task_name: str,
- prompt_type: PromptType | None = None,
- **kwargs,
- ) -> np.ndarray:
- """Encodes the given sentences using the encoder.
-
- Args:
- sentences: The sentences to encode.
- task_name: The name of the task.
- prompt_type: The prompt type to use.
- **kwargs: Additional arguments to pass to the encoder.
-
- Returns:
- The encoded sentences.
- """
- pass
-
-model = CustomModel()
-tasks = mteb.get_tasks(tasks=["Banking77Classification"])
-evaluation = mteb.MTEB(tasks=tasks)
-evaluation.run(model)
-```
-
-
-
-
- Evaluating on a custom dataset
-
-
-### Evaluating on a custom dataset
-
-To evaluate on a custom task, you can run the following code on your custom task. See [how to add a new task](docs/adding_a_dataset.md), for how to create a new task in MTEB.
-
-```python
-from mteb import MTEB
-from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
-from sentence_transformers import SentenceTransformer
-
-
-class MyCustomTask(AbsTaskReranking):
- ...
-
-model = SentenceTransformer("average_word_embeddings_komninos")
-evaluation = MTEB(tasks=[MyCustomTask()])
-evaluation.run(model)
-```
-
-
-
-
- Using a cross encoder for reranking
-
-
-### Using a cross encoder for reranking
-
-To use a cross encoder for reranking, you can directly use a CrossEncoder from SentenceTransformers. The following code shows a two-stage run with the second stage reading results saved from the first stage.
-
-```python
-from mteb import MTEB
-import mteb
-from sentence_transformers import CrossEncoder, SentenceTransformer
-
-cross_encoder = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-2-v2")
-dual_encoder = SentenceTransformer("all-MiniLM-L6-v2")
-
-tasks = mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])
-
-subset = "default" # subset name used in the NFCorpus dataset
-eval_splits = ["test"]
-
-evaluation = MTEB(tasks=tasks)
-evaluation.run(
- dual_encoder,
- eval_splits=eval_splits,
- save_predictions=True,
- output_folder="results/stage1",
-)
-evaluation.run(
- cross_encoder,
- eval_splits=eval_splits,
- top_k=5,
- save_predictions=True,
- output_folder="results/stage2",
- previous_results=f"results/stage1/NFCorpus_{subset}_predictions.json",
-)
-```
-
-
-
-
- Late Interaction (ColBERT)
-
-### Using Late Interaction models for retrieval
-
-```python
-from mteb import MTEB
-import mteb
-
-
-colbert = mteb.get_model("colbert-ir/colbertv2.0")
-tasks = mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])
-
-eval_splits = ["test"]
-
-evaluation = MTEB(tasks=tasks)
-
-evaluation.run(
- colbert,
- eval_splits=eval_splits,
- corpus_chunk_size=500,
-)
-```
-This implementation employs the MaxSim operation to compute the similarity between sentences. While MaxSim provides high-quality results, it processes a larger number of embeddings, potentially leading to increased resource usage. To manage resource consumption, consider lowering the `corpus_chunk_size` parameter.
-
-
-
-
-
- Saving retrieval task predictions
-
-### Saving retrieval task predictions
-
-To save the predictions from a retrieval task, add the `--save_predictions` flag in the CLI or set `save_predictions=True` in the run method. The filename will be in the "{task_name}_{subset}_predictions.json" format.
-
-Python:
-```python
-from mteb import MTEB
-import mteb
-from sentence_transformers import SentenceTransformer
-
-model = SentenceTransformer("all-MiniLM-L6-v2")
-
-tasks = mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])
-
-evaluation = MTEB(tasks=tasks)
-evaluation.run(
- model,
- eval_splits=["test"],
- save_predictions=True,
- output_folder="results",
-)
-```
-
-CLI:
-```bash
-mteb run -t NFCorpus -m all-MiniLM-L6-v2 --output_folder results --save_predictions
-```
-
-
-
-
- Fetching result from the results repository
-
-### Fetching results from the results repository
-
-Multiple models have already been run on tasks available within MTEB. These results are available results [repository](https://github.com/embeddings-benchmark/results).
-
-To make the results more easily accessible, we have designed custom functionality for retrieving from the repository. For instance, if you are selecting the best model for your French and English retrieval task on legal documents you could fetch the relevant tasks and create a dataframe of the results using the following code:
-
-```python
-import mteb
-from mteb.task_selection import results_to_dataframe
-
-tasks = mteb.get_tasks(
- task_types=["Retrieval"], languages=["eng", "fra"], domains=["Legal"]
-)
-
-model_names = [
- "GritLM/GritLM-7B",
- "intfloat/multilingual-e5-small",
- "intfloat/multilingual-e5-base",
- "intfloat/multilingual-e5-large",
-]
-models = [mteb.get_model_meta(name) for name in model_names]
-
-results = mteb.load_results(models=models, tasks=tasks)
-
-df = results_to_dataframe(results)
-```
-
-
-
-
-
- Annotate Contamination in the training data of a model
-
-### Annotate Contamination
-
-have your found contamination in the training data of a model? Please let us know, either by opening an issue or ideally by submitting a PR
-annotatig the training datasets of the model:
-
-```py
-model_w_contamination = ModelMeta(
- name = "model-with-contamination"
- ...
- training_datasets: {"ArguAna": # name of dataset within MTEB
- ["test"]} # the splits that have been trained on
- ...
-)
-```
-
-
-
-
-
- Running the leaderboard locally
-
-
-### Running the Leaderboard
-
-It is possible to completely deploy the leaderboard locally or self-host it. This can e.g. be relevant for companies that might want to
-integrate build their own benchmarks or integrate custom tasks into existing benchmarks.
-
-Running the leaderboard is quite easy. Simply run:
-```py
-python -m mteb.leaderboard.app
-```
-
-The leaderboard requires gradio install, which can be installed using `pip install mteb[gradio]` and requires python >3.10.
-
-
-
-
- Caching Embeddings To Re-Use Them
-
-
-### Caching Embeddings To Re-Use Them
-
-There are times you may want to cache the embeddings so you can re-use them. This may be true if you have multiple query sets for the same corpus (e.g. Wikipedia) or are doing some optimization over the queries (e.g. prompting, other experiments). You can setup a cache by using a simple wrapper, which will save the cache per task in the `cache_embeddings/{task_name}` folder:
-
-```python
-# define your task and model above as normal
-...
-# wrap the model with the cache wrapper
-from mteb.models.cache_wrapper import CachedEmbeddingWrapper
-model_with_cached_emb = CachedEmbeddingWrapper(model, cache_path='path_to_cache_dir')
-# run as normal
-evaluation.run(model, ...)
-```
-
-
-
-
-
-
-
-## Documentation
-
-| Documentation | |
+The following links to the main sections in the usage documentation.
+
+| Section | |
+| ------- |- |
+| **General** | |
+| [Evaluating a Model](docs/usage/usage.md#evaluating-a-model) | How to evaluate a model |
+| [Evaluating on different Modalities](docs/usage/usage.md#evaluating-on-different-modalities) | How to evaluate image and image-text tasks |
+| **Selecting Tasks** | |
+| [Selecting a benchmark](docs/usage/usage.md#selecting-a-benchmark) | How to select and filter tasks |
+| [Task selection](docs/usage/usage.md#task-selection) | How to select and filter tasks |
+| [Selecting Split and Subsets](docs/usage/usage.md#selecting-evaluation-split-or-subsets) | How to select evaluation splits or subsets |
+| [Using a Custom Task](docs/usage/usage.md#using-a-custom-task) | How to evaluate on a custom task |
+| **Selecting a Model** | |
+| [Using a Pre-defined Model](docs/usage/usage.md#using-a-pre-defined-model) | How to run a pre-defined model |
+| [Using a SentenceTransformer Model](docs/usage/usage.md#using-a-sentence-transformer-model) | How to run a model loaded using sentence-transformers |
+| [Using a Custom Model](docs/usage/usage.md#using-a-custom-model) | How to run and implement a custom model |
+| **Running Evaluation** | |
+| [Passing Arguments to the model](docs/usage/usage.md#passing-in-encode-arguments) | How to pass `encode` arguments to the model |
+| [Running Cross Encoders](docs/usage/usage.md#running-cross-encoders-on-reranking) | How to run cross encoders for reranking |
+| [Running Late Interaction (ColBERT)](docs/usage/usage.md#using-late-interaction-models) | How to run late interaction models |
+| [Saving Retrieval Predictions](docs/usage/usage.md#saving-retrieval-task-predictions) | How to save prediction for later analysis |
+| [Caching Embeddings](docs/usage/usage.md#caching-embeddings-to-re-use-them) | How to cache and re-use embeddings |
+| **Leaderboard** | |
+| [Running the Leaderboard Locally](docs/usage/usage.md#running-the-leaderboard-locally) | How to run the leaderboard locally |
+| [Report Data Contamination](docs/usage/usage.md#annotate-contamination) | How to report data contamination for a model |
+| [Fetching Result from the Leaderboard](docs/usage/usage.md#fetching-results-from-the-leaderboard) | How to fetch the raw results from the leaderboard |
+
+
+## Overview
+
+| Overview | |
|--------------------------------|-------------------------------------------------------------------------------------|
+| 📈 [Leaderboard] | The interactive leaderboard of the benchmark |
| 📋 [Tasks] | Overview of available tasks |
| 📐 [Benchmarks] | Overview of available benchmarks |
-| 📈 [Leaderboard] | The interactive leaderboard of the benchmark |
+| **Contributing** | |
| 🤖 [Adding a model] | Information related to how to submit a model to MTEB and to the leaderboard |
-| 👩🔬 [Reproducible workflows] | Information related to how to reproduce and create reproducible workflows with MTEB |
+| 👩🔬 [Reproducible workflows] | Information related to how to create reproducible workflows with MTEB |
| 👩💻 [Adding a dataset] | How to add a new task/dataset to MTEB |
| 👩💻 [Adding a benchmark] | How to add a new benchmark to MTEB and to the leaderboard |
| 🤝 [Contributing] | How to contribute to MTEB and set it up for development |
-| 🌐 [MMTEB] | An open-source effort to extend MTEB to cover a broad set of languages |
-| 🖼️ [MIEB] | Extension of MTEB to image embeddings |
[Tasks]: docs/tasks.md
[Benchmarks]: docs/benchmarks.md
@@ -525,27 +121,51 @@ evaluation.run(model, ...)
[Adding a dataset]: docs/adding_a_dataset.md
[Adding a benchmark]: docs/adding_a_benchmark.md
[Leaderboard]: https://huggingface.co/spaces/mteb/leaderboard
-[MMTEB]: docs/mmteb/readme.md
-[MIEB]: docs/mieb.md
[Reproducible workflows]: docs/reproducible_workflow.md
## Citing
-MTEB was introduced in "[MTEB: Massive Text Embedding Benchmark](https://arxiv.org/abs/2210.07316)", feel free to cite:
+MTEB was introduced in "[MTEB: Massive Text Embedding Benchmark](https://arxiv.org/abs/2210.07316)", and heavily expanded in "[MMTEB: Massive Multilingual Text Embedding Benchmark](https://arxiv.org/abs/2502.13595)". When using `mteb` we recommend that you cite both articles.
+
+
+ Bibtex Citation (click to unfold)
+
```bibtex
+@article{enevoldsen2025mmtebmassivemultilingualtext,
+ title={MMTEB: Massive Multilingual Text Embedding Benchmark},
+ author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and Márton Kardos and Ashwin Mathur and David Stap and Jay Gala and Wissam Siblini and Dominik Krzemiński and Genta Indra Winata and Saba Sturua and Saiteja Utpala and Mathieu Ciancone and Marion Schaeffer and Gabriel Sequeira and Diganta Misra and Shreeya Dhakal and Jonathan Rystrøm and Roman Solomatin and Ömer Çağatan and Akash Kundu and Martin Bernstorff and Shitao Xiao and Akshita Sukhlecha and Bhavish Pahwa and Rafał Poświata and Kranthi Kiran GV and Shawon Ashraf and Daniel Auras and Björn Plüster and Jan Philipp Harries and Loïc Magne and Isabelle Mohr and Mariya Hendriksen and Dawei Zhu and Hippolyte Gisserot-Boukhlef and Tom Aarsen and Jan Kostkan and Konrad Wojtasik and Taemin Lee and Marek Šuppa and Crystina Zhang and Roberta Rocca and Mohammed Hamdy and Andrianos Michail and John Yang and Manuel Faysse and Aleksei Vatolin and Nandan Thakur and Manan Dey and Dipam Vasani and Pranjal Chitale and Simone Tedeschi and Nguyen Tai and Artem Snegirev and Michael Günther and Mengzhou Xia and Weijia Shi and Xing Han Lù and Jordan Clive and Gayatri Krishnakumar and Anna Maksimova and Silvan Wehrli and Maria Tikhonova and Henil Panchal and Aleksandr Abramov and Malte Ostendorff and Zheng Liu and Simon Clematide and Lester James Miranda and Alena Fenogenova and Guangyu Song and Ruqiya Bin Safi and Wen-Ding Li and Alessia Borghini and Federico Cassano and Hongjin Su and Jimmy Lin and Howard Yen and Lasse Hansen and Sara Hooker and Chenghao Xiao and Vaibhav Adlakha and Orion Weller and Siva Reddy and Niklas Muennighoff},
+ publisher = {arXiv},
+ journal={arXiv preprint arXiv:2502.13595},
+ year={2025},
+ url={https://arxiv.org/abs/2502.13595},
+ doi = {10.48550/arXiv.2502.13595},
+}
+
@article{muennighoff2022mteb,
- doi = {10.48550/ARXIV.2210.07316},
- url = {https://arxiv.org/abs/2210.07316},
author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
title = {MTEB: Massive Text Embedding Benchmark},
publisher = {arXiv},
journal={arXiv preprint arXiv:2210.07316},
year = {2022}
+ url = {https://arxiv.org/abs/2210.07316},
+ doi = {10.48550/ARXIV.2210.07316},
}
```
+
+
+
+If you use any of the specific benchmark we also recommend that you cite the authors.
+
+```py
+benchmark = mteb.get_benchmark("MTEB(eng, v2)")
+benchmark.citation # get citation for a specific benchmarks
+
+# you can also create a table of the task for the appendix using:
+benchmark.tasks.to_latex()
+```
-You may also want to read and cite the amazing work that has extended MTEB & integrated new datasets:
+Some of these amazing publications include:
- Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff. "[C-Pack: Packaged Resources To Advance General Chinese Embedding](https://arxiv.org/abs/2309.07597)" arXiv 2023
- Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, Han Xiao. "[Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents](https://arxiv.org/abs/2310.19923)" arXiv 2023
- Silvan Wehrli, Bert Arnrich, Christopher Irrgang. "[German Text Embedding Clustering Benchmark](https://arxiv.org/abs/2401.02709)" arXiv 2024
@@ -553,5 +173,3 @@ You may also want to read and cite the amazing work that has extended MTEB & int
- Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li. "[LongEmbed: Extending Embedding Models for Long Context Retrieval](https://arxiv.org/abs/2404.12096)" arXiv 2024
- Kenneth Enevoldsen, Márton Kardos, Niklas Muennighoff, Kristoffer Laigaard Nielbo. "[The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding](https://arxiv.org/abs/2406.02396)" arXiv 2024
- Ali Shiraee Kasmaee, Mohammad Khodadad, Mohammad Arshi Saloot, Nick Sherck, Stephen Dokas, Hamidreza Mahyar, Soheila Samiee. "[ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain](https://arxiv.org/abs/2412.00532)" arXiv 2024
-
-For works that have used MTEB for benchmarking, you can find them on the [leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
diff --git a/docs/mieb.md b/docs/mieb/readme.md
similarity index 79%
rename from docs/mieb.md
rename to docs/mieb/readme.md
index cf51d9f5c9..af23c8573e 100644
--- a/docs/mieb.md
+++ b/docs/mieb/readme.md
@@ -1,3 +1,6 @@
+**NOTE**: This collaboration have been finalized and the paper is soon to be released. This document remains for documentation.
+
+
# Welcome to MIEB! 👋
The Massive Image Embedding Benchmark (MIEB) is an image extension of [MTEB](https://arxiv.org/abs/2210.07316) to cover embedding tasks for image-text tasks.
@@ -76,41 +79,4 @@ evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model)
```
-By default, results will be under `results/laion__CLIP-ViT-L-14-laion2B-s32B-b82K/REVISION/CIFAR10ZeroShot.json`. Sometimes metrics can be a bit different than what the original paper claimed. This might be due to the resolution/layout difference of images in the remake of the dataset.
-
-
-## Specific Model running Instructions
-
-Some models require some specific steps before running. Those are collected here.
-
-
- Vista
-
- ## set up VISTA
-
- ```
- git clone https://github.com/FlagOpen/FlagEmbedding.git
- cd FlagEmbedding/research/visual_bge
- pip install -e .
- pip install torchvision timm einops ftfy
- ```
- back to the root folder of mteb; download the vision tower for bge-base
- ```
- cd ..
- wget https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_base_en_v1.5.pth?download=true
- ```
- rename it to `visualized_base_en_V1.5.pth`
- ```
- mv Visualized_base_en_v1.5.pth?download=true visualized_base_en_V1.5.pth
- ```
- download the vision tower for bge-m3
- ```
- wget https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_m3.pth?download=true
- ```
- rename it to `visualized_m3.pth`
- ```
- mv Visualized_m3.pth?download=true visualized_m3.pth
- ```
-
-
-
+By default, results will be under `results/laion__CLIP-ViT-L-14-laion2B-s32B-b82K/REVISION/CIFAR10ZeroShot.json`. Sometimes metrics can be a bit different than what the original paper claimed. This might be due to the resolution/layout difference of images in the remake of the dataset.
diff --git a/docs/mmteb/readme.md b/docs/mmteb/readme.md
index 56b7a0bef4..ef5768b0be 100644
--- a/docs/mmteb/readme.md
+++ b/docs/mmteb/readme.md
@@ -1,3 +1,6 @@
+
+**NOTE**: This open collaboration have been finalized and the [paper](https://arxiv.org/abs/2502.13595) released. This document remains for documentation.
+
# Welcome to MMTEB! 👋
The Massive Multilingual Text Embedding Benchmark (MMTEB) is a community-led extension of [MTEB](https://arxiv.org/abs/2210.07316) to cover embedding tasks for a massive number of languages.
diff --git a/docs/usage/usage.md b/docs/usage/usage.md
new file mode 100644
index 0000000000..cba88f21ea
--- /dev/null
+++ b/docs/usage/usage.md
@@ -0,0 +1,510 @@
+# Usage
+
+This usage documentation follows a structure similar first it introduces a simple example of how to evaluate a model in MTEB.
+Then introduces model detailed section of defining model, selecting tasks and running the evaluation. Each section contain subsection pertaining to
+these.
+
+
+## Evaluating a Model
+
+Evaluating a model on MTEB follows a three step approach, 1) defining model, 2) selecting the tasks and 3) running the evaluation
+
+```python
+import mteb
+
+# Specify the model that we want to evaluate
+model = ...
+
+# specify what you want to evaluate it on
+tasks = mteb.get_tasks(tasks=["{task1}", "{task1}"])
+
+# run the evaluation
+evaluation = mteb.MTEB(tasks=tasks)
+results = evaluation.run(model)
+```
+
+For instance if we want to run [`"sentence-transformers/all-MiniLM-L6-v2"`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) on
+`"Banking77Classification"` we can do this using the following code:
+
+```python
+model_name = "sentence-transformers/all-MiniLM-L6-v2"
+
+# or using SentenceTransformers
+model = SentenceTransformers(model_name)
+# load the model using MTEB
+model = mteb.get_model(model_name) # will default to SentenceTransformers(model_name) if not implemented in MTEB
+
+# select the desired tasks and evaluate
+tasks = mteb.get_tasks(tasks=["Banking77Classification"])
+evaluation = mteb.MTEB(tasks=tasks)
+results = evaluation.run(model)
+```
+
+
+### Evaluating on Different Modalities
+MTEB is not only text evaluating, but also allow you to evaluate image and image-text embeddings.
+
+> [!NOTE]
+> Running MTEB on images requires you to install the optional dependencies using `pip install mteb[image]`
+
+To evaluate image embeddings you can follows the same approach for any other task in `mteb`. Simply ensuring that the task contains the modality "image":
+
+```python
+tasks = mteb.get_tasks(modalities=["image"]) # Only select tasks with image modalities
+task = task[0]
+
+print(task.metadata.modalites)
+# ['text', 'image']
+```
+
+However, we recommend starting with one of the predefined benchmarks:
+
+```python
+import mteb
+benchmark = mteb.get_benchmark("MIEB(eng)")
+evaluation = mteb.MTEB(tasks=benchmark)
+
+model = mteb.get_model("{model-of-choice}")
+evaluation.run(model)
+```
+
+You can also specify exclusive modality filtering to only get tasks with exactly the requested modalities (default behavior with `exclusive_modality_filter=False`):
+```python
+# Get tasks with image modality, this will also include tasks having both text and image modalities
+tasks = mteb.get_tasks(modalities=["image"], exclusive_modality_filter=False)
+
+# Get tasks that have ONLY image modality
+tasks = mteb.get_tasks(modalities=["image"], exclusive_modality_filter=True)
+```
+
+
+
+
+
+
+## Defining a Model
+
+### Using a pre-defined Model
+
+MTEB comes with an implementation of many popular models and APIs. These can be loaded using `mteb.get_model_meta`:
+
+```python
+model_name = "intfloat/multilingual-e5-small"
+meta = mteb.get_model_meta(model_name)
+model = meta.load_model()
+# or directly using
+model = mteb.get_model(model_name)
+```
+
+You can get an overview of on the models available in `mteb` as follows:
+
+```py
+model_metas = mteb.get_model_metas()
+
+# You can e.g. use the model metas to find all openai models
+openai_models = [meta for meta in model_metas if "openai" in meta.name]
+```
+> [!TIP]
+> Some models require additional dependencies to run on MTEB. An example of such a model is the OpenAI APIs.
+> These dependencies can be installed using `pip install mteb[openai]`
+
+### Using a Sentence Transformer Model
+
+MTEB is made to be compatible with sentence transformers and thus you can readily evaluate any model that can be loaded via. sentence transformers
+on `MTEB`:
+
+```python
+model = SentenceTransformers("sentence-transformers/LaBSE")
+
+# select the desired tasks and evaluate
+tasks = mteb.get_tasks(tasks=["Banking77Classification"])
+evaluation = mteb.MTEB(tasks=tasks)
+results = evaluation.run(model)
+```
+
+However, we do recommend check in mteb include an implementation of the model before using sentence transformers since some models (e.g. the [multilingual e5 models](https://huggingface.co/collections/intfloat/multilingual-e5-text-embeddings-67b2b8bb9bff40dec9fb3534)) require a prompt and not specifying it may reduce performance.
+
+> [!NOTE]
+> If you want to evaluate a cross encoder on a reranking task, see section on [running cross encoders for reranking](#running-cross-encoders-on-reranking)
+
+### Using a Custom Model
+
+It is also possible to implement your own custom model in MTEB as long as it adheres to the [encoder interface](https://github.com/embeddings-benchmark/mteb/blob/main/mteb/encoder_interface.py#L21).
+
+This entails implementing an `encode` function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.).
+
+```python
+import mteb
+from mteb.encoder_interface import PromptType
+import numpy as np
+
+
+class CustomModel:
+ def encode(
+ self,
+ sentences: list[str],
+ task_name: str,
+ prompt_type: PromptType | None = None,
+ **kwargs,
+ ) -> np.ndarray:
+ """Encodes the given sentences using the encoder.
+
+ Args:
+ sentences: The sentences to encode.
+ task_name: The name of the task.
+ prompt_type: The prompt type to use.
+ **kwargs: Additional arguments to pass to the encoder.
+
+ Returns:
+ The encoded sentences.
+ """
+ pass
+
+
+# evaluating the model:
+model = CustomModel()
+tasks = mteb.get_tasks(tasks=["Banking77Classification"])
+evaluation = mteb.MTEB(tasks=tasks)
+evaluation.run(model)
+```
+
+If you want to submit your implementation to be included in the leaderboard see the section on [submitting a model](https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_model.md).
+
+## Selecting Tasks
+
+This section describes how to select benchmarks and task to evaluate, including selecting specific subsets or splits to run.
+
+### Selecting a Benchmark
+
+`mteb` comes with a set of predefined benchmarks. These can be fetched using `mteb.get_benchmark` and run in a similar fashion to other sets of tasks.
+For instance to select the 56 English datasets that form the English leaderboard:
+
+```python
+import mteb
+benchmark = mteb.get_benchmark("MTEB(eng, v2)")
+evaluation = mteb.MTEB(tasks=benchmark)
+```
+
+The benchmark specified not only a list of tasks, but also what splits and language to run on.
+
+To get an overview of all available benchmarks simply run:
+
+```python
+import mteb
+benchmarks = mteb.get_benchmarks()
+```
+
+> [!NOTE]
+> Generally we use the naming scheme for benchmarks `MTEB(*)`, where the "*" denotes the target of the benchmark.
+> In the case of a language, we use the three-letter language code.
+> For large groups of languages, we use the group notation, e.g., `MTEB(Scandinavian, v1)` for Scandinavian languages.
+> External benchmarks implemented in MTEB like `CoIR` use their original name.
+
+When using a benchmark from MTEB please cite `mteb` along with the citations of the benchmark which you can access using:
+
+```python
+benchmark.citation
+```
+
+### Task selection
+
+`mteb` comes the utility function `mteb.get_task` and `mteb_get_tasks` for fetching and analysing the tasks of interest.
+
+This can be done in multiple ways, e.g.:
+
+* by the task name
+* by their type (e.g. "Clustering" or "Classification")
+* by their languages
+* by their domains
+* by their modalities
+* and many more
+
+```python
+# by name
+tasks = mteb.get_tasks(tasks=["Banking77Classification"])
+# by type
+tasks = mteb.get_tasks(task_types=["Clustering", "Retrieval"]) # Only select clustering and retrieval tasks
+# by language
+tasks = mteb.get_tasks(languages=["eng", "deu"]) # Only select datasets which contain "eng" or "deu" (iso 639-3 codes)
+# by domain
+tasks = get_tasks(domains=["Legal"])
+# by modality
+tasks = mteb.get_tasks(modalities=["text", "image"]) # Only select tasks with text or image modalities
+# or using multiple
+tasks = get_tasks(languages=["eng", "deu"], script=["Latn"], domains=["Legal"])
+```
+
+For more information see the documention for `mteb.get_tasks`
+
+You can also specify which languages to load for multilingual/cross-lingual tasks like below:
+
+```python
+import mteb
+
+tasks = [
+ mteb.get_task("AmazonReviewsClassification", languages = ["eng", "fra"]),
+ mteb.get_task("BUCCBitextMining", languages = ["deu"]), # all subsets containing "deu"
+]
+```
+
+### Selecting Evaluation Split or Subsets
+A task in `mteb` mirrors the structure of a dataset on Huggingface. It includes a splits (i.e. "test") and a subset.
+
+```python
+# selecting an evaluation split
+task = mteb.get_task("Banking77Classification", eval_splits=["test"])
+# selecting a Huggingface subset
+task = mteb.get_task("AmazonReviewsClassification", hf_subsets=["en", "fr"])
+```
+
+> [!NOTE]
+> **What is a subset?** A subset on a Huggingface dataset is what you specify after the dataset name, e.g. `datasets.load_dataset("nyu-mll/glue", "cola")`.
+> Often the subset does not need to be defined and is left as "default". The subset is however useful, especially for multilingual datasets to specify the
+> desired language or language pair e.g. in [`mteb/bucc-bitext-mining`](https://huggingface.co/datasets/mteb/bucc-bitext-mining) we might want to evaluate only on the French-English subset `"fr-en"`.
+
+
+
+
+### Using a Custom Task
+
+To evaluate on a custom task, you can run the following code on your custom task.
+See [how to add a new task](https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_dataset.md), for how to create a new task in MTEB.
+
+
+```python
+import mteb
+from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
+
+
+class MyCustomTask(AbsTaskReranking):
+ ...
+
+model = mteb.get_model(...)
+evaluation = mteb.MTEB(tasks=[MyCustomTask()])
+evaluation.run(model)
+```
+
+
+## Running the Evaluation
+
+This section contain documentation related to the runtime of the evalution. How to pass arguments to the encoder, saving outputs and similar.
+
+
+### Introduction to the runner
+
+By default `mteb` with save the results in the `results/{model_name}` folder, however if you want to saving the results in a specific folder you
+can specify it as follows:
+
+```python
+evaluation = mteb.MTEB(tasks=tasks)
+results = evaluation.run(model, output_folder="my_results_folder")
+```
+
+### Tracking Carbon Emissions
+
+`mteb` allows for easy tracking of carbon emission eq. using `codecarbon`. You simply need to install `mteb[codecarbon]` and enable co2 tracking:
+
+```python
+evaluation = mteb.MTEB(tasks=tasks)
+results = evaluation.run(model, co2_tracker=True)
+```
+
+
+### Passing in `encode` arguments
+
+To pass in arguments to the model's `encode` function, you can use the encode keyword arguments (`encode_kwargs`):
+
+```python
+evaluation.run(model, encode_kwargs={"batch_size": 32})
+```
+
+### Running SentenceTransformer model with prompts
+
+Prompts can be passed to the SentenceTransformer model using the `prompts` parameter. The following code shows how to use prompts with SentenceTransformer:
+
+```python
+from sentence_transformers import SentenceTransformer
+
+
+model = SentenceTransformer("average_word_embeddings_komninos", prompts={"query": "Query:", "passage": "Passage:"})
+evaluation = mteb.MTEB(tasks=tasks)
+```
+
+In prompts the key can be:
+1. Prompt types (`passage`, `query`) - they will be used in reranking and retrieval tasks
+2. Task type - these prompts will be used in all tasks of the given type
+ 1. `BitextMining`
+ 2. `Classification`
+ 3. `MultilabelClassification`
+ 4. `Clustering`
+ 5. `PairClassification`
+ 6. `Reranking`
+ 7. `Retrieval`
+ 8. `STS`
+ 9. `Summarization`
+ 10. `InstructionRetrieval`
+3. Pair of task type and prompt type like `Retrival-query` - these prompts will be used in all classification tasks
+4. Task name - these prompts will be used in the specific task
+5. Pair of task name and prompt type like `NFCorpus-query` - these prompts will be used in the specific task
+
+
+### Running Cross Encoders on Reranking
+
+To use a cross encoder for reranking, you can directly use a CrossEncoder from SentenceTransformers. The following code shows a two-stage run with the second stage reading results saved from the first stage.
+
+```python
+from mteb import MTEB
+import mteb
+from sentence_transformers import CrossEncoder, SentenceTransformer
+
+cross_encoder = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-2-v2")
+dual_encoder = SentenceTransformer("all-MiniLM-L6-v2")
+
+tasks = mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])
+
+subset = "default" # subset name used in the NFCorpus dataset
+eval_splits = ["test"]
+
+evaluation = MTEB(tasks=tasks)
+evaluation.run(
+ dual_encoder,
+ eval_splits=eval_splits,
+ save_predictions=True,
+ output_folder="results/stage1",
+)
+evaluation.run(
+ cross_encoder,
+ eval_splits=eval_splits,
+ top_k=5,
+ save_predictions=True,
+ output_folder="results/stage2",
+ previous_results=f"results/stage1/NFCorpus_{subset}_predictions.json",
+)
+```
+
+
+### Using Late Interaction Models
+
+This section outlines how to use late interaction models for retrieval.
+
+```python
+from mteb import MTEB
+import mteb
+
+
+colbert = mteb.get_model("colbert-ir/colbertv2.0")
+tasks = mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])
+
+eval_splits = ["test"]
+
+evaluation = MTEB(tasks=tasks)
+
+evaluation.run(
+ colbert,
+ eval_splits=eval_splits,
+ corpus_chunk_size=500,
+)
+```
+This implementation employs the MaxSim operation to compute the similarity between sentences. While MaxSim provides high-quality results, it processes a larger number of embeddings, potentially leading to increased resource usage. To manage resource consumption, consider lowering the `corpus_chunk_size` parameter.
+
+
+### Saving retrieval task predictions
+
+To save the predictions from a retrieval task, add the `--save_predictions` flag in the CLI or set `save_predictions=True` in the run method. The filename will be in the "{task_name}_{subset}_predictions.json" format.
+
+Python:
+```python
+from mteb import MTEB
+import mteb
+from sentence_transformers import SentenceTransformer
+
+model = SentenceTransformer("all-MiniLM-L6-v2")
+
+tasks = mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])
+
+evaluation = MTEB(tasks=tasks)
+evaluation.run(
+ model,
+ eval_splits=["test"],
+ save_predictions=True,
+ output_folder="results",
+)
+```
+
+CLI:
+```bash
+mteb run -t NFCorpus -m all-MiniLM-L6-v2 --output_folder results --save_predictions
+```
+
+### Caching Embeddings To Re-Use Them
+
+There are times you may want to cache the embeddings so you can re-use them. This may be true if you have multiple query sets for the same corpus (e.g. Wikipedia) or are doing some optimization over the queries (e.g. prompting, other experiments). You can setup a cache by using a simple wrapper, which will save the cache per task in the `cache_embeddings/{task_name}` folder:
+
+```python
+# define your task and model above as normal
+...
+# wrap the model with the cache wrapper
+from mteb.models.cache_wrapper import CachedEmbeddingWrapper
+model_with_cached_emb = CachedEmbeddingWrapper(model, cache_path='path_to_cache_dir')
+# run as normal
+evaluation.run(model, ...)
+```
+
+## Leaderboard
+
+This section contains information on how to interact with the leaderboard including running it locally, analysing the results, annotating contamination and more.
+
+### Fetching results from the Leaderboard
+
+Multiple models have already been run on tasks available within MTEB. These results are available results [repository](https://github.com/embeddings-benchmark/results).
+
+To make the results more easily accessible, we have designed custom functionality for retrieving from the repository. For instance, if you are selecting the best model for your French and English retrieval task on legal documents you could fetch the relevant tasks and create a dataframe of the results using the following code:
+
+```python
+import mteb
+from mteb.task_selection import results_to_dataframe
+
+tasks = mteb.get_tasks(
+ task_types=["Retrieval"], languages=["eng", "fra"], domains=["Legal"]
+)
+
+model_names = [
+ "GritLM/GritLM-7B",
+ "intfloat/multilingual-e5-small",
+ "intfloat/multilingual-e5-base",
+ "intfloat/multilingual-e5-large",
+]
+models = [mteb.get_model_meta(name) for name in model_names]
+
+results = mteb.load_results(models=models, tasks=tasks)
+
+df = results_to_dataframe(results)
+```
+
+### Annotate Contamination
+
+have your found contamination in the training data of a model? Please let us know, either by opening an issue or ideally by submitting a PR
+annotating the training datasets of the model:
+
+```py
+model_w_contamination = ModelMeta(
+ name = "model-with-contamination"
+ ...
+ training_datasets: {"ArguAna": # name of dataset within MTEB
+ ["test"]} # the splits that have been trained on
+ ...
+)
+```
+
+
+### Running the Leaderboard Locally
+
+It is possible to completely deploy the leaderboard locally or self-host it. This can e.g. be relevant for companies that might want to
+integrate build their own benchmarks or integrate custom tasks into existing benchmarks.
+
+Running the leaderboard is quite easy. Simply run:
+```py
+python -m mteb.leaderboard.app
+```
+
+The leaderboard requires gradio install, which can be installed using `pip install mteb[gradio]` and requires python >3.10.
diff --git a/mteb/benchmarks/benchmarks.py b/mteb/benchmarks/benchmarks.py
index c350684561..56fde5afde 100644
--- a/mteb/benchmarks/benchmarks.py
+++ b/mteb/benchmarks/benchmarks.py
@@ -16,6 +16,18 @@
] # Allows the type to be a string, but ensures that the string is a URL
+MMTEB_CITATION = """
+@article{enevoldsen2025mmtebmassivemultilingualtext,
+ title={MMTEB: Massive Multilingual Text Embedding Benchmark},
+ author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and Márton Kardos and Ashwin Mathur and David Stap and Jay Gala and Wissam Siblini and Dominik Krzemiński and Genta Indra Winata and Saba Sturua and Saiteja Utpala and Mathieu Ciancone and Marion Schaeffer and Gabriel Sequeira and Diganta Misra and Shreeya Dhakal and Jonathan Rystrøm and Roman Solomatin and Ömer Çağatan and Akash Kundu and Martin Bernstorff and Shitao Xiao and Akshita Sukhlecha and Bhavish Pahwa and Rafał Poświata and Kranthi Kiran GV and Shawon Ashraf and Daniel Auras and Björn Plüster and Jan Philipp Harries and Loïc Magne and Isabelle Mohr and Mariya Hendriksen and Dawei Zhu and Hippolyte Gisserot-Boukhlef and Tom Aarsen and Jan Kostkan and Konrad Wojtasik and Taemin Lee and Marek Šuppa and Crystina Zhang and Roberta Rocca and Mohammed Hamdy and Andrianos Michail and John Yang and Manuel Faysse and Aleksei Vatolin and Nandan Thakur and Manan Dey and Dipam Vasani and Pranjal Chitale and Simone Tedeschi and Nguyen Tai and Artem Snegirev and Michael Günther and Mengzhou Xia and Weijia Shi and Xing Han Lù and Jordan Clive and Gayatri Krishnakumar and Anna Maksimova and Silvan Wehrli and Maria Tikhonova and Henil Panchal and Aleksandr Abramov and Malte Ostendorff and Zheng Liu and Simon Clematide and Lester James Miranda and Alena Fenogenova and Guangyu Song and Ruqiya Bin Safi and Wen-Ding Li and Alessia Borghini and Federico Cassano and Hongjin Su and Jimmy Lin and Howard Yen and Lasse Hansen and Sara Hooker and Chenghao Xiao and Vaibhav Adlakha and Orion Weller and Siva Reddy and Niklas Muennighoff},
+ publisher = {arXiv},
+ journal={arXiv preprint arXiv:2502.13595},
+ year={2025},
+ url={https://arxiv.org/abs/2502.13595},
+ doi = {10.48550/arXiv.2502.13595},
+}
+"""
+
MTEB_EN = Benchmark(
name="MTEB(eng, v2)",
tasks=MTEBTasks(
@@ -81,7 +93,7 @@
The original MTEB leaderboard is available under the [MTEB(eng, v1)](http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28eng%2C+v1%29) tab.
""",
- citation="",
+ citation=MMTEB_CITATION,
contacts=["KennethEnevoldsen", "Muennighoff"],
)
@@ -646,7 +658,7 @@
),
description="A massive code embedding benchmark covering retrieval tasks in a miriad of popular programming languages.",
reference=None,
- citation=None,
+ citation=MMTEB_CITATION,
)
MTEB_multilingual = Benchmark(
@@ -789,7 +801,7 @@
),
description="A large-scale multilingual expansion of MTEB, driven mainly by highly-curated community contributions covering 250+ languages.",
reference=None,
- citation=None,
+ citation=MMTEB_CITATION,
contacts=["KennethEnevoldsen", "isaac-chung"],
)
@@ -904,7 +916,7 @@
),
description="A regional geopolitical text embedding benchmark targetting embedding performance on Indic languages.",
reference=None,
- citation=None,
+ citation=MMTEB_CITATION,
contacts=["KennethEnevoldsen", "isaac-chung"],
)
@@ -1036,7 +1048,7 @@
),
description="A regional geopolitical text embedding benchmark targetting embedding performance on European languages.",
reference=None,
- citation=None,
+ citation=MMTEB_CITATION,
contacts=["KennethEnevoldsen", "isaac-chung"],
)
@@ -1084,7 +1096,6 @@
}""",
)
-
BRIGHT_LONG = Benchmark(
name="BRIGHT (long)",
tasks=MTEBTasks(
diff --git a/mteb/evaluation/evaluators/Image/Any2AnyMultiChoiceEvaluator.py b/mteb/evaluation/evaluators/Image/Any2AnyMultiChoiceEvaluator.py
index c69f0153a2..dd41e5f23c 100644
--- a/mteb/evaluation/evaluators/Image/Any2AnyMultiChoiceEvaluator.py
+++ b/mteb/evaluation/evaluators/Image/Any2AnyMultiChoiceEvaluator.py
@@ -15,9 +15,9 @@
from datasets import Dataset
from PIL import Image
from torch.utils.data import DataLoader
-from torchvision import transforms
from mteb.encoder_interface import Encoder
+from mteb.requires_package import requires_image_dependencies
from ..Evaluator import Evaluator
from ..utils import (
@@ -36,7 +36,12 @@
logger = logging.getLogger(__name__)
-transform = transforms.Compose([transforms.PILToTensor()])
+
+def get_default_transform():
+ requires_image_dependencies()
+ from torchvision import transforms
+
+ return transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor()])
class ImageDataset(torch.utils.data.Dataset):
@@ -121,6 +126,8 @@ def search(
q_modality = queries[0]["modality"]
+ default_transform = get_default_transform()
+
if q_modality == "text":
query_texts = queries["text"]
query_embeddings = self.model.get_text_embeddings(
@@ -130,7 +137,7 @@ def search(
)
else:
queries_dataset = ImageDataset(
- queries, image_column_name="image", transform=transform
+ queries, image_column_name="image", transform=default_transform
)
query_image_dataloader = DataLoader(
queries_dataset,
@@ -182,7 +189,7 @@ def search(
)
else:
corpus_dataset = ImageDataset(
- chunk, image_column_name="image", transform=transform
+ chunk, image_column_name="image", transform=default_transform
)
corpus_image_dataloader = DataLoader(
corpus_dataset,
diff --git a/mteb/evaluation/evaluators/Image/Any2AnyRetrievalEvaluator.py b/mteb/evaluation/evaluators/Image/Any2AnyRetrievalEvaluator.py
index 777e3b545f..74a41fb1a3 100644
--- a/mteb/evaluation/evaluators/Image/Any2AnyRetrievalEvaluator.py
+++ b/mteb/evaluation/evaluators/Image/Any2AnyRetrievalEvaluator.py
@@ -15,9 +15,9 @@
from datasets import Dataset
from PIL import Image
from torch.utils.data import DataLoader
-from torchvision import transforms
from mteb.encoder_interface import Encoder, PromptType
+from mteb.requires_package import requires_image_dependencies
from ..Evaluator import Evaluator
from ..utils import (
@@ -36,7 +36,12 @@
logger = logging.getLogger(__name__)
-DEFAULT_TRANSFORM = transforms.Compose([transforms.PILToTensor()])
+
+def get_default_transform():
+ requires_image_dependencies()
+ from torchvision import transforms
+
+ return transforms.Compose([transforms.PILToTensor()])
class ImageDataset(torch.utils.data.Dataset):
@@ -74,13 +79,14 @@ def __init__(
encode_kwargs: dict[str, Any] = {},
corpus_chunk_size: int = 20000,
previous_results: str | None = None,
- transform=DEFAULT_TRANSFORM,
+ transform=None,
**kwargs: Any,
):
# Model is class that provides get_text_embeddings() and get_image_embeddings()
self.model = model
self.encode_kwargs = encode_kwargs
- self.transform = transform
+ if transform is None:
+ self.transform = get_default_transform()
if "batch_size" not in encode_kwargs:
encode_kwargs["batch_size"] = 128
diff --git a/mteb/evaluation/evaluators/Image/Any2TextMultipleChoiceEvaluator.py b/mteb/evaluation/evaluators/Image/Any2TextMultipleChoiceEvaluator.py
index a93714e770..3aaa4a14ff 100644
--- a/mteb/evaluation/evaluators/Image/Any2TextMultipleChoiceEvaluator.py
+++ b/mteb/evaluation/evaluators/Image/Any2TextMultipleChoiceEvaluator.py
@@ -7,7 +7,6 @@
import torch
from sklearn.metrics import accuracy_score
from sklearn.metrics.pairwise import cosine_similarity
-from torchvision import transforms
from tqdm import tqdm
from mteb.encoder_interface import Encoder, EncoderWithSimilarity
@@ -15,8 +14,6 @@
logger = logging.getLogger(__name__)
-transform = transforms.Compose([transforms.PILToTensor()])
-
class Any2TextMultipleChoiceEvaluator(Evaluator):
"""Evaluate a model based on the similarity of queries (can be interleaved) and candidate answers.
@@ -38,7 +35,6 @@ def __init__(
label_column_name: str,
choices_column_name: str,
task_name: str | None = None,
- transform=None,
limit: int | None = None,
**kwargs,
):
@@ -51,7 +47,6 @@ def __init__(
self.label_column_name = label_column_name
self.choices_column_name = choices_column_name
self.task_name = task_name
- self.transform = transform
def __call__(
self,
diff --git a/mteb/evaluation/evaluators/Image/ClassificationEvaluator.py b/mteb/evaluation/evaluators/Image/ClassificationEvaluator.py
index a0d84d5714..a3f7c022fa 100644
--- a/mteb/evaluation/evaluators/Image/ClassificationEvaluator.py
+++ b/mteb/evaluation/evaluators/Image/ClassificationEvaluator.py
@@ -16,9 +16,9 @@
from sklearn.neighbors import KNeighborsClassifier
from torch import Tensor
from torch.utils.data import DataLoader
-from torchvision import transforms
from mteb.encoder_interface import Encoder
+from mteb.requires_package import requires_image_dependencies
from ..Evaluator import Evaluator
@@ -29,7 +29,11 @@ def dot_distance(a: np.ndarray, b: np.ndarray) -> float:
return -np.dot(a, b)
-transform = transforms.Compose([transforms.PILToTensor()])
+def get_default_transform():
+ requires_image_dependencies()
+ from torchvision import transforms
+
+ return transforms.Compose([transforms.PILToTensor()])
class ImageDataset(torch.utils.data.Dataset):
@@ -71,13 +75,18 @@ def __init__(
if limit is not None:
dataset_train = dataset_train.select(list(range(limit)))
+ default_transform = get_default_transform()
self.dataset_train = ImageDataset(
- dataset_train, image_column_name=image_column_name, transform=transform
+ dataset_train,
+ image_column_name=image_column_name,
+ transform=default_transform,
)
self.y_train = dataset_train[label_column_name]
self.dataset_test = ImageDataset(
- dataset_test, image_column_name=image_column_name, transform=transform
+ dataset_test,
+ image_column_name=image_column_name,
+ transform=default_transform,
)
self.y_test = dataset_test[label_column_name]
self.task_name = task_name
@@ -155,13 +164,18 @@ def __init__(
if limit is not None:
dataset_train = dataset_train.select(list(range(limit)))
+ default_transform = get_default_transform()
self.dataset_train = ImageDataset(
- dataset_train, image_column_name=image_column_name, transform=transform
+ dataset_train,
+ image_column_name=image_column_name,
+ transform=default_transform,
)
self.y_train = dataset_train[label_column_name]
self.dataset_test = ImageDataset(
- dataset_test, image_column_name=image_column_name, transform=transform
+ dataset_test,
+ image_column_name=image_column_name,
+ transform=default_transform,
)
self.y_test = dataset_test[label_column_name]
self.task_name = task_name
@@ -322,12 +336,17 @@ def __init__(
if limit is not None:
dataset_train = dataset_train.select(list(range(limit)))
+ default_transform = get_default_transform()
self.dataset_train = ImageDataset(
- dataset_train, image_column_name=image_column_name, transform=transform
+ dataset_train,
+ image_column_name=image_column_name,
+ transform=default_transform,
)
self.y_train = dataset_train[label_column_name]
self.dataset_test = ImageDataset(
- dataset_test, image_column_name=image_column_name, transform=transform
+ dataset_test,
+ image_column_name=image_column_name,
+ transform=default_transform,
)
self.y_test = dataset_test[label_column_name]
diff --git a/mteb/evaluation/evaluators/Image/ImageTextPairClassificationEvaluator.py b/mteb/evaluation/evaluators/Image/ImageTextPairClassificationEvaluator.py
index f3188f7753..72cefbcfde 100644
--- a/mteb/evaluation/evaluators/Image/ImageTextPairClassificationEvaluator.py
+++ b/mteb/evaluation/evaluators/Image/ImageTextPairClassificationEvaluator.py
@@ -8,15 +8,12 @@
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
-from torchvision import transforms
from mteb.encoder_interface import Encoder, EncoderWithSimilarity
from mteb.evaluation.evaluators.Evaluator import Evaluator
logger = logging.getLogger(__name__)
-transform = transforms.Compose([transforms.PILToTensor()])
-
class ImageTextDataset(torch.utils.data.Dataset):
def __init__(
diff --git a/mteb/evaluation/evaluators/Image/VisualSTSEvaluator.py b/mteb/evaluation/evaluators/Image/VisualSTSEvaluator.py
index a042d22f5a..69d203711f 100644
--- a/mteb/evaluation/evaluators/Image/VisualSTSEvaluator.py
+++ b/mteb/evaluation/evaluators/Image/VisualSTSEvaluator.py
@@ -14,13 +14,19 @@
paired_manhattan_distances,
)
from torch.utils.data import DataLoader
-from torchvision import transforms
+
+from mteb.requires_package import requires_image_dependencies
from ..Evaluator import Evaluator
logger = logging.getLogger(__name__)
-transform = transforms.Compose([transforms.PILToTensor()])
+
+def get_default_transform():
+ requires_image_dependencies()
+ from torchvision import transforms
+
+ return transforms.Compose([transforms.PILToTensor()])
class ImageDataset(torch.utils.data.Dataset):
@@ -54,11 +60,17 @@ def __init__(
**kwargs,
):
super().__init__(**kwargs)
+
+ default_transform = get_default_transform()
self.sentence1_dataset = ImageDataset(
- dataset, image_column_name=sentences_column_names[0], transform=transform
+ dataset,
+ image_column_name=sentences_column_names[0],
+ transform=default_transform,
)
self.sentence2_dataset = ImageDataset(
- dataset, image_column_name=sentences_column_names[1], transform=transform
+ dataset,
+ image_column_name=sentences_column_names[1],
+ transform=default_transform,
)
self.gold_scores = gold_scores
self.task_name = task_name
diff --git a/mteb/evaluation/evaluators/Image/ZeroShotClassificationEvaluator.py b/mteb/evaluation/evaluators/Image/ZeroShotClassificationEvaluator.py
index da3e8f5f97..b25232d28e 100644
--- a/mteb/evaluation/evaluators/Image/ZeroShotClassificationEvaluator.py
+++ b/mteb/evaluation/evaluators/Image/ZeroShotClassificationEvaluator.py
@@ -8,15 +8,20 @@
import torch
from sklearn import metrics
from torch.utils.data import DataLoader
-from torchvision import transforms
from mteb.encoder_interface import Encoder
+from mteb.requires_package import requires_image_dependencies
from ..Evaluator import Evaluator
logger = logging.getLogger(__name__)
-transform = transforms.Compose([transforms.PILToTensor()])
+
+def get_default_transform():
+ requires_image_dependencies()
+ from torchvision import transforms
+
+ return transforms.Compose([transforms.PILToTensor()])
class ImageDataset(torch.utils.data.Dataset):
@@ -52,7 +57,9 @@ def __init__(
):
super().__init__(**kwargs)
self.dataset = ImageDataset(
- dataset, image_column_name=image_column_name, transform=transform
+ dataset,
+ image_column_name=image_column_name,
+ transform=get_default_transform(),
)
self.image_column_name = image_column_name
self.labels = labels
diff --git a/mteb/models/cohere_v.py b/mteb/models/cohere_v.py
index a3856627b1..a5d88ae9aa 100644
--- a/mteb/models/cohere_v.py
+++ b/mteb/models/cohere_v.py
@@ -10,14 +10,11 @@
import torch
from PIL import Image
from torch.utils.data import DataLoader
-from torchvision import transforms
from tqdm import tqdm
from mteb.encoder_interface import PromptType
from mteb.model_meta import ModelMeta
-
-api_key = os.getenv("COHERE_API_KEY")
-tensor_to_image = transforms.Compose([transforms.ToPILImage()])
+from mteb.requires_package import requires_image_dependencies
def cohere_v_loader(**kwargs):
@@ -32,15 +29,20 @@ def __init__(
model_name: str,
**kwargs: Any,
):
- self.model_name = model_name
- self.client = cohere.ClientV2(api_key)
- self.image_format = "JPEG"
- """ Wrapper for Cohere multimodal embedding model,
+ """Wrapper for Cohere multimodal embedding model,
do `export COHERE_API_KEY=` before running eval scripts.
Cohere currently supports 40 images/min, thus time.sleep(1.5) is applied after each image.
Remove or adjust this after Cohere API changes capacity.
"""
+ requires_image_dependencies()
+ from torchvision import transforms
+
+ self.model_name = model_name
+ api_key = os.getenv("COHERE_API_KEY")
+ self.client = cohere.ClientV2(api_key)
+ self.image_format = "JPEG"
+ self.transform = transforms.Compose([transforms.PILToTensor()])
def get_text_embeddings(
self,
@@ -81,7 +83,7 @@ def get_image_embeddings(
for image in batch:
# cohere only supports 1 image per call
buffered = io.BytesIO()
- image = tensor_to_image(image)
+ image = self.transform(image)
image.save(buffered, format=self.image_format)
image_bytes = buffered.getvalue()
stringified_buffer = base64.b64encode(image_bytes).decode(
@@ -142,8 +144,8 @@ def calculate_probs(self, text_embeddings, image_embeddings):
def get_fused_embeddings(
self,
- texts: list[str] = None,
- images: list[Image.Image] | DataLoader = None,
+ texts: list[str] | None = None,
+ images: list[Image.Image] | DataLoader | None = None,
fusion_mode="sum",
**kwargs: Any,
):
diff --git a/mteb/models/evaclip_models.py b/mteb/models/evaclip_models.py
index 0b9e0e19bc..8cbd184447 100644
--- a/mteb/models/evaclip_models.py
+++ b/mteb/models/evaclip_models.py
@@ -10,6 +10,7 @@
from mteb.encoder_interface import PromptType
from mteb.model_meta import ModelMeta
+from mteb.requires_package import requires_image_dependencies
def evaclip_loader(**kwargs):
@@ -36,6 +37,8 @@ def __init__(
device: str = "cuda" if torch.cuda.is_available() else "cpu",
**kwargs: Any,
):
+ requires_image_dependencies()
+
self.model_name = model_name
self.device = device
pretrained = "eva_clip" # or "/path/to/EVA02_CLIP_B_psz16_s8B.pt"
@@ -86,10 +89,10 @@ def get_image_embeddings(
batch_size: int = 32,
**kwargs: Any,
):
+ import torchvision.transforms.functional as F
+
all_image_embeddings = []
if isinstance(images, DataLoader):
- import torchvision.transforms.functional as F
-
with torch.no_grad(), torch.cuda.amp.autocast():
for batch in tqdm(images):
# import pdb; pdb.set_trace()
diff --git a/mteb/models/jina_clip.py b/mteb/models/jina_clip.py
index 94d498802f..208b77e44a 100644
--- a/mteb/models/jina_clip.py
+++ b/mteb/models/jina_clip.py
@@ -11,6 +11,7 @@
from mteb.encoder_interface import PromptType
from mteb.model_meta import ModelMeta
+from mteb.requires_package import requires_image_dependencies
class JinaCLIPModelWrapper:
@@ -20,6 +21,8 @@ def __init__(
device: str = "cuda" if torch.cuda.is_available() else "cpu",
**kwargs: Any,
):
+ requires_image_dependencies()
+
self.model_name = model_name
self.device = device
self.model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(
@@ -63,12 +66,12 @@ def get_image_embeddings(
convert_to_tensor=True,
**kwargs: Any,
):
+ import torchvision.transforms.functional as F
+
all_image_embeddings = []
if isinstance(images, DataLoader):
with torch.no_grad():
- import torchvision.transforms.functional as F
-
for batch in tqdm(images):
image_outputs = self.model.encode_image(
[F.to_pil_image(b.to("cpu")) for b in batch],
diff --git a/mteb/models/llm2clip_models.py b/mteb/models/llm2clip_models.py
index 25ed3c6808..5c2a17cfe8 100644
--- a/mteb/models/llm2clip_models.py
+++ b/mteb/models/llm2clip_models.py
@@ -12,6 +12,7 @@
from mteb.encoder_interface import PromptType
from mteb.model_meta import ModelMeta
+from mteb.requires_package import requires_image_dependencies
MODEL2PROCESSOR = {
"microsoft/LLM2CLIP-Openai-L-14-336": "openai/clip-vit-large-patch14-336",
@@ -36,6 +37,8 @@ def __init__(
device: str = "cuda" if torch.cuda.is_available() else "cpu",
**kwargs: Any,
):
+ requires_image_dependencies()
+
if model_name not in MODEL2PROCESSOR:
raise Exception(
f"This model {model_name} is not in the supported mode list: {list(MODEL2PROCESSOR.keys())}."
@@ -119,10 +122,10 @@ def get_image_embeddings(
batch_size: int = 32,
**kwargs: Any,
):
+ import torchvision.transforms.functional as F
+
all_image_embeddings = []
if isinstance(images, DataLoader):
- import torchvision.transforms.functional as F
-
with torch.no_grad(), torch.amp.autocast("cuda"):
for batch in tqdm(images):
input_pixels = self.processor(
diff --git a/mteb/models/moco_models.py b/mteb/models/moco_models.py
index cb2ee875da..b88e9805c7 100644
--- a/mteb/models/moco_models.py
+++ b/mteb/models/moco_models.py
@@ -10,6 +10,7 @@
from mteb.encoder_interface import PromptType
from mteb.model_meta import ModelMeta
+from mteb.requires_package import requires_image_dependencies
def mocov3_loader(**kwargs):
@@ -29,6 +30,8 @@ def __init__(
device: str = "cuda" if torch.cuda.is_available() else "cpu",
**kwargs: Any,
):
+ requires_image_dependencies()
+
self.model_name = model_name
self.device = device
name = "vit_base_patch16_224"
@@ -69,11 +72,11 @@ def get_image_embeddings(
batch_size: int = 32,
**kwargs: Any,
):
+ import torchvision.transforms.functional as F
+
all_image_embeddings = []
if isinstance(images, DataLoader):
- import torchvision.transforms.functional as F
-
with torch.no_grad():
for batch in tqdm(images):
inputs = torch.vstack(
@@ -107,8 +110,8 @@ def calculate_probs(text_embeddings, image_embeddings):
def get_fused_embeddings(
self,
- texts: list[str] = None,
- images: list[Image.Image] | DataLoader = None,
+ texts: list[str] | None = None,
+ images: list[Image.Image] | DataLoader | None = None,
*,
task_name: str | None = None,
prompt_type: PromptType | None = None,
diff --git a/mteb/models/openai_models.py b/mteb/models/openai_models.py
index 0163376930..630528c983 100644
--- a/mteb/models/openai_models.py
+++ b/mteb/models/openai_models.py
@@ -26,10 +26,20 @@ def __init__(
"""Wrapper for OpenAIs embedding API.
To handle documents larger than 8191 tokens, we truncate the document to the specified sequence length.
"""
- requires_package(self, "openai", "Openai text embedding")
+ requires_package(
+ self,
+ "openai",
+ "Openai text embedding",
+ install_instruction="pip install mteb[openai]",
+ )
from openai import OpenAI
- requires_package(self, "tiktoken", "Tiktoken package")
+ requires_package(
+ self,
+ "tiktoken",
+ "Tiktoken package",
+ install_instruction="pip install mteb[openai]",
+ )
import tiktoken
self._client = OpenAI()
diff --git a/mteb/models/openclip_models.py b/mteb/models/openclip_models.py
index 3079ff6933..8399fd4f64 100644
--- a/mteb/models/openclip_models.py
+++ b/mteb/models/openclip_models.py
@@ -10,6 +10,7 @@
from mteb.encoder_interface import PromptType
from mteb.model_meta import ModelMeta
+from mteb.requires_package import requires_image_dependencies
def openclip_loader(**kwargs):
@@ -25,6 +26,8 @@ def __init__(
device: str = "cuda" if torch.cuda.is_available() else "cpu",
**kwargs: Any,
):
+ requires_image_dependencies()
+
self.model_name = model_name
self.device = device
self.model, _, self.img_preprocess = open_clip.create_model_and_transforms(
@@ -71,10 +74,10 @@ def get_image_embeddings(
batch_size: int = 32,
**kwargs: Any,
):
+ import torchvision.transforms.functional as F
+
all_image_embeddings = []
if isinstance(images, DataLoader):
- import torchvision.transforms.functional as F
-
with torch.no_grad(), torch.cuda.amp.autocast():
for batch in tqdm(images):
# import pdb; pdb.set_trace()
@@ -112,8 +115,8 @@ def calculate_probs(self, text_embeddings, image_embeddings):
def get_fused_embeddings(
self,
- texts: list[str] = None,
- images: list[Image.Image] | DataLoader = None,
+ texts: list[str] | None = None,
+ images: list[Image.Image] | DataLoader | None = None,
fusion_mode="sum",
**kwargs: Any,
):
diff --git a/mteb/models/vista_models.py b/mteb/models/vista_models.py
index 47382fae4a..0905e649ab 100644
--- a/mteb/models/vista_models.py
+++ b/mteb/models/vista_models.py
@@ -6,13 +6,11 @@
import torch
from PIL import Image
from torch.utils.data import DataLoader
-from torchvision import transforms
from tqdm import tqdm
from mteb.encoder_interface import PromptType
from mteb.model_meta import ModelMeta
-
-tensor_to_image = transforms.Compose([transforms.ToPILImage()])
+from mteb.requires_package import requires_image_dependencies
def vista_loader(**kwargs):
@@ -24,18 +22,48 @@ def vista_loader(**kwargs):
)
class VisualizedBGEWrapper(Visualized_BGE):
+ """Setting up VISTA
+
+ ```
+ git clone https://github.com/FlagOpen/FlagEmbedding.git
+ cd FlagEmbedding/research/visual_bge
+ pip install -e .
+ pip install torchvision timm einops ftfy
+ ```
+ back to the root folder of mteb; download the vision tower for bge-base
+ ```
+ cd ..
+ wget https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_base_en_v1.5.pth?download=true
+ ```
+ rename it to `visualized_base_en_V1.5.pth`
+ ```
+ mv Visualized_base_en_v1.5.pth?download=true visualized_base_en_V1.5.pth
+ ```
+ download the vision tower for bge-m3
+ ```
+ wget https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_m3.pth?download=true
+ ```
+ rename it to `visualized_m3.pth`
+ ```
+ mv Visualized_m3.pth?download=true visualized_m3.pth
+ ```
+ """
+
def __init__(
self,
- model_name_bge: str = None,
+ model_name_bge: str | None = None,
model_weight=None,
normlized: bool = True,
sentence_pooling_method: str = "cls",
negatives_cross_device: bool = False,
temperature: float = 0.02,
from_pretrained=None,
- image_tokens_num: int = None,
+ image_tokens_num: int | None = None,
**kwargs: Any,
):
+ requires_image_dependencies()
+ from torchvision import transforms
+
super().__init__(
model_name_bge=model_name_bge,
model_weight=model_weight,
@@ -49,6 +77,7 @@ def __init__(
self.max_text_len_with_image = (
self.tokenizer.model_max_length - image_tokens_num
)
+ self.transform = transforms.Compose([transforms.ToPILImage()])
self.eval()
def encode_text(self, texts):
@@ -120,7 +149,7 @@ def encode(
]
else:
images = [
- self.preprocess_val(tensor_to_image(image))
+ self.preprocess_val(self.tensor_to_image(image))
for image in images
]
images = torch.stack(images)
diff --git a/mteb/models/vlm2vec_models.py b/mteb/models/vlm2vec_models.py
index fbf7bf9f0a..1d629b86c9 100644
--- a/mteb/models/vlm2vec_models.py
+++ b/mteb/models/vlm2vec_models.py
@@ -12,6 +12,7 @@
from mteb.encoder_interface import PromptType
from mteb.model_meta import ModelMeta
+from mteb.requires_package import requires_image_dependencies
logging.basicConfig(level=logging.WARNING)
logger = logging.getLogger(__name__)
@@ -28,6 +29,7 @@ def __init__(
device: str = "cuda" if torch.cuda.is_available() else "cpu",
**kwargs,
):
+ requires_image_dependencies()
try:
import flash_attn # noqa
from peft import LoraConfig, PeftModel # noqa
@@ -119,11 +121,11 @@ def get_image_embeddings(
batch_size: int = 32,
**kwargs: Any,
):
+ import torchvision.transforms.functional as F
+
text = "<|image_1|> Represent the given image."
all_image_embeddings = []
if isinstance(images, DataLoader):
- import torchvision.transforms.functional as F
-
with torch.no_grad():
for batch in tqdm(images):
input_ids, pixel_values, image_sizes = [], [], []
@@ -253,8 +255,8 @@ def calculate_probs(self, text_embeddings, image_embeddings):
def get_fused_embeddings(
self,
- texts: list[str] = None,
- images: list[Image.Image] | DataLoader = None,
+ texts: list[str] | None = None,
+ images: list[Image.Image] | DataLoader | None = None,
*,
task_name: str | None = None,
prompt_type: PromptType | None = None,
@@ -262,6 +264,8 @@ def get_fused_embeddings(
fusion_mode="sum",
**kwargs: Any,
):
+ import torchvision.transforms.functional as F
+
if texts is None and images is None:
raise ValueError("Either texts or images must be provided")
@@ -283,8 +287,6 @@ def get_fused_embeddings(
texts = iter(texts)
all_fused_embeddings = []
if isinstance(images, DataLoader):
- import torchvision.transforms.functional as F
-
with torch.no_grad():
for batch in images:
input_ids, pixel_values, image_sizes = [], [], []
diff --git a/mteb/models/voyage_v.py b/mteb/models/voyage_v.py
index fc880347c5..96e7ff9997 100644
--- a/mteb/models/voyage_v.py
+++ b/mteb/models/voyage_v.py
@@ -1,21 +1,17 @@
from __future__ import annotations
import logging
-import os
from functools import partial
from typing import Any
import torch
from PIL import Image
from torch.utils.data import DataLoader
-from torchvision import transforms
from tqdm import tqdm
from mteb.encoder_interface import PromptType
from mteb.model_meta import ModelMeta
-
-api_key = os.getenv("VOYAGE_API_KEY")
-tensor_to_image = transforms.Compose([transforms.ToPILImage()])
+from mteb.requires_package import requires_image_dependencies
def downsample_image(
@@ -37,15 +33,15 @@ def downsample_image(
logging.info(
f"Downsampling image from {width}x{height} to {new_width}x{new_height}"
)
- return image.resize(new_size, Image.LANCZOS)
+ return image.resize(new_size, Image.LANCZOS) # type: ignore
if width > height:
if width > 10000:
logging.error("Processing extremely wide images.")
- return image.resize((10000, height), Image.LANCZOS)
+ return image.resize((10000, height), Image.LANCZOS) # type: ignore
else:
if height > 10000:
logging.error("Processing extremely high images.")
- return image.resize((width, 10000), Image.LANCZOS)
+ return image.resize((width, 10000), Image.LANCZOS) # type: ignore
return image
@@ -67,8 +63,12 @@ def __init__(
model_name: str,
**kwargs: Any,
):
+ requires_image_dependencies()
+ from torchvision import transforms
+
self.model_name = model_name
self.vo = voyageai.Client()
+ self.tensor_to_image = transforms.Compose([transforms.PILToTensor()])
@retry(
stop=stop_after_attempt(6), # Stop after 6 attempts
@@ -132,7 +132,8 @@ def get_image_embeddings(
if index == 0:
assert len(batch) == batch_size
batch_images = [
- [downsample_image(tensor_to_image(image))] for image in batch
+ [downsample_image(self.tensor_to_image(image))]
+ for image in batch
]
embeddings = self._multimodal_embed(
batch_images, model=self.model_name, input_type=input_type
@@ -190,7 +191,8 @@ def get_fused_embeddings(
if index == 0:
assert len(batch) == batch_size
batch_images = [
- downsample_image(tensor_to_image(image)) for image in batch
+ downsample_image(self.tensor_to_image(image))
+ for image in batch
]
batch_texts = texts[
index * batch_size : (index + 1) * batch_size
diff --git a/mteb/requires_package.py b/mteb/requires_package.py
index a91c2ba093..d261acdffb 100644
--- a/mteb/requires_package.py
+++ b/mteb/requires_package.py
@@ -8,10 +8,25 @@ def _is_package_available(pkg_name: str) -> bool:
return package_exists
-def requires_package(obj, package_name: str, model_name: str) -> None:
+def requires_package(
+ obj, package_name: str, model_name: str, install_instruction: str | None = None
+) -> None:
if not _is_package_available(package_name):
+ install_instruction = (
+ f"pip install {package_name}"
+ if install_instruction is None
+ else install_instruction
+ )
name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__
raise ImportError(
f"{name} requires the `{package_name}` library but it was not found in your environment. "
- + f"If you want to load {model_name} models, please `pip install {package_name}` else they will not be available."
+ + f"If you want to load {model_name} models, please `{install_instruction}` to install the package."
+ )
+
+
+def requires_image_dependencies() -> None:
+ if not _is_package_available("torchvision"):
+ raise ImportError(
+ "You are trying to running the image subset of mteb without having installed the required dependencies (`torchvision`). "
+ + "You can install the required dependencies using `pip install 'mteb[image]'` to install the required dependencies."
)
diff --git a/pyproject.toml b/pyproject.toml
index e1804ddd91..6e83a1b2d0 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -40,7 +40,6 @@ dependencies = [
"typing_extensions>=0.0.0",
"eval_type_backport>=0.0.0",
"polars>=0.20.22",
- "torchvision>0.0.0",
]
@@ -53,6 +52,7 @@ homepage = "https://github.com/embeddings-benchmark/mteb"
mteb = "mteb.cli:main"
[project.optional-dependencies]
+image = ["torchvision>0.0.0"]
dev = [
"ruff==0.9.7", # locked so we don't get PRs which fail only due to a lint update
"pytest>=8.3.4",