embeddings-benchmark · Muennighoff · Dec 11, 2024 · Oct 8, 2024 · Oct 11, 2024 · Oct 15, 2024
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -34,6 +34,6 @@ see also https://github.com/embeddings-benchmark/mteb/blob/main/docs/reproducibl
 
  - [ ] I have filled out the ModelMeta object to the extent possible
  - [ ] I have ensured that my model can be loaded using
-   - [ ] `mteb.get_model(model_name, revision_id)` and
-   - [ ] `mteb.get_model_meta(model_name, revision_id)`
+   - [ ] `mteb.get_model(model_name, revision)` and
+   - [ ] `mteb.get_model_meta(model_name, revision)`
  - [ ] I have tested the implementation works on a representative set of tasks.
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -15,7 +15,7 @@ jobs:
 
       - uses: actions/setup-python@v4
         with:
-          python-version: "3.8"
+          python-version: "3.9"
           cache: "pip"
 
       - name: Install dependencies

diff --git a/.github/workflows/mmteb.yml b/.github/workflows/mmteb.yml
@@ -16,7 +16,7 @@ jobs:
 
       - uses: actions/setup-python@v4
         with:
-          python-version: "3.8"
+          python-version: "3.9"
           cache: "pip"
 
       - name: Install dependencies
@@ -38,7 +38,7 @@ jobs:
 
       - uses: actions/setup-python@v4
         with:
-          python-version: "3.8"
+          python-version: "3.9"
           cache: "pip"
 
       - name: Install dependencies

diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -16,11 +16,11 @@ jobs:
       fail-fast: false
       matrix:
         os: [ubuntu-latest] #, macos-latest, windows-latest]
-        python-version: ["3.8", "3.9", "3.10"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
         include:
           # Add Windows with Python 3.8 only to avoid tests taking too long
           - os: windows-latest
-            python-version: "3.8"
+            python-version: "3.9"
 
     steps:
       - uses: actions/checkout@v3

diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@
 <h4 align="center">
     <p>
         <a href="#installation">Installation</a> |
-        <a href="#usage">Usage</a> |
+        <a href="#usage-documentation">Usage</a> |
         <a href="https://huggingface.co/spaces/mteb/leaderboard">Leaderboard</a> |
         <a href="#documentation">Documentation</a> |
         <a href="#citing">Citing</a>
@@ -36,9 +36,9 @@
 pip install mteb
 ```
 
-## Usage
+## Example Usage
 
-* Using a python script (see [scripts/run_mteb_english.py](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/run_mteb_english.py) and [mteb/mtebscripts](https://github.com/embeddings-benchmark/mtebscripts) for more):
+* Using a Python script:
 
 ```python
 import mteb
@@ -55,6 +55,37 @@ evaluation = mteb.MTEB(tasks=tasks)
 results = evaluation.run(model, output_folder=f"results/{model_name}")
 ```
 
+<details>
+  <summary> Running SentneceTransformermer model with prompts </summary>
+
+Prompts can be passed to the SentenceTransformer model using the `prompts` parameter. The following code shows how to use prompts with SentenceTransformer:
+
+```python
+from sentence_transformers import SentenceTransformer
+
+
+model = SentenceTransformer("average_word_embeddings_komninos", prompts={"query": "Query:", "passage": "Passage:"})
+evaluation = mteb.MTEB(tasks=tasks)
+```
+
+In prompts the key can be:
+1. Prompt types (`passage`, `query`) - they will be used in reranking and retrieval tasks 
+2. Task type - these prompts will be used in all tasks of the given type
+   1. `BitextMining`
+   2. `Classification`
+   3. `MultilabelClassification`
+   4. `Clustering`
+   5. `PairClassification`
+   6. `Reranking`
+   7. `Retrieval`
+   8. `STS`
+   9. `Summarization`
+   10. `InstructionRetrieval`
+3. Pair of task type and prompt type like `Retrival-query` - these prompts will be used in all classification tasks
+4. Task name - these prompts will be used in the specific task
+5. Pair of task name and prompt type like `NFCorpus-query` - these prompts will be used in the specific task
+</details>
+
 * Using CLI
 
 ```bash
@@ -71,17 +102,17 @@ mteb run -m sentence-transformers/all-MiniLM-L6-v2 \
 
 
 
-## Advanced Usage
+## Usage Documentation
 Click on each section below to see the details.
 
 <br /> 
 
 <details>
-  <summary>  Dataset selection </summary>
+  <summary>  Task selection </summary>
 
-### Dataset selection
+### Task selection
 
-Datasets can be selected by providing the list of datasets, but also
+Tasks can be selected by providing the list of datasets, but also
 
 * by their task (e.g. "Clustering" or "Classification")
 
@@ -121,11 +152,33 @@ evaluation = mteb.MTEB(tasks=[
 # for an example of a HF subset see "Subset" in the dataset viewer at: https://huggingface.co/datasets/mteb/bucc-bitext-mining
 ```
 
-There are also presets available for certain task collections, e.g. to select the 56 English datasets that form the "Overall MTEB English leaderboard":
+</details>
+
+<details>
+  <summary>  Running a benchmark </summary>
+
+### Running a Benchmark
+
+`mteb` comes with a set of predefined benchmarks. These can be fetched using `get_benchmark` and run in a similar fashion to other sets of tasks. 
+For instance to select the 56 English datasets that form the "Overall MTEB English leaderboard":
+
+```python
+import mteb
+benchmark = mteb.get_benchmark("MTEB(eng)")
+evaluation = mteb.MTEB(tasks=benchmark)
+```
+
+The benchmark specified not only a list of tasks, but also what splits and language to run on. To get an overview of all available benchmarks simply run:
 
 ```python
-from mteb import MTEB_MAIN_EN
-evaluation = mteb.MTEB(tasks=MTEB_MAIN_EN, task_langs=["en"])
+import mteb
+benchmarks = mteb.get_benchmarks()
+```
+
+Generally we use the naming scheme for benchmarks `MTEB(*)`, where the "*" denotes the target of the benchmark. In the case of a language, we use the three-letter language code. For large groups of languages, we use the group notation, e.g., `MTEB(Scandinavian)` for Scandinavian languages. External benchmarks implemented in MTEB like `CoIR` use their original name. When using a benchmark from MTEB please cite `mteb` along with the citations of the benchmark which you can access using:
+
+```python
+benchmark.citation
 ```
 
 </details>
@@ -139,7 +192,7 @@ evaluation = mteb.MTEB(tasks=MTEB_MAIN_EN, task_langs=["en"])
 To pass in arguments to the model's `encode` function, you can use the encode keyword arguments (`encode_kwargs`):
 
 ```python
-evaluation.run(model, encode_kwargs={"batch_size": 32}
+evaluation.run(model, encode_kwargs={"batch_size": 32})
 ```
 </details>
 
@@ -167,55 +220,35 @@ Note that the public leaderboard uses the test splits for all datasets except MS
 Models should implement the following interface, implementing an `encode` function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.). For inspiration, you can look at the [mteb/mtebscripts repo](https://github.com/embeddings-benchmark/mtebscripts) used for running diverse models via SLURM scripts for the paper.
 
 ```python
-class MyModel():
+from mteb.encoder_interface import PromptType
+
+class CustomModel:
     def encode(
-        self, sentences: list[str], **kwargs: Any
-    ) -> torch.Tensor | np.ndarray:
+        self,
+        sentences: list[str],
+        task_name: str,
+        prompt_type: PromptType | None = None,
+        **kwargs,
+    ) -> np.ndarray:
         """Encodes the given sentences using the encoder.
-
+        
         Args:
             sentences: The sentences to encode.
+            task_name: The name of the task.
+            prompt_type: The prompt type to use.
             **kwargs: Additional arguments to pass to the encoder.
-
+            
         Returns:
             The encoded sentences.
         """
         pass
 
-model = MyModel()
+model = CustomModel()
 tasks = mteb.get_task("Banking77Classification")
 evaluation = MTEB(tasks=tasks)
 evaluation.run(model)
 ```
 
-If you'd like to use different encoding functions for query and corpus when evaluating on Retrieval or Reranking tasks, you can add separate methods for `encode_queries` and `encode_corpus`. If these methods exist, they will be automatically used for those tasks. You can refer to the `DRESModel` at `mteb/evaluation/evaluators/RetrievalEvaluator.py` for an example of these functions.
-
-```python
-class MyModel():
-    def encode_queries(self, queries: list[str], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
-        """
-        Returns a list of embeddings for the given sentences.
-        Args:
-            queries: List of sentences to encode
-
-        Returns:
-            List of embeddings for the given sentences
-        """
-        pass
-
-    def encode_corpus(self, corpus: list[str] | list[dict[str, str]], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
-        """
-        Returns a list of embeddings for the given sentences.
-        Args:
-            corpus: List of sentences to encode
-                or list of dictionaries with keys "title" and "text"
-
-        Returns:
-            List of embeddings for the given sentences
-        """
-        pass
-```
-
 </details>
 
 <details>
@@ -297,7 +330,7 @@ from sentence_transformers import SentenceTransformer
 
 model = SentenceTransformer("all-MiniLM-L6-v2")
 
-tasks = mteb.get_tasks( tasks=["NFCorpus"], languages=["eng"])
+tasks = mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])
 
 evaluation = MTEB(tasks=tasks)
 evaluation.run(
@@ -309,7 +342,7 @@ evaluation.run(
 ```
 
 CLI:
-```
+```bash
 mteb run -t NFCorpus -m all-MiniLM-L6-v2 --output_folder results --save_predictions
 ```
 
@@ -318,9 +351,11 @@ mteb run -t NFCorpus -m all-MiniLM-L6-v2 --output_folder results --save_predicti
 <details>
   <summary> Fetching result from the results repository </summary>
 
-Multiple models have already been run on tasks avaiable within MTEB. These results are available results [repository](https://github.com/embeddings-benchmark/results).
+### Fetching results from the results repository
 
-To make the results more easily accecible we have designed custom functionality for retrieving from the repository. For instance, you are selecting the best model for your French and English retrieval task on legal documents you could fetch the relevant tasks and create a dataframe of the results using the following code:
+Multiple models have already been run on tasks available within MTEB. These results are available results [repository](https://github.com/embeddings-benchmark/results).
+
+To make the results more easily accessible, we have designed custom functionality for retrieving from the repository. For instance, if you are selecting the best model for your French and English retrieval task on legal documents you could fetch the relevant tasks and create a dataframe of the results using the following code:
 
 ```python
 import mteb
@@ -345,6 +380,26 @@ df = results_to_dataframe(results)
 
 </details>
 
+<details>
+  <summary>  Caching Embeddings To Re-Use Them </summary>
+
+
+### Caching Embeddings To Re-Use Them
+
+There are times you may want to cache the embeddings so you can re-use them. This may be true if you have multiple query sets for the same corpus (e.g. Wikipedia) or are doing some optimization over the queries (e.g. prompting, other experiments). You can setup a cache by using a simple wrapper, which will save the cache per task in the `cache_embeddings/{task_name}` folder:
+
+```python
+# define your task and model above as normal
+...
+# wrap the model with the cache wrapper
+from mteb.models.cache_wrapper import CachedEmbeddingWrapper
+model_with_cached_emb = CachedEmbeddingWrapper(model, cache_path='path_to_cache_dir')
+# run as normal
+evaluation.run(model, ...) 
+```
+
+</details>
+
 <br /> 
 
 
@@ -354,6 +409,7 @@ df = results_to_dataframe(results)
 | Documentation                  |                        |
 | ------------------------------ | ---------------------- |
 | 📋 [Tasks] | Overview of available tasks |
+| 📐 [Benchmarks] | Overview of available benchmarks |
 | 📈 [Leaderboard] | The interactive leaderboard of the benchmark |
 | 🤖 [Adding a model] | Information related to how to submit a model to the leaderboard |
 | 👩‍🔬 [Reproducible workflows] | Information related to how to reproduce and create reproducible workflows with MTEB |
@@ -363,6 +419,7 @@ df = results_to_dataframe(results)
 | 🌐 [MMTEB] | An open-source effort to extend MTEB to cover a broad set of languages |  
 
 [Tasks]: docs/tasks.md
+[Benchmarks]: docs/benchmarks.md
 [Contributing]: CONTRIBUTING.md
 [Adding a model]: docs/adding_a_model.md
 [Adding a dataset]: docs/adding_a_dataset.md

diff --git a/scripts/mmteb_create_author_list.py → docs/__init__.py b/scripts/mmteb_create_author_list.py → docs/__init__.py
diff --git a/docs/adding_a_dataset.md b/docs/adding_a_dataset.md
@@ -77,7 +77,7 @@ evaluation = MTEB(tasks=[MindSmallReranking()])
 evaluation.run(model)
 ```
 
-> **Note:** for multilingual / crosslingual tasks, make sure your class also inherits from the `MultilingualTask` class like in [this](https://github.com/embeddings-benchmark/mteb-draft/blob/main/mteb/tasks/Classification/MTOPIntentClassification.py) example.
+> **Note:** for multilingual / crosslingual tasks, make sure your class also inherits from the `MultilingualTask` class like in [this](https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/Classification/multilingual/MTOPIntentClassification.py) example.
 
 
 
@@ -104,7 +104,7 @@ class VGClustering(AbsTaskClustering):
         form="Written",
         domains=["Academic", "Non-fiction"],
         task_subtypes=["Scientific Reranking"],
-        license="cc-by-nc",
+        license="cc-by-nc-4.0",
         annotations_creators="derived",
         dialect=[],
         text_creation="found",

diff --git a/docs/adding_a_model.md b/docs/adding_a_model.md
@@ -29,10 +29,7 @@ mteb run -m {model_name} -t {task_names}
 
 These will save the results in a folder called `results/{model_name}/{model_revision}`.
 
-For reference you can also look at [scripts/run_mteb_english.py](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/run_mteb_english.py) for all MTEB English datasets used in the main ranking, or [scripts/run_mteb_chinese.py](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/run_mteb_chinese.py) for the Chinese ones. 
-Advanced scripts with different models are available in the [mteb/mtebscripts repo](https://github.com/embeddings-benchmark/mtebscripts).
-
-2. **Format the results using the CLI:**
+1. **Format the results using the CLI:**
 
 ```bash
 mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md
@@ -44,12 +41,28 @@ If readme of model exists:
 mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md --from_existing your_existing_readme.md 
 ```
 
-3. **Add the frontmatter to model repository:**
+2. **Add the frontmatter to model repository:**
 
 Copy the content of the `model_card.md` file to the top of a `README.md` file of your model on the Hub. See [here](https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit/blob/main/README.md) for an example.
 
-4. **Wait for a refresh the leaderboard:**
+3. **Wait for a refresh the leaderboard:**
+
+The leaderboard [automatically refreshes daily](https://github.com/embeddings-benchmark/leaderboard/commits/main/) so once submitted you only need to wait for the automatic refresh. You can find the workflows for the leaderboard refresh [here](https://github.com/embeddings-benchmark/leaderboard/tree/main/.github/workflows). If you experience issues with the leaderboard please create an [issue](https://github.com/embeddings-benchmark/mteb/issues).
+
+**Notes:**
+- We remove models with scores that cannot be reproduced, so please ensure that your model is accessible and scores can be reproduced.
+- An alternative way of submitting to the leaderboard is by opening a PR with your results [here](https://github.com/embeddings-benchmark/results) & checking that they are displayed correctly by [locally running the leaderboard](https://github.com/embeddings-benchmark/leaderboard?tab=readme-ov-file#developer-setup)
+
+- ##### Using Prompts with Sentence Transformers
+
+    If your model uses Sentence Transformers and requires different prompts for encoding the queries and corpus, you can take advantage of the `prompts` [parameter](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer). 
+
+    Internally, `mteb` uses the prompt named `query` for encoding the queries and `passage` as the prompt name for encoding the corpus. This is aligned with the default names used by Sentence Transformers.
+
+    ###### Adding the prompts in the model configuration (Preferred)
+
+    You can directly add the prompts when saving and uploading your model to the Hub. For an example, refer to this [configuration file](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5/blob/3b5a16eaf17e47bd997da998988dce5877a57092/config_sentence_transformers.json).
 
-The leaderboard will then automatically refresh daily so once submitted all you have to do is wait for the automatic refresh.
+    ###### Instantiating the Model with Prompts
 
-You can find the workflows for the leaderboard refresh [here](https://github.com/embeddings-benchmark/leaderboard/tree/main/.github/workflows). If you experience issues with the leaderboard please create an [issue](https://github.com/embeddings-benchmark/mteb/issues).
+    If you are unable to directly add the prompts in the model configuration, you can instantiate the model using the `sentence_transformers_loader` and pass `prompts` as an argument. For more details, see the `mteb/models/bge_models.py` file.