embeddings-benchmark · KennethEnevoldsen · Mar 10, 2025 · Feb 28, 2025 · Feb 28, 2025 · Feb 28, 2025
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -29,12 +29,12 @@
 - [ ] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
 - [ ] If the dataset is too big (e.g. >2048 examples), considering using `self.stratified_subsampling() under dataset_transform()`
 - [ ] I have filled out the metadata object in the dataset file (find documentation on it [here](https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_dataset.md#2-creating-the-metadata-object)).
-- [ ] Run tests locally to make sure nothing is broken using `make test`. 
-- [ ] Run the formatter to format the code using `make lint`. 
+- [ ] Run tests locally to make sure nothing is broken using `make test`.
+- [ ] Run the formatter to format the code using `make lint`.
 
 
 ### Adding a model checklist
-<!-- 
+<!--
 When adding a model to the model registry
 see also https://github.com/embeddings-benchmark/mteb/blob/main/docs/reproducible_workflow.md
 -->
@@ -43,4 +43,4 @@ see also https://github.com/embeddings-benchmark/mteb/blob/main/docs/reproducibl
  - [ ] I have ensured that my model can be loaded using
    - [ ] `mteb.get_model(model_name, revision)` and
    - [ ] `mteb.get_model_meta(model_name, revision)`
- - [ ] I have tested the implementation works on a representative set of tasks.
+ - [ ] I have tested the implementation works on a representative set of tasks.
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -47,7 +47,7 @@ jobs:
       - name: Create table
         run: |
           make build-docs
-      
+
       - name: Push table
         run: |
           git config --global user.email "github-actions[bot]@users.noreply.github.com"
@@ -60,4 +60,3 @@ jobs:
             git commit -m "Update tasks table"
             git push
           fi
-
diff --git a/.github/workflows/leaderboard_refresh.yaml b/.github/workflows/leaderboard_refresh.yaml
@@ -2,8 +2,8 @@ name: Daily Space Rebuild
 on:
   schedule:
     # Runs at midnight Pacific Time (8 AM UTC)
-    - cron: '0 8 * * *'
-  workflow_dispatch:  # Allows manual triggering
+    - cron: "0 8 * * *"
+  workflow_dispatch: # Allows manual triggering
 
 jobs:
   rebuild:

diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -25,4 +25,3 @@ jobs:
         id: lint
         run: |
           make lint-check
-
diff --git a/.github/workflows/model_loading.yml b/.github/workflows/model_loading.yml
@@ -3,22 +3,22 @@ name: Model Loading
 on:
   pull_request:
     paths:
-      - 'mteb/models/**.py'
+      - "mteb/models/**.py"
 
 jobs:
   extract-and-run:
     runs-on: ubuntu-latest
 
     steps:
-    - name: Checkout repository
-      uses: actions/checkout@v3
+      - name: Checkout repository
+        uses: actions/checkout@v3
 
-    - name: Set up Python
-      uses: actions/setup-python@v4
-      with:
-        python-version: '3.10'
-        cache: 'pip'
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.10"
+          cache: "pip"
 
-    - name: Install dependencies and run tests
-      run: |
-        make model-load-test BASE_BRANCH=${{ github.event.pull_request.base.ref }}
+      - name: Install dependencies and run tests
+        run: |
+          make model-load-test BASE_BRANCH=${{ github.event.pull_request.base.ref }}
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -20,8 +20,7 @@ jobs:
     runs-on: ubuntu-latest
     concurrency: release
     permissions:
-      id-token: write  # IMPORTANT: this permission is mandatory for trusted publishing using PyPI 
-
+      id-token: write # IMPORTANT: this permission is mandatory for trusted publishing using PyPI
 
     if: ${{ github.ref == 'refs/heads/main' && github.event.workflow_run.conclusion == 'success'}}
     steps:
@@ -40,8 +39,8 @@ jobs:
       - name: Publish package distributions to PyPI
         uses: pypa/gh-action-pypi-publish@release/v1
         if: steps.release.outputs.released == 'true'
-        # This action supports PyPI's trusted publishing implementation, which allows authentication to PyPI without a manually 
-        # configured API token or username/password combination. To perform trusted publishing with this action, your project's 
+        # This action supports PyPI's trusted publishing implementation, which allows authentication to PyPI without a manually
+        # configured API token or username/password combination. To perform trusted publishing with this action, your project's
         # publisher must already be configured on PyPI.
 
       - name: Publish package distributions to GitHub Releases

diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -2,7 +2,6 @@
 # 1) install Python dependencies
 # 2) run make test
 
-
 name: Test
 on:
   push:
@@ -30,7 +29,7 @@ jobs:
         with:
           python-version: ${{ matrix.python-version }}
           cache: "pip"
-        
+
       - name: Install dependencies
         shell: bash
         run: |
@@ -53,4 +52,3 @@ jobs:
           # if it fails again, the workflow will fail.
           # If it passes the first time the test will not run again
           make test || make test
-
diff --git a/.gitignore b/.gitignore
@@ -151,4 +151,4 @@ model_names.txt
 mteb/leaderboard/__cached_results.json
 
 # gradio
-.gradio/
+.gradio/
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,31 @@
+fail_fast: true
+
+repos:
+  - repo: https://github.com/abravalheri/validate-pyproject
+    rev: v0.23
+    hooks:
+      - id: validate-pyproject
+
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v2.3.0
+    hooks:
+    -   id: check-yaml
+    -   id: check-json
+    -   id: pretty-format-json
+        args:
+          - "--autofix"
+          - "--indent=4"
+          - "--no-sort-keys"
+    -   id: end-of-file-fixer # generated a lot of changes
+    -   id: trailing-whitespace
+    -   id: check-toml
+
+  - repo: local
+    hooks:
+      - id: lint
+        name: lint
+        description: "Run 'make lint'"
+        entry: make lint
+        language: python
+        types_or: [python]
+        minimum_pre_commit_version: "2.9.2"
diff --git a/.vscode/extensions.json b/.vscode/extensions.json
@@ -2,4 +2,4 @@
     "recommendations": [
         "charliermarsh.ruff"
     ]
-}
+}
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -4,5 +4,5 @@
     ],
     "python.testing.unittestEnabled": false,
     "python.testing.pytestEnabled": true,
-    "editor.defaultFormatter": "charliermarsh.ruff",
+    "editor.defaultFormatter": "charliermarsh.ruff"
 }
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,5 +1,5 @@
 ## Contributing to MTEB
-We welcome contributions such as new datasets to MTEB! Please see detailed see the related [issue](https://github.com/embeddings-benchmark/mteb/issues/360) for more information. 
+We welcome contributions such as new datasets to MTEB! Please see detailed see the related [issue](https://github.com/embeddings-benchmark/mteb/issues/360) for more information.
 
 Once you have decided on your contribution, this document describes how to set up the repository for development.
 
@@ -41,4 +41,4 @@ MTEB follows [semantic versioning](https://semver.org/). This means that the ver
 
 Any commit with one of these prefixes will trigger a version bump upon merging to the main branch as long as tests pass. A version bump will then trigger a new release on PyPI as well as a new release on GitHub.
 
-Other prefixes will not trigger a version bump. For example, `docs:`, `chore:`, `refactor:`, etc., however they will structure the commit history and the changelog. You can find more information about this in the [python-semantic-release documentation](https://python-semantic-release.readthedocs.io/en/latest/). If you do not intend to trigger a version bump you're not required to follow this convention when contributing to MTEB.
+Other prefixes will not trigger a version bump. For example, `docs:`, `chore:`, `refactor:`, etc., however they will structure the commit history and the changelog. You can find more information about this in the [python-semantic-release documentation](https://python-semantic-release.readthedocs.io/en/latest/). If you do not intend to trigger a version bump you're not required to follow this convention when contributing to MTEB.
diff --git a/Makefile b/Makefile
@@ -1,6 +1,7 @@
 install:
 	@echo "--- 🚀 Installing project dependencies ---"
 	pip install -e ".[dev]"
+	pre-commit install
 
 install-for-tests:
 	@echo "--- 🚀 Installing project dependencies for test ---"
@@ -10,7 +11,7 @@ install-for-tests:
 lint:
 	@echo "--- 🧹 Running linters ---"
 	ruff format . 			# running ruff formatting
-	ruff check . --fix  	# running ruff linting
+	ruff check . --fix --exit-non-zero-on-fix  	# running ruff linting # --exit-non-zero-on-fix is used for the pre-commit hook to work
 
 lint-check:
 	@echo "--- 🧹 Check is project is linted ---"
@@ -22,9 +23,10 @@ test:
 	@echo "--- 🧪 Running tests ---"
 	pytest -n auto -m "not test_datasets"
 
+
 test-with-coverage:
 	@echo "--- 🧪 Running tests with coverage ---"
-	pytest -n auto --cov-report=term-missing --cov-config=pyproject.toml --cov=mteb 
+	pytest -n auto --cov-report=term-missing --cov-config=pyproject.toml --cov=mteb
 
 pr:
 	@echo "--- 🚀 Running requirements for a PR ---"
@@ -52,4 +54,10 @@ dataset-load-test:
 
 run-leaderboard:
 	@echo "--- 🚀 Running leaderboard locally ---"
-	python -m mteb.leaderboard.app
+	python -m mteb.leaderboard.app
+
+
+.PHONY: check
+check: ## Run code quality tools.
+	@echo "--- 🧹 Running code quality tools ---"
+	@pre-commit run -a
diff --git a/README.md b/README.md
@@ -67,7 +67,7 @@ evaluation = mteb.MTEB(tasks=tasks)
 ```
 
 In prompts the key can be:
-1. Prompt types (`passage`, `query`) - they will be used in reranking and retrieval tasks 
+1. Prompt types (`passage`, `query`) - they will be used in reranking and retrieval tasks
 2. Task type - these prompts will be used in all tasks of the given type
    1. `BitextMining`
    2. `Classification`
@@ -103,7 +103,7 @@ mteb run -m sentence-transformers/all-MiniLM-L6-v2 \
 ## Usage Documentation
 Click on each section below to see the details.
 
-<br /> 
+<br />
 
 <details>
   <summary>  Task selection </summary>
@@ -159,7 +159,7 @@ tasks = mteb.get_tasks(modalities=["text", "image"]) # Only select tasks with te
  You can also specify exclusive modality filtering to only get tasks with exactly the requested modalities (default behavior with exclusive_modality_filter=False):
 ```python
 # Get tasks with text modality, this will also include tasks having both text and image modalities
-tasks = mteb.get_tasks(modalities=["text"], exclusive_modality_filter=False) 
+tasks = mteb.get_tasks(modalities=["text"], exclusive_modality_filter=False)
 
 # Get tasks that have ONLY text modality (no image or other modalities)
 tasks = mteb.get_tasks(modalities=["text"], exclusive_modality_filter=True)
@@ -172,7 +172,7 @@ tasks = mteb.get_tasks(modalities=["text"], exclusive_modality_filter=True)
 
 ### Running a Benchmark
 
-`mteb` comes with a set of predefined benchmarks. These can be fetched using `get_benchmark` and run in a similar fashion to other sets of tasks. 
+`mteb` comes with a set of predefined benchmarks. These can be fetched using `get_benchmark` and run in a similar fashion to other sets of tasks.
 For instance to select the 56 English datasets that form the "Overall MTEB English leaderboard":
 
 ```python
@@ -262,13 +262,13 @@ class CustomModel:
         **kwargs,
     ) -> np.ndarray:
         """Encodes the given sentences using the encoder.
-        
+
         Args:
             sentences: The sentences to encode.
             task_name: The name of the task.
             prompt_type: The prompt type to use.
             **kwargs: Additional arguments to pass to the encoder.
-            
+
         Returns:
             The encoded sentences.
         """
@@ -312,7 +312,7 @@ evaluation.run(model)
 
 ### Using a cross encoder for reranking
 
-To use a cross encoder for reranking, you can directly use a CrossEncoder from SentenceTransformers. The following code shows a two-stage run with the second stage reading results saved from the first stage. 
+To use a cross encoder for reranking, you can directly use a CrossEncoder from SentenceTransformers. The following code shows a two-stage run with the second stage reading results saved from the first stage.
 
 ```python
 from mteb import MTEB
@@ -468,7 +468,7 @@ model_w_contamination = ModelMeta(
 ### Running the Leaderboard
 
 It is possible to completely deploy the leaderboard locally or self-host it. This can e.g. be relevant for companies that might want to
-integrate build their own benchmarks or integrate custom tasks into existing benchmarks. 
+integrate build their own benchmarks or integrate custom tasks into existing benchmarks.
 
 Running the leaderboard is quite easy. Simply run:
 ```py
@@ -494,12 +494,12 @@ There are times you may want to cache the embeddings so you can re-use them. Thi
 from mteb.models.cache_wrapper import CachedEmbeddingWrapper
 model_with_cached_emb = CachedEmbeddingWrapper(model, cache_path='path_to_cache_dir')
 # run as normal
-evaluation.run(model, ...) 
+evaluation.run(model, ...)
 ```
 
 </details>
 
-<br /> 
+<br />
 
 
 
@@ -540,7 +540,7 @@ MTEB was introduced in "[MTEB: Massive Text Embedding Benchmark](https://arxiv.o
   author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
   title = {MTEB: Massive Text Embedding Benchmark},
   publisher = {arXiv},
-  journal={arXiv preprint arXiv:2210.07316},  
+  journal={arXiv preprint arXiv:2210.07316},
   year = {2022}
 }
 ```

diff --git a/docs/adding_a_benchmark.md b/docs/adding_a_benchmark.md
@@ -4,4 +4,4 @@ The MTEB Leaderboard is available [here](https://huggingface.co/spaces/mteb/lead
 
 1. Add your benchmark to [benchmark.py](../mteb/benchmarks/benchmarks.py) as a `Benchmark` object, and select the MTEB tasks that will be in the benchmark. If some of the tasks do not exist in MTEB, follow the "add a dataset" instructions to add them.
 2. Open a PR at https://github.com/embedding-benchmark/results with results of models on your benchmark.
-3. When PRs are merged, your benchmark will be added to the leaderboard automatically after the next workflow trigger.
+3. When PRs are merged, your benchmark will be added to the leaderboard automatically after the next workflow trigger.
diff --git a/docs/adding_a_model.md b/docs/adding_a_model.md
@@ -5,7 +5,7 @@ The MTEB Leaderboard is available [here](https://huggingface.co/spaces/mteb/lead
 1. **Add meta information about your model to [model dir](../mteb/models/)**. See the docstring of ModelMeta for meta data details.
    ```python
    from mteb.model_meta import ModelMeta
-    
+
    bge_m3 = ModelMeta(
        name="model_name",
        languages=["model_languages"], # in format eng-Latn
@@ -31,12 +31,12 @@ The MTEB Leaderboard is available [here](https://huggingface.co/spaces/mteb/lead
    from mteb.models.wrapper import Wrapper
    from mteb.encoder_interface import PromptType
    import numpy as np
-   
+
    class CustomWrapper(Wrapper):
        def __init__(self, model_name, model_revision):
            super().__init__(model_name, model_revision)
            # your custom implementation here
-       
+
        def encode(
             self,
             sentences: list[str],
@@ -52,7 +52,7 @@ The MTEB Leaderboard is available [here](https://huggingface.co/spaces/mteb/lead
    ```python
    your_model = ModelMeta(
        loader=partial(
-            CustomWrapper, 
+            CustomWrapper,
             model_name="model_name",
             model_revision="5617a9f61b028005a4858fdac845db406aefb181"
        ),
@@ -70,7 +70,7 @@ import mteb
 model = mteb.get_model("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
 
 tasks = mteb.get_tasks(...) # get specific tasks
-# or 
+# or
 tasks = mteb.get_benchmark("MTEB(eng, classic)") # or use a specific benchmark
 
 evaluation = mteb.MTEB(tasks=tasks)
@@ -95,7 +95,7 @@ To add results to the public leaderboard you can push your results to the [resul
 
 ##### Using Prompts with Sentence Transformers
 
-If your model uses Sentence Transformers and requires different prompts for encoding the queries and corpus, you can take advantage of the `prompts` [parameter](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer). 
+If your model uses Sentence Transformers and requires different prompts for encoding the queries and corpus, you can take advantage of the `prompts` [parameter](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer).
 
 Internally, `mteb` uses `query` for encoding the queries and `passage` as the prompt names for encoding the corpus. This is aligned with the default names used by Sentence Transformers.
 

diff --git a/docs/benchmarks.md b/docs/benchmarks.md
@@ -30,4 +30,4 @@ The following table gives you an overview of the benchmarks in MTEB.
 | [MTEB(rus)](https://aclanthology.org/2023.eacl-main.148/) | 23 | {'Classification': 9, 'Clustering': 3, 'MultilabelClassification': 2, 'PairClassification': 1, 'Reranking': 2, 'Retrieval': 3, 'STS': 3} | [Web, Social, Academic, Written, Blog, News, Spoken, Reviews, Encyclopaedic] | rus |
 | [NanoBEIR](https://huggingface.co/collections/zeta-alpha-ai/nanobeir-66e1a0af21dfd93e620cd9f6) | 13 | {'Retrieval': 13} | [Web, Academic, Social, Medical, Written, Non-fiction, News, Encyclopaedic] | eng |
 | [RAR-b](https://arxiv.org/abs/2404.06347) | 17 | {'Retrieval': 17} | [Encyclopaedic, Written, Programming] | eng |
-<!-- BENCHMARKS TABLE END -->
+<!-- BENCHMARKS TABLE END -->
Original file line number	Diff line number	Diff line change
Expand Up		@@ -25,4 +25,3 @@ jobs:
		id: lint
		run: \|
		make lint-check
-Original file line number
+Diff line change
@@ Expand Up / @@ -2,4 +2,4 @@ @@
         "recommendations": [
             "charliermarsh.ruff"
         ]
-    }
+    }