embeddings-benchmark · isaac-chung · Jan 23, 2025 · Nov 6, 2024 · Nov 6, 2024 · Nov 7, 2024
diff --git a/.github/workflows/leaderboard_refresh.yaml b/.github/workflows/leaderboard_refresh.yaml
@@ -0,0 +1,16 @@
+name: Daily Space Rebuild
+on:
+  schedule:
+    # Runs at midnight Pacific Time (8 AM UTC)
+    - cron: '0 8 * * *'
+  workflow_dispatch:  # Allows manual triggering
+
+jobs:
+  rebuild:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Trigger Factory Rebuild
+        run: |
+          curl -X POST \
+            "https://huggingface.co/api/spaces/mteb/leaderboard_2_demo/restart?factory=true" \
+            -H "Authorization: Bearer ${{ secrets.HF_TOKEN }}"
diff --git a/.github/workflows/model_loading.yml b/.github/workflows/model_loading.yml
@@ -0,0 +1,24 @@
+name: Model Loading
+
+on:
+  pull_request:
+    paths:
+      - 'mteb/models/**.py'
+
+jobs:
+  extract-and-run:
+    runs-on: ubuntu-latest
+
+    steps:
+    - name: Checkout repository
+      uses: actions/checkout@v3
+
+    - name: Set up Python
+      uses: actions/setup-python@v4
+      with:
+        python-version: '3.10'
+        cache: 'pip'
+
+    - name: Install dependencies and run tests
+      run: |
+        make model-load-test BASE_BRANCH=${{ github.event.pull_request.base.ref }}
diff --git a/.gitignore b/.gitignore
@@ -143,4 +143,8 @@ sb.ipynb
 tests/create_meta/model_card.md
 
 # removed results from mteb repo they are now available at: https://github.com/embeddings-benchmark/results
-results/
+results/
+uv.lock
+
+# model loading tests
+model_names.txt
diff --git a/Makefile b/Makefile
@@ -35,4 +35,11 @@ pr:
 build-docs:
 	@echo "--- 📚 Building documentation ---"
 	# since we do not have a documentation site, this just build tables for the .md files
-	python docs/create_tasks_table.py
+	python docs/create_tasks_table.py
+
+
+model-load-test:
+	@echo "--- 🚀 Running model load test ---"
+	pip install ".[dev, speedtask, pylate,gritlm,xformers,model2vec]"
+	python scripts/extract_model_names.py $(BASE_BRANCH) --return_one_model_name_per_file
+	python tests/test_models/model_loading.py --model_name_file scripts/model_names.txt
diff --git a/README.md b/README.md
@@ -46,17 +46,15 @@ from sentence_transformers import SentenceTransformer
 
 # Define the sentence-transformers model name
 model_name = "average_word_embeddings_komninos"
-# or directly from huggingface:
-# model_name = "sentence-transformers/all-MiniLM-L6-v2"
 
-model = SentenceTransformer(model_name)
+model = mteb.get_model(model_name) # if the model is not implemented in MTEB it will be eq. to SentenceTransformer(model_name)
 tasks = mteb.get_tasks(tasks=["Banking77Classification"])
 evaluation = mteb.MTEB(tasks=tasks)
 results = evaluation.run(model, output_folder=f"results/{model_name}")
 ```
 
 <details>
-  <summary> Running SentneceTransformermer model with prompts </summary>
+  <summary> Running SentenceTransformer model with prompts </summary>
 
 Prompts can be passed to the SentenceTransformer model using the `prompts` parameter. The following code shows how to use prompts with SentenceTransformer:
 
@@ -164,7 +162,7 @@ For instance to select the 56 English datasets that form the "Overall MTEB Engli
 
 ```python
 import mteb
-benchmark = mteb.get_benchmark("MTEB(eng)")
+benchmark = mteb.get_benchmark("MTEB(eng, classic)")
 evaluation = mteb.MTEB(tasks=benchmark)
 ```
 
@@ -211,6 +209,21 @@ Note that the public leaderboard uses the test splits for all datasets except MS
 
 </details>
 
+
+<details>
+  <summary> Selecting evaluation subset </summary>
+
+### Selecting evaluation subset
+You can evaluate only on selected subsets. For example, if you want to evaluate only the `subset_name_to_run` subset of all tasks, do the following:
+
+```python
+evaluation.run(model, eval_subsets=["subset_name_to_run"])
+```
+
+Monolingual tasks have `default` subset, other tasks have subsets that are specific to the dataset.
+
+</details>
+
 <details>
   <summary>  Using a custom model </summary>
 
@@ -220,7 +233,10 @@ Note that the public leaderboard uses the test splits for all datasets except MS
 Models should implement the following interface, implementing an `encode` function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.). For inspiration, you can look at the [mteb/mtebscripts repo](https://github.com/embeddings-benchmark/mtebscripts) used for running diverse models via SLURM scripts for the paper.
 
 ```python
+import mteb
 from mteb.encoder_interface import PromptType
+import numpy as np
+
 
 class CustomModel:
     def encode(
@@ -244,7 +260,7 @@ class CustomModel:
         pass
 
 model = CustomModel()
-tasks = mteb.get_task("Banking77Classification")
+tasks = mteb.get_tasks(tasks=["Banking77Classification"])
 evaluation = MTEB(tasks=tasks)
 evaluation.run(model)
 ```
@@ -313,6 +329,34 @@ evaluation.run(
 )
 ```
 
+</details>
+
+<details>
+  <summary> Late Interaction (ColBERT) </summary>
+
+### Using Late Interaction models for retrieval
+
+```python
+from mteb import MTEB
+import mteb
+
+
+colbert = mteb.get_model("colbert-ir/colbertv2.0")
+tasks = mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])
+
+eval_splits = ["test"]
+
+evaluation = MTEB(tasks=tasks)
+
+evaluation.run(
+    colbert,
+    eval_splits=eval_splits,
+    corpus_chunk_size=500,
+)
+```
+This implementation employs the MaxSim operation to compute the similarity between sentences. While MaxSim provides high-quality results, it processes a larger number of embeddings, potentially leading to increased resource usage. To manage resource consumption, consider lowering the `corpus_chunk_size` parameter.
+
+
 </details>
 
 <details>
@@ -378,6 +422,28 @@ results = mteb.load_results(models=models, tasks=tasks)
 df = results_to_dataframe(results)
 ```
 
+</details>
+
+
+<details>
+  <summary>  Annotate Contamination in the training data of a model  </summary>
+
+### Annotate Contamination
+
+have your found contamination in the training data of a model? Please let us know, either by opening an issue or ideally by submitting a PR
+annotatig the training datasets of the model:
+
+```py
+model_w_contamination = ModelMeta(
+    name = "model-with-contamination"
+    ...
+    training_datasets: {"ArguAna": # name of dataset within MTEB
+                        ["test"]} # the splits that have been trained on
+    ...
+)
+```
+
+
 </details>
 
 <details>

diff --git a/docs/adding_a_dataset.md b/docs/adding_a_dataset.md
@@ -37,15 +37,14 @@ class SciDocsReranking(AbsTaskReranking):
         dataset={
             "path": "mteb/scidocs-reranking",
             "revision": "d3c5e1fc0b855ab6097bf1cda04dd73947d7caab",
-        }
+        },
         date=("2000-01-01", "2020-12-31"), # best guess
         domains=["Academic", "Non-fiction", "Domains"],
         task_subtypes=["Scientific Reranking"],
         license="cc-by-4.0",
         annotations_creators="derived",
         dialect=[],
         sample_creation="found",
-        descriptive_stats={"n_samples": {"test": 19599}, "avg_character_length": {"test": 69.0}},
         bibtex_citation="""
 @inproceedings{cohan-etal-2020-specter,
     title = "{SPECTER}: Document-level Representation Learning using Citation-informed Transformers",
@@ -73,7 +72,7 @@ class SciDocsReranking(AbsTaskReranking):
 
 # testing the task with a model:
 model = SentenceTransformer("average_word_embeddings_komninos")
-evaluation = MTEB(tasks=[MindSmallReranking()])
+evaluation = MTEB(tasks=[SciDocsReranking()])
 evaluation.run(model)
 ```
 
@@ -109,7 +108,7 @@ class VGClustering(AbsTaskClustering):
         dialect=[],
         text_creation="found",
         bibtex_citation= ... # removed for brevity
-)
+    )
 
     def dataset_transform(self):
         splits = self.description["eval_splits"]

diff --git a/docs/adding_a_model.md b/docs/adding_a_model.md
@@ -14,8 +14,7 @@ model = mteb.get_model("sentence-transformers/paraphrase-multilingual-MiniLM-L12
 
 tasks = mteb.get_tasks(...) # get specific tasks
 # or 
-from mteb.benchmarks import MTEB_MAIN_EN
-tasks = MTEB_MAIN_EN # or use a specific benchmark
+tasks = mteb.get_benchmark("MTEB(eng, classic)") # or use a specific benchmark
 
 evaluation = mteb.MTEB(tasks=tasks)
 evaluation.run(model, output_folder="results")
@@ -29,29 +28,49 @@ mteb run -m {model_name} -t {task_names}
 
 These will save the results in a folder called `results/{model_name}/{model_revision}`.
 
+<<<<<<< HEAD
 1. **Format the results using the CLI:**
+=======
+2. **Push Results to the Leaderboard**
+
+To add results to the public leaderboard you can push your results to the [results repository](https://github.com/embeddings-benchmark/results) via a PR. Once merged they will appear on the leaderboard after a day.
+
+
+3. (Optional) **Add results to the model card:**
+
+`mteb` implements a cli for adding results to the model card:
+>>>>>>> main
 
 ```bash
 mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md
 ```
 
-If readme of model exists:
+To add the content to the public model simply copy the content of the `model_card.md` file to the top of a `README.md` file of your model on the Hub. See [here](https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit/blob/main/README.md) for an example.
+
+If the readme already exists:
 
 ```bash
 mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md --from_existing your_existing_readme.md 
 ```
 
+<<<<<<< HEAD
 2. **Add the frontmatter to model repository:**
 
 Copy the content of the `model_card.md` file to the top of a `README.md` file of your model on the Hub. See [here](https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit/blob/main/README.md) for an example.
+=======
+Note that running the model on many tasks may lead to a huge readme front matter.
+>>>>>>> main
 
 3. **Wait for a refresh the leaderboard:**
 
 The leaderboard [automatically refreshes daily](https://github.com/embeddings-benchmark/leaderboard/commits/main/) so once submitted you only need to wait for the automatic refresh. You can find the workflows for the leaderboard refresh [here](https://github.com/embeddings-benchmark/leaderboard/tree/main/.github/workflows). If you experience issues with the leaderboard please create an [issue](https://github.com/embeddings-benchmark/mteb/issues).
 
 **Notes:**
 - We remove models with scores that cannot be reproduced, so please ensure that your model is accessible and scores can be reproduced.
+<<<<<<< HEAD
 - An alternative way of submitting to the leaderboard is by opening a PR with your results [here](https://github.com/embeddings-benchmark/results) & checking that they are displayed correctly by [locally running the leaderboard](https://github.com/embeddings-benchmark/leaderboard?tab=readme-ov-file#developer-setup)
+=======
+>>>>>>> main
 
 - ##### Using Prompts with Sentence Transformers
 
@@ -65,4 +84,8 @@ The leaderboard [automatically refreshes daily](https://github.com/embeddings-be
 
     ###### Instantiating the Model with Prompts
 
-    If you are unable to directly add the prompts in the model configuration, you can instantiate the model using the `sentence_transformers_loader` and pass `prompts` as an argument. For more details, see the `mteb/models/bge_models.py` file.
+<<<<<<< HEAD
+    If you are unable to directly add the prompts in the model configuration, you can instantiate the model using the `sentence_transformers_loader` and pass `prompts` as an argument. For more details, see the `mteb/models/bge_models.py` file.
+=======
+    If you are unable to directly add the prompts in the model configuration, you can instantiate the model using the `sentence_transformers_loader` and pass `prompts` as an argument. For more details, see the `mteb/models/bge_models.py` file.
+>>>>>>> main