embeddings-benchmark · AdnanElAssadi56 · Feb 19, 2025 · Feb 19, 2025 · Feb 20, 2025 · Feb 20, 2025
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -7,6 +7,9 @@ on:
     branches: [main]
   pull_request:
 
+permissions:
+  contents: write
+
 jobs:
   create-table-on-pr:
     if: github.event_name == 'pull_request'
@@ -32,8 +35,6 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v3
-        with:
-          token: ${{ secrets.RELEASE }}
 
       - uses: actions/setup-python@v4
         with:
@@ -49,6 +50,8 @@ jobs:
           make build-docs
 
       - name: Push table
+        env:
+          GITHUB_TOKEN: ${{ github.token }}
         run: |
           git config --global user.email "github-actions[bot]@users.noreply.github.com"
           git config --global user.name "github-actions[bot]"

diff --git a/README.md b/README.md
@@ -175,4 +175,3 @@ Some of these amazing publications include (ordered chronologically):
 - Kenneth Enevoldsen, Márton Kardos, Niklas Muennighoff, Kristoffer Laigaard Nielbo. "[The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding](https://arxiv.org/abs/2406.02396)" arXiv 2024
 - Ali Shiraee Kasmaee, Mohammad Khodadad, Mohammad Arshi Saloot, Nick Sherck, Stephen Dokas, Hamidreza Mahyar, Soheila Samiee. "[ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain](https://arxiv.org/abs/2412.00532)" arXiv 2024
 - Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff. "[MIEB: Massive Image Embedding Benchmark](https://arxiv.org/abs/2504.10471)" arXiv 2025
-
diff --git a/docs/adding_a_model.md b/docs/adding_a_model.md
@@ -52,7 +52,7 @@ model_meta.calculate_memory_usage_mb()
 ### Adding instruction models
 
 Some models, such as the [E5 models](https://huggingface.co/intfloat/multilingual-e5-large-instruct), use instructions or prompts.
-You can directly add the prompts when saving and uploading your model to the Hub. Refer to this [configuration file as an example](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5/blob/3b5a16eaf17e47bd997da998988dce5877a57092/config_sentence_transformers.json). 
+You can directly add the prompts when saving and uploading your model to the Hub. Refer to this [configuration file as an example](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5/blob/3b5a16eaf17e47bd997da998988dce5877a57092/config_sentence_transformers.json).
 
 However, you can also add these directly to the model configuration:
 
@@ -142,4 +142,4 @@ When submitting you models as a PR, please copy and paste the following checklis
   - [ ] `mteb.get_model(model_name, revision)` and
   - [ ] `mteb.get_model_meta(model_name, revision)`
 - [ ] I have tested the implementation works on a representative set of tasks.
-- [ ] The model is public, i.e. is available either as an API or the wieght are publicly avaiable to download 
+- [ ] The model is public, i.e. is available either as an API or the wieght are publicly avaiable to download
diff --git a/docs/mieb/readme.md b/docs/mieb/readme.md
@@ -11,7 +11,7 @@ MIEB intends to extend MTEB and MMTEB to cover image representation learning and
 
 ## 🚀 Running MIEB
 
-If you’re already familiar with how MTEB works, then run any benchmark, task, and model the same way! 
+If you’re already familiar with how MTEB works, then run any benchmark, task, and model the same way!
 
 
 ### Run MIEB in 2 lines via CLI
@@ -46,14 +46,14 @@ Or select tasks by categories:
 tasks = mteb.get_tasks(task_types=["Compositionality"])
 ```
 
-2. Load a Model: 
+2. Load a Model:
 
 ```python
 model_name = "laion/CLIP-ViT-L-14-laion2B-s32B-b82K"
 model = mteb.get_model(model_name=model_name)
 ```
 
-3. Run the Evaluation: 
+3. Run the Evaluation:
 
 ```python
 evaluation = mteb.MTEB(tasks=tasks)
@@ -71,7 +71,7 @@ There are a few ways for anyone to contribute to MIEB:
   2.  Add a model. This could mean either: a) The model wrapper, e.g. `OpenCLIPWrapper`, already exists, and the effort is solely in adding a filled out `ModelMeta` object, and/or b) Add a new model wrapper.
   3. Add a new task type. This means that the existing task types do not cover this new task. An accompanying evaluator should also be implemented.
 
-Let's go through an example. 
+Let's go through an example.
 
 <details>
   <summary> Contribution Example (click to unfold) </summary>

diff --git a/mteb/abstasks/Audio/AbsTaskAudioClassification.py b/mteb/abstasks/Audio/AbsTaskAudioClassification.py
@@ -0,0 +1,257 @@
+from __future__ import annotations
+
+import logging
+from collections import defaultdict
+from typing import Any
+
+import numpy as np
+from sklearn.model_selection import KFold
+
+from mteb.abstasks.TaskMetadata import HFSubset
+
+from ...encoder_interface import AudioEncoder
+from ...evaluation.evaluators import AudiologRegClassificationEvaluator
+from ..AbsTask import AbsTask, ScoresDict
+
+logger = logging.getLogger(__name__)
+
+
+class AbsTaskAudioClassification(AbsTask):
+    """Abstract class for audio classification tasks
+    The similarity is computed between pairs and the results are ranked.
+
+    self.load_data() must generate a huggingface dataset with a split matching self.metadata_dict["eval_splits"], and assign it to self.dataset. It
+    must contain the following columns:
+        audio: datasets.Audio
+        label: int
+    """
+
+    audio_column_name: str = "audio"
+    label_column_name: str = "labels"
+    is_cross_validation: bool = False
+    n_splits = 5  # by default: 5-fold cross-validation
+
+    def __init__(
+        self,
+        method: str = "logReg",
+        n_experiments: int | None = None,
+        samples_per_label: int | None = None,
+        k: int = 3,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.method = method
+
+        # Bootstrap parameters
+        self.n_experiments: int = (  # type: ignore
+            n_experiments
+            if n_experiments is not None
+            else self.metadata_dict.get("n_experiments", 5)
+        )
+        self.samples_per_label: int = (  # type: ignore
+            samples_per_label
+            if samples_per_label is not None
+            else self.metadata_dict.get("samples_per_label", 16)
+        )
+
+        # kNN parameters
+        self.k = k
+
+    def _add_main_score(self, scores: dict[HFSubset, ScoresDict]) -> None:
+        scores["main_score"] = scores[self.metadata.main_score]
+
+    def _calculate_metrics_from_split(
+        self, split: str, hf_subset: str | None = None, compute_overall: bool = False
+    ):
+        pass
+
+    def evaluate(
+        self,
+        model: AudioEncoder,
+        eval_split: str = "test",
+        train_split: str = "train",
+        *,
+        encode_kwargs: dict[str, Any] = {},
+        **kwargs,
+    ) -> dict[HFSubset, ScoresDict]:
+        if not self.data_loaded:
+            self.load_data()
+
+        scores = {}
+        hf_subsets = list(self.dataset) if self.is_multilingual else ["default"]
+
+        for hf_subset in hf_subsets:
+            logger.info(
+                f"\nTask: {self.metadata.name}, split: {eval_split}, subset: {hf_subset}. Running..."
+            )
+
+            if hf_subset not in self.dataset and hf_subset == "default":
+                ds = self.dataset
+            else:
+                ds = self.dataset[hf_subset]
+
+            if self.is_cross_validation:
+                scores[hf_subset] = self._evaluate_subset_cross_validation(
+                    model,
+                    ds,
+                    eval_split,
+                    train_split,
+                    encode_kwargs=encode_kwargs,
+                    **kwargs,
+                )
+            else:
+                scores[hf_subset] = self._evaluate_subset(
+                    model,
+                    ds,
+                    eval_split,
+                    train_split,
+                    encode_kwargs=encode_kwargs,
+                    **kwargs,
+                )
+            self._add_main_score(scores[hf_subset])
+
+        return scores
+
+    def _evaluate_subset_cross_validation(
+        self,
+        model: AudioEncoder,
+        dataset,
+        eval_split: str = "test",
+        train_split: str = "train",
+        encode_kwargs: dict[str, Any] = {},
+        **kwargs,
+    ) -> ScoresDict:
+        assert train_split == eval_split, (
+            f"Performing {self.n_splits}-fold cross validation, but the dataset has a train (`{train_split}`) and test split (`{eval_split}`)! Set `is_cross_validation` to False, and retry."
+        )
+        logger.info(
+            f"Performing {self.n_splits}-fold cross-validation on the entire dataset!"
+        )
+
+        ds = dataset[train_split]
+        num_samples = len(ds)
+        kf = KFold(n_splits=self.n_splits, shuffle=True, random_state=42)
+
+        params = {"k": self.k}
+        params.update(kwargs)
+        scores = []
+        test_cache, idxs = (
+            None,
+            None,
+        )  # we store idxs to make the shuffling reproducible
+
+        for train_idx, val_idx in kf.split(
+            range(num_samples)
+        ):  # perform k-fold cross validation
+            train_split = ds.select(train_idx)
+            eval_split = ds.select(val_idx)
+
+            # Bootstrap `self.samples_per_label` samples per label for each split
+            undersampled_train, idxs = self._undersample_data(
+                train_split,
+                self.label_column_name,
+                self.samples_per_label,
+                idxs=idxs,
+            )
+
+            if self.method == "logReg":
+                evaluator = AudiologRegClassificationEvaluator(
+                    undersampled_train,
+                    eval_split,
+                    self.audio_column_name,
+                    self.label_column_name,
+                    task_name=self.metadata.name,
+                    encode_kwargs=encode_kwargs,
+                    **params,
+                )
+            else:
+                raise ValueError(f"Method {self.method} not supported")
+
+            scores_exp, test_cache = evaluator(model, test_cache=test_cache)
+            scores.append(scores_exp)
+
+        avg_scores: dict[str, Any] = {
+            k: np.mean([s[k] for s in scores]) for k in scores[0].keys()
+        }
+        avg_scores["scores_per_experiment"] = scores
+        return avg_scores
+
+    def _evaluate_subset(
+        self,
+        model: AudioEncoder,
+        dataset,
+        eval_split: str = "test",
+        train_split: str = "train",
+        encode_kwargs: dict[str, Any] = {},
+        **kwargs,
+    ) -> ScoresDict:
+        train_split = dataset[train_split]
+        eval_split = dataset[eval_split]
+        params = {"k": self.k}
+        params.update(kwargs)
+
+        scores = []
+        test_cache, idxs = (
+            None,
+            None,
+        )  # we store idxs to make the shuffling reproducible
+        for i in range(self.n_experiments):
+            logger.info(
+                "=" * 10 + f" Experiment {i + 1}/{self.n_experiments} " + "=" * 10
+            )
+            # Bootstrap `self.samples_per_label` samples per label for each split
+            undersampled_train, idxs = self._undersample_data(
+                train_split,
+                self.label_column_name,
+                self.samples_per_label,
+                idxs=idxs,
+            )
+
+            if self.method == "logReg":
+                evaluator = AudiologRegClassificationEvaluator(
+                    undersampled_train,
+                    eval_split,
+                    self.audio_column_name,
+                    self.label_column_name,
+                    task_name=self.metadata.name,
+                    encode_kwargs=encode_kwargs,
+                    **params,
+                )
+            else:
+                raise ValueError(f"Method {self.method} not supported")
+
+            scores_exp, test_cache = evaluator(model, test_cache=test_cache)
+            scores.append(scores_exp)
+
+        avg_scores: dict[str, Any] = {
+            k: np.mean([s[k] for s in scores]) for k in scores[0].keys()
+        }
+        avg_scores["scores_per_experiment"] = scores
+        return avg_scores
+
+    def _undersample_data(
+        self, dataset_split, label_column_name, samples_per_label, idxs=None
+    ):
+        """Undersample data to have samples_per_label samples of each label
+        without loading all audio into memory.
+        """
+        if idxs is None:
+            idxs = np.arange(len(dataset_split))
+        np.random.shuffle(idxs)
+        if not isinstance(idxs, list):
+            idxs = idxs.tolist()
+        label_counter = defaultdict(int)
+        selected_indices = []
+
+        labels = dataset_split[label_column_name]
+        for i in idxs:
+            label = labels[i]
+            if label_counter[label] < samples_per_label:
+                selected_indices.append(i)
+                label_counter[label] += 1
+
+        undersampled_dataset = dataset_split.select(selected_indices)
+        return (
+            undersampled_dataset,
+            idxs,
+        )
Original file line number	Diff line number	Diff line change
Expand Up		@@ -175,4 +175,3 @@ Some of these amazing publications include (ordered chronologically):
		- Kenneth Enevoldsen, Márton Kardos, Niklas Muennighoff, Kristoffer Laigaard Nielbo. "[The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding](https://arxiv.org/abs/2406.02396)" arXiv 2024
		- Ali Shiraee Kasmaee, Mohammad Khodadad, Mohammad Arshi Saloot, Nick Sherck, Stephen Dokas, Hamidreza Mahyar, Soheila Samiee. "[ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain](https://arxiv.org/abs/2412.00532)" arXiv 2024
		- Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff. "[MIEB: Massive Image Embedding Benchmark](https://arxiv.org/abs/2504.10471)" arXiv 2025