20 MAEB Datasets by AdnanElAssadi56 · Pull Request #2527 · embeddings-benchmark/mteb

AdnanElAssadi56 · 2025-04-09T17:36:52Z

I have added and adapted 20 datasets to MAEB.

The newly added datasets:

Classification Datasets

BirdCLEFClassification()
MInDS14Classification()
TUTAcousticScenesClassification()
VoxPopuliAccentID()
VoxPopuliGenderID()
VoxPopuliLanguageID()
IEMOCAPEmotionClassification()
IEMOCAPGenderClassification()
SpeechCommandsClassification()
AmbientAcousticContextClassification()

Clustering Datasets

ESC50Clustering()
TUTAcousticScenesClustering()
AmbientAcousticContextClustering()
CREMA_DClustering()
GTZANGenreClustering()
VoxCelebClustering()
VoxPopuliAccentClustering()
VoxPopuliGenderClustering()
IEMOCAPEmotionClustering()
IEMOCAPGenderClustering()

AdnanElAssadi56 · 2025-04-09T17:40:43Z

Hey @Samoed

While running the current version of the MAEB code from previous commits on my local machine, I encountered the following issues:

Directory Name Error:
A folder in the image directory is mistakenly named ZeroshotClassication instead of the correct name ZeroShotClassification.
Model Metadata Evaluation Error:
When evaluating the datasets with one of the CLAP models, the following error is observed:

"Failed to extract metadata from model: 'ClapZeroShotWrapper' object has no attribute 'model_card_data'. Upgrading to sentence-transformers v3.0.0 or above is recommended."

Because of the above, I couldn't get the actual evaluation results to confirm the datasets, but I have verified that the metadata adheres to the taskmetadata class that is present in the code.

Please take a look and confirm when you have the time.

Samoed · 2025-04-09T18:16:18Z

I think your can rename it in other PR
How did you run it?

KennethEnevoldsen

Thanks for the PR. I made a few overall suggestions below

KennethEnevoldsen · 2025-04-10T10:40:40Z

mteb/tasks/Audio/AudioClassification/eng/AmbientAcousticContext.py

+        date=("2020-01-01", "2020-12-31"),  # Paper publication date
+        domains=["Spoken", "Speech"],
+        task_subtypes=["Environment Sound Classification"],
+        license="not specified",  # As specified in dataset card


Suggested change

license="not specified", # As specified in dataset card

license="not specified", # Not specified in dataset card

?

KennethEnevoldsen · 2025-04-10T10:43:07Z

mteb/tasks/Audio/AudioClassification/eng/AmbientAcousticContext.py

+
+    audio_column_name: str = "audio"
+    label_column_name: str = "label"
+    samples_per_label: int = 300  # Placeholder because value varies


This is not a descriptive stat. It states how many samples it should sample from each category to fit the classifier.

300 seems pretty high.

Also a general observation

KennethEnevoldsen · 2025-04-10T10:45:59Z

mteb/tasks/Audio/AudioClassification/eng/AmbientAcousticContext.py

+            series = {MobileHCI '20}
+        }""",
+        descriptive_stats={
+            "n_samples": {"train": 70254},  # As mentioned in dataset card


This seems too large (we will spend a lot of time downloading), I would downsample the dataset to 100 samples pr. label. Then I would also add the test splits (50 samples pr. label?)

These parameters can be tested using a couple of models to figure out a good threshold.

This seems like a general observation throughout

KennethEnevoldsen · 2025-04-10T10:48:08Z

mteb/tasks/Audio/AudioClassification/eng/IEMOCAPEmotion.py

+        descriptive_stats={
+            "n_samples": {"train": 5000},  # Approximate after subsampling
+        },


but how much is being downloaded? I would just re-upload it in the correct format.

KennethEnevoldsen · 2025-04-10T10:48:58Z

mteb/tasks/Audio/AudioClassification/eng/BirdCLEF.py

+        main_score="accuracy",
+        date=("2025-01-01", "2025-12-31"),  # Competition year
+        domains=["Spoken", "Speech"],
+        task_subtypes=["Environment Sound Classification"],


Spe

Suggested change

task_subtypes=["Environment Sound Classification"],

task_subtypes=["Species Classification"],

KennethEnevoldsen · 2025-04-10T10:50:19Z

mteb/tasks/Audio/AudioClassification/eng/BirdCLEF.py

+        eval_langs=[
+            "eng-Latn",
+            "spa-Latn",
+        ],  # Both English and Spanish names are provided


This makes no sense given the description

KennethEnevoldsen · 2025-04-10T10:51:03Z

mteb/tasks/Audio/AudioClassification/eng/IEMOCAPEmotion.py

+    samples_per_label: int = 50  # Approximate placeholder because value varies
+    is_cross_validation: bool = False
+
+    def dataset_transform(self):


you can remove this on re-upload

KennethEnevoldsen · 2025-04-10T10:53:40Z

mteb/tasks/Audio/AudioClassification/eng/MInDS14.py

+        eval_langs=[
+            "all"
+        ],  # Evaluation supported for all language configurations (the 14 languages)


This is invalid. You should structure it as follows:

Suggested change

eval_langs=[

"all"

], # Evaluation supported for all language configurations (the 14 languages)

eval_langs={"fr-FR": ["fra-Latn"], ...}

KennethEnevoldsen · 2025-04-10T10:54:47Z

mteb/tasks/Audio/AudioClassification/eng/SpeechCommands.py

+        main_score="accuracy",
+        date=("2018-04-11", "2018-04-11"),  # v0.02 release date
+        domains=["Speech"],
+        task_subtypes=["Spoken Digit Classification"],


Doesn't match the description

KennethEnevoldsen · 2025-04-10T10:59:25Z

mteb/tasks/Audio/Clustering/eng/IEMOCAPEmotionClustering.py

+from mteb.abstasks.TaskMetadata import TaskMetadata
+
+
+class IEMOCAPEmotionClustering(AbsTaskAudioClustering):


Some of these datasets seem like duplicates of the classification datasets.

I generally don't think we want to have duplicates. We should consider if a task is best suited for clustering or classification.

A general guideline is:

do we expect the properly to generally represented in the embedding space (domain, thematic content, topic of conversation) then it is a clustering task

If we expect the labels to be extractable from the embeddings (but not generally represented), then I would put it in Classification. E.g., intent classification

feel free to test this using a model if you are unsure. I would imagine that clustering performance might be very low for e.g. intent classification.

AdnanElAssadi56 · 2025-04-15T06:00:24Z

Hi @KennethEnevoldsen,

Thanks for your suggestions—I’ve taken them into account. I’ve downsampled the datasets that were easier to process, and fixed other issues.

BirdCLEF took significantly long in processing on a colab notebook (to get the right number of samples per label for each species). Thus, I've decided to put that on hold. Its current size is around 28k, which seems on the order of magnitude of some existing datasets, so I left it as is for now. The same applies to Speech Commands, which is also a well-known dataset in its current form without downsampling.

Regarding the clustering task, you're right—there's some duplication. I initially included those because I noticed some classification datasets were being repurposed for clustering, so I assumed that was the intended direction. That said, I’ve now fixed the current metadata in the clustering datasets, and we can drop the ones that perform poorly.

For some reason, I'm currently running into errors with the audio models when trying to test the datasets. Could you let me know exactly which model you're using for evaluation, and if you're running it in a specific way?

Once we get the models running, we can test the datasets and, if we observe poor performance, drop the weaker ones from the clustering task to avoid redundancy.

Please tell me if you have anything more to add!

KennethEnevoldsen · 2025-04-15T12:47:27Z

Thus, I've decided to put that on hold. Its current size is around 28k, which seems on the order of magnitude of some existing datasets, so I left it as is for now

If you can't process it in a Google Colab then I would imagine that it severely limits the practical usability of the benchmark. I would def. downsample. Same goes for speech commands (we can keep the test and validation set the same), but there is not need to download 84,848 samples when 2k could do.

That said, I’ve now fixed the current metadata in the clustering datasets, and we can drop the ones that perform poorly.

We need to do this validation before the merge.

For some reason, I'm currently running into errors with the audio models when trying to test the datasets. Could you let me know exactly which model you're using for evaluation, and if you're running it in a specific way?

I would use two familiar models that are currently implemented (I would choose models on the smaller side). I don't have a strong preference, but otherwise it might be worth reaching to the model team on slack?

I would run it as follows:

task=  mteb.get_task("NAME")
runner = mteb.MTEB(tasks = [task])
model = mteb.get_model("MODELNAME")

results = runner.run(model)

AdnanElAssadi56 · 2025-04-15T19:23:51Z

If you can't process it in a Google Colab then I would imagine that it severely limits the practical usability of the benchmark. I would def. downsample. Same goes for speech commands (we can keep the test and validation set the same), but there is not need to download 84,848 samples when 2k could do.

I've down-sampled the datasets.

We need to do this validation before the merge.

The audio models I have tried are still giving me errors. I am not sure if the problem is from the current version of MAEB or the models I have tested with (I tried 3). Mainly, the errors are about dimensionality differences and number of workers. If there is a responsible team on slack, I don't mind communicating.

KennethEnevoldsen · 2025-04-16T12:18:16Z

@AdnanElAssadi56 will you create (or link) an issue on the models that don't run

AdnanElAssadi56 · 2025-04-19T21:21:23Z

Here's the data I obtained for the Clustering Tasks.
(Performance might be slightly lower than expected/actual since I truncated the audio samples to the first 5 seconds to speed up processing.)
This was mainly a quick sanity check, I just wanted to confirm that the results weren't trivial.

{
  "dataset_revision": "360c858462b79492c6b09d5855ec4d59c87497c6",
  "task_name": "AmbientAcousticContextClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "train": [
      {
        "v_measure": 0.163042,
        "nmi": 0.163042,
        "ari": 0.043129,
        "cluster_accuracy": 0.166318,
        "main_score": 0.166318,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 31.59774661064148,
  "kg_co2_emissions": null
}

{
  "dataset_revision": "5efdda59d0d185bfe17ada9b54d233349d0e0168",
  "task_name": "GTZANGenreClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "train": [
      {
        "v_measure": 0.190086,
        "nmi": 0.190086,
        "ari": 0.096,
        "cluster_accuracy": 0.292,
        "main_score": 0.292,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 97.10436391830444,
  "kg_co2_emissions": null
}

{
  "dataset_revision": "e3e2a63ffff66b9a9735524551e3818e96af03ee",
  "task_name": "ESC50Clustering",
  "mteb_version": "1.36.36",
  "scores": {
    "train": [
      {
        "v_measure": 0.302579,
        "nmi": 0.302579,
        "ari": 0.043225,
        "cluster_accuracy": 0.1515,
        "main_score": 0.1515,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 34.19173741340637,
  "kg_co2_emissions": null
}

{
  "dataset_revision": "9f1696a135a65ce997d898d4121c952269a822ca",
  "task_name": "IEMOCAPEmotionClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "train": [
      {
        "v_measure": 0.014234,
        "nmi": 0.014234,
        "ari": 0.006539,
        "cluster_accuracy": 0.154996,
        "main_score": 0.154996,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 185.41184282302856,
  "kg_co2_emissions": null
}


{
  "dataset_revision": "9f1696a135a65ce997d898d4121c952269a822ca",
  "task_name": "IEMOCAPGenderClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "train": [
      {
        "v_measure": 0.000329,
        "nmi": 0.000329,
        "ari": 0.000571,
        "cluster_accuracy": 0.513199,
        "main_score": 0.513199,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 187.12454199790955,
  "kg_co2_emissions": null
}


{
  "dataset_revision": "554ad4367e98b7c6f4d4d9756dc6bbdf345e042e",
  "task_name": "VoxCelebClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "train": [
      {
        "v_measure": 0.002534,
        "nmi": 0.002534,
        "ari": -0.002025,
        "cluster_accuracy": 0.399591,
        "main_score": 0.399591,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 104.42584371566772,
  "kg_co2_emissions": null
}


{
  "dataset_revision": "719aaef8225945c0d80b277de6c79aa42ab053d5",
  "task_name": "VoxPopuliAccentClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "test": [
      {
        "v_measure": 0.030532,
        "nmi": 0.030532,
        "ari": -0.000755,
        "cluster_accuracy": 0.118209,
        "main_score": 0.118209,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 80.55528330802917,
  "kg_co2_emissions": null
}

{
  "dataset_revision": "719aaef8225945c0d80b277de6c79aa42ab053d5",
  "task_name": "VoxPopuliGenderClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "validation": [
      {
        "v_measure": 0.001049,
        "nmi": 0.001049,
        "ari": 0.004027,
        "cluster_accuracy": 0.543069,
        "main_score": 0.543069,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ],
    "test": [
      {
        "v_measure": 4.1e-05,
        "nmi": 4.1e-05,
        "ari": -0.002091,
        "cluster_accuracy": 0.567318,
        "main_score": 0.567318,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 164.94520568847656,
  "kg_co2_emissions": null
}

KennethEnevoldsen · 2025-04-27T10:31:14Z

Some of it seems quite low, though "0.000329", which could suggest that the task is not possible, misformatted or similar (we don't want that either)

AdnanElAssadi56 · 2025-04-27T16:23:26Z

Some of it seems quite low, though "0.000329", which could suggest that the task is not possible, misformatted or similar (we don't want that either)

I am not sure about the reason, IEMOCAP seems like a clean dataset. Do we try to inspect what is happening or do we just drop them from clustering.

AdnanElAssadi56 · 2025-04-29T17:07:26Z

@KennethEnevoldsen
I think because we are tight on time, let's just drop whatever clustering dataset that seems too off for you, so we can merge this PR, then move to the other things we need to finish.

KennethEnevoldsen · 2025-05-02T12:59:12Z

Yeah I agree. I will close this PR and then let us submit the datasets one at a time (this is how we usually do it and it prevents issues where one dataset blocks another)

I am not sure about the reason, IEMOCAP seems like a clean dataset. Do we try to inspect what is happening or do we just drop them from clustering.

My intuition is that the audio doesn't cluster in the ways that these tasks suggests - I would do some investigation on how audio actually clusters (we can already get a hint from the ones where the v-measure is higher)

To be clear the goal is to have tasks that can meaningfully differentiate between models so if a task can't differentiate between two seemingly different models. We either have:

the tasks is poor
the task is hard and none of the models can solve it

In this case I am not sure that I would suspect audio to cluster in accordance to emotion (though I would imagine that it is extractable, i.e. classification)

AdnanElAssadi56 · 2025-05-02T16:30:24Z

Do we just add all classification datasets now? And then add separate PR's for whatever we choose from clustering? I am not sure if opening ~ 15 PR's at this point is a good idea.

KennethEnevoldsen · 2025-05-03T14:15:07Z

How so? I think it will be much faster than grouping them together in one PR

AdnanElAssadi56 added 7 commits April 8, 2025 15:01

BirdCLEF Dataset

e502933

3 More Datasets.

479a706

AmbientAcousticContext Dataset

cf8824e

3 Clustering Datasets

0263e12

8 More Datasets

512bdb3

4 More Datasets

c7fd59d

Ran Make Lint

cd4c7f0

KennethEnevoldsen requested changes Apr 10, 2025

View reviewed changes

Datasets Updated

1c86e3e

AdnanElAssadi56 added 2 commits April 15, 2025 13:56

Speech Commands Dataset Downsampled

a37def4

BirdClef Dataset Downsampled

47ed7ff

AdnanElAssadi56 added 3 commits April 16, 2025 19:10

Model Fix + MISC

fe0e7da

Ran Make Lint

890670d

Fixed datasets

4f8c7be

KennethEnevoldsen closed this May 2, 2025

	license="not specified", # As specified in dataset card
	license="not specified", # Not specified in dataset card

	task_subtypes=["Environment Sound Classification"],
	task_subtypes=["Species Classification"],

		from mteb.abstasks.TaskMetadata import TaskMetadata


		class IEMOCAPEmotionClustering(AbsTaskAudioClustering):

Conversation

AdnanElAssadi56 commented Apr 9, 2025

Classification Datasets

Clustering Datasets

Uh oh!

AdnanElAssadi56 commented Apr 9, 2025

Uh oh!

Samoed commented Apr 9, 2025

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdnanElAssadi56 commented Apr 15, 2025

Uh oh!

KennethEnevoldsen commented Apr 15, 2025

Uh oh!

AdnanElAssadi56 commented Apr 15, 2025

Uh oh!

KennethEnevoldsen commented Apr 16, 2025

Uh oh!

AdnanElAssadi56 commented Apr 19, 2025 • edited by KennethEnevoldsen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KennethEnevoldsen commented Apr 27, 2025

Uh oh!

AdnanElAssadi56 commented Apr 27, 2025

Uh oh!

AdnanElAssadi56 commented Apr 29, 2025

Uh oh!

KennethEnevoldsen commented May 2, 2025

Uh oh!

AdnanElAssadi56 commented May 2, 2025

Uh oh!

KennethEnevoldsen commented May 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AdnanElAssadi56 commented Apr 19, 2025 •

edited by KennethEnevoldsen

Loading