Skip to content

20 MAEB Datasets#2527

Closed
AdnanElAssadi56 wants to merge 13 commits intoembeddings-benchmark:maebfrom
AdnanElAssadi56:maeb
Closed

20 MAEB Datasets#2527
AdnanElAssadi56 wants to merge 13 commits intoembeddings-benchmark:maebfrom
AdnanElAssadi56:maeb

Conversation

@AdnanElAssadi56
Copy link
Contributor

I have added and adapted 20 datasets to MAEB.

The newly added datasets:

Classification Datasets

  • BirdCLEFClassification()
  • MInDS14Classification()
  • TUTAcousticScenesClassification()
  • VoxPopuliAccentID()
  • VoxPopuliGenderID()
  • VoxPopuliLanguageID()
  • IEMOCAPEmotionClassification()
  • IEMOCAPGenderClassification()
  • SpeechCommandsClassification()
  • AmbientAcousticContextClassification()

Clustering Datasets

  • ESC50Clustering()
  • TUTAcousticScenesClustering()
  • AmbientAcousticContextClustering()
  • CREMA_DClustering()
  • GTZANGenreClustering()
  • VoxCelebClustering()
  • VoxPopuliAccentClustering()
  • VoxPopuliGenderClustering()
  • IEMOCAPEmotionClustering()
  • IEMOCAPGenderClustering()

@AdnanElAssadi56
Copy link
Contributor Author

Hey @Samoed

While running the current version of the MAEB code from previous commits on my local machine, I encountered the following issues:

  1. Directory Name Error:
    A folder in the image directory is mistakenly named ZeroshotClassication instead of the correct name ZeroShotClassification.

  2. Model Metadata Evaluation Error:
    When evaluating the datasets with one of the CLAP models, the following error is observed:

    "Failed to extract metadata from model: 'ClapZeroShotWrapper' object has no attribute 'model_card_data'. Upgrading to sentence-transformers v3.0.0 or above is recommended."

Because of the above, I couldn't get the actual evaluation results to confirm the datasets, but I have verified that the metadata adheres to the taskmetadata class that is present in the code.

Please take a look and confirm when you have the time.

@Samoed
Copy link
Member

Samoed commented Apr 9, 2025

  1. I think your can rename it in other PR
  2. How did you run it?

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I made a few overall suggestions below

date=("2020-01-01", "2020-12-31"), # Paper publication date
domains=["Spoken", "Speech"],
task_subtypes=["Environment Sound Classification"],
license="not specified", # As specified in dataset card
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
license="not specified", # As specified in dataset card
license="not specified", # Not specified in dataset card

?


audio_column_name: str = "audio"
label_column_name: str = "label"
samples_per_label: int = 300 # Placeholder because value varies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a descriptive stat. It states how many samples it should sample from each category to fit the classifier.

300 seems pretty high.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a general observation

series = {MobileHCI '20}
}""",
descriptive_stats={
"n_samples": {"train": 70254}, # As mentioned in dataset card
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too large (we will spend a lot of time downloading), I would downsample the dataset to 100 samples pr. label. Then I would also add the test splits (50 samples pr. label?)

These parameters can be tested using a couple of models to figure out a good threshold.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a general observation throughout

Comment on lines +41 to +43
descriptive_stats={
"n_samples": {"train": 5000}, # Approximate after subsampling
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but how much is being downloaded? I would just re-upload it in the correct format.

main_score="accuracy",
date=("2025-01-01", "2025-12-31"), # Competition year
domains=["Spoken", "Speech"],
task_subtypes=["Environment Sound Classification"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spe

Suggested change
task_subtypes=["Environment Sound Classification"],
task_subtypes=["Species Classification"],

Comment on lines +21 to +24
eval_langs=[
"eng-Latn",
"spa-Latn",
], # Both English and Spanish names are provided
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes no sense given the description

samples_per_label: int = 50 # Approximate placeholder because value varies
is_cross_validation: bool = False

def dataset_transform(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this on re-upload

Comment on lines +19 to +21
eval_langs=[
"all"
], # Evaluation supported for all language configurations (the 14 languages)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is invalid. You should structure it as follows:

Suggested change
eval_langs=[
"all"
], # Evaluation supported for all language configurations (the 14 languages)
eval_langs={"fr-FR": ["fra-Latn"], ...}

main_score="accuracy",
date=("2018-04-11", "2018-04-11"), # v0.02 release date
domains=["Speech"],
task_subtypes=["Spoken Digit Classification"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't match the description

from mteb.abstasks.TaskMetadata import TaskMetadata


class IEMOCAPEmotionClustering(AbsTaskAudioClustering):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these datasets seem like duplicates of the classification datasets.

I generally don't think we want to have duplicates. We should consider if a task is best suited for clustering or classification.

A general guideline is:

  1. do we expect the properly to generally represented in the embedding space (domain, thematic content, topic of conversation) then it is a clustering task
  2. If we expect the labels to be extractable from the embeddings (but not generally represented), then I would put it in Classification. E.g., intent classification

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to test this using a model if you are unsure. I would imagine that clustering performance might be very low for e.g. intent classification.

@AdnanElAssadi56
Copy link
Contributor Author

Hi @KennethEnevoldsen,

Thanks for your suggestions—I’ve taken them into account. I’ve downsampled the datasets that were easier to process, and fixed other issues.

BirdCLEF took significantly long in processing on a colab notebook (to get the right number of samples per label for each species). Thus, I've decided to put that on hold. Its current size is around 28k, which seems on the order of magnitude of some existing datasets, so I left it as is for now. The same applies to Speech Commands, which is also a well-known dataset in its current form without downsampling.

Regarding the clustering task, you're right—there's some duplication. I initially included those because I noticed some classification datasets were being repurposed for clustering, so I assumed that was the intended direction. That said, I’ve now fixed the current metadata in the clustering datasets, and we can drop the ones that perform poorly.

For some reason, I'm currently running into errors with the audio models when trying to test the datasets. Could you let me know exactly which model you're using for evaluation, and if you're running it in a specific way?

Once we get the models running, we can test the datasets and, if we observe poor performance, drop the weaker ones from the clustering task to avoid redundancy.

Please tell me if you have anything more to add!

@KennethEnevoldsen
Copy link
Contributor

Thus, I've decided to put that on hold. Its current size is around 28k, which seems on the order of magnitude of some existing datasets, so I left it as is for now

If you can't process it in a Google Colab then I would imagine that it severely limits the practical usability of the benchmark. I would def. downsample. Same goes for speech commands (we can keep the test and validation set the same), but there is not need to download 84,848 samples when 2k could do.

That said, I’ve now fixed the current metadata in the clustering datasets, and we can drop the ones that perform poorly.

We need to do this validation before the merge.

For some reason, I'm currently running into errors with the audio models when trying to test the datasets. Could you let me know exactly which model you're using for evaluation, and if you're running it in a specific way?

I would use two familiar models that are currently implemented (I would choose models on the smaller side). I don't have a strong preference, but otherwise it might be worth reaching to the model team on slack?

I would run it as follows:

task=  mteb.get_task("NAME")
runner = mteb.MTEB(tasks = [task])
model = mteb.get_model("MODELNAME")

results = runner.run(model)

@AdnanElAssadi56
Copy link
Contributor Author

If you can't process it in a Google Colab then I would imagine that it severely limits the practical usability of the benchmark. I would def. downsample. Same goes for speech commands (we can keep the test and validation set the same), but there is not need to download 84,848 samples when 2k could do.

I've down-sampled the datasets.

We need to do this validation before the merge.

The audio models I have tried are still giving me errors. I am not sure if the problem is from the current version of MAEB or the models I have tested with (I tried 3). Mainly, the errors are about dimensionality differences and number of workers. If there is a responsible team on slack, I don't mind communicating.

@KennethEnevoldsen
Copy link
Contributor

@AdnanElAssadi56 will you create (or link) an issue on the models that don't run

@AdnanElAssadi56
Copy link
Contributor Author

AdnanElAssadi56 commented Apr 19, 2025

Here's the data I obtained for the Clustering Tasks.
(Performance might be slightly lower than expected/actual since I truncated the audio samples to the first 5 seconds to speed up processing.)
This was mainly a quick sanity check, I just wanted to confirm that the results weren't trivial.

{
  "dataset_revision": "360c858462b79492c6b09d5855ec4d59c87497c6",
  "task_name": "AmbientAcousticContextClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "train": [
      {
        "v_measure": 0.163042,
        "nmi": 0.163042,
        "ari": 0.043129,
        "cluster_accuracy": 0.166318,
        "main_score": 0.166318,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 31.59774661064148,
  "kg_co2_emissions": null
}

{
  "dataset_revision": "5efdda59d0d185bfe17ada9b54d233349d0e0168",
  "task_name": "GTZANGenreClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "train": [
      {
        "v_measure": 0.190086,
        "nmi": 0.190086,
        "ari": 0.096,
        "cluster_accuracy": 0.292,
        "main_score": 0.292,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 97.10436391830444,
  "kg_co2_emissions": null
}

{
  "dataset_revision": "e3e2a63ffff66b9a9735524551e3818e96af03ee",
  "task_name": "ESC50Clustering",
  "mteb_version": "1.36.36",
  "scores": {
    "train": [
      {
        "v_measure": 0.302579,
        "nmi": 0.302579,
        "ari": 0.043225,
        "cluster_accuracy": 0.1515,
        "main_score": 0.1515,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 34.19173741340637,
  "kg_co2_emissions": null
}

{
  "dataset_revision": "9f1696a135a65ce997d898d4121c952269a822ca",
  "task_name": "IEMOCAPEmotionClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "train": [
      {
        "v_measure": 0.014234,
        "nmi": 0.014234,
        "ari": 0.006539,
        "cluster_accuracy": 0.154996,
        "main_score": 0.154996,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 185.41184282302856,
  "kg_co2_emissions": null
}


{
  "dataset_revision": "9f1696a135a65ce997d898d4121c952269a822ca",
  "task_name": "IEMOCAPGenderClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "train": [
      {
        "v_measure": 0.000329,
        "nmi": 0.000329,
        "ari": 0.000571,
        "cluster_accuracy": 0.513199,
        "main_score": 0.513199,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 187.12454199790955,
  "kg_co2_emissions": null
}


{
  "dataset_revision": "554ad4367e98b7c6f4d4d9756dc6bbdf345e042e",
  "task_name": "VoxCelebClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "train": [
      {
        "v_measure": 0.002534,
        "nmi": 0.002534,
        "ari": -0.002025,
        "cluster_accuracy": 0.399591,
        "main_score": 0.399591,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 104.42584371566772,
  "kg_co2_emissions": null
}


{
  "dataset_revision": "719aaef8225945c0d80b277de6c79aa42ab053d5",
  "task_name": "VoxPopuliAccentClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "test": [
      {
        "v_measure": 0.030532,
        "nmi": 0.030532,
        "ari": -0.000755,
        "cluster_accuracy": 0.118209,
        "main_score": 0.118209,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 80.55528330802917,
  "kg_co2_emissions": null
}

{
  "dataset_revision": "719aaef8225945c0d80b277de6c79aa42ab053d5",
  "task_name": "VoxPopuliGenderClustering",
  "mteb_version": "1.36.36",
  "scores": {
    "validation": [
      {
        "v_measure": 0.001049,
        "nmi": 0.001049,
        "ari": 0.004027,
        "cluster_accuracy": 0.543069,
        "main_score": 0.543069,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ],
    "test": [
      {
        "v_measure": 4.1e-05,
        "nmi": 4.1e-05,
        "ari": -0.002091,
        "cluster_accuracy": 0.567318,
        "main_score": 0.567318,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 164.94520568847656,
  "kg_co2_emissions": null
}

@KennethEnevoldsen
Copy link
Contributor

Some of it seems quite low, though "0.000329", which could suggest that the task is not possible, misformatted or similar (we don't want that either)

@AdnanElAssadi56
Copy link
Contributor Author

Some of it seems quite low, though "0.000329", which could suggest that the task is not possible, misformatted or similar (we don't want that either)

I am not sure about the reason, IEMOCAP seems like a clean dataset. Do we try to inspect what is happening or do we just drop them from clustering.

@AdnanElAssadi56
Copy link
Contributor Author

@KennethEnevoldsen
I think because we are tight on time, let's just drop whatever clustering dataset that seems too off for you, so we can merge this PR, then move to the other things we need to finish.

@KennethEnevoldsen
Copy link
Contributor

Yeah I agree. I will close this PR and then let us submit the datasets one at a time (this is how we usually do it and it prevents issues where one dataset blocks another)

I am not sure about the reason, IEMOCAP seems like a clean dataset. Do we try to inspect what is happening or do we just drop them from clustering.

My intuition is that the audio doesn't cluster in the ways that these tasks suggests - I would do some investigation on how audio actually clusters (we can already get a hint from the ones where the v-measure is higher)

To be clear the goal is to have tasks that can meaningfully differentiate between models so if a task can't differentiate between two seemingly different models. We either have:

  1. the tasks is poor
  2. the task is hard and none of the models can solve it

In this case I am not sure that I would suspect audio to cluster in accordance to emotion (though I would imagine that it is extractable, i.e. classification)

@AdnanElAssadi56
Copy link
Contributor Author

Do we just add all classification datasets now? And then add separate PR's for whatever we choose from clustering? I am not sure if opening ~ 15 PR's at this point is a good idea.

@KennethEnevoldsen
Copy link
Contributor

How so? I think it will be much faster than grouping them together in one PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants