20 MAEB Datasets#2527
Conversation
|
Hey @Samoed While running the current version of the MAEB code from previous commits on my local machine, I encountered the following issues:
Because of the above, I couldn't get the actual evaluation results to confirm the datasets, but I have verified that the metadata adheres to the Please take a look and confirm when you have the time. |
|
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Thanks for the PR. I made a few overall suggestions below
| date=("2020-01-01", "2020-12-31"), # Paper publication date | ||
| domains=["Spoken", "Speech"], | ||
| task_subtypes=["Environment Sound Classification"], | ||
| license="not specified", # As specified in dataset card |
There was a problem hiding this comment.
| license="not specified", # As specified in dataset card | |
| license="not specified", # Not specified in dataset card |
?
|
|
||
| audio_column_name: str = "audio" | ||
| label_column_name: str = "label" | ||
| samples_per_label: int = 300 # Placeholder because value varies |
There was a problem hiding this comment.
This is not a descriptive stat. It states how many samples it should sample from each category to fit the classifier.
300 seems pretty high.
There was a problem hiding this comment.
Also a general observation
| series = {MobileHCI '20} | ||
| }""", | ||
| descriptive_stats={ | ||
| "n_samples": {"train": 70254}, # As mentioned in dataset card |
There was a problem hiding this comment.
This seems too large (we will spend a lot of time downloading), I would downsample the dataset to 100 samples pr. label. Then I would also add the test splits (50 samples pr. label?)
These parameters can be tested using a couple of models to figure out a good threshold.
There was a problem hiding this comment.
This seems like a general observation throughout
| descriptive_stats={ | ||
| "n_samples": {"train": 5000}, # Approximate after subsampling | ||
| }, |
There was a problem hiding this comment.
but how much is being downloaded? I would just re-upload it in the correct format.
| main_score="accuracy", | ||
| date=("2025-01-01", "2025-12-31"), # Competition year | ||
| domains=["Spoken", "Speech"], | ||
| task_subtypes=["Environment Sound Classification"], |
There was a problem hiding this comment.
Spe
| task_subtypes=["Environment Sound Classification"], | |
| task_subtypes=["Species Classification"], |
| eval_langs=[ | ||
| "eng-Latn", | ||
| "spa-Latn", | ||
| ], # Both English and Spanish names are provided |
There was a problem hiding this comment.
This makes no sense given the description
| samples_per_label: int = 50 # Approximate placeholder because value varies | ||
| is_cross_validation: bool = False | ||
|
|
||
| def dataset_transform(self): |
There was a problem hiding this comment.
you can remove this on re-upload
| eval_langs=[ | ||
| "all" | ||
| ], # Evaluation supported for all language configurations (the 14 languages) |
There was a problem hiding this comment.
This is invalid. You should structure it as follows:
| eval_langs=[ | |
| "all" | |
| ], # Evaluation supported for all language configurations (the 14 languages) | |
| eval_langs={"fr-FR": ["fra-Latn"], ...} |
| main_score="accuracy", | ||
| date=("2018-04-11", "2018-04-11"), # v0.02 release date | ||
| domains=["Speech"], | ||
| task_subtypes=["Spoken Digit Classification"], |
There was a problem hiding this comment.
Doesn't match the description
| from mteb.abstasks.TaskMetadata import TaskMetadata | ||
|
|
||
|
|
||
| class IEMOCAPEmotionClustering(AbsTaskAudioClustering): |
There was a problem hiding this comment.
Some of these datasets seem like duplicates of the classification datasets.
I generally don't think we want to have duplicates. We should consider if a task is best suited for clustering or classification.
A general guideline is:
- do we expect the properly to generally represented in the embedding space (domain, thematic content, topic of conversation) then it is a clustering task
- If we expect the labels to be extractable from the embeddings (but not generally represented), then I would put it in Classification. E.g., intent classification
There was a problem hiding this comment.
feel free to test this using a model if you are unsure. I would imagine that clustering performance might be very low for e.g. intent classification.
|
Thanks for your suggestions—I’ve taken them into account. I’ve downsampled the datasets that were easier to process, and fixed other issues. BirdCLEF took significantly long in processing on a colab notebook (to get the right number of samples per label for each species). Thus, I've decided to put that on hold. Its current size is around 28k, which seems on the order of magnitude of some existing datasets, so I left it as is for now. The same applies to Speech Commands, which is also a well-known dataset in its current form without downsampling. Regarding the clustering task, you're right—there's some duplication. I initially included those because I noticed some classification datasets were being repurposed for clustering, so I assumed that was the intended direction. That said, I’ve now fixed the current metadata in the clustering datasets, and we can drop the ones that perform poorly. For some reason, I'm currently running into errors with the audio models when trying to test the datasets. Could you let me know exactly which model you're using for evaluation, and if you're running it in a specific way? Once we get the models running, we can test the datasets and, if we observe poor performance, drop the weaker ones from the clustering task to avoid redundancy. Please tell me if you have anything more to add! |
If you can't process it in a Google Colab then I would imagine that it severely limits the practical usability of the benchmark. I would def. downsample. Same goes for speech commands (we can keep the test and validation set the same), but there is not need to download 84,848 samples when 2k could do.
We need to do this validation before the merge.
I would use two familiar models that are currently implemented (I would choose models on the smaller side). I don't have a strong preference, but otherwise it might be worth reaching to the model team on slack? I would run it as follows: task= mteb.get_task("NAME")
runner = mteb.MTEB(tasks = [task])
model = mteb.get_model("MODELNAME")
results = runner.run(model) |
I've down-sampled the datasets.
The audio models I have tried are still giving me errors. I am not sure if the problem is from the current version of MAEB or the models I have tested with (I tried 3). Mainly, the errors are about dimensionality differences and number of workers. If there is a responsible team on slack, I don't mind communicating. |
|
@AdnanElAssadi56 will you create (or link) an issue on the models that don't run |
|
Here's the data I obtained for the Clustering Tasks. {
"dataset_revision": "360c858462b79492c6b09d5855ec4d59c87497c6",
"task_name": "AmbientAcousticContextClustering",
"mteb_version": "1.36.36",
"scores": {
"train": [
{
"v_measure": 0.163042,
"nmi": 0.163042,
"ari": 0.043129,
"cluster_accuracy": 0.166318,
"main_score": 0.166318,
"hf_subset": "default",
"languages": [
"eng-Latn"
]
}
]
},
"evaluation_time": 31.59774661064148,
"kg_co2_emissions": null
}
{
"dataset_revision": "5efdda59d0d185bfe17ada9b54d233349d0e0168",
"task_name": "GTZANGenreClustering",
"mteb_version": "1.36.36",
"scores": {
"train": [
{
"v_measure": 0.190086,
"nmi": 0.190086,
"ari": 0.096,
"cluster_accuracy": 0.292,
"main_score": 0.292,
"hf_subset": "default",
"languages": [
"eng-Latn"
]
}
]
},
"evaluation_time": 97.10436391830444,
"kg_co2_emissions": null
}
{
"dataset_revision": "e3e2a63ffff66b9a9735524551e3818e96af03ee",
"task_name": "ESC50Clustering",
"mteb_version": "1.36.36",
"scores": {
"train": [
{
"v_measure": 0.302579,
"nmi": 0.302579,
"ari": 0.043225,
"cluster_accuracy": 0.1515,
"main_score": 0.1515,
"hf_subset": "default",
"languages": [
"eng-Latn"
]
}
]
},
"evaluation_time": 34.19173741340637,
"kg_co2_emissions": null
}
{
"dataset_revision": "9f1696a135a65ce997d898d4121c952269a822ca",
"task_name": "IEMOCAPEmotionClustering",
"mteb_version": "1.36.36",
"scores": {
"train": [
{
"v_measure": 0.014234,
"nmi": 0.014234,
"ari": 0.006539,
"cluster_accuracy": 0.154996,
"main_score": 0.154996,
"hf_subset": "default",
"languages": [
"eng-Latn"
]
}
]
},
"evaluation_time": 185.41184282302856,
"kg_co2_emissions": null
}
{
"dataset_revision": "9f1696a135a65ce997d898d4121c952269a822ca",
"task_name": "IEMOCAPGenderClustering",
"mteb_version": "1.36.36",
"scores": {
"train": [
{
"v_measure": 0.000329,
"nmi": 0.000329,
"ari": 0.000571,
"cluster_accuracy": 0.513199,
"main_score": 0.513199,
"hf_subset": "default",
"languages": [
"eng-Latn"
]
}
]
},
"evaluation_time": 187.12454199790955,
"kg_co2_emissions": null
}
{
"dataset_revision": "554ad4367e98b7c6f4d4d9756dc6bbdf345e042e",
"task_name": "VoxCelebClustering",
"mteb_version": "1.36.36",
"scores": {
"train": [
{
"v_measure": 0.002534,
"nmi": 0.002534,
"ari": -0.002025,
"cluster_accuracy": 0.399591,
"main_score": 0.399591,
"hf_subset": "default",
"languages": [
"eng-Latn"
]
}
]
},
"evaluation_time": 104.42584371566772,
"kg_co2_emissions": null
}
{
"dataset_revision": "719aaef8225945c0d80b277de6c79aa42ab053d5",
"task_name": "VoxPopuliAccentClustering",
"mteb_version": "1.36.36",
"scores": {
"test": [
{
"v_measure": 0.030532,
"nmi": 0.030532,
"ari": -0.000755,
"cluster_accuracy": 0.118209,
"main_score": 0.118209,
"hf_subset": "default",
"languages": [
"eng-Latn"
]
}
]
},
"evaluation_time": 80.55528330802917,
"kg_co2_emissions": null
}
{
"dataset_revision": "719aaef8225945c0d80b277de6c79aa42ab053d5",
"task_name": "VoxPopuliGenderClustering",
"mteb_version": "1.36.36",
"scores": {
"validation": [
{
"v_measure": 0.001049,
"nmi": 0.001049,
"ari": 0.004027,
"cluster_accuracy": 0.543069,
"main_score": 0.543069,
"hf_subset": "default",
"languages": [
"eng-Latn"
]
}
],
"test": [
{
"v_measure": 4.1e-05,
"nmi": 4.1e-05,
"ari": -0.002091,
"cluster_accuracy": 0.567318,
"main_score": 0.567318,
"hf_subset": "default",
"languages": [
"eng-Latn"
]
}
]
},
"evaluation_time": 164.94520568847656,
"kg_co2_emissions": null
} |
|
Some of it seems quite low, though "0.000329", which could suggest that the task is not possible, misformatted or similar (we don't want that either) |
I am not sure about the reason, IEMOCAP seems like a clean dataset. Do we try to inspect what is happening or do we just drop them from clustering. |
|
@KennethEnevoldsen |
|
Yeah I agree. I will close this PR and then let us submit the datasets one at a time (this is how we usually do it and it prevents issues where one dataset blocks another)
My intuition is that the audio doesn't cluster in the ways that these tasks suggests - I would do some investigation on how audio actually clusters (we can already get a hint from the ones where the v-measure is higher) To be clear the goal is to have tasks that can meaningfully differentiate between models so if a task can't differentiate between two seemingly different models. We either have:
In this case I am not sure that I would suspect audio to cluster in accordance to emotion (though I would imagine that it is extractable, i.e. classification) |
|
Do we just add all classification datasets now? And then add separate PR's for whatever we choose from clustering? I am not sure if opening ~ 15 PR's at this point is a good idea. |
|
How so? I think it will be much faster than grouping them together in one PR |
I have added and adapted 20 datasets to MAEB.
The newly added datasets:
Classification Datasets
BirdCLEFClassification()MInDS14Classification()TUTAcousticScenesClassification()VoxPopuliAccentID()VoxPopuliGenderID()VoxPopuliLanguageID()IEMOCAPEmotionClassification()IEMOCAPGenderClassification()SpeechCommandsClassification()AmbientAcousticContextClassification()Clustering Datasets
ESC50Clustering()TUTAcousticScenesClustering()AmbientAcousticContextClustering()CREMA_DClustering()GTZANGenreClustering()VoxCelebClustering()VoxPopuliAccentClustering()VoxPopuliGenderClustering()IEMOCAPEmotionClustering()IEMOCAPGenderClustering()