Skip to content

Add new and complete version of FSD50K multi-label audio classification task#2285

Merged
Samoed merged 5 commits intoembeddings-benchmark:maebfrom
anime-sh:fsd50k_hf_upload
Mar 8, 2025
Merged

Add new and complete version of FSD50K multi-label audio classification task#2285
Samoed merged 5 commits intoembeddings-benchmark:maebfrom
anime-sh:fsd50k_hf_upload

Conversation

@RahulSChand
Copy link

All existing HuggingFace datasets for FSD50K were broken, either corrupt data or didn't have complete test+train split. We downloaded the original data from FreeSound and put it on HF. Tested wav2vec (results next comment)

Code Quality

  • [✅ ] Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

  • [✅ ] Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

  • [❌ ] New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
  • Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: ...

All existing HuggingFace datasets for FSD50K were broken, either corrupt data or didn't have complete test+train split. We downloaded the original data from FreeSound and put it on HF. Tested wav2vec (results next comment)

  • [ ✅] I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • [✅ ] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • [ ✅] If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • [✅ ] I have filled out the metadata object in the dataset file (find documentation on it here).
  • [ ✅] Run tests locally to make sure nothing is broken using make test.
  • [✅ ] Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

@RahulSChand RahulSChand added the maeb Audio extension label Mar 8, 2025
@RahulSChand RahulSChand self-assigned this Mar 8, 2025
@RahulSChand
Copy link
Author

Test results using wav2vec model on multi-label classification task

image

@Samoed Samoed merged commit 2188585 into embeddings-benchmark:maeb Mar 8, 2025
8 checks passed
@diffunity
Copy link
Collaborator

diffunity commented Mar 14, 2025

I have one question regarding this dataset.

I noticed that your labels are in comma-separated string format (huggingface link) However, judging from the mteb multi-class classification tasks, it seems the labels should be in list[int] format. I think this format mismatch may cause problems?

For example,

label_counter = defaultdict(int)
for i in idxs:
if any((label_counter[label] < samples_per_label) for label in y[i]):

My understanding is that the label_counter should be counting the unique label values, but an instance of the label_counter when running this task is : defaultdict(<class 'int'>, {'W': 3, 'i': 19, 'n': 29, 'd': 18, '_': 21, 's': 26, 't': 15, 'r': 14, 'u': 19, 'm': 15, 'e': 16, 'a': 15, 'w': 2, 'o': 27, ',': 15, 'M': 4, 'c': 10, 'l': 5, 'K': 2, 'k': 1, 'D': 5, 'h': 5, 'T': 4, 'R': 1, 'B': 1, 'p': 2, 'g': 2, 'P': 1, 'y': 2, 'b': 1, '(': 1, ')': 1, 'H': 1, 'S': 1})

Same happens in

test_audio = eval_split[self.audio_column_name]
binarizer = MultiLabelBinarizer()
y_test = binarizer.fit_transform(eval_split[self.label_column_name])

Unless this was intended, I think the labels should be mapped to a list of their unique IDs.

@Samoed
Copy link
Member

Samoed commented Mar 14, 2025

I think labels column should be splitted additionally @RahulSChand

@RahulSChand
Copy link
Author

I think labels column should be splitted additionally @RahulSChand

Yes, makes sense, will add a PR for fix

@anime-sh
Copy link

#2369 should fix this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maeb Audio extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants