Add new and complete version of FSD50K multi-label audio classification task#2285
Conversation
|
I have one question regarding this dataset. I noticed that your labels are in comma-separated string format (huggingface link) However, judging from the For example, mteb/mteb/abstasks/Audio/AbsTaskAudioMultilabelClassification.py Lines 258 to 260 in ef30e3d My understanding is that the Same happens in mteb/mteb/abstasks/Audio/AbsTaskAudioMultilabelClassification.py Lines 213 to 215 in ef30e3d Unless this was intended, I think the labels should be mapped to a list of their unique IDs. |
|
I think labels column should be splitted additionally @RahulSChand |
Yes, makes sense, will add a PR for fix |
|
#2369 should fix this |

All existing HuggingFace datasets for FSD50K were broken, either corrupt data or didn't have complete test+train split. We downloaded the original data from FreeSound and put it on HF. Tested wav2vec (results next comment)
Code Quality
make lintto maintain consistent style.Documentation
Testing
make test-with-coverage.make testormake test-with-coverageto ensure no existing functionality is broken.Adding datasets checklist
Reason for dataset addition: ...
All existing HuggingFace datasets for FSD50K were broken, either corrupt data or didn't have complete test+train split. We downloaded the original data from FreeSound and put it on HF. Tested wav2vec (results next comment)
mteb -m {model_name} -t {task_name}command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2intfloat/multilingual-e5-smallself.stratified_subsampling() under dataset_transform()make test.make lint.Adding a model checklist
mteb.get_model(model_name, revision)andmteb.get_model_meta(model_name, revision)