Skip to content

feat: Introduce MAEB#3470

Merged
isaac-chung merged 235 commits intomainfrom
maeb
Feb 16, 2026
Merged

feat: Introduce MAEB#3470
isaac-chung merged 235 commits intomainfrom
maeb

Conversation

@Samoed
Copy link
Member

@Samoed Samoed commented Oct 22, 2025

Fixes #2072

What: Introduces MAEB (Massive Audio Embedding Benchmark) - a comprehensive benchmark for evaluating audio embedding models

Key additions:

  • New audio task types: Classification, multilabel classification, zero-shot classification, clustering, pair classification, reranking, and audio-text retrieval (A2T/T2A)
  • Audio models: Wav2Vec2, WavLM, Whisper, CLAP, Qwen2Audio, MS-CLAP, and LCO-Embedding models
  • Datasets: 95+ audio tasks including FSD50K, ESC-50, VoxCeleb, AudioCaps, Clotho, Common Voice, FLEURS, and many others
  • Benchmark variants: MAEB (30 tasks), MAEB+ (97 tasks), MAEB(audio-only) (various subsets), and MAEB(audio-text) with lite/extended versions
  • Infrastructure: Audio encoder interface, evaluation pipeline with resampling support, task selection using correlation analysis, and leaderboard integration

Technical improvements:

  • Support for datasets v4+ with torchcodec for audio processing
  • Efficient data collators for batch processing
  • Descriptive statistics for all tasks
  • Task selection methodology using correlation thresholds to reduce redundancy

sufen-f and others added 30 commits February 18, 2025 23:33
- define audio encoder interface
- implement abstask and evaluator for clustering
Created test_maeb_datasets.py to test  AbsTask and Evaluator for clustering
…SCAN and agglomerative algorithms into clustering evaluator, added algorithm selector into VoiceGender
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
…2175)

* Added wav2vec model wrapper

* Added four w2v variants

* Update wav2vec_models.py

* Removed run.py test script

* Added subTask with small sample of dataset for testing

* Removed test portion of VoiceGender.py task

* add commit hash and bibtex

* make lint

* update models

* fix circular import

* make VoiceGender discoverable in get_tasks

* add a2a as category for clustering

* specify latest commit hash

* revert linting changes

* Based on feedback for model: updated w2v2 revisions and added torchaudio to .toml file

* Added Bibtex for dataset, set data to be test instead of training, shortened task_subtype

* Changed task from Voice Gender Clustering to Gender Clustering.

* Fixed mock audio clustering tests

* Added dataset metadata

* Linted

* Passed revision into the w2v2 loader

* passed lint check

* Linted

* Update VoiceGender.py

---------

Co-authored-by: Ali Sartaz Khan <alisartazkhan@gmail.com>
Co-authored-by: Ali Sartaz Khan <71156712+alisartazkhan@users.noreply.github.com>
Co-authored-by: mn <mn@Ms-MacBook-Pro.local>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
… VoxCeleb dataset (subset)"" (#2203)

Revert "Revert "Maeb - added voice clustering task, wav2vec model and VoxCele…"

This reverts commit ee10191.
…odel and VoxCeleb dataset (subset)""" (#2207)

Revert "Revert "Revert "Maeb - added voice clustering task, wav2vec model and…"

This reverts commit f1449c0.
… FSD50k Dataset and Task (#2082)

* init audio

* some encoder related changes

* some more abs task defs

Co-authored-by: rahulschand <rahulsc@stanford.edu>

* evaluators and classification

* remove rahul changes to generate first PR

* make lint

* add dataset/tasks skeleton

* readd changes lost in rebase

* add fsd50k

* add task categories for audio

* slight updates to fsd50k

* make lint

* wav2vec2 model

* add fsd50k metadata

* rename folder

* add metric

* add torchaudio in req

* reigster wav2vec2 models

* fixes

* add audio in valid task types

* mock interface changes

* make lint

* rm audio clustering

* wav2vec2 model revision update

* rm comment

* rm test.py

* add revisions to all wav2vec2 models

* rm empty abstask files

* rm empty evaluator files

* rm empty task files

* Update tests/test_tasks/test_all_abstasks.py

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update mteb/models/wav2vec2_models.py

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* rm non-logReg evaluators for audio classification

* lint

* fn name changed to convert_audio_from_numpy

* rm mock tests for audio kNN classification

* rm evaluators for audio kNN classification

* fix imports

* fix audio kNN; make lint

* rm AbsTaskAudioClassification.py for later PR

* remove commented code; reset changes to ClassificationEvaluator.py

* fix mock tasks for multilabel classification

* make lint

* inherit Wrapper class

* add all languages supported by wav2vec2

* make lint

* add script info to all languages

* make lint

* recent changes

* merge wav2vec2 + add updated logic for auto padding for fqd50k type datasets

* make lint remove uwanted files

* remove debug lines

* remove esc50 refs

* fix mock tasks for multilabel

* fix mock tasks for multilabel

* Revert "Merge branch 'maeb' into maeb" bad direct commit made to upstream maeb branch
4f23fdf

This reverts commit 4f23fdf, reversing
changes made to 1302477.

* fix model imports

* fqd50k cleaning

* update fsd50k

* change dataset

* eval subsets correctly

* make lint and remove debug statements

* clean print statements

* make lint

* update fsd2019 dataset

* remove init in AbsTaskAudioMultilabelClassification.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* add class parameters in AbsTaskAudioMultilabelClassification

* inherit from multilingualtask for FSD2019Kaggle

* make lint

* update mock_tasks; make lint

* remove train_split from fn parameters

* define fsd2019k to be multilingual

* inherit from MultilingualTask in fsd2019K

* fix tests

* inherit correct multingial task class

* remove MockAudioMultilabelClassificationLogRegTask

* rm other instances of MockAudioMultilabelClassificationLogRegTask

---------

Co-authored-by: rahulschand <rahulsc@stanford.edu>
Co-authored-by: Silky Singh <silky1708@gmail.com>
Co-authored-by: Silky Singh <54901747+silky1708@users.noreply.github.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* init audio

* some encoder related changes

* some more abs task defs

Co-authored-by: rahulschand <rahulsc@stanford.edu>

* evaluators and classification

* remove rahul changes to generate first PR

* make lint

* init audio

* some encoder related changes

* some more abs task defs

Co-authored-by: rahulschand <rahulsc@stanford.edu>

* evaluators and classification

* remove rahul changes to generate first PR

* make lint

* add dataset/tasks skeleton

* readd changes lost in rebase

* add fsd50k

* add task categories for audio

* slight updates to fsd50k

* make lint

* wav2vec2 model

* add fsd50k metadata

* rename folder

* add metric

* add torchaudio in req

* reigster wav2vec2 models

* fixes

* add audio in valid task types

* mock interface changes

* my 0 shot

* make lint

* rm audio clustering

* wav2vec2 model revision update

* rm comment

* rm test.py

* add revisions to all wav2vec2 models

* rm empty abstask files

* rm empty evaluator files

* rm empty task files

* Update tests/test_tasks/test_all_abstasks.py

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update mteb/models/wav2vec2_models.py

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* rm non-logReg evaluators for audio classification

* lint

* fn name changed to convert_audio_from_numpy

* rm mock tests for audio kNN classification

* rm evaluators for audio kNN classification

* fix imports

* fix audio kNN; make lint

* rm AbsTaskAudioClassification.py for later PR

* added zero-shot loading model and dataset checked

* remove commented code; reset changes to ClassificationEvaluator.py

* fix mock tasks for multilabel classification

* make lint

* inherit Wrapper class

* add all languages supported by wav2vec2

* make lint

* add script info to all languages

* make lint

* before cleaning comments

* ESC and clap model. Tested 81 percent zero-shot numbers

* fixed label names for ESC50-multilabel and removed comments

* recent changes

* merge wav2vec2 + add updated logic for auto padding for fqd50k type datasets

* make lint remove uwanted files

* remove debug lines

* remove esc50 refs

* changes for debugging

* lint changes and maeb main branch merge

* fix mock tasks for multilabel

* fix mock tasks for multilabel

* Revert "Merge branch 'maeb' into maeb" bad direct commit made to upstream maeb branch
4f23fdf

This reverts commit 4f23fdf, reversing
changes made to 1302477.

* fix model imports

* fqd50k cleaning

* fixed error in Image zero shot classfification

* update fsd50k

* change dataset

* eval subsets correctly

* make lint and remove debug statements

* clean print statements

* make lint

* update fsd2019 dataset

* remove init in AbsTaskAudioMultilabelClassification.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* add class parameters in AbsTaskAudioMultilabelClassification

* inherit from multilingualtask for FSD2019Kaggle

* make lint

* update mock_tasks; make lint

* remove train_split from fn parameters

* define fsd2019k to be multilingual

* inherit from MultilingualTask in fsd2019K

* fix tests

* inherit correct multingial task class

* remove MockAudioMultilabelClassificationLogRegTask

* rm other instances of MockAudioMultilabelClassificationLogRegTask

* removed unncessary files

* removed unncrssary files

* removed uncrssary files part 3

* deleted esc50 from multi label classification

* fixed errors

* fixed lintng, added precision and recall. Removed extra comments

* fixed double loading of model

* filled in missing meta-data

* fixed linting

---------

Co-authored-by: Animesh Jha <jha.animesh01@gmail.com>
Co-authored-by: rahulschand <rahulsc@stanford.edu>
Co-authored-by: Silky Singh <silky1708@gmail.com>
Co-authored-by: Silky Singh <54901747+silky1708@users.noreply.github.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* added unfused model

* fixed lint
…on task (#2285)

* Added fsd50k dataset on huggingface

* added correct hf version of fsd50k dataset

* added correct hf version of fsd50k dataset

* removed extra imports

* removed unecessary load_data fn
* added large, music and speech clap models

* fixed public_training_data and removed training_datasets split

* added latest revision

* lowercase mit license

* fixed issue related to training_datasets

* fixed lint
Samoed and others added 6 commits February 7, 2026 21:17
# Conflicts:
#	mteb/abstasks/task_metadata.py
#	mteb/leaderboard/app.py
#	mteb/tasks/reranking/eng/__init__.py
#	pyproject.toml
#	uv.lock
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
* use collator for processing

* use collator for processing

* fix typing

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix qwen2audio

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* change warining to log

* remove `_get_resampler`

* fix clap with transformers v5

* add cast instead of ignore

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix globe v2 and add globe v3

* add descriptive stats

* fix eval splits for globe v2 age and v2 gender
[MAEB] Add audio task installation instructions to docs

Document FFmpeg and transformers>=4.57.6 requirements for users
running audio tasks with datasets>=4. The datasets library v4+ uses
torchcodec for audio processing which requires FFmpeg to be installed.

Fixes #4023

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
@isaac-chung isaac-chung changed the title Maeb MAEB Feb 15, 2026
@isaac-chung isaac-chung marked this pull request as ready for review February 15, 2026 18:05
# Conflicts:
#	mteb/leaderboard/app.py
#	pyproject.toml
#	uv.lock
@isaac-chung
Copy link
Collaborator

Need to regenerate uv.lock

isaac-chung and others added 4 commits February 16, 2026 09:19
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolve conflict in app.py by keeping both:
- Session tracking functions from main (_generate_fingerprint_session_id, on_page_load)
- skip_cache_file parameter from maeb

Also add 'Beng' (Bengali script code) to typos ignore list.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The function now only takes benchmark_name and computes
languages/task_types/domains internally from the benchmark.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@Samoed
Copy link
Member Author

Samoed commented Feb 16, 2026

@isaac-chung I'm fixing tests in #4098

@isaac-chung
Copy link
Collaborator

Great, thanks! I'll pause here then.

@isaac-chung isaac-chung changed the title MAEB feat: Introduce MAEB Feb 16, 2026
* fix pyproject

* fix torchcodec version
@isaac-chung
Copy link
Collaborator

isaac-chung commented Feb 16, 2026

Only windows test failing now. @Samoed would you mind taking a look please?

* regenerate lock

* lock torchcodec to 0.9

* lock torchcodec to 0.9.1
@Samoed
Copy link
Member Author

Samoed commented Feb 16, 2026

In lock was torchcodec 0.9.0, but it had a bug when it couldn't find torch version on windows that was fixed in 0.9.1

@isaac-chung
Copy link
Collaborator

@isaac-chung isaac-chung merged commit a79aefe into main Feb 16, 2026
13 checks passed
@isaac-chung isaac-chung deleted the maeb branch February 16, 2026 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maeb Audio extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MAEB Overview Issue