feat: Introduce MAEB by Samoed · Pull Request #3470 · embeddings-benchmark/mteb

Samoed · 2025-10-22T12:48:40Z

Fixes #2072

What: Introduces MAEB (Massive Audio Embedding Benchmark) - a comprehensive benchmark for evaluating audio embedding models

Key additions:

New audio task types: Classification, multilabel classification, zero-shot classification, clustering, pair classification, reranking, and audio-text retrieval (A2T/T2A)
Audio models: Wav2Vec2, WavLM, Whisper, CLAP, Qwen2Audio, MS-CLAP, and LCO-Embedding models
Datasets: 95+ audio tasks including FSD50K, ESC-50, VoxCeleb, AudioCaps, Clotho, Common Voice, FLEURS, and many others
Benchmark variants: MAEB (30 tasks), MAEB+ (97 tasks), MAEB(audio-only) (various subsets), and MAEB(audio-text) with lite/extended versions
Infrastructure: Audio encoder interface, evaluation pipeline with resampling support, task selection using correlation analysis, and leaderboard integration

Technical improvements:

Support for datasets v4+ with torchcodec for audio processing
Efficient data collators for batch processing
Descriptive statistics for all tasks
Task selection methodology using correlation thresholds to reduce redundancy

- define audio encoder interface - implement abstask and evaluator for clustering

…d evaluator

Created test_maeb_datasets.py to test AbsTask and Evaluator for clustering

…SCAN and agglomerative algorithms into clustering evaluator, added algorithm selector into VoiceGender

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

…2175) * Added wav2vec model wrapper * Added four w2v variants * Update wav2vec_models.py * Removed run.py test script * Added subTask with small sample of dataset for testing * Removed test portion of VoiceGender.py task * add commit hash and bibtex * make lint * update models * fix circular import * make VoiceGender discoverable in get_tasks * add a2a as category for clustering * specify latest commit hash * revert linting changes * Based on feedback for model: updated w2v2 revisions and added torchaudio to .toml file * Added Bibtex for dataset, set data to be test instead of training, shortened task_subtype * Changed task from Voice Gender Clustering to Gender Clustering. * Fixed mock audio clustering tests * Added dataset metadata * Linted * Passed revision into the w2v2 loader * passed lint check * Linted * Update VoiceGender.py --------- Co-authored-by: Ali Sartaz Khan <alisartazkhan@gmail.com> Co-authored-by: Ali Sartaz Khan <71156712+alisartazkhan@users.noreply.github.com> Co-authored-by: mn <mn@Ms-MacBook-Pro.local> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

…b dataset (subset)" (#2202)

… VoxCeleb dataset (subset)"" (#2203) Revert "Revert "Maeb - added voice clustering task, wav2vec model and VoxCele…" This reverts commit ee10191.

…odel and VoxCeleb dataset (subset)""" (#2207) Revert "Revert "Revert "Maeb - added voice clustering task, wav2vec model and…" This reverts commit f1449c0.

… FSD50k Dataset and Task (#2082) * init audio * some encoder related changes * some more abs task defs Co-authored-by: rahulschand <rahulsc@stanford.edu> * evaluators and classification * remove rahul changes to generate first PR * make lint * add dataset/tasks skeleton * readd changes lost in rebase * add fsd50k * add task categories for audio * slight updates to fsd50k * make lint * wav2vec2 model * add fsd50k metadata * rename folder * add metric * add torchaudio in req * reigster wav2vec2 models * fixes * add audio in valid task types * mock interface changes * make lint * rm audio clustering * wav2vec2 model revision update * rm comment * rm test.py * add revisions to all wav2vec2 models * rm empty abstask files * rm empty evaluator files * rm empty task files * Update tests/test_tasks/test_all_abstasks.py Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update mteb/models/wav2vec2_models.py Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * rm non-logReg evaluators for audio classification * lint * fn name changed to convert_audio_from_numpy * rm mock tests for audio kNN classification * rm evaluators for audio kNN classification * fix imports * fix audio kNN; make lint * rm AbsTaskAudioClassification.py for later PR * remove commented code; reset changes to ClassificationEvaluator.py * fix mock tasks for multilabel classification * make lint * inherit Wrapper class * add all languages supported by wav2vec2 * make lint * add script info to all languages * make lint * recent changes * merge wav2vec2 + add updated logic for auto padding for fqd50k type datasets * make lint remove uwanted files * remove debug lines * remove esc50 refs * fix mock tasks for multilabel * fix mock tasks for multilabel * Revert "Merge branch 'maeb' into maeb" bad direct commit made to upstream maeb branch 4f23fdf This reverts commit 4f23fdf, reversing changes made to 1302477. * fix model imports * fqd50k cleaning * update fsd50k * change dataset * eval subsets correctly * make lint and remove debug statements * clean print statements * make lint * update fsd2019 dataset * remove init in AbsTaskAudioMultilabelClassification.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * add class parameters in AbsTaskAudioMultilabelClassification * inherit from multilingualtask for FSD2019Kaggle * make lint * update mock_tasks; make lint * remove train_split from fn parameters * define fsd2019k to be multilingual * inherit from MultilingualTask in fsd2019K * fix tests * inherit correct multingial task class * remove MockAudioMultilabelClassificationLogRegTask * rm other instances of MockAudioMultilabelClassificationLogRegTask --------- Co-authored-by: rahulschand <rahulsc@stanford.edu> Co-authored-by: Silky Singh <silky1708@gmail.com> Co-authored-by: Silky Singh <54901747+silky1708@users.noreply.github.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* init audio * some encoder related changes * some more abs task defs Co-authored-by: rahulschand <rahulsc@stanford.edu> * evaluators and classification * remove rahul changes to generate first PR * make lint * init audio * some encoder related changes * some more abs task defs Co-authored-by: rahulschand <rahulsc@stanford.edu> * evaluators and classification * remove rahul changes to generate first PR * make lint * add dataset/tasks skeleton * readd changes lost in rebase * add fsd50k * add task categories for audio * slight updates to fsd50k * make lint * wav2vec2 model * add fsd50k metadata * rename folder * add metric * add torchaudio in req * reigster wav2vec2 models * fixes * add audio in valid task types * mock interface changes * my 0 shot * make lint * rm audio clustering * wav2vec2 model revision update * rm comment * rm test.py * add revisions to all wav2vec2 models * rm empty abstask files * rm empty evaluator files * rm empty task files * Update tests/test_tasks/test_all_abstasks.py Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update mteb/models/wav2vec2_models.py Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * rm non-logReg evaluators for audio classification * lint * fn name changed to convert_audio_from_numpy * rm mock tests for audio kNN classification * rm evaluators for audio kNN classification * fix imports * fix audio kNN; make lint * rm AbsTaskAudioClassification.py for later PR * added zero-shot loading model and dataset checked * remove commented code; reset changes to ClassificationEvaluator.py * fix mock tasks for multilabel classification * make lint * inherit Wrapper class * add all languages supported by wav2vec2 * make lint * add script info to all languages * make lint * before cleaning comments * ESC and clap model. Tested 81 percent zero-shot numbers * fixed label names for ESC50-multilabel and removed comments * recent changes * merge wav2vec2 + add updated logic for auto padding for fqd50k type datasets * make lint remove uwanted files * remove debug lines * remove esc50 refs * changes for debugging * lint changes and maeb main branch merge * fix mock tasks for multilabel * fix mock tasks for multilabel * Revert "Merge branch 'maeb' into maeb" bad direct commit made to upstream maeb branch 4f23fdf This reverts commit 4f23fdf, reversing changes made to 1302477. * fix model imports * fqd50k cleaning * fixed error in Image zero shot classfification * update fsd50k * change dataset * eval subsets correctly * make lint and remove debug statements * clean print statements * make lint * update fsd2019 dataset * remove init in AbsTaskAudioMultilabelClassification.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * add class parameters in AbsTaskAudioMultilabelClassification * inherit from multilingualtask for FSD2019Kaggle * make lint * update mock_tasks; make lint * remove train_split from fn parameters * define fsd2019k to be multilingual * inherit from MultilingualTask in fsd2019K * fix tests * inherit correct multingial task class * remove MockAudioMultilabelClassificationLogRegTask * rm other instances of MockAudioMultilabelClassificationLogRegTask * removed unncessary files * removed unncrssary files * removed uncrssary files part 3 * deleted esc50 from multi label classification * fixed errors * fixed lintng, added precision and recall. Removed extra comments * fixed double loading of model * filled in missing meta-data * fixed linting --------- Co-authored-by: Animesh Jha <jha.animesh01@gmail.com> Co-authored-by: rahulschand <rahulsc@stanford.edu> Co-authored-by: Silky Singh <silky1708@gmail.com> Co-authored-by: Silky Singh <54901747+silky1708@users.noreply.github.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* added unfused model * fixed lint

…on task (#2285) * Added fsd50k dataset on huggingface * added correct hf version of fsd50k dataset * added correct hf version of fsd50k dataset * removed extra imports * removed unecessary load_data fn

* added large, music and speech clap models * fixed public_training_data and removed training_datasets split * added latest revision * lowercase mit license * fixed issue related to training_datasets * fixed lint

# Conflicts: # mteb/abstasks/task_metadata.py # mteb/leaderboard/app.py # mteb/tasks/reranking/eng/__init__.py # pyproject.toml # uv.lock

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* use collator for processing * use collator for processing * fix typing * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix qwen2audio * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * change warining to log * remove `_get_resampler` * fix clap with transformers v5 * add cast instead of ignore --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix globe v2 and add globe v3 * add descriptive stats * fix eval splits for globe v2 age and v2 gender

[MAEB] Add audio task installation instructions to docs Document FFmpeg and transformers>=4.57.6 requirements for users running audio tasks with datasets>=4. The datasets library v4+ uses torchcodec for audio processing which requires FFmpeg to be installed. Fixes #4023 Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

# Conflicts: # mteb/leaderboard/app.py # pyproject.toml # uv.lock

isaac-chung · 2026-02-16T09:09:31Z

Need to regenerate uv.lock

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Resolve conflict in app.py by keeping both: - Session tracking functions from main (_generate_fingerprint_session_id, on_page_load) - skip_cache_file parameter from maeb Also add 'Beng' (Bengali script code) to typos ignore list. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The function now only takes benchmark_name and computes languages/task_types/domains internally from the benchmark. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Samoed · 2026-02-16T09:52:32Z

@isaac-chung I'm fixing tests in #4098

isaac-chung · 2026-02-16T09:53:08Z

Great, thanks! I'll pause here then.

* fix pyproject * fix torchcodec version

.github/workflows/test.yml

pyproject.toml

.github/workflows/test.yml

isaac-chung · 2026-02-16T11:32:41Z

Only windows test failing now. @Samoed would you mind taking a look please?

* regenerate lock * lock torchcodec to 0.9 * lock torchcodec to 0.9.1

Samoed · 2026-02-16T20:15:44Z

In lock was torchcodec 0.9.0, but it had a bug when it couldn't find torch version on windows that was fixed in 0.9.1

isaac-chung · 2026-02-16T20:46:16Z

cc @KennethEnevoldsen @AdnanElAssadi56 @gowitheflow-1998

sufen-f and others added 30 commits February 18, 2025 23:33

Started the following:

32b7af8

- define audio encoder interface - implement abstask and evaluator for clustering

Minor changes and linted files. #2093

8eff2c6

Minor changes and linted files. #2093

53a2e36

Minor changes and linted files. #2093

ed93f2b

Refs #2068: Initial Implementation of audio-text retrieval abstask an…

fbab033

…d evaluator

Added MockAudioClustering task + MockAudioEncoder for testcase

d39e187

Created test_maeb_datasets.py to test AbsTask and Evaluator for clustering

MockAudioClustering + MockAudioEncoder (#2093)

bcca37f

Added wav2vec model wrapper

2a238ed

Added subTask with small sample of dataset for testing

7816974

Added four w2v variants

07f53b1

Update wav2vec_models.py

882af38

Added wav2vec (5), wavlm (7), and whisper (5) models

daeada0

Added revisions from HF to wav2vec models, added silhouette score, DB…

c1ebf2a

…SCAN and agglomerative algorithms into clustering evaluator, added algorithm selector into VoiceGender

Update mteb/models/wavlm_models.py

716deed

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

setting up colab

ce1bee9

Merge remote-tracking branch 'origin/maeb' into maeb

4cf7e6f

added a2a

545b938

PCA + hidden layer + shuffling

ed978fa

New task: emotion clustering

1616ba9

Added qwen2 model

ac14d16

Merge branch 'maeb' into maeb

4f23fdf

Revert "Maeb - added voice clustering task, wav2vec model and VoxCele…

ee10191

…b dataset (subset)" (#2202)

Revert "Revert "Maeb - added voice clustering task, wav2vec model and…

f1449c0

… VoxCeleb dataset (subset)"" (#2203) Revert "Revert "Maeb - added voice clustering task, wav2vec model and VoxCele…" This reverts commit ee10191.

Revert "Revert "Revert "Maeb - added voice clustering task, wav2vec m…

d731d40

…odel and VoxCeleb dataset (subset)""" (#2207) Revert "Revert "Revert "Maeb - added voice clustering task, wav2vec model and…" This reverts commit f1449c0.

Add unfused clap model for zero-shot (#2269)

6d9eca3

* added unfused model * fixed lint

Add new and complete version of FSD50K multi-label audio classificati…

2188585

…on task (#2285) * Added fsd50k dataset on huggingface * added correct hf version of fsd50k dataset * added correct hf version of fsd50k dataset * removed extra imports * removed unecessary load_data fn

added large, music and speech clap models (#2284)

bdefb14

* added large, music and speech clap models * fixed public_training_data and removed training_datasets split * added latest revision * lowercase mit license * fixed issue related to training_datasets * fixed lint

Samoed and others added 6 commits February 7, 2026 21:17

Merge branch 'main' into maeb

71d45e3

# Conflicts: # mteb/abstasks/task_metadata.py # mteb/leaderboard/app.py # mteb/tasks/reranking/eng/__init__.py # pyproject.toml # uv.lock

Make birdset dataset handling more efficient (#3863)

5fd436e

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

skip test audio models

f193eed

[MAEB] Add GLOBE v3 dataset (#4091)

6120b00

* fix globe v2 and add globe v3 * add descriptive stats * fix eval splits for globe v2 age and v2 gender

isaac-chung changed the title ~~Maeb~~ MAEB Feb 15, 2026

isaac-chung marked this pull request as ready for review February 15, 2026 18:05

Merge remote-tracking branch 'origin/main' into maeb

b1f55e9

# Conflicts: # mteb/leaderboard/app.py # pyproject.toml # uv.lock

isaac-chung and others added 4 commits February 16, 2026 09:19

Update uv.lock

165514f

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix _update_description call sites to match new signature

29136c1

The function now only takes benchmark_name and computes languages/task_types/domains internally from the benchmark. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

try install runtime libraries that provide libav*.so

7ac00fa

isaac-chung changed the title ~~MAEB~~ feat: Introduce MAEB Feb 16, 2026

fix pyproject (#4098)

744bd2f

* fix pyproject * fix torchcodec version

isaac-chung reviewed Feb 16, 2026

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

isaac-chung reviewed Feb 16, 2026

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

remove vllm duplicate

864461a

isaac-chung reviewed Feb 16, 2026

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

Update .github/workflows/test.yml

90afbff

fix windows ci (#4100)

91460f8

* regenerate lock * lock torchcodec to 0.9 * lock torchcodec to 0.9.1

isaac-chung approved these changes Feb 16, 2026

View reviewed changes

isaac-chung merged commit a79aefe into main Feb 16, 2026
13 checks passed

isaac-chung deleted the maeb branch February 16, 2026 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Introduce MAEB#3470

feat: Introduce MAEB#3470
isaac-chung merged 235 commits intomainfrom
maeb

Samoed commented Oct 22, 2025 •

edited by isaac-chung

Loading

Uh oh!

isaac-chung commented Feb 16, 2026

Uh oh!

Samoed commented Feb 16, 2026

Uh oh!

isaac-chung commented Feb 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isaac-chung commented Feb 16, 2026 •

edited

Loading

Uh oh!

Samoed commented Feb 16, 2026

Uh oh!

isaac-chung commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Conversation

Samoed commented Oct 22, 2025 • edited by isaac-chung Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Feb 16, 2026

Uh oh!

Samoed commented Feb 16, 2026

Uh oh!

isaac-chung commented Feb 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isaac-chung commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Feb 16, 2026

Uh oh!

isaac-chung commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Samoed commented Oct 22, 2025 •

edited by isaac-chung

Loading

isaac-chung commented Feb 16, 2026 •

edited

Loading