Maeb - added voice clustering task, wav2vec model and VoxCeleb dataset (subset) by sufen-f · Pull Request #2136 · embeddings-benchmark/mteb

sufen-f · 2025-02-22T19:32:44Z

PR linked to #2093

@mnasser3 @alisartazkhan

Adding datasets checklist

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- facebook/wav2vec2-base
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding a model checklist

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
- mteb.get_model(model_name, revision) and
- mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.

Samoed · 2025-02-22T19:45:10Z

mteb/tasks/Audio/Clustering/eng/VoiceGender.py

+if __name__ == "__main__":
+    model_name = "facebook/wav2vec2-base"
+    model = mteb.get_model(model_name, revision="main")
+    print(f"Loaded model type: {type(model)}")
+    evaluation = mteb.MTEB(tasks=[VoiceGenderClustering()])
+    results = evaluation.run(model, output_folder=f"results/{model_name}")
+    print(results)


Suggested change

if __name__ == "__main__":

model_name = "facebook/wav2vec2-base"

model = mteb.get_model(model_name, revision="main")

print(f"Loaded model type: {type(model)}")

evaluation = mteb.MTEB(tasks=[VoiceGenderClustering()])

results = evaluation.run(model, output_folder=f"results/{model_name}")

print(results)

Samoed · 2025-02-22T19:45:33Z

mteb/tasks/Audio/Clustering/eng/VoiceGender.py

+        reference="https://huggingface.co/datasets/mmn3690/voice-gender-clustering",
+        dataset={
+            "path": "mmn3690/voice-gender-clustering",
+            "revision": "main",


Please specify revision as a commit hash

Samoed · 2025-02-22T19:47:10Z

mteb/models/wav2vec_models.py

+        **kwargs
+    ) -> np.ndarray:
+
+        batch_size = kwargs.get('batch_size', 32)


Suggested change

**kwargs

) -> np.ndarray:

batch_size = kwargs.get('batch_size', 32)

batch_size: int = 32,

**kwargs

) -> np.ndarray:

Samoed · 2025-02-22T19:47:40Z

mteb/models/wav2vec_models.py

+        self, 
+        device: str | None = None,
+        **kwargs
+    ):
+        super().__init__(device=device, **kwargs)
+        self.model_name = kwargs.get('model_name', 'facebook/wav2vec2-base')
+        self.model_revision = kwargs.get('model_revision', None)


Suggested change

self,

device: str | None = None,

**kwargs

):

super().__init__(device=device, **kwargs)

self.model_name = kwargs.get('model_name', 'facebook/wav2vec2-base')

self.model_revision = kwargs.get('model_revision', None)

self,

model_name: str,

revision: str,

device: str | None = None,

**kwargs

):

super().__init__(device=device, **kwargs)

self.model_name = kwargs.get('model_name', 'facebook/wav2vec2-base')

self.model_revision = kwargs.get('model_revision', None)

Samoed · 2025-02-22T20:42:00Z

Also you have error in your branch

ImportError: cannot import name 'model_meta_from_sentence_transformers' from partially initialized module 'mteb.models' (most likely due to a circular import)

mteb/models/wavlm_models.py

…SCAN and agglomerative algorithms into clustering evaluator, added algorithm selector into VoiceGender

Samoed · 2025-02-23T19:36:42Z

mteb/evaluation/evaluators/Audio/ClusteringEvaluator.py

+        if self.cluster_algo == "Kmeans":
+            logger.info("Fitting Mini-Batch K-Means model...")
+            clustering_model = sklearn.cluster.MiniBatchKMeans(
+                n_clusters=len(set(self.labels)),
+                batch_size=self.clustering_batch_size,
+                n_init="auto",
+            )
+        elif self.cluster_algo == "DBSCAN":
+            # need to plot out the distribution of the embeddings to decide on parameters for DBSCAN
+            logger.info("Fitting DBSCAN model...")
+            clustering_model = sklearn.cluster.DBSCAN(eps=0.5, min_samples=3, metric="euclidean")
+        elif self.cluster_algo == "Agg":
+            logger.info("Fitting Agglomerative model...")
+            clustering_model = sklearn.cluster.AgglomerativeClustering(n_clusters=len(set(self.labels)))
+


Could you specify the clustering algorithm as a method in the class (maybe partial)? This approach would make it easier to reproduce results, as the current method could lead to difficulties in reproduction

Thanks @Samoed did you mean 3 methods (i.e. one for each clustering algorithm) or would 1 clustering method suffice?

I think 1 method is enough

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

isaac-chung

Could you please keep this PR smaller and limit this just the clustering abstask, evaluator, and one clustering task? (Can include one model if none exists currently)

This is starting to balloon in scope and I'd like to draw a line somewhere. e.g. the models (each file) should be a separate PR. Each additional dataset should be a separate PR. All of these should also be tracked in issues, so please create them if they do not exist yet.

Thanks!

mnasser3 · 2025-02-25T05:05:57Z

Could you please keep this PR smaller and limit this just the clustering abstask, evaluator, and one clustering task? (Can include one model if none exists currently)

This is starting to balloon in scope and I'd like to draw a line somewhere. e.g. the models (each file) should be a separate PR. Each additional dataset should be a separate PR. All of these should also be tracked in issues, so please create them if they do not exist yet.

Thanks!

Hi @isaac-chung , yes of course. As this is still a WIP, should we keep this one PR where we push our code and then divide it into multiple PRs like you suggested, or do you want us to do that now?

isaac-chung · 2025-02-25T11:24:39Z

@mnasser3 please group the PRs. If this is WIP, please mark this as draft, and hold off requesting reviews in the future. Thanks!

sufen-f · 2025-02-27T08:05:47Z

@isaac-chung as requested, we are splitting up the PR. Took the first 4 commits from here and put them in PR #2175. The remaining models and datasets will be their own PR.

sufen-f · 2025-02-28T19:34:15Z

@isaac-chung looks like all the remaining 9 commits out of the initial 13 were merged into the main maeb branch instead of just the 4 we put in #2175. Is it possible to restore the main maeb branch to its previous state?
@alisartazkhan @mnasser3

Samoed · 2025-02-28T19:44:34Z

You can revert this PR and close it and merge locally this branch into #2175

alisartazkhan and others added 2 commits February 21, 2025 23:47

Added wav2vec model wrapper

2a238ed

Added subTask with small sample of dataset for testing

7816974

sufen-f requested review from Muennighoff and isaac-chung February 22, 2025 19:32

Samoed reviewed Feb 22, 2025

View reviewed changes

alisartazkhan and others added 3 commits February 22, 2025 18:58

Added four w2v variants

07f53b1

Update wav2vec_models.py

882af38

Added wav2vec (5), wavlm (7), and whisper (5) models

daeada0

Samoed reviewed Feb 23, 2025

View reviewed changes

mteb/models/wavlm_models.py Outdated Show resolved Hide resolved

Added revisions from HF to wav2vec models, added silhouette score, DB…

c1ebf2a

…SCAN and agglomerative algorithms into clustering evaluator, added algorithm selector into VoiceGender

Samoed reviewed Feb 23, 2025

View reviewed changes

alisartazkhan and others added 5 commits February 23, 2025 12:29

Update mteb/models/wavlm_models.py

716deed

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

setting up colab

ce1bee9

Merge remote-tracking branch 'origin/maeb' into maeb

4cf7e6f

added a2a

545b938

PCA + hidden layer + shuffling

ed978fa

mnasser3 force-pushed the maeb branch from 3458803 to ed978fa Compare February 24, 2025 04:53

New task: emotion clustering

1616ba9

isaac-chung requested changes Feb 24, 2025

View reviewed changes

alisartazkhan marked this pull request as draft February 26, 2025 07:43

alisartazkhan added the WIP Work In Progress label Feb 26, 2025

Added qwen2 model

ac14d16

Samoed added the maeb Audio extension label Feb 27, 2025

sufen-f merged commit 4f23fdf into embeddings-benchmark:maeb Feb 28, 2025
1 of 9 checks passed

isaac-chung mentioned this pull request Mar 1, 2025

Create audio clustering AbsTask and Evaluator #2093

Closed

sufen-f mentioned this pull request Mar 1, 2025

Revert "Maeb - added voice clustering task, wav2vec model and VoxCeleb dataset (subset)" #2202

Merged

Conversation

sufen-f commented Feb 22, 2025

Adding datasets checklist

Adding a model checklist

Uh oh!

Samoed Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

Samoed Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

Samoed Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

Samoed Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

Samoed commented Feb 22, 2025

Uh oh!

Uh oh!

Samoed Feb 23, 2025

Choose a reason for hiding this comment

Uh oh!

sufen-f Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

Samoed Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

isaac-chung left a comment

Choose a reason for hiding this comment

Uh oh!

mnasser3 commented Feb 25, 2025

Uh oh!

isaac-chung commented Feb 25, 2025

Uh oh!

sufen-f commented Feb 27, 2025

Uh oh!

Uh oh!

sufen-f commented Feb 28, 2025

Uh oh!

Samoed commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Samoed commented Feb 28, 2025 •

edited

Loading