Skip to content

Maeb - added voice clustering task, wav2vec model and VoxCeleb dataset (subset)#2136

Merged
sufen-f merged 13 commits intoembeddings-benchmark:maebfrom
sufen-f:maeb
Feb 28, 2025
Merged

Maeb - added voice clustering task, wav2vec model and VoxCeleb dataset (subset)#2136
sufen-f merged 13 commits intoembeddings-benchmark:maebfrom
sufen-f:maeb

Conversation

@sufen-f
Copy link
Contributor

@sufen-f sufen-f commented Feb 22, 2025

PR linked to #2093

@mnasser3 @alisartazkhan

Adding datasets checklist

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • facebook/wav2vec2-base
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

Comment on lines 32 to 38
if __name__ == "__main__":
model_name = "facebook/wav2vec2-base"
model = mteb.get_model(model_name, revision="main")
print(f"Loaded model type: {type(model)}")
evaluation = mteb.MTEB(tasks=[VoiceGenderClustering()])
results = evaluation.run(model, output_folder=f"results/{model_name}")
print(results)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if __name__ == "__main__":
model_name = "facebook/wav2vec2-base"
model = mteb.get_model(model_name, revision="main")
print(f"Loaded model type: {type(model)}")
evaluation = mteb.MTEB(tasks=[VoiceGenderClustering()])
results = evaluation.run(model, output_folder=f"results/{model_name}")
print(results)

reference="https://huggingface.co/datasets/mmn3690/voice-gender-clustering",
dataset={
"path": "mmn3690/voice-gender-clustering",
"revision": "main",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify revision as a commit hash

Comment on lines 37 to 40
**kwargs
) -> np.ndarray:

batch_size = kwargs.get('batch_size', 32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**kwargs
) -> np.ndarray:
batch_size = kwargs.get('batch_size', 32)
batch_size: int = 32,
**kwargs
) -> np.ndarray:

Comment on lines 12 to 18
self,
device: str | None = None,
**kwargs
):
super().__init__(device=device, **kwargs)
self.model_name = kwargs.get('model_name', 'facebook/wav2vec2-base')
self.model_revision = kwargs.get('model_revision', None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self,
device: str | None = None,
**kwargs
):
super().__init__(device=device, **kwargs)
self.model_name = kwargs.get('model_name', 'facebook/wav2vec2-base')
self.model_revision = kwargs.get('model_revision', None)
self,
model_name: str,
revision: str,
device: str | None = None,
**kwargs
):
super().__init__(device=device, **kwargs)
self.model_name = kwargs.get('model_name', 'facebook/wav2vec2-base')
self.model_revision = kwargs.get('model_revision', None)

@Samoed
Copy link
Member

Samoed commented Feb 22, 2025

Also you have error in your branch

ImportError: cannot import name 'model_meta_from_sentence_transformers' from partially initialized module 'mteb.models' (most likely due to a circular import)

…SCAN and agglomerative algorithms into clustering evaluator, added algorithm selector into VoiceGender
Comment on lines 50 to 64
if self.cluster_algo == "Kmeans":
logger.info("Fitting Mini-Batch K-Means model...")
clustering_model = sklearn.cluster.MiniBatchKMeans(
n_clusters=len(set(self.labels)),
batch_size=self.clustering_batch_size,
n_init="auto",
)
elif self.cluster_algo == "DBSCAN":
# need to plot out the distribution of the embeddings to decide on parameters for DBSCAN
logger.info("Fitting DBSCAN model...")
clustering_model = sklearn.cluster.DBSCAN(eps=0.5, min_samples=3, metric="euclidean")
elif self.cluster_algo == "Agg":
logger.info("Fitting Agglomerative model...")
clustering_model = sklearn.cluster.AgglomerativeClustering(n_clusters=len(set(self.labels)))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you specify the clustering algorithm as a method in the class (maybe partial)? This approach would make it easier to reproduce results, as the current method could lead to difficulties in reproduction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Samoed did you mean 3 methods (i.e. one for each clustering algorithm) or would 1 clustering method suffice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 1 method is enough

Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please keep this PR smaller and limit this just the clustering abstask, evaluator, and one clustering task? (Can include one model if none exists currently)

This is starting to balloon in scope and I'd like to draw a line somewhere. e.g. the models (each file) should be a separate PR. Each additional dataset should be a separate PR. All of these should also be tracked in issues, so please create them if they do not exist yet.

Thanks!

@mnasser3
Copy link

Could you please keep this PR smaller and limit this just the clustering abstask, evaluator, and one clustering task? (Can include one model if none exists currently)

This is starting to balloon in scope and I'd like to draw a line somewhere. e.g. the models (each file) should be a separate PR. Each additional dataset should be a separate PR. All of these should also be tracked in issues, so please create them if they do not exist yet.

Thanks!

Hi @isaac-chung , yes of course. As this is still a WIP, should we keep this one PR where we push our code and then divide it into multiple PRs like you suggested, or do you want us to do that now?

@isaac-chung
Copy link
Collaborator

@mnasser3 please group the PRs. If this is WIP, please mark this as draft, and hold off requesting reviews in the future. Thanks!

@alisartazkhan alisartazkhan marked this pull request as draft February 26, 2025 07:43
@alisartazkhan alisartazkhan added the WIP Work In Progress label Feb 26, 2025
@Samoed Samoed added the maeb Audio extension label Feb 27, 2025
@sufen-f
Copy link
Contributor Author

sufen-f commented Feb 27, 2025

@isaac-chung as requested, we are splitting up the PR. Took the first 4 commits from here and put them in PR #2175. The remaining models and datasets will be their own PR.

@sufen-f sufen-f merged commit 4f23fdf into embeddings-benchmark:maeb Feb 28, 2025
1 of 9 checks passed
@sufen-f
Copy link
Contributor Author

sufen-f commented Feb 28, 2025

@isaac-chung looks like all the remaining 9 commits out of the initial 13 were merged into the main maeb branch instead of just the 4 we put in #2175. Is it possible to restore the main maeb branch to its previous state?
@alisartazkhan @mnasser3

@Samoed
Copy link
Member

Samoed commented Feb 28, 2025

You can revert this PR and close it and merge locally this branch into #2175

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maeb Audio extension WIP Work In Progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants