Maeb - added voice clustering task, wav2vec model and VoxCeleb dataset (subset)#2136
Maeb - added voice clustering task, wav2vec model and VoxCeleb dataset (subset)#2136sufen-f merged 13 commits intoembeddings-benchmark:maebfrom
Conversation
| if __name__ == "__main__": | ||
| model_name = "facebook/wav2vec2-base" | ||
| model = mteb.get_model(model_name, revision="main") | ||
| print(f"Loaded model type: {type(model)}") | ||
| evaluation = mteb.MTEB(tasks=[VoiceGenderClustering()]) | ||
| results = evaluation.run(model, output_folder=f"results/{model_name}") | ||
| print(results) |
There was a problem hiding this comment.
| if __name__ == "__main__": | |
| model_name = "facebook/wav2vec2-base" | |
| model = mteb.get_model(model_name, revision="main") | |
| print(f"Loaded model type: {type(model)}") | |
| evaluation = mteb.MTEB(tasks=[VoiceGenderClustering()]) | |
| results = evaluation.run(model, output_folder=f"results/{model_name}") | |
| print(results) |
| reference="https://huggingface.co/datasets/mmn3690/voice-gender-clustering", | ||
| dataset={ | ||
| "path": "mmn3690/voice-gender-clustering", | ||
| "revision": "main", |
There was a problem hiding this comment.
Please specify revision as a commit hash
mteb/models/wav2vec_models.py
Outdated
| **kwargs | ||
| ) -> np.ndarray: | ||
|
|
||
| batch_size = kwargs.get('batch_size', 32) |
There was a problem hiding this comment.
| **kwargs | |
| ) -> np.ndarray: | |
| batch_size = kwargs.get('batch_size', 32) | |
| batch_size: int = 32, | |
| **kwargs | |
| ) -> np.ndarray: |
mteb/models/wav2vec_models.py
Outdated
| self, | ||
| device: str | None = None, | ||
| **kwargs | ||
| ): | ||
| super().__init__(device=device, **kwargs) | ||
| self.model_name = kwargs.get('model_name', 'facebook/wav2vec2-base') | ||
| self.model_revision = kwargs.get('model_revision', None) |
There was a problem hiding this comment.
| self, | |
| device: str | None = None, | |
| **kwargs | |
| ): | |
| super().__init__(device=device, **kwargs) | |
| self.model_name = kwargs.get('model_name', 'facebook/wav2vec2-base') | |
| self.model_revision = kwargs.get('model_revision', None) | |
| self, | |
| model_name: str, | |
| revision: str, | |
| device: str | None = None, | |
| **kwargs | |
| ): | |
| super().__init__(device=device, **kwargs) | |
| self.model_name = kwargs.get('model_name', 'facebook/wav2vec2-base') | |
| self.model_revision = kwargs.get('model_revision', None) |
|
Also you have error in your branch |
…SCAN and agglomerative algorithms into clustering evaluator, added algorithm selector into VoiceGender
| if self.cluster_algo == "Kmeans": | ||
| logger.info("Fitting Mini-Batch K-Means model...") | ||
| clustering_model = sklearn.cluster.MiniBatchKMeans( | ||
| n_clusters=len(set(self.labels)), | ||
| batch_size=self.clustering_batch_size, | ||
| n_init="auto", | ||
| ) | ||
| elif self.cluster_algo == "DBSCAN": | ||
| # need to plot out the distribution of the embeddings to decide on parameters for DBSCAN | ||
| logger.info("Fitting DBSCAN model...") | ||
| clustering_model = sklearn.cluster.DBSCAN(eps=0.5, min_samples=3, metric="euclidean") | ||
| elif self.cluster_algo == "Agg": | ||
| logger.info("Fitting Agglomerative model...") | ||
| clustering_model = sklearn.cluster.AgglomerativeClustering(n_clusters=len(set(self.labels))) | ||
|
|
There was a problem hiding this comment.
Could you specify the clustering algorithm as a method in the class (maybe partial)? This approach would make it easier to reproduce results, as the current method could lead to difficulties in reproduction
There was a problem hiding this comment.
Thanks @Samoed did you mean 3 methods (i.e. one for each clustering algorithm) or would 1 clustering method suffice?
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
isaac-chung
left a comment
There was a problem hiding this comment.
Could you please keep this PR smaller and limit this just the clustering abstask, evaluator, and one clustering task? (Can include one model if none exists currently)
This is starting to balloon in scope and I'd like to draw a line somewhere. e.g. the models (each file) should be a separate PR. Each additional dataset should be a separate PR. All of these should also be tracked in issues, so please create them if they do not exist yet.
Thanks!
Hi @isaac-chung , yes of course. As this is still a WIP, should we keep this one PR where we push our code and then divide it into multiple PRs like you suggested, or do you want us to do that now? |
|
@mnasser3 please group the PRs. If this is WIP, please mark this as draft, and hold off requesting reviews in the future. Thanks! |
|
@isaac-chung as requested, we are splitting up the PR. Took the first 4 commits from here and put them in PR #2175. The remaining models and datasets will be their own PR. |
|
@isaac-chung looks like all the remaining 9 commits out of the initial 13 were merged into the main maeb branch instead of just the 4 we put in #2175. Is it possible to restore the main maeb branch to its previous state? |
|
You can revert this PR and close it and merge locally this branch into #2175 |
PR linked to #2093
@mnasser3 @alisartazkhan
Adding datasets checklist
mteb -m {model_name} -t {task_name}command.facebook/wav2vec2-baseself.stratified_subsampling() under dataset_transform()make test.make lint.Adding a model checklist
mteb.get_model(model_name, revision)andmteb.get_model_meta(model_name, revision)