-
Notifications
You must be signed in to change notification settings - Fork 570
added SpeechCommand dataset and Keyword spotting task #2329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
isaac-chung
merged 12 commits into
embeddings-benchmark:maeb
from
anime-sh:speech_commands
Jun 21, 2025
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
9b675d4
added SpeechCommand dataset and Keyword spotting task
RahulSChand 7f71c19
some bejing
RahulSChand cc1e28e
speech command changes + others
RahulSChand 1e74445
removed uncessary files
RahulSChand a0ddd5e
removed uncessary files
RahulSChand 48c7b6c
fixed label logic
RahulSChand 753ee3c
Merge branch 'maeb' into speech_commands
isaac-chung a9fc780
fix metadata and correct abstask
isaac-chung 0b6b775
correct esc50 category
isaac-chung c645461
remove unused param
isaac-chung 729a3ba
add v2 as well
isaac-chung f67042b
Merge branch 'maeb' into speech_commands
isaac-chung File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
146 changes: 146 additions & 0 deletions
146
mteb/tasks/Audio/AudioZeroshotClassification/eng/SpeechCommands.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,146 @@ | ||
| from __future__ import annotations | ||
|
|
||
| from mteb.abstasks.Audio.AbsTaskAudioZeroshotClassification import ( | ||
| AbsTaskAudioZeroshotClassification, | ||
| ) | ||
| from mteb.abstasks.TaskMetadata import TaskMetadata | ||
|
|
||
|
|
||
| class SpeechCommandsZeroshotClassificationv01(AbsTaskAudioZeroshotClassification): | ||
| metadata = TaskMetadata( | ||
| name="SpeechCommandsZeroshotv0.01", | ||
| description="Sound Classification/Keyword Spotting Dataset. This is a set of one-second audio clips containing a single spoken English word or background noise. These words are from a small set of commands such as 'yes', 'no', and 'stop' spoken by various speakers. With a total of 10 labels/commands for keyword spotting and a total of 30 labels for other auxiliary tasks", | ||
| reference="https://huggingface.co/datasets/google/speech_commands", | ||
| dataset={ | ||
| "path": "google/speech_commands", | ||
| "name": "v0.01", | ||
| "revision": "57ba463ab37e1e7845e0626539a6f6d0fcfbe64a", | ||
| "trust_remote_code": True, | ||
| }, | ||
| type="AudioZeroshotClassification", | ||
| category="a2t", | ||
| eval_splits=["test"], | ||
| eval_langs=["eng-Latn"], | ||
| main_score="accuracy", | ||
| date=("2018-07-07", "2018-07-13"), | ||
| domains=["Spoken"], | ||
| task_subtypes=["Keyword Spotting"], | ||
| license="cc-by-4.0", # Replace with appropriate license from allowed list | ||
| annotations_creators="human-annotated", | ||
| dialect=[], | ||
| modalities=["audio"], | ||
| sample_creation="found", | ||
| bibtex_citation=r""" | ||
| @article{DBLP:journals/corr/abs-1804-03209, | ||
| author = {Pete Warden}, | ||
| bibsource = {dblp computer science bibliography, https://dblp.org}, | ||
| biburl = {https://dblp.org/rec/journals/corr/abs-1804-03209.bib}, | ||
| eprint = {1804.03209}, | ||
| eprinttype = {arXiv}, | ||
| journal = {CoRR}, | ||
| timestamp = {Mon, 13 Aug 2018 16:48:32 +0200}, | ||
| title = {Speech Commands: {A} Dataset for Limited-Vocabulary Speech Recognition}, | ||
| url = {http://arxiv.org/abs/1804.03209}, | ||
| volume = {abs/1804.03209}, | ||
| year = {2018}, | ||
| } | ||
| """, | ||
| descriptive_stats={ | ||
| "n_samples": {"test": 3081}, | ||
| }, | ||
| ) | ||
|
|
||
| label_column_name: str = "label" | ||
|
|
||
| def get_candidate_labels(self) -> list[str]: | ||
| """Return the text candidates for zeroshot classification""" | ||
| return [ | ||
| "Yes", | ||
KennethEnevoldsen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| "No", | ||
| "Up", | ||
| "Down", | ||
| "Left", | ||
| "Right", | ||
| "On", | ||
| "Off", | ||
| "Stop", | ||
| "Go", | ||
| # Dataset has 30 labels, but only first 10 are used for zeroshot classification since they are considered as commands, others are considered as auxiliary labels for v1.1 | ||
| ] | ||
|
|
||
| def dataset_transform(self): | ||
| """Transform dataset to ensure labels are in list format and filter to keep only the first 10 command labels""" | ||
| # Filter dataset to keep only examples with labels 0-9 | ||
| self.dataset = self.dataset.filter( | ||
| lambda x: 0 <= x[self.label_column_name] < len(self.get_candidate_labels()) | ||
| ) | ||
|
|
||
|
|
||
| class SpeechCommandsZeroshotClassificationv02(AbsTaskAudioZeroshotClassification): | ||
| metadata = TaskMetadata( | ||
| name="SpeechCommandsZeroshotv0.02", | ||
| description="Sound Classification/Keyword Spotting Dataset. This is a set of one-second audio clips containing a single spoken English word or background noise. These words are from a small set of commands such as 'yes', 'no', and 'stop' spoken by various speakers. With a total of 10 labels/commands for keyword spotting and a total of 30 labels for other auxiliary tasks", | ||
| reference="https://huggingface.co/datasets/google/speech_commands", | ||
| dataset={ | ||
| "path": "google/speech_commands", | ||
| "name": "v0.02", | ||
| "revision": "57ba463ab37e1e7845e0626539a6f6d0fcfbe64a", | ||
| "trust_remote_code": True, | ||
| }, | ||
| type="AudioZeroshotClassification", | ||
| category="a2t", | ||
| eval_splits=["test"], | ||
| eval_langs=["eng-Latn"], | ||
| main_score="accuracy", | ||
| date=("2018-07-07", "2018-07-13"), | ||
| domains=["Spoken"], | ||
| task_subtypes=["Keyword Spotting"], | ||
| license="cc-by-4.0", # Replace with appropriate license from allowed list | ||
| annotations_creators="human-annotated", | ||
| dialect=[], | ||
| modalities=["audio"], | ||
| sample_creation="found", | ||
| bibtex_citation=r""" | ||
| @article{DBLP:journals/corr/abs-1804-03209, | ||
| author = {Pete Warden}, | ||
| bibsource = {dblp computer science bibliography, https://dblp.org}, | ||
| biburl = {https://dblp.org/rec/journals/corr/abs-1804-03209.bib}, | ||
| eprint = {1804.03209}, | ||
| eprinttype = {arXiv}, | ||
| journal = {CoRR}, | ||
| timestamp = {Mon, 13 Aug 2018 16:48:32 +0200}, | ||
| title = {Speech Commands: {A} Dataset for Limited-Vocabulary Speech Recognition}, | ||
| url = {http://arxiv.org/abs/1804.03209}, | ||
| volume = {abs/1804.03209}, | ||
| year = {2018}, | ||
| } | ||
| """, | ||
| descriptive_stats={ | ||
| "n_samples": {"test": 4890}, | ||
| }, | ||
| ) | ||
|
|
||
| label_column_name: str = "label" | ||
|
|
||
| def get_candidate_labels(self) -> list[str]: | ||
| """Return the text candidates for zeroshot classification""" | ||
| return [ | ||
| "Yes", | ||
| "No", | ||
| "Up", | ||
| "Down", | ||
| "Left", | ||
| "Right", | ||
| "On", | ||
| "Off", | ||
| "Stop", | ||
| "Go", | ||
| # Dataset has 30 labels, but only first 10 are used for zeroshot classification since they are considered as commands, others are considered as auxiliary labels for v1.1 | ||
| ] | ||
|
|
||
| def dataset_transform(self): | ||
| """Transform dataset to ensure labels are in list format and filter to keep only the first 10 command labels""" | ||
| # Filter dataset to keep only examples with labels 0-9 | ||
| self.dataset = self.dataset.filter( | ||
| lambda x: 0 <= x[self.label_column_name] < len(self.get_candidate_labels()) | ||
| ) | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.