Skip to content

added SpeechCommand dataset and Keyword spotting task#2329

Merged
isaac-chung merged 12 commits intoembeddings-benchmark:maebfrom
anime-sh:speech_commands
Jun 21, 2025
Merged

added SpeechCommand dataset and Keyword spotting task#2329
isaac-chung merged 12 commits intoembeddings-benchmark:maebfrom
anime-sh:speech_commands

Conversation

@RahulSChand
Copy link
Contributor

@RahulSChand RahulSChand commented Mar 11, 2025

Added google/speech-commands v1 dataset. Part of the larger #2319 issue list to add all clap models. This is a keyword spotting dataset, therefore added new type of task as well. Test results & Prompt logic in next comments

Code Quality

  • [:heavy_check_mark: ] Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

  • [:heavy_check_mark: ] Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

  • New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
  • Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: ...

  • [ ✔️ ] I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • [:heavy_check_mark: ] I have filled out the metadata object in the dataset file (find documentation on it here).
  • [:heavy_check_mark: ] Run tests locally to make sure nothing is broken using make test.
  • [:heavy_check_mark: ] Run the formatter to format the code using make lint.

@RahulSChand RahulSChand self-assigned this Mar 11, 2025
@RahulSChand RahulSChand marked this pull request as draft March 11, 2025 18:54
@RahulSChand
Copy link
Contributor Author

RahulSChand commented Mar 11, 2025

Prompt logic for key-word spotting for zero-shot is to just use the label as text with no additional prefix, unlike in other cases where we use stuff like "this is a sound of .." (from the clap paper)

Screenshot from 2025-03-11 12-26-33

@RahulSChand RahulSChand added the audio Audio extension label Mar 11, 2025
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good few minor things.

Did you run a model on it? I could imagine that this task might be too easy.

@RahulSChand
Copy link
Contributor Author

Looks good few minor things.

Did you run a model on it? I could imagine that this task might be too easy.

No, haven't run the model yet. its still a draft PR. The dataset is large, ~70k audio files so it will take some time.

@RahulSChand RahulSChand changed the title added SpeechCommand dataset and Keyword spotting task added SpeechCommand dataset and Keyword spotting task (WIP) Mar 11, 2025
@KennethEnevoldsen
Copy link
Contributor

No, haven't run the model yet. its still a draft PR.

No worries, just listing what is missing. Reducing the number of samples should make it more doable

@silky1708 silky1708 linked an issue Mar 13, 2025 that may be closed by this pull request
@RahulSChand
Copy link
Contributor Author

RahulSChand commented Mar 15, 2025

No, haven't run the model yet. its still a draft PR.

No worries, just listing what is missing. Reducing the number of samples should make it more doable

Tested with samples_per_label=8. For additional context, the dataset has 30 labels in v1.1 but only the first 10 are considered. commands the rest are considered auxiliary labels. Below is from the offiical huggingface repo

In both versions, ten of them are used as commands by convention: "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go". Other words are considered to be auxiliary (in current implementation it is marked by True value of "is_unknown" feature).

The v2 version has total of 13 commands (v2 has not been added yet). The results below (which are on the v1.1 verson of the dataset) are close to the one in the ofifical clap paper (ofifical paper gets 10.63% on the v2 version of the dataset)

Screenshot 2025-03-14 at 4 45 52 PM

I feel one thing that can be done in this PR is to go directly to the v2 version of the dataset which is now used for benchmarking speech commands and not add the v1 dataset since its no longer used anymore. What do you think?

@RahulSChand RahulSChand changed the title added SpeechCommand dataset and Keyword spotting task (WIP) added SpeechCommand dataset and Keyword spotting task Mar 15, 2025
@RahulSChand RahulSChand marked this pull request as ready for review March 15, 2025 00:24
@silky1708 silky1708 mentioned this pull request May 7, 2025
84 tasks
@isaac-chung
Copy link
Collaborator

isaac-chung commented Jun 11, 2025

@RahulSChand the test set is ~3k so it seems manageable without downsampling. Want to also note that the zero shot abstask does not use samples_per_label -> no training is done.

For different version of the dataset, we could include both v1 and v2, by naming the class and the metadata name field like SpeechCommandsZeroshotv1. The file would have 2 classes, one for each version. This way, both are supported, and we do not necessarily include both in the benchmark - we'll have a choice.

So if you have the bandwidth, it would be great to add v2 as well. Otherwise, I think it is also fine to merge it as is right now.

@isaac-chung isaac-chung merged commit 6bc4c5a into embeddings-benchmark:maeb Jun 21, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audio Audio extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Speech Commands dataset

4 participants