Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bengali.AI Speech corpus for Kaggle Research Code Competition #1108

Merged
merged 3 commits into from
Jul 25, 2023

Conversation

yfyeung
Copy link
Contributor

@yfyeung yfyeung commented Jul 24, 2023

This recipe prepares manifests for the dataset of a Kaggle Research Code Competition.

The competition dataset comprises about 1200 hours of recordings of Bengali speech.
Note that this is a Code Competition, in which the actual test set is hidden. The full test set contains about 20 hours of speech in almost 8000 MP3 audio files. All of the files in the test set are encoded at a sample rate of 32k, a bit rate of 48k, in one channel.

Details on the dataset are available in the dataset paper: https://arxiv.org/abs/2305.09688

@yfyeung yfyeung changed the title Add Bengali.AI Speech corpus Add Bengali.AI Speech corpus [Kaggle Research Code Competition] Jul 24, 2023
@yfyeung yfyeung changed the title Add Bengali.AI Speech corpus [Kaggle Research Code Competition] Add Bengali.AI Speech corpus for Kaggle Research Code Competition Jul 24, 2023
It is covered in more detail at https://arxiv.org/abs/2305.09688

Please download manually by
kaggle competitions download -c bengaliai-speech
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we provide a download function that calls subprocess.run("kaggle competitions download ...")? And possibly checks if kaggle binary is available using lhotse.utils.is_module_available, and if not it raises an exception asking to pip install kaggle first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pzelasko I tried to implement the download function, but unfortunately, I have to authenticate with an API token before calling subprocess.run("kaggle competitions download -c bengaliai-speech"), otherwise, I will encounter OSError: Could not find kaggle.json. Make sure it's located in /k2-dev/yangyifan/.kaggle.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it's good enough as it is. Thanks!

@pzelasko pzelasko added this to the v1.16 milestone Jul 24, 2023
@pzelasko pzelasko merged commit 601c345 into lhotse-speech:master Jul 25, 2023
@yfyeung yfyeung deleted the bengaliai_speech branch July 26, 2023 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants