Add Bengali.AI Speech corpus for Kaggle Research Code Competition #1108

yfyeung · 2023-07-24T03:47:06Z

This recipe prepares manifests for the dataset of a Kaggle Research Code Competition.

The competition dataset comprises about 1200 hours of recordings of Bengali speech.
Note that this is a Code Competition, in which the actual test set is hidden. The full test set contains about 20 hours of speech in almost 8000 MP3 audio files. All of the files in the test set are encoded at a sample rate of 32k, a bit rate of 48k, in one channel.

Details on the dataset are available in the dataset paper: https://arxiv.org/abs/2305.09688

pzelasko · 2023-07-24T13:50:50Z

lhotse/recipes/bengaliai_speech.py

+It is covered in more detail at https://arxiv.org/abs/2305.09688
+
+Please download manually by
+kaggle competitions download -c bengaliai-speech


Could we provide a download function that calls subprocess.run("kaggle competitions download ...")? And possibly checks if kaggle binary is available using lhotse.utils.is_module_available, and if not it raises an exception asking to pip install kaggle first.

@pzelasko I tried to implement the download function, but unfortunately, I have to authenticate with an API token before calling subprocess.run("kaggle competitions download -c bengaliai-speech"), otherwise, I will encounter OSError: Could not find kaggle.json. Make sure it's located in /k2-dev/yangyifan/.kaggle.

OK, it's good enough as it is. Thanks!

Add Bengali.AI Speech

8fd41ad

yfyeung changed the title ~~Add Bengali.AI Speech corpus~~ Add Bengali.AI Speech corpus [Kaggle Research Code Competition] Jul 24, 2023

yfyeung changed the title ~~Add Bengali.AI Speech corpus [Kaggle Research Code Competition]~~ Add Bengali.AI Speech corpus for Kaggle Research Code Competition Jul 24, 2023

Update bengaliai_speech.py

7d6a1cf

pzelasko reviewed Jul 24, 2023

View reviewed changes

pzelasko added this to the v1.16 milestone Jul 24, 2023

Refactor

0daf768

pzelasko approved these changes Jul 25, 2023

View reviewed changes

pzelasko merged commit 601c345 into lhotse-speech:master Jul 25, 2023

yfyeung deleted the bengaliai_speech branch July 26, 2023 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Bengali.AI Speech corpus for Kaggle Research Code Competition #1108

Add Bengali.AI Speech corpus for Kaggle Research Code Competition #1108

yfyeung commented Jul 24, 2023

pzelasko Jul 24, 2023

yfyeung Jul 25, 2023

pzelasko Jul 25, 2023

Add Bengali.AI Speech corpus for Kaggle Research Code Competition #1108

Add Bengali.AI Speech corpus for Kaggle Research Code Competition #1108

Conversation

yfyeung commented Jul 24, 2023

pzelasko Jul 24, 2023

Choose a reason for hiding this comment

yfyeung Jul 25, 2023

Choose a reason for hiding this comment

pzelasko Jul 25, 2023

Choose a reason for hiding this comment