CommonLanguage Dataset [download]
This dataset is composed of speakers of 45 languages that were carefully selected from CommonVoice database. The total duration of audio recordings is 45.1 hours. The data is already split into train, dev (validation) and test sets.
Name | Train | Dev | Test |
---|---|---|---|
# of utterances | 177552 | 47104 | 47704 |
# unique speakers | 11189 | 1297 | 1322 |
Total duration, hr | 30.04 | 7.53 | 7.53 |
Min duration, sec | 0.86 | 0.98 | 0.89 |
Mean duration, sec | 4.87 | 4.61 | 4.55 |
Max duration, sec | 21.72 | 105.67 | 29.83 |
Duration per language, min | ~40 | ~10 | ~10 |
- Arabic
- Basque
- Breton
- Catalan
- Chinese_China
- Chinese_Hongkong
- Chinese_Taiwan
- Chuvash
- Czech
- Dhivehi
- Dutch
- English
- Esperanto
- Estonian
- French
- Frisian
- Georgian
- German
- Greek
- Hakha_Chin
- Indonesian
- Interlingua
- Italian
- Japanese
- Kabyle
- Kinyarwanda
- Kyrgyz
- Latvian
- Maltese
- Mangolian
- Persian
- Polish
- Portuguese
- Romanian
- Romansh_Sursilvan
- Russian
- Sakha
- Slovenian
- Spanish
- Swedish
- Tamil
- Tatar
- Turkish
- Ukrainian
- Welsh
In addition to the language label, the datapoints have age
, gender
and utterance transcription
associated with each utterance.