Skip to content

Latest commit

 

History

History
65 lines (60 loc) · 1.62 KB

README.md

File metadata and controls

65 lines (60 loc) · 1.62 KB

CommonLanguage Dataset [download]

This dataset is composed of speakers of 45 languages that were carefully selected from CommonVoice database. The total duration of audio recordings is 45.1 hours. The data is already split into train, dev (validation) and test sets.

Statistics of CommonLanguage:

Name Train Dev Test
# of utterances 177552 47104 47704
# unique speakers 11189 1297 1322
Total duration, hr 30.04 7.53 7.53
Min duration, sec 0.86 0.98 0.89
Mean duration, sec 4.87 4.61 4.55
Max duration, sec 21.72 105.67 29.83
Duration per language, min ~40 ~10 ~10

List of languages:

  • Arabic
  • Basque
  • Breton
  • Catalan
  • Chinese_China
  • Chinese_Hongkong
  • Chinese_Taiwan
  • Chuvash
  • Czech
  • Dhivehi
  • Dutch
  • English
  • Esperanto
  • Estonian
  • French
  • Frisian
  • Georgian
  • German
  • Greek
  • Hakha_Chin
  • Indonesian
  • Interlingua
  • Italian
  • Japanese
  • Kabyle
  • Kinyarwanda
  • Kyrgyz
  • Latvian
  • Maltese
  • Mangolian
  • Persian
  • Polish
  • Portuguese
  • Romanian
  • Romansh_Sursilvan
  • Russian
  • Sakha
  • Slovenian
  • Spanish
  • Swedish
  • Tamil
  • Tatar
  • Turkish
  • Ukrainian
  • Welsh

Other information

In addition to the language label, the datapoints have age, gender and utterance transcription associated with each utterance.