CommonLanguage Dataset [download]

This dataset is composed of speakers of 45 languages that were carefully selected from CommonVoice database. The total duration of audio recordings is 45.1 hours. The data is already split into train, dev (validation) and test sets.

Statistics of CommonLanguage:

Name	Train	Dev	Test
# of utterances	177552	47104	47704
# unique speakers	11189	1297	1322
Total duration, hr	30.04	7.53	7.53
Min duration, sec	0.86	0.98	0.89
Mean duration, sec	4.87	4.61	4.55
Max duration, sec	21.72	105.67	29.83
Duration per language, min	~40	~10	~10

List of languages:

Arabic
Basque
Breton
Catalan
Chinese_China
Chinese_Hongkong
Chinese_Taiwan
Chuvash
Czech
Dhivehi
Dutch
English
Esperanto
Estonian
French
Frisian
Georgian
German
Greek
Hakha_Chin
Indonesian
Interlingua
Italian
Japanese
Kabyle
Kinyarwanda
Kyrgyz
Latvian
Maltese
Mangolian
Persian
Polish
Portuguese
Romanian
Romansh_Sursilvan
Russian
Sakha
Slovenian
Spanish
Swedish
Tamil
Tatar
Turkish
Ukrainian
Welsh

Other information

In addition to the language label, the datapoints have age, gender and utterance transcription associated with each utterance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CommonLanguage Dataset [download]

Statistics of CommonLanguage:

List of languages:

Other information

Files

README.md

Latest commit

History

README.md

File metadata and controls

CommonLanguage Dataset [download]

Statistics of CommonLanguage:

List of languages:

Other information