This is the repository for the CMU multilingual speech extension data set presented in the paper entitled MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible.
For copyright reasons, we are not allowed to share the audio files however, we provide the extraction pipeline below. We also highlight this pipeline can be used to new languages of interested. Inside the dataset folder, for each language we provide:
- Alignment textgrids (from Maus forced aligner)
- Final textual output and segments textgrids
- Mel Filterbank Spectrograms (such as used in the paper's experiments)
1) Downloading audio chapters from bible.is.
1.1. The audios used in our work are available in the following links:
- Basque dataset
- English dataset
- Finnish dataset
- French dataset
- Hungarian dataset
- Romanian dataset
- Russian dataset
- Spanish dataset
1.2. The audios were converted from multi to single channel and forced aligned by using this script.
1.3. The raw chapter text files are not available for download anymore at the website. Thus, we provide them at dataset/LANGUAGE/raw_txt/. For new languages, chapter text files can be extracted from this webpage. These .txt files (chapter level) should be put on the same folder than the audios.
2) Aligning the data with Maus forced aligner
For the covered languages, we make available the output from the Maus forced aligner in LANGUAGE/maus_textgrid/. For new languages, please check the Website.
For each language, the audios were sliced in verses considering the output of 1.3. and the generated texgrids (2.). More details available here.
For translating the IDs in English, we provide the simple python script below.
python3 scripts/fetch_data.py <language folder> <output folder> <language code>
Use this script to tenerate a CSV files listing the verses available for each language. As not all the verses of a given language exist in another language, this CSV file can be use to get a list of verses common to all languages.
The speech-to-speech retrieval baseline model proposed at the paper is available here.
If you use this corpus in your experiments, please use the following bibtex for citation
@inproceedings{zanon-boito-etal-2020-mass,
title = {{M}a{SS}: {A} {L}arge and {C}lean {M}ultilingual {C}orpus of {S}entence-aligned {S}poken {U}tterances {E}xtracted from the {B}ible},
author = {Zanon Boito*, Marcely and Havard*, William and Garnerin, Mahault and Le Ferrand, Éric and Besacier, Laurent},
booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference},
month = may,
year = {2020},
address = {Marseille, France},
publisher = {European Language Resources Association},
url = {https://aclanthology.org/2020.lrec-1.799},
pages = {6486--6493},
language = {English},
isbn = {979-10-95546-34-4},
}
The people behind the (325) project:
- Marcely ZANON BOITO
- William N. HAVARD
- Mahault GARNERIN
- Eric Le FERRAND
- Laurent BESACIER