This is a modularized Text-to-speech framework aiming to support fast research and product developments. Main features include
- all modules are configurable via yaml,
- speaker embedding / prosody embeding/ multi-stream text embedding are supported and configurable,
- various vocoders (VocGAN, hifi-GAN, waveglow, melGAN) are supported by adapter so that comparison across different vocoders can be done easily,
- durations/pitch/energy variance predictor are supported, and other variances can be added easily,
- and more on the road-map.
Contributions are welcome.
- Interesting audio samples for aishell3 added here.
- The github page also hosts some samples for biaobei and aishell3 datasets.
git clone https://github.com/ranchlai/mandarin-tts.git
cd mandarin-tts
git submodule update --force --recursive --init --remote
pip install -e . f
Two examples are provided here: biaobei and aishell3.
To train your own models, first make a copy from existing examples, then prepare the melspectrogram features using wav2mel.py by
cd examples
python wav2mel.py -c ./aishell3/config.yaml -w <aishell3_wav_folder> -m <mel_folder> -d cpu
prepare the scp files necessary for training,
cd examples/aishell3
python prepare.py --wav_folder <aishell3_wav_folder> --mel_folder <mel_folder> --dst_folder ./train/
This will generate scp files required by config.yaml (in the dataset/train section). You would also need to check that everything is fine in the config file. Usually you don't need to change the code.
Now you can start your training by
cd examples/aishell3
python ../../mtts/train.py -c config.yaml -d cuda
For biaobei dataset, the workflow is the same, except that there is no speaker embedding but you can add prosody embedding.
More examples will be added. Please stay.
Currently two examples are provided, and the corresponding checkpoints/configs are summarized as follows.
dataset | checkpoint | config |
---|---|---|
aishell3 | link | link |
biaobei | link | link |
Vocoders play the role of converting melspectrograms to waveforms. They are added as submodules and will be be trained in this project. Hence you should download the checkpoints before synthesizing. In training, vocoders are not necessary, as you can monitor the training process from generated melspectrograms and also the loss curve. Current we support the following vocoders,
Vocoder | checkpoint | github |
---|---|---|
Waveglow | link | link |
hifi-gan | link | link |
VocGAN | link link | link |
MelGAN | link | link |
All vocoders will be ready after running git submodule update --force --recursive --init --remote
. However, you have to download the checkpoint manually and properly set the path in the config.yaml file.
The input.txt should be consistent with your setting of emb_type1 to emb_type_n in config file, i.e., same type, same order.
To facilitate transcription of hanzi to pinyin, you can try:
cd examples/aishell3/
python ../../mtts/text/gp2py.py -t "为适应新的网络传播方式和读者阅读习惯"
>> sil wei4 shi4 ying4 xin1 de5 wang3 luo4 chuan2 bo1 fang1 shi4 he2 du2 zhe3 yue4 du2 xi2 guan4 sil|sil 为 适 应 新 的 网 络 传 播 方 式 和 读 者 阅 读 习 惯 sil
Not you can copy the text to input.txt, and remember to put down the self-defined name and speaker id, separated by '|'.
With the above checkpoints and text ready, finally you can run the synthesis process,
python ../../mtts/synthesize.py -d cuda --c config.yaml --checkpoint ./checkpoints/checkpoint_1240000.pth.tar -i input.txt
Please check the config.yaml file for the vocoder settings.
If lucky, audio examples can be found in the output folder.