LightningFastSpeech

WARNING: This is a work in progress and until version 0.1 (which will be out very soon), it might be hard to get running on your own machine. Thanks for your patience.

Large Pretrained TTS

In the NLP community, and more recently in speech recognition, large pre-trained models and how they can be used for down-stream tasks have become an exciting area of research.

In TTS however, little similar work exists. With this project, I hope to make a first step into bringing pretrained models to TTS. The original FastSpeech 2 model is 27M parameters large and models a single speaker, while our version would have almost 1B parameters without the improvements from LightSpeech, which bring its size down to a manageable 76M, and models more than 2,000 speakers.

A big upside of this implementation is that it is based on Pytorch Lightning, which makes it easy to do multi-gpu training, load pre-trained models and a lot more.

LightningFastSpeech couldn't exist without the amazing open source work of many others, for a full list see Attribution.

Current Status

This library is a work in progress, and until v1.0, updates might break things occasionally.

Goals

v0.1

0.1 is right around the corner! For this version, the core functionality is already there, and what's missing are mostly quality of life improvements that we should get out of the way now.

v1.0

It will take a while to get to 1.0 -- the goal for this to allow everyone to easily fine-tune our models and to easily do controllable synthesis of utterances.

Allow models to be loaded from the Huggingface hub.
Streamlit interface for synthesising utterances and generating datasets.
Tract and tractjs integration to export models for on-device and web use.
Make it easy to add new datasets and to fine-tune models with them.
Add HiFi-GAN fine-tuning to the pipeline.
A range of pre-trained models with different domains and sizes (e.g. multi-lingual, noisy/clean)

Attribution

This would not be possible without a lot of amazing open source project in the TTS space already present -- please cite their work when appropriate!

Chung-Ming Chien's FastSpeech 2 implementation, which was used during as a reference implementation.
yistLin's public d-vector implementation, which is used for multi-speaker training.
Aidan Pine's fork of FastSpeech 2, which served as the basis for the implementation of the depth-wise convolutions used in LightSpeech.
Coqui AI's excellent TTS toolkit, which was used for the Stochastic Duration Predictor and inspired the loss weighing we do.
Jungil Kong's HiFi-GAN implementation, which is used vocoding mel spectrograms produced by our TTS system.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
litfass		litfass
scripts		scripts
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LightningFastSpeech

Large Pretrained TTS

Current Status

Goals

v0.1

v1.0

Attribution

About

Releases

Packages

Languages

License

MiniXC/LightningFastSpeech2

Folders and files

Latest commit

History

Repository files navigation

LightningFastSpeech

Large Pretrained TTS

Current Status

Goals

v0.1

v1.0

Attribution

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages