Skip to content

MiniXC/LightningFastSpeech2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LightningFastSpeech

WARNING: This is a work in progress and until version 0.1 (which will be out very soon), it might be hard to get running on your own machine. Thanks for your patience.

Large Pretrained TTS

In the NLP community, and more recently in speech recognition, large pre-trained models and how they can be used for down-stream tasks have become an exciting area of research.

In TTS however, little similar work exists. With this project, I hope to make a first step into bringing pretrained models to TTS. The original FastSpeech 2 model is 27M parameters large and models a single speaker, while our version would have almost 1B parameters without the improvements from LightSpeech, which bring its size down to a manageable 76M, and models more than 2,000 speakers.

A big upside of this implementation is that it is based on Pytorch Lightning, which makes it easy to do multi-gpu training, load pre-trained models and a lot more.

LightningFastSpeech couldn't exist without the amazing open source work of many others, for a full list see Attribution.

Current Status

This library is a work in progress, and until v1.0, updates might break things occasionally.

Goals

v0.1

0.1 is right around the corner! For this version, the core functionality is already there, and what's missing are mostly quality of life improvements that we should get out of the way now.

  • Replicate original FastSpeech 2 architecture
  • Include Depth-wise separable convolutions found in LightSpeech
  • Dataloader which computes prosody features online
  • Synthesis of both individual utterances and whole datasets
  • Configurable training script.
  • Configurable synthesis script.
  • First large pre-trained model (LibriTTS, 2k speakers, 76M parameters).
  • Documentation & tutorials.
  • Allow reporting other than wandb.
  • Configurable metrics.
  • LJSpeech support.
  • PyPi package.
  • Hifi GAN Finetuning (during and after training)

v1.0

It will take a while to get to 1.0 -- the goal for this to allow everyone to easily fine-tune our models and to easily do controllable synthesis of utterances.

  • Allow models to be loaded from the Huggingface hub.
  • Streamlit interface for synthesising utterances and generating datasets.
  • Tract and tractjs integration to export models for on-device and web use.
  • Make it easy to add new datasets and to fine-tune models with them.
  • Add HiFi-GAN fine-tuning to the pipeline.
  • A range of pre-trained models with different domains and sizes (e.g. multi-lingual, noisy/clean)

Attribution

This would not be possible without a lot of amazing open source project in the TTS space already present -- please cite their work when appropriate!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages