Skip to content

Commit

Permalink
Add VITS and GlowTTS class docs 🗒️
Browse files Browse the repository at this point in the history
  • Loading branch information
erogol committed Aug 9, 2021
1 parent 7a37404 commit 0601825
Show file tree
Hide file tree
Showing 3 changed files with 61 additions and 1 deletion.
7 changes: 6 additions & 1 deletion TTS/tts/models/vits.py
Original file line number Diff line number Diff line change
Expand Up @@ -304,7 +304,12 @@ def __init__(self, config: Coqpit):

if args.use_sdp:
self.duration_predictor = StochasticDurationPredictor(
args.hidden_channels, 192, 3, args.dropout_p_duration_predictor, 4, cond_channels=self.embedded_speaker_dim
args.hidden_channels,
192,
3,
args.dropout_p_duration_predictor,
4,
cond_channels=self.embedded_speaker_dim,
)
else:
self.duration_predictor = DurationPredictor(
Expand Down
22 changes: 22 additions & 0 deletions docs/source/models/glow_tts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Glow TTS

Glow TTS is a normalizing flow model for text-to-speech. It is built on the generic Glow model that is previously
used in computer vision and vocoder models. It uses "monotonic alignment search" (MAS) to fine the text-to-speech alignment
and uses the output to train a separate duration predictor network for faster inference run-time.

## Important resources & papers
- GlowTTS: https://arxiv.org/abs/2005.11129
- Glow (Generative Flow with invertible 1x1 Convolutions): https://arxiv.org/abs/1807.03039
- Normalizing Flows: https://blog.evjang.com/2018/01/nf1.html

## GlowTTS Config
```{eval-rst}
.. autoclass:: TTS.tts.configs.glow_tts_config.GlowTTSConfig
:members:
```

## GlowTTS Model
```{eval-rst}
.. autoclass:: TTS.tts.models.glow_tts.GlowTTS
:members:
```
33 changes: 33 additions & 0 deletions docs/source/models/vits.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# VITS

VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
) is an End-to-End (encoder -> vocoder together) TTS model that takes advantage of SOTA DL techniques like GANs, VAE,
Normalizing Flows. It does not require external alignment annotations and learns the text-to-audio alignment
using MAS as explained in the paper. The model architecture is a combination of GlowTTS encoder and HiFiGAN vocoder.
It is a feed-forward model with x67.12 real-time factor on a GPU.

## Important resources & papers
- VITS: https://arxiv.org/pdf/2106.06103.pdf
- Neural Spline Flows: https://arxiv.org/abs/1906.04032
- Variational Autoencoder: https://arxiv.org/pdf/1312.6114.pdf
- Generative Adversarial Networks: https://arxiv.org/abs/1406.2661
- HiFiGAN: https://arxiv.org/abs/2010.05646
- Normalizing Flows: https://blog.evjang.com/2018/01/nf1.html

## VitsConfig
```{eval-rst}
.. autoclass:: TTS.tts.configs.vits_config.VitsConfig
:members:
```

## VitsArgs
```{eval-rst}
.. autoclass:: TTS.tts.models.vits.VitsArgs
:members:
```

## Vits Model
```{eval-rst}
.. autoclass:: TTS.tts.models.vits.Vits
:members:
```

0 comments on commit 0601825

Please sign in to comment.