Speaker Encoder

TTS has a subproject, called Speaker Encoder. It is an implementation of https://arxiv.org/abs/1710.10467 . There is also a released model trained on LibriTTS dataset with ~1000 speakers in Released Models page.

You can use this model for various purposes:

Training a multi-speaker model using voice embeddings as speaker features.
- Compute embedding vectors by compute_embedding.py and feed them to your TTS network. (TTS side needs to be implemented but it should be straight forward)
Pruning bad examples from your TTS dataset.
- Compute embedding vectors and plot them using the notebook provided. Thx @nmstoker for this!
Use as a speaker classification or verification system.
Speaker diarization for ASR systems.

The model provided here is the halve of the baseline model. I figured, it is easier to train and the final performance does not differ too much compared to the larger version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker Encoder

Clone this wiki locally