Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🤖 AUTO TTS #696

Closed
wants to merge 132 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
132 commits
Select commit Hold shift + click to select a range
88d782e
Skip phoneme cache pre-compute if the path exists
erogol Jul 27, 2021
66bcb8f
Fix json formatting error
Moldoteck Aug 2, 2021
cf55494
Update distribute.py (#685)
astricks Aug 2, 2021
168a9a6
Merge pull request #694 from Moldoteck/patch-1
erogol Aug 2, 2021
c6e5121
nice way to train complete recipes with three lines of code. currentl…
loganhart02 Aug 3, 2021
8c2a2c5
added command line training file for ljspeech tacotron2 models.
loganhart02 Aug 6, 2021
1de4995
added SCGlowTts to model hub
loganhart02 Aug 6, 2021
55fbcdb
added recipe for sc glow tts training on vctk and tacotron2 models fo…
loganhart02 Aug 8, 2021
42bd0dc
hifi gan and wavegrad configs for and ljspeech vocoder recipe
loganhart02 Aug 12, 2021
b84301d
tacotron2 config
loganhart02 Aug 14, 2021
ae22b23
Merge branch 'main' of https://github.com/coqui-ai/TTS into recipe_api
loganhart02 Aug 15, 2021
7d9029d
Merge branch 'coqui-ai:main' into recipe_api
loganhart02 Aug 15, 2021
807d432
Merge remote-tracking branch 'origin/recipe_api' into recipe_api
loganhart02 Aug 15, 2021
42fb3fd
made ljspeech tts trainer for all models( define models by string ins…
loganhart02 Aug 15, 2021
55d9fc3
made ljspeech tts and vocoder trainer.
loganhart02 Aug 16, 2021
68975bd
vocoder command line file
loganhart02 Aug 19, 2021
dd80696
added vits(forgot to push this in my last commit)
loganhart02 Aug 23, 2021
88e2e39
added vits default model
loganhart02 Sep 1, 2021
e5c3c27
Merge branch 'coqui-ai:main' into recipe_api
loganhart02 Sep 1, 2021
e4b08ce
nice way to train complete recipes with three lines of code. currentl…
loganhart02 Aug 3, 2021
391348d
added command line training file for ljspeech tacotron2 models.
loganhart02 Aug 6, 2021
d4f1551
added SCGlowTts to model hub
loganhart02 Aug 6, 2021
71e9ea3
added recipe for sc glow tts training on vctk and tacotron2 models fo…
loganhart02 Aug 8, 2021
17a8dd6
hifi gan and wavegrad configs for and ljspeech vocoder recipe
loganhart02 Aug 12, 2021
6f0ab09
tacotron2 config
loganhart02 Aug 14, 2021
3a16744
Update default ja vocoder
erogol Aug 11, 2021
bc967eb
made ljspeech tts trainer for all models( define models by string ins…
loganhart02 Aug 15, 2021
3af309e
made ljspeech tts and vocoder trainer.
loganhart02 Aug 16, 2021
fb4cf15
vocoder command line file
loganhart02 Aug 19, 2021
fe301dc
added vits(forgot to push this in my last commit)
loganhart02 Aug 23, 2021
3e862d8
Merge branch 'recipe_api' of https://github.com/loganhart420/TTS into…
loganhart02 Sep 2, 2021
6bbd811
Merge branch 'coqui-ai:main' into recipe_api
loganhart02 Sep 6, 2021
c5d2334
nice way to train complete recipes with three lines of code. currentl…
loganhart02 Aug 3, 2021
51380ec
added command line training file for ljspeech tacotron2 models.
loganhart02 Aug 6, 2021
4275b96
added recipe for sc glow tts training on vctk and tacotron2 models fo…
loganhart02 Aug 8, 2021
e47f763
hifi gan and wavegrad configs for and ljspeech vocoder recipe
loganhart02 Aug 12, 2021
a7f1ac0
made ljspeech tts trainer for all models( define models by string ins…
loganhart02 Aug 15, 2021
2cdaf08
made ljspeech tts and vocoder trainer.
loganhart02 Aug 16, 2021
0b5f397
vocoder command line file
loganhart02 Aug 19, 2021
40b9eaa
changed camel case to _. created single speaker and multispeaker auto…
loganhart02 Sep 2, 2021
9155412
made single speaker vocoder trainer function
loganhart02 Sep 7, 2021
95e978d
nice way to train complete recipes with three lines of code. currentl…
loganhart02 Aug 3, 2021
190b308
added command line training file for ljspeech tacotron2 models.
loganhart02 Aug 6, 2021
5aa9002
added SCGlowTts to model hub
loganhart02 Aug 6, 2021
fac98ca
added recipe for sc glow tts training on vctk and tacotron2 models fo…
loganhart02 Aug 8, 2021
ff2f974
hifi gan and wavegrad configs for and ljspeech vocoder recipe
loganhart02 Aug 12, 2021
ed38be6
tacotron2 config
loganhart02 Aug 14, 2021
68f7357
Update default ja vocoder
erogol Aug 11, 2021
8ab2967
made ljspeech tts trainer for all models( define models by string ins…
loganhart02 Aug 15, 2021
d6f5da8
made ljspeech tts and vocoder trainer.
loganhart02 Aug 16, 2021
485daec
vocoder command line file
loganhart02 Aug 19, 2021
e82dc31
added vits(forgot to push this in my last commit)
loganhart02 Aug 23, 2021
7b67e79
nice way to train complete recipes with three lines of code. currentl…
loganhart02 Aug 3, 2021
d452288
Update Japanese phonemizer (#758)
kaiidams Sep 1, 2021
85cfbe6
Compute F0 using librosa
erogol Jul 6, 2021
cfcba2a
Add FastPitchLoss
erogol Jul 7, 2021
1b67fe2
Add comput_f0 field
erogol Jul 7, 2021
f3a3893
Fix `compute_attention_masks.py`
erogol Jul 12, 2021
5f7f383
Fix configs
erogol Jul 12, 2021
ac9ac9b
Fix `FastPitchLoss`
erogol Jul 12, 2021
bda1409
Fix `base_tacotron` `aux_input` handling
erogol Jul 12, 2021
3af4eda
Cache pitch features
erogol Jul 12, 2021
0da267b
Set BaseDatasetConfig for tests
erogol Jul 12, 2021
43b3ff7
Don't print computed phonemes
erogol Jul 13, 2021
306de95
Compute mean and std pitch
erogol Jul 14, 2021
a478f68
Add FastPitch LJSpeech recipe
erogol Jul 14, 2021
8978751
Add yin based pitch computation
erogol Jul 14, 2021
5a4e98a
Add FastPitch model and FastPitchconfig
erogol Jul 14, 2021
5beb0b3
Use absolute paths of the attention masks
erogol Jul 16, 2021
6ccbf9f
Fix SpeakerManager usage in `synthesize.py`
erogol Jul 16, 2021
a6082ba
Restore `last_epoch` of the scheduler
erogol Jul 16, 2021
3933208
Make optional to detach duration predictor input
erogol Jul 20, 2021
407bb78
Update docstring format
erogol Jul 22, 2021
69d7807
Update FastPitch config
erogol Jul 22, 2021
d27ef8d
Update docstring format
erogol Jul 22, 2021
39897f0
Update docstrings
erogol Jul 22, 2021
85e98d2
Update FastPitchLoss
erogol Jul 22, 2021
d7042b6
Don't use align_score for models with duration predictor
erogol Jul 22, 2021
2b06370
Format style of the recipes
erogol Jul 22, 2021
d51f84f
Refactor FastPitch model
erogol Jul 22, 2021
73d0bb8
Update FastPitch don't detach duration network inputs
erogol Jul 24, 2021
302ed58
Enable aligner for FastPitch
erogol Jul 24, 2021
ff6725b
Refactor FastPitchv2
erogol Jul 26, 2021
f125e0d
Disable autcast for criterions
erogol Sep 1, 2021
a8ac1c4
Add `sort_by_audio_len` option
erogol Sep 3, 2021
9f43989
Implement binary alignment loss
erogol Sep 3, 2021
ec7c77b
Add `PitchExtractor` and return dict by `collate`
erogol Sep 3, 2021
80bf855
Add `AlignerNetwork`
erogol Sep 3, 2021
d8ef0f5
FastPitch refactor and commenting
erogol Sep 3, 2021
d13d5bc
Update `generic.FFTransformer`
erogol Sep 3, 2021
af26a5d
Add tests for certain FastPitch functions
erogol Sep 3, 2021
8f7cc47
Update `PositionalEncoding`
erogol Sep 3, 2021
8e581de
Integrate Scarf pixel
erogol Sep 4, 2021
8311558
Update README.md format
erogol Sep 6, 2021
c46e987
Refactor TTSDataset
erogol Sep 6, 2021
00aa649
Fix attn mask reading bug
erogol Sep 6, 2021
b575e85
Fix loader setup in `base_tts`
erogol Sep 6, 2021
181a781
Plot unnormalized pitch by `FastPitch`
erogol Sep 6, 2021
eb4717e
Reformat multi-speaker handling in GlowTTS
erogol Sep 6, 2021
d0d8fd2
Plot pitch over spectrogram
erogol Sep 6, 2021
09cc932
Use pyworld for pitch
erogol Sep 6, 2021
6c8184b
Update loader tests for dict return
erogol Sep 6, 2021
6cb6034
Fix linter issues
erogol Sep 6, 2021
85c3c26
Add FastPitch model to `.models.json`
erogol Sep 6, 2021
1e118c2
Bump up to v0.2.2
erogol Sep 6, 2021
d155e97
added command line training file for ljspeech tacotron2 models.
loganhart02 Aug 6, 2021
dc69fc1
added recipe for sc glow tts training on vctk and tacotron2 models fo…
loganhart02 Aug 8, 2021
97b94ff
hifi gan and wavegrad configs for and ljspeech vocoder recipe
loganhart02 Aug 12, 2021
b85160b
made ljspeech tts trainer for all models( define models by string ins…
loganhart02 Aug 15, 2021
fd63352
made ljspeech tts and vocoder trainer.
loganhart02 Aug 16, 2021
2672bb5
vocoder command line file
loganhart02 Aug 19, 2021
16ff386
changed camel case to _. created single speaker and multispeaker auto…
loganhart02 Sep 2, 2021
ee19679
made single speaker vocoder trainer function
loganhart02 Sep 7, 2021
026249d
Merge branch 'recipe_api' of https://github.com/loganhart420/TTS into…
loganhart02 Sep 7, 2021
7c4dbf7
refactored model configs
loganhart02 Sep 12, 2021
13f91aa
refactored model configs
loganhart02 Sep 12, 2021
b40e0ba
refactored model configs
loganhart02 Sep 12, 2021
12fe775
refactored model configs
loganhart02 Sep 12, 2021
a9945d4
refactored model configs
loganhart02 Sep 12, 2021
fdc9220
refactored model configs
loganhart02 Sep 12, 2021
591df9a
Merge branch 'coqui-ai:main' into recipe_api
loganhart02 Sep 13, 2021
05e443e
added args and usage docs
loganhart02 Sep 18, 2021
e1e70a0
Merge branch 'coqui-ai:main' into recipe_api
loganhart02 Sep 18, 2021
3062c80
documentation for how to train
loganhart02 Sep 21, 2021
49986d2
added tacotron2 multispeaker model
loganhart02 Oct 18, 2021
77dda0f
added pretrained model loading and ljspeech fast pitch model config. …
loganhart02 Oct 21, 2021
40c17b2
Merge pull request #891 from coqui-ai/dev
erogol Oct 26, 2021
b31d147
Merge branch 'main' of https://github.com/coqui-ai/TTS into recipe_api
loganhart02 Oct 28, 2021
90cb11a
This is just a class to download some public tts public datasets. st…
loganhart02 Oct 28, 2021
450d450
make style
loganhart02 Oct 28, 2021
33aa27e
Merge pull request #901 from coqui-ai/dev
erogol Oct 29, 2021
5d5ed65
Merge branch 'coqui-ai:main' into recipe_api
loganhart02 Nov 2, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
309 changes: 309 additions & 0 deletions TTS/auto_tts/complete_recipes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
import zipfile

import requests
import tqdm

from TTS.auto_tts.model_hub import TtsModels, VocoderModels
from TTS.auto_tts.utils import data_loader
from TTS.trainer import Trainer, TrainingArgs, init_training


class TtsAutoTrainer(TtsModels):
"""
Args:

data_path (str):
The path to the dataset. Defaults to None.

dataset (str):
The dataset identifier. ex: ljspeech would be "ljspeech". Defaults to None.
See auto_tts utils for specific dataset names.

batch_size (int):
The size the batches you pass to the model. This will depend on gpu memory.
less than 32 is not recommended. Defaults to 32.

output_path (str):
The path where you want to the model config and model weights. If it is None it will
use your current directory. Defaults to None

mixed_precision (bool):
enables mixed precision training. can make batch sizes bigger and make training faster.
Could also make some trainings unstable. Defualts to False.

learning_rate (float):
The learning rate for the model. Defaults to 1e-3.

epochs (int):
how many times you want to model to go through the entire dataset. This usually doesn't need changing.
Defaults to 1000.

Usage:
Python:
From TTS.auto_tts.complete_recipes import TtsAutoTrainer
trainer = TtsAutoTrainer(data_path='DEFINE THIS', stats_path=None, dataset="DEFINE THIS" batch_size=32, learning_rate=0.001,
mixed_precision=False, output_path='DEFINE THIS', epochs=1000)
model = trainer.single_speaker_autotts("tacotron2, "double decoder consistency")
model.fit()

command line:
python single_speaker_autotts.py --data_path ../LJSpeech-1.1 --dataset ljspeech --batch_size 32 --mixed_precision
--model tacotron2 --tacotron2_model_type double decoder consistency --forward_attention
--location_attention

"""

def __init__(
self,
data_path=None,
dataset=None,
batch_size=32,
output_path=None,
mixed_precision=False,
learning_rate=1e-3,
epochs=1000,
):

super().__init__(batch_size, mixed_precision, learning_rate, epochs, output_path)
self.data_path = data_path
self.dataset_name = dataset

def single_speaker_autotts( # im actually going to change this to autotts_recipes and i'm making a more generic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for personal notes

# single_speaker_autotts cause it's gonna get too clunky when implenting fine tuning
# all in the same function. it'll be finished in the next commit
self,
model_name,
stats_path=None,
tacotron2_model_type=None,
glow_tts_encoder=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can add encoder and decoder config for ForwardTTS here

forward_attention=False,
location_attention=True,
pretrained=False,
):
"""

Args:
model_name (str):
name of the model you want to train. Defaults to None.


stats_path (str):
Optional, Stats path for the audio config if the model uses it. Defaults to None.


tacotron2_model_type (str):
Optional, Type of tacotron2 model you want to train, either double deocder consistency,
or dynamic convolution attention. Defaults to None.


glow_tts_encoder (str):
Optional, Type of encoder to train glow tts with. either transformer, gated,
residual_bn, or time_depth. Defaults to None.


forward_attention:
Optional, Whether to use forward attention or not on tacotron2 models,
Usaully makes the model allign faster. Defaults to False.


location_attention:
Optional, Whether to use location attention or not on Tacotron2 models. Defaults to True.


pretrained (str):
whether to use a pre trained model or not, This is recommended if you are training on
custom data. Defaults to False

"""

audio, dataset = data_loader(name=self.dataset_name, path=self.data_path, stats_path=stats_path)
if self.dataset_name == "ljspeech":
if model_name == "tacotron2":
if tacotron2_model_type == "double decoder consistency":
model_config = self._single_speaker_tacotron2_DDC(
audio, dataset, forward_attn=forward_attention, location_attn=location_attention
)
elif tacotron2_model_type == "dynamic convolution attention":
model_config = self._single_speaker_tacotron2_DCA(
audio, dataset, forward_attn=forward_attention, location_attn=location_attention
)
else:
model_config = self._single_speaker_tacotron2_base(
audio, dataset, forward_attn=forward_attention, location_attn=location_attention
)
elif model_name == "glow tts":
model_config = self._single_speaker_glow_tts(audio, dataset, encoder=glow_tts_encoder)
elif model_name == "vits tts":
model_config = self._single_speaker_vits_tts(audio, dataset)
elif model_name == "fast pitch":
model_config = self._ljspeech_fast_fastpitch(audio, dataset)
elif self.dataset_name == "baker":
if model_name == "tacotron2":
if tacotron2_model_type == "double decoder consistency":
model_config = self._single_speaker_tacotron2_DDC(
audio,
dataset,
pla=0.5,
dla=0.5,
ga=0.0,
forward_attn=forward_attention,
location_attn=location_attention,
)
elif tacotron2_model_type == "dynamic convolution attention":
model_config = self._single_speaker_tacotron2_DCA(
audio,
dataset,
pla=0.5,
dla=0.5,
ga=0.0,
forward_attn=forward_attention,
location_attn=location_attention,
)
args, config, output_path, _, c_logger, tb_logger = init_training(TrainingArgs(), model_config)
trainer = Trainer(args, config, output_path, c_logger, tb_logger)
return trainer

def multi_speaker_autotts(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add ForwardTTS model too for multi-speaker

self, model_name, speaker_file, glowtts_encoder=None, r=2, forward_attn=True, location_attn=False
):
"""

Args:
location_attn (bool):
enable location attention for tacotron2 model. Defaults to True.


r (int):
set the r for tacotron2 model. Defaults to 2.


forward_attn (bool):
set forward attention for tacotron2 model. Defaults to True.


model_name (str):
name of the model you want to train. Defaults to None.


speaker_file (str):
Path to either the d_vector file for glow_tts or speaker ids file for vits.
Defaults to None


glowtts_encoder:
Optional, which encoder you want the glow tts model to use. Defaults to None.

"""
audio, dataset = data_loader(name=self.dataset_name, path=self.data_path, stats_path=None)
if self.dataset_name == "vctk":
if model_name == "glow tts":
model_config = self._sc_glow_tts(audio, dataset, speaker_file, encoder=glowtts_encoder)
elif model_name == "vits tts":
model_config = self._vctk_vits_tts(audio, dataset, speaker_file)
elif model_name == "tacotron2":
model_config = self._multi_speaker_vctk_tacotron2(
audio, dataset, speaker_file, r=r, forward_attn=forward_attn, location_attn=location_attn
)
args, config, output_path, _, c_logger, tb_logger = init_training(TrainingArgs(), model_config)
trainer = Trainer(args, config, output_path, c_logger, tb_logger)
return trainer


class VocoderAutoTrainer(VocoderModels):
"""
Args:

data_path (str):
The path to the dataset. Defaults to None.

dataset (str):
The dataset identifier. ex: ljspeech would be "ljspeech". Defaults to None.
See auto_tts utils for specific dataset names.

batch_size (int):
The size the batches you pass to the model. This will depend on gpu memory.
less than 32 is not recommended. Defaults to 32.

output_path (str):
The path where you want to the model config and model weights. If it is None it will
use your current directory. Defaults to None

mixed_precision (bool):
enables mixed precision training. can make batch sizes bigger and make training faster.
Could also make some trainings unstable. Defualts to False.

learning_rate (List [float, float]):
The learning rate for the model. This should be a list with the generator rate being first
and discrimiator rate being second. Defaults to [1e-3, 1e-3].

epochs (int):
how many times you want to model to go through the entire dataset. This usually doesn't need changing.
Defaults to 1000.

Usage:
Python:
From TTS.auto_tts.complete_recipes import VocoderAutoTrainer
trainer = VocoderAutoTrainer(data_path='DEFINE THIS', stats_path=None, dataset="DEFINE THIS",
batch_size=32, learning_rate=[1e-3, 1e-3],
mixed_precision=False, output_path='DEFINE THIS', epochs=1000)
model = trainer.single_speaker_autotts("hifigan")
model.fit()

command line:
python vocoder_autotts.py --data_path ../LJSpeech-1.1 --dataset ljspeech --batch_size 32 --mixed_precision
--model hifigan

"""

def __init__(
self,
data_path=None,
dataset=None,
batch_size=32,
output_path=None,
mixed_precision=False,
learning_rate=None,
epochs=1000,
):
if learning_rate is None:
learning_rate = [0.001, 0.001]
super().__init__(
batch_size,
mixed_precision,
generator_learning_rate=learning_rate[0],
discriminator_learning_rate=learning_rate[1],
epochs=epochs,
output_path=output_path,
)
self.data_path: str = data_path
self.dataset_name: str = dataset

def single_speaker_autotts(self, model_name, stats_path=None):
"""
Args:

model_name (str):
name of the model you want to train.

Stats_path (str):
Optional, Path to the stats file for the audio config. Defaults to None.

"""
if self.dataset_name == "ljspeech":
audio, _ = data_loader(name="ljspeech", path=self.data_path, stats_path=stats_path)
if model_name == "hifigan":
model_config = self._hifi_gan(audio, self.data_path)
elif model_name == "wavegrad":
model_config = self._wavegrad(audio, self.data_path)
elif model_name == "univnet":
model_config = self._univnet(audio, self.data_path)
elif model_name == "multiband melgan":
model_config = self._multiband_melgan(audio, self.data_path)
elif model_name == "wavernn":
model_config = self._wavernn(audio, self.data_path)
args, config, output_path, _, c_logger, tb_logger = init_training(TrainingArgs(), model_config)
trainer = Trainer(args, config, output_path, c_logger, tb_logger)
return trainer

def from_pretrained(model_name):
pass
12 changes: 12 additions & 0 deletions TTS/auto_tts/example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from TTS.utils.manage import ModelManager
loganhart02 marked this conversation as resolved.
Show resolved Hide resolved

manager = ModelManager()
model_path, config_path, x = manager.download_model("tts_models/en/ljspeech/tacotron2-DCA")

print(model_path)

print(config_path)

print(x)

manager.list_models()
Loading