-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🤖 AUTO TTS #696
Closed
Closed
🤖 AUTO TTS #696
Changes from all commits
Commits
Show all changes
132 commits
Select commit
Hold shift + click to select a range
88d782e
Skip phoneme cache pre-compute if the path exists
erogol 66bcb8f
Fix json formatting error
Moldoteck cf55494
Update distribute.py (#685)
astricks 168a9a6
Merge pull request #694 from Moldoteck/patch-1
erogol c6e5121
nice way to train complete recipes with three lines of code. currentl…
loganhart02 8c2a2c5
added command line training file for ljspeech tacotron2 models.
loganhart02 1de4995
added SCGlowTts to model hub
loganhart02 55fbcdb
added recipe for sc glow tts training on vctk and tacotron2 models fo…
loganhart02 42bd0dc
hifi gan and wavegrad configs for and ljspeech vocoder recipe
loganhart02 b84301d
tacotron2 config
loganhart02 ae22b23
Merge branch 'main' of https://github.com/coqui-ai/TTS into recipe_api
loganhart02 7d9029d
Merge branch 'coqui-ai:main' into recipe_api
loganhart02 807d432
Merge remote-tracking branch 'origin/recipe_api' into recipe_api
loganhart02 42fb3fd
made ljspeech tts trainer for all models( define models by string ins…
loganhart02 55d9fc3
made ljspeech tts and vocoder trainer.
loganhart02 68975bd
vocoder command line file
loganhart02 dd80696
added vits(forgot to push this in my last commit)
loganhart02 88e2e39
added vits default model
loganhart02 e5c3c27
Merge branch 'coqui-ai:main' into recipe_api
loganhart02 e4b08ce
nice way to train complete recipes with three lines of code. currentl…
loganhart02 391348d
added command line training file for ljspeech tacotron2 models.
loganhart02 d4f1551
added SCGlowTts to model hub
loganhart02 71e9ea3
added recipe for sc glow tts training on vctk and tacotron2 models fo…
loganhart02 17a8dd6
hifi gan and wavegrad configs for and ljspeech vocoder recipe
loganhart02 6f0ab09
tacotron2 config
loganhart02 3a16744
Update default ja vocoder
erogol bc967eb
made ljspeech tts trainer for all models( define models by string ins…
loganhart02 3af309e
made ljspeech tts and vocoder trainer.
loganhart02 fb4cf15
vocoder command line file
loganhart02 fe301dc
added vits(forgot to push this in my last commit)
loganhart02 3e862d8
Merge branch 'recipe_api' of https://github.com/loganhart420/TTS into…
loganhart02 6bbd811
Merge branch 'coqui-ai:main' into recipe_api
loganhart02 c5d2334
nice way to train complete recipes with three lines of code. currentl…
loganhart02 51380ec
added command line training file for ljspeech tacotron2 models.
loganhart02 4275b96
added recipe for sc glow tts training on vctk and tacotron2 models fo…
loganhart02 e47f763
hifi gan and wavegrad configs for and ljspeech vocoder recipe
loganhart02 a7f1ac0
made ljspeech tts trainer for all models( define models by string ins…
loganhart02 2cdaf08
made ljspeech tts and vocoder trainer.
loganhart02 0b5f397
vocoder command line file
loganhart02 40b9eaa
changed camel case to _. created single speaker and multispeaker auto…
loganhart02 9155412
made single speaker vocoder trainer function
loganhart02 95e978d
nice way to train complete recipes with three lines of code. currentl…
loganhart02 190b308
added command line training file for ljspeech tacotron2 models.
loganhart02 5aa9002
added SCGlowTts to model hub
loganhart02 fac98ca
added recipe for sc glow tts training on vctk and tacotron2 models fo…
loganhart02 ff2f974
hifi gan and wavegrad configs for and ljspeech vocoder recipe
loganhart02 ed38be6
tacotron2 config
loganhart02 68f7357
Update default ja vocoder
erogol 8ab2967
made ljspeech tts trainer for all models( define models by string ins…
loganhart02 d6f5da8
made ljspeech tts and vocoder trainer.
loganhart02 485daec
vocoder command line file
loganhart02 e82dc31
added vits(forgot to push this in my last commit)
loganhart02 7b67e79
nice way to train complete recipes with three lines of code. currentl…
loganhart02 d452288
Update Japanese phonemizer (#758)
kaiidams 85cfbe6
Compute F0 using librosa
erogol cfcba2a
Add FastPitchLoss
erogol 1b67fe2
Add comput_f0 field
erogol f3a3893
Fix `compute_attention_masks.py`
erogol 5f7f383
Fix configs
erogol ac9ac9b
Fix `FastPitchLoss`
erogol bda1409
Fix `base_tacotron` `aux_input` handling
erogol 3af4eda
Cache pitch features
erogol 0da267b
Set BaseDatasetConfig for tests
erogol 43b3ff7
Don't print computed phonemes
erogol 306de95
Compute mean and std pitch
erogol a478f68
Add FastPitch LJSpeech recipe
erogol 8978751
Add yin based pitch computation
erogol 5a4e98a
Add FastPitch model and FastPitchconfig
erogol 5beb0b3
Use absolute paths of the attention masks
erogol 6ccbf9f
Fix SpeakerManager usage in `synthesize.py`
erogol a6082ba
Restore `last_epoch` of the scheduler
erogol 3933208
Make optional to detach duration predictor input
erogol 407bb78
Update docstring format
erogol 69d7807
Update FastPitch config
erogol d27ef8d
Update docstring format
erogol 39897f0
Update docstrings
erogol 85e98d2
Update FastPitchLoss
erogol d7042b6
Don't use align_score for models with duration predictor
erogol 2b06370
Format style of the recipes
erogol d51f84f
Refactor FastPitch model
erogol 73d0bb8
Update FastPitch don't detach duration network inputs
erogol 302ed58
Enable aligner for FastPitch
erogol ff6725b
Refactor FastPitchv2
erogol f125e0d
Disable autcast for criterions
erogol a8ac1c4
Add `sort_by_audio_len` option
erogol 9f43989
Implement binary alignment loss
erogol ec7c77b
Add `PitchExtractor` and return dict by `collate`
erogol 80bf855
Add `AlignerNetwork`
erogol d8ef0f5
FastPitch refactor and commenting
erogol d13d5bc
Update `generic.FFTransformer`
erogol af26a5d
Add tests for certain FastPitch functions
erogol 8f7cc47
Update `PositionalEncoding`
erogol 8e581de
Integrate Scarf pixel
erogol 8311558
Update README.md format
erogol c46e987
Refactor TTSDataset
erogol 00aa649
Fix attn mask reading bug
erogol b575e85
Fix loader setup in `base_tts`
erogol 181a781
Plot unnormalized pitch by `FastPitch`
erogol eb4717e
Reformat multi-speaker handling in GlowTTS
erogol d0d8fd2
Plot pitch over spectrogram
erogol 09cc932
Use pyworld for pitch
erogol 6c8184b
Update loader tests for dict return
erogol 6cb6034
Fix linter issues
erogol 85c3c26
Add FastPitch model to `.models.json`
erogol 1e118c2
Bump up to v0.2.2
erogol d155e97
added command line training file for ljspeech tacotron2 models.
loganhart02 dc69fc1
added recipe for sc glow tts training on vctk and tacotron2 models fo…
loganhart02 97b94ff
hifi gan and wavegrad configs for and ljspeech vocoder recipe
loganhart02 b85160b
made ljspeech tts trainer for all models( define models by string ins…
loganhart02 fd63352
made ljspeech tts and vocoder trainer.
loganhart02 2672bb5
vocoder command line file
loganhart02 16ff386
changed camel case to _. created single speaker and multispeaker auto…
loganhart02 ee19679
made single speaker vocoder trainer function
loganhart02 026249d
Merge branch 'recipe_api' of https://github.com/loganhart420/TTS into…
loganhart02 7c4dbf7
refactored model configs
loganhart02 13f91aa
refactored model configs
loganhart02 b40e0ba
refactored model configs
loganhart02 12fe775
refactored model configs
loganhart02 a9945d4
refactored model configs
loganhart02 fdc9220
refactored model configs
loganhart02 591df9a
Merge branch 'coqui-ai:main' into recipe_api
loganhart02 05e443e
added args and usage docs
loganhart02 e1e70a0
Merge branch 'coqui-ai:main' into recipe_api
loganhart02 3062c80
documentation for how to train
loganhart02 49986d2
added tacotron2 multispeaker model
loganhart02 77dda0f
added pretrained model loading and ljspeech fast pitch model config. …
loganhart02 40c17b2
Merge pull request #891 from coqui-ai/dev
erogol b31d147
Merge branch 'main' of https://github.com/coqui-ai/TTS into recipe_api
loganhart02 90cb11a
This is just a class to download some public tts public datasets. st…
loganhart02 450d450
make style
loganhart02 33aa27e
Merge pull request #901 from coqui-ai/dev
erogol 5d5ed65
Merge branch 'coqui-ai:main' into recipe_api
loganhart02 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,309 @@ | ||
import zipfile | ||
|
||
import requests | ||
import tqdm | ||
|
||
from TTS.auto_tts.model_hub import TtsModels, VocoderModels | ||
from TTS.auto_tts.utils import data_loader | ||
from TTS.trainer import Trainer, TrainingArgs, init_training | ||
|
||
|
||
class TtsAutoTrainer(TtsModels): | ||
""" | ||
Args: | ||
|
||
data_path (str): | ||
The path to the dataset. Defaults to None. | ||
|
||
dataset (str): | ||
The dataset identifier. ex: ljspeech would be "ljspeech". Defaults to None. | ||
See auto_tts utils for specific dataset names. | ||
|
||
batch_size (int): | ||
The size the batches you pass to the model. This will depend on gpu memory. | ||
less than 32 is not recommended. Defaults to 32. | ||
|
||
output_path (str): | ||
The path where you want to the model config and model weights. If it is None it will | ||
use your current directory. Defaults to None | ||
|
||
mixed_precision (bool): | ||
enables mixed precision training. can make batch sizes bigger and make training faster. | ||
Could also make some trainings unstable. Defualts to False. | ||
|
||
learning_rate (float): | ||
The learning rate for the model. Defaults to 1e-3. | ||
|
||
epochs (int): | ||
how many times you want to model to go through the entire dataset. This usually doesn't need changing. | ||
Defaults to 1000. | ||
|
||
Usage: | ||
Python: | ||
From TTS.auto_tts.complete_recipes import TtsAutoTrainer | ||
trainer = TtsAutoTrainer(data_path='DEFINE THIS', stats_path=None, dataset="DEFINE THIS" batch_size=32, learning_rate=0.001, | ||
mixed_precision=False, output_path='DEFINE THIS', epochs=1000) | ||
model = trainer.single_speaker_autotts("tacotron2, "double decoder consistency") | ||
model.fit() | ||
|
||
command line: | ||
python single_speaker_autotts.py --data_path ../LJSpeech-1.1 --dataset ljspeech --batch_size 32 --mixed_precision | ||
--model tacotron2 --tacotron2_model_type double decoder consistency --forward_attention | ||
--location_attention | ||
|
||
""" | ||
|
||
def __init__( | ||
self, | ||
data_path=None, | ||
dataset=None, | ||
batch_size=32, | ||
output_path=None, | ||
mixed_precision=False, | ||
learning_rate=1e-3, | ||
epochs=1000, | ||
): | ||
|
||
super().__init__(batch_size, mixed_precision, learning_rate, epochs, output_path) | ||
self.data_path = data_path | ||
self.dataset_name = dataset | ||
|
||
def single_speaker_autotts( # im actually going to change this to autotts_recipes and i'm making a more generic | ||
# single_speaker_autotts cause it's gonna get too clunky when implenting fine tuning | ||
# all in the same function. it'll be finished in the next commit | ||
self, | ||
model_name, | ||
stats_path=None, | ||
tacotron2_model_type=None, | ||
glow_tts_encoder=None, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you can add encoder and decoder config for ForwardTTS here |
||
forward_attention=False, | ||
location_attention=True, | ||
pretrained=False, | ||
): | ||
""" | ||
|
||
Args: | ||
model_name (str): | ||
name of the model you want to train. Defaults to None. | ||
|
||
|
||
stats_path (str): | ||
Optional, Stats path for the audio config if the model uses it. Defaults to None. | ||
|
||
|
||
tacotron2_model_type (str): | ||
Optional, Type of tacotron2 model you want to train, either double deocder consistency, | ||
or dynamic convolution attention. Defaults to None. | ||
|
||
|
||
glow_tts_encoder (str): | ||
Optional, Type of encoder to train glow tts with. either transformer, gated, | ||
residual_bn, or time_depth. Defaults to None. | ||
|
||
|
||
forward_attention: | ||
Optional, Whether to use forward attention or not on tacotron2 models, | ||
Usaully makes the model allign faster. Defaults to False. | ||
|
||
|
||
location_attention: | ||
Optional, Whether to use location attention or not on Tacotron2 models. Defaults to True. | ||
|
||
|
||
pretrained (str): | ||
whether to use a pre trained model or not, This is recommended if you are training on | ||
custom data. Defaults to False | ||
|
||
""" | ||
|
||
audio, dataset = data_loader(name=self.dataset_name, path=self.data_path, stats_path=stats_path) | ||
if self.dataset_name == "ljspeech": | ||
if model_name == "tacotron2": | ||
if tacotron2_model_type == "double decoder consistency": | ||
model_config = self._single_speaker_tacotron2_DDC( | ||
audio, dataset, forward_attn=forward_attention, location_attn=location_attention | ||
) | ||
elif tacotron2_model_type == "dynamic convolution attention": | ||
model_config = self._single_speaker_tacotron2_DCA( | ||
audio, dataset, forward_attn=forward_attention, location_attn=location_attention | ||
) | ||
else: | ||
model_config = self._single_speaker_tacotron2_base( | ||
audio, dataset, forward_attn=forward_attention, location_attn=location_attention | ||
) | ||
elif model_name == "glow tts": | ||
model_config = self._single_speaker_glow_tts(audio, dataset, encoder=glow_tts_encoder) | ||
elif model_name == "vits tts": | ||
model_config = self._single_speaker_vits_tts(audio, dataset) | ||
elif model_name == "fast pitch": | ||
model_config = self._ljspeech_fast_fastpitch(audio, dataset) | ||
elif self.dataset_name == "baker": | ||
if model_name == "tacotron2": | ||
if tacotron2_model_type == "double decoder consistency": | ||
model_config = self._single_speaker_tacotron2_DDC( | ||
audio, | ||
dataset, | ||
pla=0.5, | ||
dla=0.5, | ||
ga=0.0, | ||
forward_attn=forward_attention, | ||
location_attn=location_attention, | ||
) | ||
elif tacotron2_model_type == "dynamic convolution attention": | ||
model_config = self._single_speaker_tacotron2_DCA( | ||
audio, | ||
dataset, | ||
pla=0.5, | ||
dla=0.5, | ||
ga=0.0, | ||
forward_attn=forward_attention, | ||
location_attn=location_attention, | ||
) | ||
args, config, output_path, _, c_logger, tb_logger = init_training(TrainingArgs(), model_config) | ||
trainer = Trainer(args, config, output_path, c_logger, tb_logger) | ||
return trainer | ||
|
||
def multi_speaker_autotts( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can add ForwardTTS model too for multi-speaker |
||
self, model_name, speaker_file, glowtts_encoder=None, r=2, forward_attn=True, location_attn=False | ||
): | ||
""" | ||
|
||
Args: | ||
location_attn (bool): | ||
enable location attention for tacotron2 model. Defaults to True. | ||
|
||
|
||
r (int): | ||
set the r for tacotron2 model. Defaults to 2. | ||
|
||
|
||
forward_attn (bool): | ||
set forward attention for tacotron2 model. Defaults to True. | ||
|
||
|
||
model_name (str): | ||
name of the model you want to train. Defaults to None. | ||
|
||
|
||
speaker_file (str): | ||
Path to either the d_vector file for glow_tts or speaker ids file for vits. | ||
Defaults to None | ||
|
||
|
||
glowtts_encoder: | ||
Optional, which encoder you want the glow tts model to use. Defaults to None. | ||
|
||
""" | ||
audio, dataset = data_loader(name=self.dataset_name, path=self.data_path, stats_path=None) | ||
if self.dataset_name == "vctk": | ||
if model_name == "glow tts": | ||
model_config = self._sc_glow_tts(audio, dataset, speaker_file, encoder=glowtts_encoder) | ||
elif model_name == "vits tts": | ||
model_config = self._vctk_vits_tts(audio, dataset, speaker_file) | ||
elif model_name == "tacotron2": | ||
model_config = self._multi_speaker_vctk_tacotron2( | ||
audio, dataset, speaker_file, r=r, forward_attn=forward_attn, location_attn=location_attn | ||
) | ||
args, config, output_path, _, c_logger, tb_logger = init_training(TrainingArgs(), model_config) | ||
trainer = Trainer(args, config, output_path, c_logger, tb_logger) | ||
return trainer | ||
|
||
|
||
class VocoderAutoTrainer(VocoderModels): | ||
""" | ||
Args: | ||
|
||
data_path (str): | ||
The path to the dataset. Defaults to None. | ||
|
||
dataset (str): | ||
The dataset identifier. ex: ljspeech would be "ljspeech". Defaults to None. | ||
See auto_tts utils for specific dataset names. | ||
|
||
batch_size (int): | ||
The size the batches you pass to the model. This will depend on gpu memory. | ||
less than 32 is not recommended. Defaults to 32. | ||
|
||
output_path (str): | ||
The path where you want to the model config and model weights. If it is None it will | ||
use your current directory. Defaults to None | ||
|
||
mixed_precision (bool): | ||
enables mixed precision training. can make batch sizes bigger and make training faster. | ||
Could also make some trainings unstable. Defualts to False. | ||
|
||
learning_rate (List [float, float]): | ||
The learning rate for the model. This should be a list with the generator rate being first | ||
and discrimiator rate being second. Defaults to [1e-3, 1e-3]. | ||
|
||
epochs (int): | ||
how many times you want to model to go through the entire dataset. This usually doesn't need changing. | ||
Defaults to 1000. | ||
|
||
Usage: | ||
Python: | ||
From TTS.auto_tts.complete_recipes import VocoderAutoTrainer | ||
trainer = VocoderAutoTrainer(data_path='DEFINE THIS', stats_path=None, dataset="DEFINE THIS", | ||
batch_size=32, learning_rate=[1e-3, 1e-3], | ||
mixed_precision=False, output_path='DEFINE THIS', epochs=1000) | ||
model = trainer.single_speaker_autotts("hifigan") | ||
model.fit() | ||
|
||
command line: | ||
python vocoder_autotts.py --data_path ../LJSpeech-1.1 --dataset ljspeech --batch_size 32 --mixed_precision | ||
--model hifigan | ||
|
||
""" | ||
|
||
def __init__( | ||
self, | ||
data_path=None, | ||
dataset=None, | ||
batch_size=32, | ||
output_path=None, | ||
mixed_precision=False, | ||
learning_rate=None, | ||
epochs=1000, | ||
): | ||
if learning_rate is None: | ||
learning_rate = [0.001, 0.001] | ||
super().__init__( | ||
batch_size, | ||
mixed_precision, | ||
generator_learning_rate=learning_rate[0], | ||
discriminator_learning_rate=learning_rate[1], | ||
epochs=epochs, | ||
output_path=output_path, | ||
) | ||
self.data_path: str = data_path | ||
self.dataset_name: str = dataset | ||
|
||
def single_speaker_autotts(self, model_name, stats_path=None): | ||
""" | ||
Args: | ||
|
||
model_name (str): | ||
name of the model you want to train. | ||
|
||
Stats_path (str): | ||
Optional, Path to the stats file for the audio config. Defaults to None. | ||
|
||
""" | ||
if self.dataset_name == "ljspeech": | ||
audio, _ = data_loader(name="ljspeech", path=self.data_path, stats_path=stats_path) | ||
if model_name == "hifigan": | ||
model_config = self._hifi_gan(audio, self.data_path) | ||
elif model_name == "wavegrad": | ||
model_config = self._wavegrad(audio, self.data_path) | ||
elif model_name == "univnet": | ||
model_config = self._univnet(audio, self.data_path) | ||
elif model_name == "multiband melgan": | ||
model_config = self._multiband_melgan(audio, self.data_path) | ||
elif model_name == "wavernn": | ||
model_config = self._wavernn(audio, self.data_path) | ||
args, config, output_path, _, c_logger, tb_logger = init_training(TrainingArgs(), model_config) | ||
trainer = Trainer(args, config, output_path, c_logger, tb_logger) | ||
return trainer | ||
|
||
def from_pretrained(model_name): | ||
pass |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
from TTS.utils.manage import ModelManager | ||
loganhart02 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
manager = ModelManager() | ||
model_path, config_path, x = manager.download_model("tts_models/en/ljspeech/tacotron2-DCA") | ||
|
||
print(model_path) | ||
|
||
print(config_path) | ||
|
||
print(x) | ||
|
||
manager.list_models() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For docstrings use this format https://numpydoc.readthedocs.io/en/latest/format.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for personal notes