Skip to content

Commit 6fdb88f

Browse files
loganhart02loganhart420erogol
authored
Add Delightful-TTS implementation (#2095)
* add configs * Update config file * Add model configs * Add model layers * Add layer files * Add layer modules * change config names * Add emotion manager * fIX missing ap bug * Fix missing ap bug * Add base TTS e2e class * Fix wrong variable name in load_tts_samples * Add training script * Remove range predictor and gaussian upsampling * Add helper function * Add vctk recipe * Add conformer docs * Fix linting in conformer.py * Add Docs * remove duplicate import * refactor args * Fix bugs * Removew emotion embedding * remove unused arg * Remove emotion embedding arg * Remove emotion embedding arg * fix style issues * Fix bugs * Fix bugs * Add unittests * make style * fix formatter bug * fix test * Add pyworld compute pitch func * Update requirments.txt * Fix dataset Bug * Chnge layer norm to instance norm * Add missing import * Remove emotions.py * remove ssim loss * Add init layers func to aligner * refactor model layers * remove audio_config arg * Rename loss func * Rename to delightful-tts * Rename loss func * Remove unused modules * refactor imports * replace audio config with audio processor * Add change sample rate option * remove broken resample func * update recipe * fix style, add config docs * fix tests and multispeaker embd dim * remove pyworld * Make style and fix inference * Split tts tests * Fixup * Fixup * Fixup * Add argument names * Set "random" speaker in the model Tortoise/Bark * Use a diff f0_cache path for delightfull tts * Fix delightful speaker handling * Fix lint * Make style --------- Co-authored-by: loganhart420 <[email protected]> Co-authored-by: Eren Gölge <[email protected]>
1 parent f24c5e0 commit 6fdb88f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+5202
-5
lines changed

Diff for: .github/workflows/tts_tests2.yml

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
name: tts-tests2
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
types: [opened, synchronize, reopened]
9+
jobs:
10+
check_skip:
11+
runs-on: ubuntu-latest
12+
if: "! contains(github.event.head_commit.message, '[ci skip]')"
13+
steps:
14+
- run: echo "${{ github.event.head_commit.message }}"
15+
16+
test:
17+
runs-on: ubuntu-latest
18+
strategy:
19+
fail-fast: false
20+
matrix:
21+
python-version: [3.9, "3.10", "3.11"]
22+
experimental: [false]
23+
steps:
24+
- uses: actions/checkout@v3
25+
- name: Set up Python ${{ matrix.python-version }}
26+
uses: actions/setup-python@v4
27+
with:
28+
python-version: ${{ matrix.python-version }}
29+
architecture: x64
30+
cache: 'pip'
31+
cache-dependency-path: 'requirements*'
32+
- name: check OS
33+
run: cat /etc/os-release
34+
- name: set ENV
35+
run: export TRAINER_TELEMETRY=0
36+
- name: Install dependencies
37+
run: |
38+
sudo apt-get update
39+
sudo apt-get install -y --no-install-recommends git make gcc
40+
sudo apt-get install espeak
41+
sudo apt-get install espeak-ng
42+
make system-deps
43+
- name: Install/upgrade Python setup deps
44+
run: python3 -m pip install --upgrade pip setuptools wheel
45+
- name: Replace scarf urls
46+
run: |
47+
sed -i 's/https:\/\/coqui.gateway.scarf.sh\//https:\/\/github.com\/coqui-ai\/TTS\/releases\/download\//g' TTS/.models.json
48+
- name: Install TTS
49+
run: |
50+
python3 -m pip install .[all]
51+
python3 setup.py egg_info
52+
- name: Unit tests
53+
run: make test_tts2

Diff for: Makefile

+3
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,9 @@ test_vocoder: ## run vocoder tests.
1919
test_tts: ## run tts tests.
2020
nose2 -F -v -B --with-coverage --coverage TTS tests.tts_tests
2121

22+
test_tts2: ## run tts tests.
23+
nose2 -F -v -B --with-coverage --coverage TTS tests.tts_tests2
24+
2225
test_aux: ## run aux tests.
2326
nose2 -F -v -B --with-coverage --coverage TTS tests.aux_tests
2427
./run_bash_tests.sh

Diff for: TTS/bin/synthesize.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -430,9 +430,9 @@ def main():
430430
if tts_path is not None:
431431
wav = synthesizer.tts(
432432
args.text,
433-
args.speaker_idx,
434-
args.language_idx,
435-
args.speaker_wav,
433+
speaker_name=args.speaker_idx,
434+
language_name=args.language_idx,
435+
speaker_wav=args.speaker_wav,
436436
reference_wav=args.reference_wav,
437437
style_wav=args.capacitron_style_wav,
438438
style_text=args.capacitron_style_text,

Diff for: TTS/tts/configs/delightful_tts_config.py

+170
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
from dataclasses import dataclass, field
2+
from typing import List
3+
4+
from TTS.tts.configs.shared_configs import BaseTTSConfig
5+
from TTS.tts.models.delightful_tts import DelightfulTtsArgs, DelightfulTtsAudioConfig, VocoderConfig
6+
7+
8+
@dataclass
9+
class DelightfulTTSConfig(BaseTTSConfig):
10+
"""
11+
Configuration class for the DelightfulTTS model.
12+
13+
Attributes:
14+
model (str): Name of the model ("delightful_tts").
15+
audio (DelightfulTtsAudioConfig): Configuration for audio settings.
16+
model_args (DelightfulTtsArgs): Configuration for model arguments.
17+
use_attn_priors (bool): Whether to use attention priors.
18+
vocoder (VocoderConfig): Configuration for the vocoder.
19+
init_discriminator (bool): Whether to initialize the discriminator.
20+
steps_to_start_discriminator (int): Number of steps to start the discriminator.
21+
grad_clip (List[float]): Gradient clipping values.
22+
lr_gen (float): Learning rate for the gan generator.
23+
lr_disc (float): Learning rate for the gan discriminator.
24+
lr_scheduler_gen (str): Name of the learning rate scheduler for the generator.
25+
lr_scheduler_gen_params (dict): Parameters for the learning rate scheduler for the generator.
26+
lr_scheduler_disc (str): Name of the learning rate scheduler for the discriminator.
27+
lr_scheduler_disc_params (dict): Parameters for the learning rate scheduler for the discriminator.
28+
scheduler_after_epoch (bool): Whether to schedule after each epoch.
29+
optimizer (str): Name of the optimizer.
30+
optimizer_params (dict): Parameters for the optimizer.
31+
ssim_loss_alpha (float): Alpha value for the SSIM loss.
32+
mel_loss_alpha (float): Alpha value for the mel loss.
33+
aligner_loss_alpha (float): Alpha value for the aligner loss.
34+
pitch_loss_alpha (float): Alpha value for the pitch loss.
35+
energy_loss_alpha (float): Alpha value for the energy loss.
36+
u_prosody_loss_alpha (float): Alpha value for the utterance prosody loss.
37+
p_prosody_loss_alpha (float): Alpha value for the phoneme prosody loss.
38+
dur_loss_alpha (float): Alpha value for the duration loss.
39+
char_dur_loss_alpha (float): Alpha value for the character duration loss.
40+
binary_align_loss_alpha (float): Alpha value for the binary alignment loss.
41+
binary_loss_warmup_epochs (int): Number of warm-up epochs for the binary loss.
42+
disc_loss_alpha (float): Alpha value for the discriminator loss.
43+
gen_loss_alpha (float): Alpha value for the generator loss.
44+
feat_loss_alpha (float): Alpha value for the feature loss.
45+
vocoder_mel_loss_alpha (float): Alpha value for the vocoder mel loss.
46+
multi_scale_stft_loss_alpha (float): Alpha value for the multi-scale STFT loss.
47+
multi_scale_stft_loss_params (dict): Parameters for the multi-scale STFT loss.
48+
return_wav (bool): Whether to return audio waveforms.
49+
use_weighted_sampler (bool): Whether to use a weighted sampler.
50+
weighted_sampler_attrs (dict): Attributes for the weighted sampler.
51+
weighted_sampler_multipliers (dict): Multipliers for the weighted sampler.
52+
r (int): Value for the `r` override.
53+
compute_f0 (bool): Whether to compute F0 values.
54+
f0_cache_path (str): Path to the F0 cache.
55+
attn_prior_cache_path (str): Path to the attention prior cache.
56+
num_speakers (int): Number of speakers.
57+
use_speaker_embedding (bool): Whether to use speaker embedding.
58+
speakers_file (str): Path to the speaker file.
59+
speaker_embedding_channels (int): Number of channels for the speaker embedding.
60+
language_ids_file (str): Path to the language IDs file.
61+
"""
62+
63+
model: str = "delightful_tts"
64+
65+
# model specific params
66+
audio: DelightfulTtsAudioConfig = field(default_factory=DelightfulTtsAudioConfig)
67+
model_args: DelightfulTtsArgs = field(default_factory=DelightfulTtsArgs)
68+
use_attn_priors: bool = True
69+
70+
# vocoder
71+
vocoder: VocoderConfig = field(default_factory=VocoderConfig)
72+
init_discriminator: bool = True
73+
74+
# optimizer
75+
steps_to_start_discriminator: int = 200000
76+
grad_clip: List[float] = field(default_factory=lambda: [1000, 1000])
77+
lr_gen: float = 0.0002
78+
lr_disc: float = 0.0002
79+
lr_scheduler_gen: str = "ExponentialLR"
80+
lr_scheduler_gen_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
81+
lr_scheduler_disc: str = "ExponentialLR"
82+
lr_scheduler_disc_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
83+
scheduler_after_epoch: bool = True
84+
optimizer: str = "AdamW"
85+
optimizer_params: dict = field(default_factory=lambda: {"betas": [0.8, 0.99], "eps": 1e-9, "weight_decay": 0.01})
86+
87+
# acoustic model loss params
88+
ssim_loss_alpha: float = 1.0
89+
mel_loss_alpha: float = 1.0
90+
aligner_loss_alpha: float = 1.0
91+
pitch_loss_alpha: float = 1.0
92+
energy_loss_alpha: float = 1.0
93+
u_prosody_loss_alpha: float = 0.5
94+
p_prosody_loss_alpha: float = 0.5
95+
dur_loss_alpha: float = 1.0
96+
char_dur_loss_alpha: float = 0.01
97+
binary_align_loss_alpha: float = 0.1
98+
binary_loss_warmup_epochs: int = 10
99+
100+
# vocoder loss params
101+
disc_loss_alpha: float = 1.0
102+
gen_loss_alpha: float = 1.0
103+
feat_loss_alpha: float = 1.0
104+
vocoder_mel_loss_alpha: float = 10.0
105+
multi_scale_stft_loss_alpha: float = 2.5
106+
multi_scale_stft_loss_params: dict = field(
107+
default_factory=lambda: {
108+
"n_ffts": [1024, 2048, 512],
109+
"hop_lengths": [120, 240, 50],
110+
"win_lengths": [600, 1200, 240],
111+
}
112+
)
113+
114+
# data loader params
115+
return_wav: bool = True
116+
use_weighted_sampler: bool = False
117+
weighted_sampler_attrs: dict = field(default_factory=lambda: {})
118+
weighted_sampler_multipliers: dict = field(default_factory=lambda: {})
119+
120+
# overrides
121+
r: int = 1
122+
123+
# dataset configs
124+
compute_f0: bool = True
125+
f0_cache_path: str = None
126+
attn_prior_cache_path: str = None
127+
128+
# multi-speaker settings
129+
# use speaker embedding layer
130+
num_speakers: int = 0
131+
use_speaker_embedding: bool = False
132+
speakers_file: str = None
133+
speaker_embedding_channels: int = 256
134+
language_ids_file: str = None
135+
use_language_embedding: bool = False
136+
137+
# use d-vectors
138+
use_d_vector_file: bool = False
139+
d_vector_file: str = None
140+
d_vector_dim: int = None
141+
142+
# testing
143+
test_sentences: List[str] = field(
144+
default_factory=lambda: [
145+
"It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
146+
"Be a voice, not an echo.",
147+
"I'm sorry Dave. I'm afraid I can't do that.",
148+
"This cake is great. It's so delicious and moist.",
149+
"Prior to November 22, 1963.",
150+
]
151+
)
152+
153+
def __post_init__(self):
154+
# Pass multi-speaker parameters to the model args as `model.init_multispeaker()` looks for it there.
155+
if self.num_speakers > 0:
156+
self.model_args.num_speakers = self.num_speakers
157+
158+
# speaker embedding settings
159+
if self.use_speaker_embedding:
160+
self.model_args.use_speaker_embedding = True
161+
if self.speakers_file:
162+
self.model_args.speakers_file = self.speakers_file
163+
164+
# d-vector settings
165+
if self.use_d_vector_file:
166+
self.model_args.use_d_vector_file = True
167+
if self.d_vector_dim is not None and self.d_vector_dim > 0:
168+
self.model_args.d_vector_dim = self.d_vector_dim
169+
if self.d_vector_file:
170+
self.model_args.d_vector_file = self.d_vector_file

Diff for: TTS/tts/datasets/dataset.py

+1
Original file line numberDiff line numberDiff line change
@@ -686,6 +686,7 @@ def __init__(
686686
self,
687687
samples: Union[List[List], List[Dict]],
688688
ap: "AudioProcessor",
689+
audio_config=None, # pylint: disable=unused-argument
689690
verbose=False,
690691
cache_path: str = None,
691692
precompute_num_workers=0,

Diff for: TTS/tts/layers/delightful_tts/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)