Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speaker manager for multi-speaker handling #441

Merged
merged 70 commits into from
Apr 27, 2021
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
993f4ae
This snippet was trying to load the model as the config file
AXKuhta Apr 15, 2021
d0d7eae
Merge pull request #430 from AXKuhta/main
erogol Apr 16, 2021
48ae52a
handle multi speaker and gst in Synthetizer class
kirianguiller Mar 1, 2021
83aa415
add usage of new Synthetizer class in the chinese model notebook
kirianguiller Mar 1, 2021
25328aa
refactoring to allow defining the speaker file externally
erogol Apr 16, 2021
47e356c
code styling
erogol Apr 16, 2021
1038fd4
fix a mistake from rebase
erogol Apr 16, 2021
d9612a4
set the default layer size compatible with scglow
erogol Apr 16, 2021
d2fa8ad
add ```unique``` param to keep scglow models compatible (they are dup…
erogol Apr 16, 2021
d0786be
remove matrix link
erogol Apr 19, 2021
9bccee9
update synthesize.py for multi-speaker setting
erogol Apr 21, 2021
37cad38
update argument name in server.py
erogol Apr 21, 2021
8b40720
add load_chekpoint to speaker encoder
erogol Apr 21, 2021
8764d02
update argument name external_speaker_embedding_dim -> speaker_embedd…
erogol Apr 21, 2021
09890c7
fix the glow-tts in setup_model
erogol Apr 21, 2021
ab31381
initial SpeakerManager implementation
erogol Apr 21, 2021
790946f
formating speakers.py
erogol Apr 21, 2021
04b6881
add ```unique``` argument to make_symbols to fix the incompat. issue …
erogol Apr 21, 2021
e1d960d
use SpeakerManager in Synthesizer
erogol Apr 21, 2021
757dfb9
add `SpeakerManager` tests
erogol Apr 21, 2021
39ceb3f
Update README.md
erogol Apr 21, 2021
0ee3eee
[ci skip] update CONTRIBUTING.md
erogol Apr 21, 2021
ef37633
[ci skip] use prenet_dropout by default with Tacotron models
erogol Apr 22, 2021
f5fd7f7
server: also listen to ipv6
Mic92 Apr 16, 2021
a6cd044
Merge branch 'dev' of https://github.com/coqui-ai/TTS into dev
erogol Apr 22, 2021
c125b71
fix windows support
WeberJulian Apr 22, 2021
355e1f4
fix dumb mistake
WeberJulian Apr 22, 2021
a264981
Change back the default value
WeberJulian Apr 22, 2021
4205284
Change name of the functions
WeberJulian Apr 23, 2021
cc4efb4
Merge pull request #446 from WeberJulian/fix-windows
erogol Apr 23, 2021
7dccbfd
handle multi speaker and gst in Synthetizer class
kirianguiller Mar 1, 2021
f393c08
add usage of new Synthetizer class in the chinese model notebook
kirianguiller Mar 1, 2021
af7baa3
refactoring to allow defining the speaker file externally
erogol Apr 16, 2021
aadb210
code styling
erogol Apr 16, 2021
3ace244
fix a mistake from rebase
erogol Apr 16, 2021
c955a12
set the default layer size compatible with scglow
erogol Apr 16, 2021
99dc07a
add ```unique``` param to keep scglow models compatible (they are dup…
erogol Apr 16, 2021
af2d36f
update synthesize.py for multi-speaker setting
erogol Apr 21, 2021
1229ccb
update argument name in server.py
erogol Apr 21, 2021
2da81f5
add load_chekpoint to speaker encoder
erogol Apr 21, 2021
d427480
update argument name external_speaker_embedding_dim -> speaker_embedd…
erogol Apr 21, 2021
7a7aeb3
fix the glow-tts in setup_model
erogol Apr 21, 2021
df42222
initial SpeakerManager implementation
erogol Apr 21, 2021
d08888e
formating speakers.py
erogol Apr 21, 2021
e971263
add ```unique``` argument to make_symbols to fix the incompat. issue …
erogol Apr 21, 2021
6d0f5e0
use SpeakerManager in Synthesizer
erogol Apr 21, 2021
32e6afc
add `SpeakerManager` tests
erogol Apr 21, 2021
10c988a
update server.py
erogol Apr 22, 2021
f9f3d04
remove moved function
erogol Apr 22, 2021
ad047c8
html formatting, enable multi-speaker model on the server with a drop…
erogol Apr 22, 2021
c80d21f
load speaker_encoder_ap and compute x_vector directly from the input …
erogol Apr 23, 2021
dfa415a
small refactor in server.py
erogol Apr 23, 2021
179722e
new arguments to synthesize.py for loading speaker encoder and speake…
erogol Apr 23, 2021
f691957
let speaker manager compute mean x_vector from multiple wav files
erogol Apr 23, 2021
7eb0c60
let synthesizer to pass speaker encoder file paths to speaker manager
erogol Apr 23, 2021
a878d8f
update tests
erogol Apr 23, 2021
4cf2113
styling and linting
erogol Apr 23, 2021
b82daa5
style and linter fixes
erogol Apr 26, 2021
f37b488
Merge branch 'speaker-manager' of https://github.com/coqui-ai/TTS int…
erogol Apr 26, 2021
b531fa6
remove conflicy noise
erogol Apr 26, 2021
2f07160
enable multi-speaker CoquiTTS models for synthesize.py
erogol Apr 26, 2021
6bdd816
place holders for sc-glow and hifigan models
erogol Apr 26, 2021
734e6a5
bug fix
erogol Apr 27, 2021
8f0519d
bump up numpy version
erogol Apr 27, 2021
add97cd
move function and remove import
erogol Apr 27, 2021
4719414
remove imports
erogol Apr 27, 2021
19d9f58
create dummy model on the fly
erogol Apr 27, 2021
1235e54
test for synthesize.py
erogol Apr 27, 2021
628abfe
remove test
erogol Apr 27, 2021
6353e87
fix test
erogol Apr 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 26 additions & 32 deletions TTS/bin/synthesize.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,15 +102,20 @@ def main():
parser.add_argument("--vocoder_config_path", type=str, help="Path to vocoder model config file.", default=None)

# args for multi-speaker synthesis
parser.add_argument("--speakers_json", type=str, help="JSON file for multi-speaker model.", default=None)
parser.add_argument("--speakers_file_path", type=str, help="JSON file for multi-speaker model.", default=None)
parser.add_argument(
"--speaker_idx",
type=str,
help="if the tts model is trained with x-vectors, then speaker_idx is a file present in speakers.json else speaker_idx is the speaker id corresponding to a speaker in the speaker embedding layer.",
default=None,
)
parser.add_argument("--gst_style", help="Wav path file for GST stylereference.", default=None)

parser.add_argument(
"--list_speaker_idxs",
help="List available speaker ids for the defined multi-speaker model.",
default=False,
type=str2bool,
)
# aux args
parser.add_argument(
"--save_spectogram",
Expand All @@ -131,6 +136,7 @@ def main():

model_path = None
config_path = None
speakers_file_path = None
vocoder_path = None
vocoder_config_path = None

Expand All @@ -139,54 +145,42 @@ def main():
manager.list_models()
sys.exit()

# CASE2: load pre-trained models
if args.model_name is not None:
# CASE2: load pre-trained model paths
if args.model_name is not None and not args.model_path:
model_path, config_path, model_item = manager.download_model(args.model_name)
args.vocoder_name = model_item["default_vocoder"] if args.vocoder_name is None else args.vocoder_name

if args.vocoder_name is not None:
if args.vocoder_name is not None and not args.vocoder_path:
vocoder_path, vocoder_config_path, _ = manager.download_model(args.vocoder_name)

# CASE3: load custome models
# CASE3: set custome model paths
if args.model_path is not None:
model_path = args.model_path
config_path = args.config_path
speakers_file_path = args.speakers_file_path

if args.vocoder_path is not None:
vocoder_path = args.vocoder_path
vocoder_config_path = args.vocoder_config_path

# RUN THE SYNTHESIS
# load models
synthesizer = Synthesizer(model_path, config_path, vocoder_path, vocoder_config_path, args.use_cuda)
synthesizer = Synthesizer(
model_path, config_path, speakers_file_path, vocoder_path, vocoder_config_path, args.use_cuda
)

print(" > Text: {}".format(args.text))
# query speaker ids of a multi-speaker model.
if args.list_speaker_idxs:
print(
" > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model."
)
print(synthesizer.speaker_manager.speaker_ids)
return

# # handle multi-speaker setting
# if not model_config.use_external_speaker_embedding_file and args.speaker_idx is not None:
# if args.speaker_idx.isdigit():
# args.speaker_idx = int(args.speaker_idx)
# else:
# args.speaker_idx = None
# else:
# args.speaker_idx = None

# if args.gst_style is None:
# if 'gst' in model_config.keys() and model_config.gst['gst_style_input'] is not None:
# gst_style = model_config.gst['gst_style_input']
# else:
# gst_style = None
# else:
# # check if gst_style string is a dict, if is dict convert else use string
# try:
# gst_style = json.loads(args.gst_style)
# if max(map(int, gst_style.keys())) >= model_config.gst['gst_style_tokens']:
# raise RuntimeError("The highest value of the gst_style dictionary key must be less than the number of GST Tokens, \n Highest dictionary key value: {} \n Number of GST tokens: {}".format(max(map(int, gst_style.keys())), model_config.gst['gst_style_tokens']))
# except ValueError:
# gst_style = args.gst_style
# RUN THE SYNTHESIS
print(" > Text: {}".format(args.text))

# kick it
wav = synthesizer.tts(args.text)
wav = synthesizer.tts(args.text, args.speaker_idx)

# save the results
print(" > Saving output to {}".format(args.out_path))
Expand Down
13 changes: 11 additions & 2 deletions TTS/server/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

from flask import Flask, render_template, request, send_file

from TTS.utils.generic_utils import style_wav_uri_to_dict
from TTS.utils.io import load_config
from TTS.utils.manage import ModelManager
from TTS.utils.synthesizer import Synthesizer
Expand Down Expand Up @@ -81,12 +82,16 @@ def convert_boolean(x):
args.tts_checkpoint, args.tts_config, args.vocoder_checkpoint, args.vocoder_config, args.use_cuda
)

use_speaker_embedding = synthesizer.tts_config.get("use_external_speaker_embedding_file", False)
use_gst = synthesizer.tts_config.get("use_gst", False)
app = Flask(__name__)


@app.route("/")
def index():
return render_template("index.html", show_details=args.show_details)
return render_template(
"index.html", show_details=args.show_details, use_speaker_embedding=use_speaker_embedding, use_gst=use_gst
)


@app.route("/details")
Expand All @@ -109,8 +114,12 @@ def details():
@app.route("/api/tts", methods=["GET"])
def tts():
text = request.args.get("text")
speaker_idx = request.args.get("speaker", "")
style_wav = request.args.get("style-wav", "")

style_wav = style_wav_uri_to_dict(style_wav)
print(" > Model input: {}".format(text))
wavs = synthesizer.tts(text)
wavs = synthesizer.tts(text, speaker_idx=speaker_idx, style_wav=style_wav)
out = io.BytesIO()
synthesizer.save_wav(wavs, out)
return send_file(out, mimetype="audio/wav")
Expand Down
25 changes: 21 additions & 4 deletions TTS/server/templates/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,14 @@

<ul class="list-unstyled">
</ul>
{%if use_speaker_embedding%}
<input id="speaker-json-key" placeholder="speaker json key.." size=45 type="text" name="speaker-json-key">
{%endif%}

{%if use_gst%}
<input value='{"0": 0.1}' id="style-wav" placeholder="style wav (dict or path ot wav).." size=45 type="text" name="style-wav">
{%endif%}

<input id="text" placeholder="Type here..." size=45 type="text" name="text">
<button id="speak-button" name="speak">Speak</button><br/><br/>
{%if show_details%}
Expand All @@ -73,15 +81,24 @@

<!-- Bootstrap core JavaScript -->
<script>
function getTextValue(textId) {
const container = q(textId)
if (container) {
return container.value
}
return ""
}
function q(selector) {return document.querySelector(selector)}
q('#text').focus()
function do_tts(e) {
text = q('#text').value
const text = q('#text').value
const speakerJsonKey = getTextValue('#speaker-json-key')
const styleWav = getTextValue('#style-wav')
if (text) {
q('#message').textContent = 'Synthesizing...'
q('#speak-button').disabled = true
q('#audio').hidden = true
synthesize(text)
synthesize(text, speakerJsonKey, styleWav)
}
e.preventDefault()
return false
Expand All @@ -92,8 +109,8 @@
do_tts(e)
}
})
function synthesize(text) {
fetch('/api/tts?text=' + encodeURIComponent(text), {cache: 'no-cache'})
function synthesize(text, speakerJsonKey="", styleWav="") {
fetch(`/api/tts?text=${encodeURIComponent(text)}&speaker=${encodeURIComponent(speakerJsonKey)}&style-wav=${encodeURIComponent(styleWav)}` , {cache: 'no-cache'})
.then(function(res) {
if (!res.ok) throw Error(res.statusText)
return res.blob()
Expand Down
8 changes: 8 additions & 0 deletions TTS/speaker_encoder/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,3 +108,11 @@ def batch_compute_embedding(self, x, seq_lens, num_frames=160, overlap=0.5):
else:
embed[cur_iter <= num_iters, :] += self.inference(frames[cur_iter <= num_iters, :, :])
return embed / num_iters

# pylint: disable=unused-argument, redefined-builtin
def load_checkpoint(self, config: dict, checkpoint_path: str, eval: bool = False):
state = torch.load(checkpoint_path, map_location=torch.device("cpu"))
self.load_state_dict(state["model"])
if eval:
self.eval()
assert not self.training
25 changes: 14 additions & 11 deletions TTS/tts/models/glow_tts.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ class GlowTTS(nn.Module):
mean_only (bool): if True, encoder only computes mean value and uses constant variance for each time step.
encoder_type (str): encoder module type.
encoder_params (dict): encoder module parameters.
external_speaker_embedding_dim (int): channels of external speaker embedding vectors.
speaker_embedding_dim (int): channels of external speaker embedding vectors.
"""

def __init__(
Expand All @@ -45,6 +45,7 @@ def __init__(
hidden_channels_dp,
out_channels,
num_flow_blocks_dec=12,
inference_noise_scale=0.33,
kernel_size_dec=5,
dilation_rate=5,
num_block_layers=4,
Expand All @@ -58,7 +59,7 @@ def __init__(
mean_only=False,
encoder_type="transformer",
encoder_params=None,
external_speaker_embedding_dim=None,
speaker_embedding_dim=None,
):

super().__init__()
Expand All @@ -79,18 +80,20 @@ def __init__(
self.sigmoid_scale = sigmoid_scale
self.mean_only = mean_only
self.use_encoder_prenet = use_encoder_prenet
self.inference_noise_scale = inference_noise_scale

# model constants.
self.noise_scale = 0.33 # defines the noise variance applied to the random z vector at inference.
self.length_scale = 1.0 # scaler for the duration predictor. The larger it is, the slower the speech.
self.external_speaker_embedding_dim = external_speaker_embedding_dim
self.speaker_embedding_dim = speaker_embedding_dim

# if is a multispeaker and c_in_channels is 0, set to 256
if num_speakers > 1:
if self.c_in_channels == 0 and not self.external_speaker_embedding_dim:
self.c_in_channels = 512
elif self.external_speaker_embedding_dim:
self.c_in_channels = self.external_speaker_embedding_dim
if self.c_in_channels == 0 and not self.speaker_embedding_dim:
# TODO: make this adjustable
self.c_in_channels = 256
elif self.speaker_embedding_dim:
self.c_in_channels = self.speaker_embedding_dim

self.encoder = Encoder(
num_chars,
Expand Down Expand Up @@ -119,7 +122,7 @@ def __init__(
c_in_channels=self.c_in_channels,
)

if num_speakers > 1 and not external_speaker_embedding_dim:
if num_speakers > 1 and not speaker_embedding_dim:
# speaker embedding layer
self.emb_g = nn.Embedding(num_speakers, self.c_in_channels)
nn.init.uniform_(self.emb_g.weight, -0.1, 0.1)
Expand Down Expand Up @@ -149,7 +152,7 @@ def forward(self, x, x_lengths, y=None, y_lengths=None, attn=None, g=None):
y_max_length = y.size(2)
# norm speaker embeddings
if g is not None:
if self.external_speaker_embedding_dim:
if self.speaker_embedding_dim:
g = F.normalize(g).unsqueeze(-1)
else:
g = F.normalize(self.emb_g(g)).unsqueeze(-1) # [b, h, 1]
Expand Down Expand Up @@ -179,7 +182,7 @@ def forward(self, x, x_lengths, y=None, y_lengths=None, attn=None, g=None):
@torch.no_grad()
def inference(self, x, x_lengths, g=None):
if g is not None:
if self.external_speaker_embedding_dim:
if self.speaker_embedding_dim:
g = F.normalize(g).unsqueeze(-1)
else:
g = F.normalize(self.emb_g(g)).unsqueeze(-1) # [b, h]
Expand All @@ -198,7 +201,7 @@ def inference(self, x, x_lengths, g=None):
attn = generate_path(w_ceil.squeeze(1), attn_mask.squeeze(1)).unsqueeze(1)
y_mean, y_log_scale, o_attn_dur = self.compute_outputs(attn, o_mean, o_log_scale, x_mask)

z = (y_mean + torch.exp(y_log_scale) * torch.randn_like(y_mean) * self.noise_scale) * y_mask
z = (y_mean + torch.exp(y_log_scale) * torch.randn_like(y_mean) * self.inference_noise_scale) * y_mask
# decoder pass
y, logdet = self.decoder(z, y_mask, g=g, reverse=True)
attn = attn.squeeze(1).permute(0, 2, 1)
Expand Down
3 changes: 2 additions & 1 deletion TTS/tts/utils/generic_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
encoder_type=c.encoder_type,
encoder_params=c.encoder_params,
use_encoder_prenet=c["use_encoder_prenet"],
inference_noise_scale=c.get("inference_noise_scale", 0.33),
num_flow_blocks_dec=12,
kernel_size_dec=5,
dilation_rate=1,
Expand All @@ -130,7 +131,7 @@ def setup_model(num_chars, num_speakers, c, speaker_embedding_dim=None):
num_squeeze=2,
sigmoid_scale=False,
mean_only=True,
external_speaker_embedding_dim=speaker_embedding_dim,
speaker_embedding_dim=speaker_embedding_dim,
)
elif c.model.lower() == "speedy_speech":
model = MyModel(
Expand Down
Loading