Global Style Tokens #167

twerkmeister · 2019-04-23T13:08:43Z

Global Style Tokens are embeddings that capture prosodic styles across the training set. This allows the system to explicitly specify the desired prosody of a generated sequence, i.e. essentially how the sentence is spoken, e.g. with a certain emotion or whispering etc. Additionally it should help training because the text of an example has no hints to prosody. Thus, the TTS system currently has to guess prosody or factor it into the character/phoneme embeddings.

The main papers for this line of work are

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron for the reference encoder capturing the sequence prosody
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis for the style tokens and matching them with the prosody embedding

To implement this in Mozilla TTS I think the following steps are necessary:

implement the reference encoder as an additional layer that takes in the spectrogram representation and returns the prosody embedding
implement the style embeddings and their configuration, plus the attention mechanism matching prosody embedding and style embeddings
replicate and concatenate the aggregated style embedding with the encoder outputs, then pass the result into the decoder
during inference we would need some form of default style. Not sure how to go about this yet. The most robust might be to ship a calmly spoken text sequence that can be used as a reference. This approach should work even if style token embeddings and their ordering vary across trainings.

Thoughts?

OswaldoBornemann · 2019-04-26T07:58:11Z

It seems that this one has implement PyTorch version of GST. But i haven't tried it yet.

twerkmeister · 2019-04-26T09:06:18Z

@tsungruihon thanks for the pointer! Will have a look at it 👍

erogol · 2019-04-26T16:38:18Z

I think we can test the GST module independently before merging it to the whole model architecture. So my idea is to train the GST encoder with computed spectrograms and see if it learns any speech traits by its token vectors. This can be visually seen by projecting final GST vectors with something like tsne algorithm.

Before, going into deeper, I need to read the paper once again.

twerkmeister · 2019-05-06T07:27:34Z

I am getting first results with GST and Tacotron. Gonna play around with it a little more and then post some audio.

Another thought I had was that it's kind of crazy to use the linear spectrograms for the loss in tacotron 1. First, half the linear spectrogram is just fairly random noise in the high frequencies. Second, linear spectrograms are a real memory drain: batch_size(32) X spec length (can easily be 500) X 1024 -> 15 Mio floats just for the linear specs... Just started another experiment with taco 1, style tokens, and using mel specs with a downsized taco 2 postnet. Let's see what comes out of this.

erogol · 2019-05-06T12:33:03Z

Cool thx for the update. It is interesting you run these experiments.

twerkmeister · 2019-05-29T07:51:55Z

So far I didn't get really robust results with GST and common voice, training and eval works reasonably well, but inference is very unstable. I think it's really the wrong dataset to use... I switched gears a bit and am now using GST and speaker embeddings with the entire German mailabs dataset with 5 different speakers. Just started the first training for that. I also have implemented using multiple datasets now. Dataset type, language and speaker id can be specified per dataset. It would be easier to contribute these things if smaller refactorings also get pulled in :D -> #192 .Then I don't have to work on some old state that I have already changed on my fork.

erogol · 2019-05-29T10:43:17Z

I try to train klarsson from mailabs but so far no success. Its clips are so noisy and poor quality.Let me know if you get any better.

erogol · 2019-06-07T14:08:43Z

Here are some of my initial results :

I trained GST Tacotron with a single dataset, and below share spectrograms of scaling tokens in inference time.

Normal Spec:

Token 0: length of pauses between phonemes

Token 5: Changes the tone of the speech.

Other tokens also corressponce to deepness, speed and some other not quite obvious traits.

twerkmeister · 2019-06-09T10:29:24Z

Good stuff, which type of attention did you use for gst? I went back to the summation of tokens instead of multi-head attention since it seemed easier to control during inference. But things are really a bit fickle, sometimes the alignment just breaks off for certain token combinations or certain speakers. Have seen some similar effects to yours where the tokens influence length of pauses or commas or even which commas are attended to and which don't (e.g. first or last). And then I had it once that certain style made one speaker sound like another speaker. I tried so many things recently, I am sure there's something I can contribute back

erogol · 2019-06-12T09:46:52Z

I used multi-head attention as in the paper. I think to target speaker IDs, we can add an embedding layer to style encoder.

erogol · 2019-11-12T00:10:11Z

So it is implemented. I close this and we can open a new issue if there are more experiments.

donand · 2019-12-17T14:46:32Z

Here are some of my initial results :

I trained GST Tacotron with a single dataset, and below share spectrograms of scaling tokens in inference time.

Normal Spec:

Token 0: length of pauses between phonemes

Token 5: Changes the tone of the speech.

Other tokens also corressponce to deepness, speed and some other not quite obvious traits.

Hi @erogol, I'm trying to understand how you implemented the inference with single tokens. I cannot find it in the code and I don't know how to integrate it with multi-head attention.
Did you just took the token (which has size embed_dim / num_heads) and replicated it num_head times to get the size to embed_dim?

Thanks

erogol · 2019-12-17T22:57:21Z

I took the model into a notebook run it manually. There I rewrote the inference to use only the token I choose. Unfortunately, it is not here in the library but easy to replicate.

943274923 · 2020-03-13T14:18:23Z

I took the model into a notebook run it manually. There I rewrote the inference to use only the token I choose. Unfortunately, it is not here in the library but easy to replicate.

Can you share you notebook?

erogol · 2020-03-14T13:21:29Z

I don't keep it anymore unfortunately.

erogol closed this as completed Nov 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global Style Tokens #167

Global Style Tokens #167

twerkmeister commented Apr 23, 2019

OswaldoBornemann commented Apr 26, 2019

twerkmeister commented Apr 26, 2019

erogol commented Apr 26, 2019

twerkmeister commented May 6, 2019

erogol commented May 6, 2019

twerkmeister commented May 29, 2019 •

edited

Loading

erogol commented May 29, 2019

erogol commented Jun 7, 2019

twerkmeister commented Jun 9, 2019

erogol commented Jun 12, 2019

erogol commented Nov 12, 2019

donand commented Dec 17, 2019

erogol commented Dec 17, 2019

943274923 commented Mar 13, 2020

erogol commented Mar 14, 2020

Global Style Tokens #167

Global Style Tokens #167

Comments

twerkmeister commented Apr 23, 2019

OswaldoBornemann commented Apr 26, 2019

twerkmeister commented Apr 26, 2019

erogol commented Apr 26, 2019

twerkmeister commented May 6, 2019

erogol commented May 6, 2019

twerkmeister commented May 29, 2019 • edited Loading

erogol commented May 29, 2019

erogol commented Jun 7, 2019

twerkmeister commented Jun 9, 2019

erogol commented Jun 12, 2019

erogol commented Nov 12, 2019

donand commented Dec 17, 2019

erogol commented Dec 17, 2019

943274923 commented Mar 13, 2020

erogol commented Mar 14, 2020

twerkmeister commented May 29, 2019 •

edited

Loading