-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global Style Tokens #167
Comments
It seems that this one has implement |
@tsungruihon thanks for the pointer! Will have a look at it 👍 |
I think we can test the GST module independently before merging it to the whole model architecture. So my idea is to train the GST encoder with computed spectrograms and see if it learns any speech traits by its token vectors. This can be visually seen by projecting final GST vectors with something like tsne algorithm. Before, going into deeper, I need to read the paper once again. |
I am getting first results with GST and Tacotron. Gonna play around with it a little more and then post some audio. Another thought I had was that it's kind of crazy to use the linear spectrograms for the loss in tacotron 1. First, half the linear spectrogram is just fairly random noise in the high frequencies. Second, linear spectrograms are a real memory drain: batch_size(32) X spec length (can easily be 500) X 1024 -> 15 Mio floats just for the linear specs... Just started another experiment with taco 1, style tokens, and using mel specs with a downsized taco 2 postnet. Let's see what comes out of this. |
Cool thx for the update. It is interesting you run these experiments. |
So far I didn't get really robust results with GST and common voice, training and eval works reasonably well, but inference is very unstable. I think it's really the wrong dataset to use... I switched gears a bit and am now using GST and speaker embeddings with the entire German mailabs dataset with 5 different speakers. Just started the first training for that. I also have implemented using multiple datasets now. Dataset type, language and speaker id can be specified per dataset. It would be easier to contribute these things if smaller refactorings also get pulled in :D -> #192 .Then I don't have to work on some old state that I have already changed on my fork. |
I try to train klarsson from mailabs but so far no success. Its clips are so noisy and poor quality.Let me know if you get any better. |
Good stuff, which type of attention did you use for gst? I went back to the summation of tokens instead of multi-head attention since it seemed easier to control during inference. But things are really a bit fickle, sometimes the alignment just breaks off for certain token combinations or certain speakers. Have seen some similar effects to yours where the tokens influence length of pauses or commas or even which commas are attended to and which don't (e.g. first or last). And then I had it once that certain style made one speaker sound like another speaker. I tried so many things recently, I am sure there's something I can contribute back |
I used multi-head attention as in the paper. I think to target speaker IDs, we can add an embedding layer to style encoder. |
So it is implemented. I close this and we can open a new issue if there are more experiments. |
Hi @erogol, I'm trying to understand how you implemented the inference with single tokens. I cannot find it in the code and I don't know how to integrate it with multi-head attention. Thanks |
I took the model into a notebook run it manually. There I rewrote the inference to use only the token I choose. Unfortunately, it is not here in the library but easy to replicate. |
Can you share you notebook? |
I don't keep it anymore unfortunately. |
Global Style Tokens are embeddings that capture prosodic styles across the training set. This allows the system to explicitly specify the desired prosody of a generated sequence, i.e. essentially how the sentence is spoken, e.g. with a certain emotion or whispering etc. Additionally it should help training because the text of an example has no hints to prosody. Thus, the TTS system currently has to guess prosody or factor it into the character/phoneme embeddings.
The main papers for this line of work are
To implement this in Mozilla TTS I think the following steps are necessary:
Thoughts?
The text was updated successfully, but these errors were encountered: