-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model path and hparams #9
Comments
I've been playing around with the TED-LIUM dataset and after some quick spot checks I noticed most of the talks had more than 1 person talking (the speaker + Chris Anderson) or a panel discussion with multiple speakers. Figured I would take a look at the projections using your model and one I have trained up to 1.2M steps so far.
Elon MuskBonoBill Gates |
Tested two languages it wasn't trained on, the colors used for each speaker are the same across both models. To be clear, there were 40 Swedish speakers in the 25,668 unique speakers used to train my model. It doesn't appear to generalize very well and the 40 Swedish speakers I used in my model didn't seem to make much of a difference. SwedishSpeakers 0 - 40Speakers 40 - 80NorwegianSpeakers 0 - 40Speakers 40 - 80 |
Interesting examples! They probably explain why im having such varied results on the the 256 pretrained encoder with the synthesizer on the swedish recordings. I will do some tests later in the week! One thing i can think of is that the first sound of the scandinavian datasets is background noise. Some of them are just single words as well. That might affect. They're all recorded in a studio with good quality with little background noise. in alot of the examples different speakers are saying the exact same thing so if you're picking the first 10 theres a high chance theyre saying the exact same phrase in the same studio environment. I've been playing around a bit with word vectors with tsne and umap and as soon as i start adding alot of examples the results becomes very cluttered if just a few vectors are "off". Even on small samples with vectors it sometimes acts up but thats more on tsne. Even when i can do man king woman queen etc it can look distorted reduced to 2d space. Whats the results like if you limit the sample size? It does seem very odd, but I havent gotten around to reading up on how the encoder really works so Im just speculating based on nothing. The most likely answer i guess based in nothing is that it doesnt generalize well and alof of the differences it picks up are more based on recording quality and background noise. How does it compare if you add other speakers into the mix with bill gates? does he group up or does he stay into separate clusters or are the other clusters other people? Since it seems based on the ok google dataset which prolly has the same voice in a lot of different settings compared to how the datasets looks like now. One voice in one setting split into multiple files. If theres duplicates its the same voice as two sets which forces it to look for other differences. Again based on nothing maybe the encoder does this, but reducing noise and adding different noises making the same speakers recordings differ more while the voice stays the same could increase accuracy? Or maybe its just that its needs alot more data and alot more steps. |
@ViktorAlm the clusters look a lot better when you only project 10 utterances per speaker in all the models. What is surprising to me is that the English only model trained up to 260k steps is performing very good across the board. Given it is 768 hidden/embedding size vs the default of 256. Bill Gates is actually multiple speakers, so the clusters are accurate. It shows a single name because it was from a single folder. TED Talks there is usually a moderator that will introduce the speaker or will be a panel where there are a minimum of 2 speakers on stage. I need to verify that each cluster is truly a different speaker but after scanning through the files quickly I remember hearing at least 4 different speakers through all of his talks.
SwedishNorweigan |
That seems like the way to go! have you looked at how much the embedding differs? I mean does it use all 768 values with clearly activated single values? When i tested it on my small dataset 256 size 150k steps it only "activated" a few of the values leaving the embedding looking very flat even though it was very good att separating in the umap plot. |
I think thats what needs to be monitored more than the differentiation unless you have very similar evalutation data. I guess the great results comes when that thing glows. Thats prolly why it needs that many steps |
Honestly I'm not sure what makes more of a difference. I haven't trained the synthesizer or vocoder yet. Hard to tell from only 4 utterances on how the embedding is utilized. Obviously there are more activation's in the default model which was trained to 1M steps. The mixed model is trained to 1.7M steps but with about 3x as many speakers and the mixed model is trained using the Swedish and Norwegian speakers from nasjonal-bank. The English model is a true 768 hidden/embedding size while the mixed is still training using 256 hidden layers which I'm sure is not ideal. |
I tried running the synthesizer and vocoder about 100k steps on an encoder which had about 150k steps done on 4k examples and it was separating the clusters great but the embedding was very flat. Astrid Lindgren(Swedish author, pippi longstockings) Annie Lööf(swedish politican) |
By the way, you will have different clusters if you project a single speaker with UMAP. It's going to try to cluster by the most distinguishing feature, so it's natural that it clusters by speakers with multiple speakers; and it's also natural that it clusters by different recording environments with a single speaker. |
Thanks @CorentinJ. That makes complete sense that it clusters by different recording environments which would explain Bill Gates TED talks perfectly. They are recorded across different years at different venues. It also explains the two large clusters for Elon Musk across two different years and most likely different recording environments. What do you think would happen if I used all the utterances for Bill Gates across different environments and years as part of the training data for a single person? The multiple years and environments will be a somewhat common occurrence if I include TED-LIUM in the training. Or would you use each talk by the same person as a separate speaker for encoder training? |
What's your goal with this idea? Training on a single speaker makes little sense, as the speaker encoder is trained on a speaker verification task. You're free to make it an "environment verification" task but I don't see anything fruitful coming out of that. |
Anyway, to answer your original question, I have a good module for handling hyperparameters but I certainly won't use it in this package because it's meant for production. The same goes for changing the model path, most users won't have custom models to use it with. You're free to modify the source code to meet your needs of course. |
I guess I didn't explain it very well. I was wondering what would happen if the encoder is trained with data from mixed environments for the same speaker, will UMAP still cluster the same speaker by recording environments? Will the clusters for a single speaker still be separated?
I'm guessing this is a module you have written and it is not open source? I have changed the code for my needs, I'll close this issue then as it doesn't sound like you want to make changing the hparams and model path configurable for Resemblyzer. |
What are your thoughts on allowing you to pass in the model path (default to None) and override the hyper parameters? Or do you think the best route is sub-classing and override the init function?
Resemblyzer/resemblyzer/voice_encoder.py
Line 12 in cdd51df
The text was updated successfully, but these errors were encountered: