Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update on maintaining this project #364

Closed
CorentinJ opened this issue Jun 19, 2020 · 47 comments
Closed

Update on maintaining this project #364

CorentinJ opened this issue Jun 19, 2020 · 47 comments

Comments

@CorentinJ
Copy link
Owner

We're one year after the initial publication of this project. I've been busy with both exams and work since, and it's only last week that I passed my last exam. During that year, I have received SO many messages from people asking for help in setting up the repo and I just had no time to allocate for any of that.
I kinda wished that the popularity of this repo would have died down, but new people keep coming in at a fairly constant rate.
I have no intentions to start developing on this repo again, but I hope I can answer some questions and possibly review some PRs. Use this issue to ask me questions and to bring light upon things that you believe need to be improved, and we'll see what can be done.

@CorentinJ CorentinJ pinned this issue Jun 19, 2020
@CorentinJ
Copy link
Owner Author

First things first, the biggest issue for me with this project is the hecking tensorflow code. Tensorflow sucks, and it sucks just as much to install it let alone install an older version.

I believe it would lower the entry barrier for new users if the version of that package were to be upgraded. I've seen a PR for that but that's only for the collab version it seems. A PR for the entire repo would be appreciated.

Ideally, we'd replace all of the synthesizer code with pytorch code (there are several open source pytorch synthesizers out there), but that's a lot of work.

If anybody is willing to pick up on either of these things, let me know.

@CorentinJ
Copy link
Owner Author

Second thing: webrtcvad. That package is hell to install on windows. There are alternatives for noise removal out there. There's also the possibility of not using it at all, but for both LibriSpeech and LibriTTS I would recommend it.

@ghost
Copy link

ghost commented Jun 19, 2020

I'd like #331 merged to enable CPU support by default. It also simplifies the install process for those with a goal of running demo_cli.py for evaluation purposes.

Some kind of API or improved CLI would be a worthwhile and easy enhancement for the community to pursue. Good usability will help keep this repo as the focal point for development of open-source SV2TTS. This is really neat stuff, many thanks for sharing your code and pre-trained models under a permissive license.

@CorentinJ
Copy link
Owner Author

CorentinJ commented Jun 20, 2020

I'll give a review to #331 tomorrow and probably will make some changes as well.

@ghost
Copy link

ghost commented Jun 22, 2020

Thank you for reviewing #331. In response I have submitted #366 which addresses your comments and carefully removes all unnecessary changes from the PR. When you have time please review and merge that one instead.

@CorentinJ

This comment has been minimized.

@ghost

This comment has been minimized.

@ghost ghost mentioned this issue Jun 22, 2020
@ghost
Copy link

ghost commented Jun 24, 2020

Opened #375 to propose a workaround for webrtcvad.

@JakubKoralewski
Copy link

I kinda wished that the popularity of this repo would have died down, but new people keep coming in at a fairly constant rate.

Use this issue to ask me questions and to bring light upon things that you believe need to be improved, and we'll see what can be done.

there are many other better open-source implementations of neural TTS out there, and new ones keep coming every day.

It would be awesome if you could point out some alternatives, maybe people would start using them instead. I'm not knowledgeable at all in this field so I don't know how to find anything on my own and how to compare which repos are good and which ones work best for what.

I think having a load and click free GUI app is the appeal of your software.

@CorentinJ
Copy link
Owner Author

I think having a load and click free GUI app is the appeal of your software.

Yeah, I can't say I expected it to have that big of an impact on the popularity of this repo when I wrote it. Too bad it only looks easy, but still is out of reach for most people with little experience in programming.

Prior to becoming my colleague, fatchord wrote not only WaveRNN but also a Tacotron 1 implementation (which, by the way, is not proved inferior to Tacotron 2): https://github.com/fatchord/WaveRNN

NVIDIA has a Tacotron 2 implementation: https://github.com/NVIDIA/tacotron2

Mozilla as well, with more frequent updates & features: https://github.com/mozilla/TTS

I would also check paperswithcode.com and ignore my repo and the ones above if you're looking for something else; perhaps something more recent, as neural TTS is still very much growing. https://paperswithcode.com/task/text-to-speech-synthesis

@cantrell
Copy link

cantrell commented Jul 2, 2020

Hi, @CorentinJ. This is a fantastic project which I've had a lot of fun playing around with.

The biggest challenge with using other projects seems to be data sets. All the other projects I've found are most easily trained on the LJSpeech data set whereas this one can generate unique results with a small sample of audio. Are you aware of any other projects that can be used to clone speech with small audio samples? Thanks!

@CorentinJ
Copy link
Owner Author

@cantrell You've got to understand the way voice cloning works in this repo. The Tacotron 2 architecture in my repo barely differs from the usual Tacotron 2. The only thing that's added is a way to condition it on a speaker's voice, which is a very minor addition. Hence why it should be simple to transfer that over to an existing Tacotron 2 implementation. The Mozilla repo has ongoing (or maybe finished?) work on that, so that's one alternative.

Do understand that it's not a matter of training the model on only 5 seconds of audio, it's an entirely different procedure which does not involve any training.

@cantrell
Copy link

cantrell commented Jul 2, 2020

Got it. Thanks, @CorentinJ. I'll take a closer look (and/or check out the Mozilla implementation).

@dathudeptrai
Copy link

@CorentinJ @cantrell Can you guys take a look our recent TTS framework here (https://github.com/TensorSpeech/TensorflowTTS). We supported Tacotron2, FastSpeech, FastSpeech2, Multiban-melgan on native Tensorflow implementation. We also have a plan to support other languages, tflite for mobile, tensorrt for Deploy server. Almost supported model are real-time now.

audio samples: https://tensorspeech.github.io/TensorflowTTS/
colab demo: https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing

I can make pull request if you want :D.

@CorentinJ
Copy link
Owner Author

@dathudeptrai cool, do go ahead, but remember that you'll have to ensure that the data compatibility between wavernn and the synthesizer must be held, and that you will have to provide new pretrained weights for both these models.

@dathudeptrai
Copy link

@CorentinJ I think it's not hard to convert pretrained tacotron2 here to my tensorflow2 implementation since my implementation based on the tacotron2 code used here.

@ghost
Copy link

ghost commented Jul 5, 2020

@CorentinJ can you please take a quick look at #227 (synthesizer produces large gaps when processing very short texts) and give us a clue where that issue might be coming from, or where to start if we want to fix it?

Edit: @macriluke says it results from the training dataset. Is it really because the models are trained on medium to long utterances? #291 (comment)

@macriluke
Copy link

macriluke commented Jul 5, 2020

@CorentinJ can you please take a quick look at #227 (synthesizer produces large gaps when processing very short texts) and give us a clue where that issue might be coming from, or where to start if we want to fix it?

Edit: @macriluke says it results from the training dataset. Is it really because the models are trained on medium to long utterances? #291 (comment)

I was going off of this bit of the thesis:

The prosody is however sometimes unnatural, with pauses at unexpected locations in the sentence, or the lack of pauses where they are expected. This is
particularly noticeable with the embedding of some speakers who talk slowly, showing
that the speaker encoder does capture some form of prosody. The lack of punctuation
in LibriSpeech is partially responsible for this, forcing the model to infer punctuation
from the text alone. This issue was highlighted by the authors as well, and can be
heard on some of their samples of LibriSpeech speakers. The limits we imposed on
the duration of utterances in the dataset (1.6s - 11.25s) are likely also problematic.
Sentences that are too short will be stretched out with long pauses, and for those that
are too long the voice will be rushed.

It looks like maybe I made the wrong assumption of the meaning of the word "pauses" here, as I see in #53 It's mentioned that this is an issue introduced through the code.

EDIT: I will say that while the wooshing and long pauses aren't this common on other pretrained tacotrons, I have heard them on mid-training evaluations of different synthesis models, so the real cause could potentially be both training and code here.

@CorentinJ
Copy link
Owner Author

CorentinJ commented Jul 6, 2020

This issue of large gaps is something that also occurred at Resemble.AI, and that I have worked on and fixed. It's a serious amount of work, I'll give you the big lines:

  • Use LibriTTS instead of LibriSpeech in order to have punctuation.
  • LibriTTS needs to be curated of speakers with bad prosody.
  • You can lower the upper bound I put on utterance duration, which I suspect has for effect of removing long utterances that are more likely to have more pauses (I formally evaluated models trained this way to generate less frequent long pauses). It also trains faster and does not have drawbacks (with a good attention paradigm, the model can generate sentences longer than seen in training).
  • The attention paradigm needs to be replaced, forward attention is poor.

@Liujingxiu23
Copy link

@CorentinJ
It's a pity that you decide no to update this project any more.
I have followed your work since latter half of 2019.
For the encoder part, I removed the Relu Activation function of the last linear layer and train with 18k speakers(Chinese+English) for about 2~3 month. I using "Resemblyzer-master" tool to analysize the embedding generated by the model as well as my own tool. I guess the encoder is ready.
For the systhsizer, Can your help me and give me some advices?

  1. My target lanuage is Chinese, I did not have enough TTS corpus to train the synthesizer, only asr corpus can be found. For example , the aishell, but the quality is not so good. Do you have any suggesion to preprocess the wavs ?
  2. When giving a target wav, the end2end systhesized wavs have some characteristic of the target timbre, but they are just similar in a low level. Do you have any suggesion to improve the similarity? How is your result? Could you share some of your best result?

@CorentinJ
Copy link
Owner Author

@Liujingxiu23 I don't have any suggestion regarding your data. As for the audio quality, you can improve it by finetuning both tacotron and the vocoder on a single speaker. To improve the quality of voice cloning in general, there's a lot more working, starting with the list I gave above.

@mueller91
Copy link

Dear @CorentinJ , thank you for your amazing work and you continued support here. I have a few questions:
a) Would you still apply denoising to LibriTTS? I find that the samples are high quality, and the data itself has already been cleaned.
b) Can i train on both LibriTTS and VCTK? If so, what should i look out for?
c) When training speaker encoder (SE), i find that there is a difference in the difficulty of the datasets: VCTK, LibriTTS, Mozilla Commonvoice are 'easy' for the SE, and it achieves low loss and low EER quickly. However, VoxCeleb{1,2} are much harder.
-> Should i train on each data set separately, and once the model has 'trained out' on the easier datasets, skip them in favor of more iterations on voxceleb?

@CorentinJ
Copy link
Owner Author

a) Yes I would. For having manually curated LibriTTS myself, I can definitely say that a lot of speakers are very noisy. Do a little data exploration to convince yourself of that: pick 100 random samples and listen to all of them. There are still many issues with this improved version of LibriSpeech: inconsistent volume, background noise, poor mic quality, mic bumps, ... Regarding denoising alone, here's a sample from LibriTTS and its denoised version:
https://puu.sh/G839T.wav
https://puu.sh/G839Y.wav

b) Yes you can, some gotchas:

  • Ensure that your preprocessed data is sampled to the same sample rate
  • Ensure you normalize volume
  • Beware of balance: compare the size of LibriTTS vs that of VCTK and compare the number of speakers. You might need to prune away some data from LibriTTS

c) I don't know if it's worth the effort. The voice encoder is a nice example of "throw more resources at it and it'll keep improving", if you merge your datasets (although again, balance might be an issue given the size of voxceleb) and train for long enough it should perform well anyway.

@mueller91
Copy link

mueller91 commented Jul 17, 2020

Thank you so much for your answer. Two follow-up questions:
a) Why would dataset balance be an issue? Assume i have 10 times more samples from LibriTTS than from VCTK - if the input format, sampling rate and preprocessing is the same, why should this imbalance matter? (provided the clips are of somewhat same quality w.r.t to noise). Same for for SP.
b) You mentioned manually curating LibriTTS. Could you elaborate what you did in a bit more detail? Are there any papers, tools, etc. you can point me to? Did you listen to all audiofiles? (I cannot imagine this)

Again, than you so much for your answers. At my university (Munich, Germany), nobody is doing speech synthesis - i'm a bit on my own here.

@CorentinJ
Copy link
Owner Author

a) It's a matter of what you want. If you want to reach VCTK quality, then LibriTTS samples vastly outnumbering VCTK samples is going to cancel that out due to sampling being uniform. In a classical multispeaker model with a speaker table (i.e. an embedding layer), it would still make sense to have a 10 to 1 ratio if your goal was only to encode a voice for these speakers in the speaker table;

b) I can't elaborate too much, no. Just know that some of the data is of poor quality, and some is great. A bit of data exploration should give you an idea.

@ghost ghost mentioned this issue Aug 9, 2020
@DanChristos
Copy link

We're one year after the initial publication of this project. I've been busy with both exams and work since, and it's only last week that I passed my last exam. During that year, I have received SO many messages from people asking for help in setting up the repo and I just had no time to allocate for any of that.
I kinda wished that the popularity of this repo would have died down, but new people keep coming in at a fairly constant rate.
I have no intentions to start developing on this repo again, but I hope I can answer some questions and possibly review some PRs. Use this issue to ask me questions and to bring light upon things that you believe need to be improved, and we'll see what can be done.

You wanted the popularity of this repo to go down because you couldn't handle the requests? That's kinda absurd, people's interest are good thing, more developers means less work on one person's shoulders. ;)

@mbdash
Copy link
Collaborator

mbdash commented Sep 3, 2020

@CorentinJ

If tensorflow is entirely removed from this repo, I will change that message for sure.

I still get a lot of feedback from people who spent hours trying to set things up.

In my opinion,
It is actually not that hard to setup on Ubuntu.
On windows... well... good luck. (for now)

I hope this will help reduce complaints :

WIKI

Installation - Ubuntu-20.04

Installation - Windows-10 TODO

Repository owner deleted a comment from steven850 Sep 13, 2020
@CodingRox82
Copy link

I want to implement something like this for voice-to-voice. Basically, I want to record a voice and then use this as a basis for masking N voices, where N >> 1. Some questions:

  1. If you're planning to work on a serious project, my strong advice: find another TTS repo.: @CorentinJ , would this comment still apply if I don't need the part that reads and creates audio from a given text?
  2. I understand that the impressive part of this repo is that it can clone a voice given only 5 seconds of audio, but in general does the output improve with training on more (and more diverse) data? What I wanted to have a professional speaker record hours of data to serve as input audio - would the output improve in quality?

@mbdash
Copy link
Collaborator

mbdash commented Sep 28, 2020

@CodingRox82 Hi,
if you are seriously interested in Voice to Voice / Voice changer / Voice Transfer / "insert any other description that involves converting the audio from 1 speaker to another without passing through TTS";

Would you be interested in joining a small group with common interest?
We are currently working on creating a polished dataset.
Our small group have different but overlapping interests for the good of this repo and others that can provide voice to voice, bypassing TTS.

If you are interested, leave a comment in #474

Repository owner deleted a comment from XCanG Sep 28, 2020
@ghost
Copy link

ghost commented Oct 12, 2020

@CorentinJ Thanks for providing the statement of direction in #543 (comment)

In that context it's not worth my time continuing to provide technical support as I have the last few months. It was initially helpful to identify common pain points but now it's mainly down to getting rid of tensorflow, and people asking for an exe. To help potential developers I suggest disallowing the use of the issues board for tech support and requests for help with projects since it dilutes the development effort. I've donated a lot of my time trying to build some sense of community, but unfortunately it is not attracting and retaining the type of people who can push this project forward.

Tensorflow has this issue policy, and it could help to implement something similar. I realize this will be unpopular because a lot of individuals want help and tech support, but it needs to be understood that you get what you pay for with open source.

If you open a GitHub Issue, here is our policy: 1. It must be a bug/performance issue or a feature request or a build issue or a documentation issue (for small doc fixes please send a PR instead). 2. Make sure the Issue Template is filled out. 3. The issue should be related to the repo it is created in.

Here's why we have this policy: We want to focus on the work that benefits the whole community, e.g., fixing bugs and adding features. Individual support should be sought on Stack Overflow or other non-GitHub channels. It helps us to address bugs and feature requests in a timely manner.

@CodingRox82
Copy link

Sorry for the late reply @blue-fish . I'm definitely interested in using this. I like your idea of creating a pre-compiled version to give people to test out. I'm going to start tinkering around with this to try to get it to work and if I find the time to learn how to create a distributable precompiled version I'll give it a shot.

@CorentinJ
Copy link
Owner Author

@blue-fish Thanks a lot for your valuable help and time. I did come to the same conclusions as you. A lot of the users coming through are highly unexperienced.

I have been wanting to make things simpler just for the sake of reducing the number of technical support requests, but my awkward position makes it hard for me to stay involved.

@macriluke
Copy link

Let me know if I'm up to date on this-

  • blue-fish finished the effort to implement and train in pytorch in his fork.

  • on review it was decided that the quality of the tensorflow model was still better overall quality.

  • sometimes with the tensorflow model the stop token prediction fails and results in large gaps in the synthesis.

  • sometimes with pytorch model will quit in the middle of synthesis, something to do with the attention model?

@CorentinJ
Copy link
Owner Author

The stop token prediction (whether the model knows when to end the generation) on the tensorflow model is usually good, the long pauses is more of a dataset/data representation and attention mechanism issue.

The pytorch model is the one to fail at predicting stop tokens - indeed due to its attention mechanism - and hence why it stops during generation.

@macriluke
Copy link

Ah okay I had it almost exactly backwards.

So following blue-fish's instructions in #538 to retrain tensorflow on libri-tts/libri-Speech should resolve the long pauses and also won't have the stop token issue?

@CorentinJ
Copy link
Owner Author

Recent similar projects:
https://github.com/Tomiinek/Multilingual_Text_to_Speech
https://github.com/espnet/espnet

@CorentinJ
Copy link
Owner Author

@eyewebs
Copy link

eyewebs commented Jan 15, 2021

Recent similar projects:
https://github.com/Tomiinek/Multilingual_Text_to_Speech
https://github.com/espnet/espnet

Can I also clone voices with these repo's using a small audio clip of 3-5 minutes? This repo needs a 5 second audio clip, but for resemable.ai a larger sample with voice is better. Now resemable ask voice verification, something I can't do.

Are there repo's that can also use a longer voice sample of, for example, 5 minutes, that sound better than this repo? if so, which ones have the best result?

I would like to pay the person who can help me make good voice clones from 3-5 minute samples. really need it. blue-fish, I see you're very active here. Help me? :)

Repository owner deleted a comment Jan 17, 2021
@ghost ghost mentioned this issue Jan 20, 2021
@pablodz
Copy link

pablodz commented Jan 24, 2021

May you add some maintainers to the repo, create an announcement and ask for help. It happened before with others repositories

Repository owner deleted a comment from spaesleme Jan 25, 2021
Repository owner deleted a comment from pablodz Jan 25, 2021
@BrentonBadGoy
Copy link

That's a very good work, congrats.
I don't know if I'm a the good place to post this but it give an american accent to the cloned voice although the speaker I want to clone have a British accent, is it the encoder, the synthesizer, the vocoder or the three ? Is there a way to change this without having a Nvidia Gpu to train the models ? Or is there already models trained with British accent available ?
Also I noticed the pronunciation is wrong sometimes and it even miss totally some words, is there a way to change this ? Maybe it's due to the ponctuation no taken in account ?

Repository owner deleted a comment from mennatallah644 Feb 5, 2021
@CorentinJ CorentinJ unpinned this issue Feb 5, 2021
Repository owner locked and limited conversation to collaborators Feb 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests