Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Multi-Language and Multi-Speaker Model #119

Open
see2run opened this issue Nov 28, 2024 · 7 comments
Open

Training Multi-Language and Multi-Speaker Model #119

see2run opened this issue Nov 28, 2024 · 7 comments

Comments

@see2run
Copy link

see2run commented Nov 28, 2024

Hi, I want to ask about multi-language and multi-speaker training. Can I do that? Maybe the dataset format could be like this: file_wav|text|lang_id|speaker_id

@shivammehta25
Copy link
Owner

Hello! Yes, you can definitely do that. But you will need to devise ways to merge two conditional information. I suggest some naive ways that do not guarantee the disentanglement between language and speaker IDs.

  1. You'll need to change filelists the way you've suggested.
  2. Adjust the numbers of returned values from the parsed filelist in this method:
    def get_datapoint(self, filepath_and_text):
    if self.n_spks > 1:
    filepath, spk, text = (
    filepath_and_text[0],
    int(filepath_and_text[1]),
    filepath_and_text[2],
    )
  3. Manage phonemization
    global_phonemizer = phonemizer.backend.EspeakBackend(
    language="en-us",
    preserve_punctuation=True,
    with_stress=True,
    language_switch="remove-flags",
    logger=critical_logger,
    )
    One way @jimregan did was to change this to a dict and initialise all languages globally and call whenever that language id was encountered.
  4. Pass language information here:
    diff_loss, _ = self.decoder.compute_loss(x1=y, mask=y_mask, mu=mu_y, spks=spks, cond=cond)
  5. Merge it in some ways to the decoder a naive way is similar to this:
    if spks is not None:
    spks = repeat(spks, "b c -> b c t", t=x.shape[-1])
    x = pack([x, spks], "b * t")[0]

And then you should be able to train it multi-language and multi speaker! Hope this helps.

@see2run
Copy link
Author

see2run commented Dec 3, 2024

Thank you for the explanation @shivammehta25. I will summarize it as follows, is it correct?

  1. Format dataset : file_wav|text|lang_id|speaker_id
  2. Adjust the numbers of returned values from the parsed filelist
        if self.n_spks > 1:
            filepath, lang, spk, text = (
                filepath_and_text[0],
                filepath_and_text[2]
                int(filepath_and_text[3]),
                filepath_and_text[1],
            )
        else:
            filepath, text = filepath_and_text[0], filepath_and_text[1]
            spk = None

        text, cleaned_text = self.get_text(text, add_blank=self.add_blank)
        mel = self.get_mel(filepath)

        durations = self.get_durations(filepath, text) if self.load_durations else None

        return {"x": text, "y": mel, "spk": spk, "lang":"lang", "filepath": filepath, "x_text": cleaned_text, "durations": durations}
  1. Manage phonemization
def cleaners_multi(text, lang):
    text = convert_to_ascii(text)
    text = lowercase(text)
    text = expand_abbreviations(text)
    global_phonemizer = phonemizer.backend.EspeakBackend(
		    language=lang,
		    preserve_punctuation=True,
		    with_stress=True,
		    language_switch="remove-flags",
		    logger=critical_logger)
    phonemes = global_phonemizer.phonemize([text], strip=True, njobs=1)[0]
    # Added in some cases espeak is not removing brackets
    phonemes = remove_brackets(phonemes)
    phonemes = collapse_whitespace(phonemes)
    return phonemes
  1. Pass language information here:
def forward(self, x, x_lengths, y, y_lengths, spks=None, lang=None, out_size=None, cond=None, durations=None):
	...
	diff_loss, _ = self.decoder.compute_loss(x1=y, mask=y_mask, mu=mu_y, spks=spks, lang=lang, cond=cond)``` 
  1. I don't understand this step. How do I merge it?

@tomschelsen
Copy link

tomschelsen commented Dec 4, 2024

It would be great if this repository allowed to train on a non-English language without having to change the Python code, but only the config files. Even though I have not reviewed the whole repository it doesn't seem that it is there yet, e.g. if you use basic_cleaner in your config file, you end up training on raw text, not on phonemes, which will certainly pose problems.

@shivammehta25 I would suggest indeed changing global_phonemizer to an empty dict, then by using global global_phonemizer you could from any of the cleaner/text preprocessing functions add a key (containing the language id) and a value phonemizer.backend.EspeakBackend(language=language_id, .... other sane defaults working well with the Matcha model) or not do it if the key is already in the global dict. That way each EspeakBackend (one per language) gets only initialized once, and we can all train new Matcha models without having to fork the code and having to re-port our changes everytime a new Matcha-TTS version will be released :) (of course I just touched on the phonemization part but you seem to know where other things should be changed already ;) )

@shivammehta25
Copy link
Owner

@see2run Hey sorry for the delay in response. Some thoughts/suggestions:

In point 3.

    global_phonemizer = phonemizer.backend.EspeakBackend(
		    language=lang,
		    preserve_punctuation=True,
		    with_stress=True,
		    language_switch="remove-flags",
		    logger=critical_logger)

move this outside to a global position or create a singleton of it, because this initialisation takes some time and will slow the training down.

For point 5.

Now you have language id in the input which is an integer (an ordinal in this case), we need to convert it into vector. One of the easiest way to do this is an nn.Embedding layer, so you would need to pass the integer to

in the__init__ create a new embedding layer, similar to

if n_spks > 1:
self.spk_emb = torch.nn.Embedding(n_spks, spk_emb_dim)

if n_languages > 1:
    self.language_emb = nn.Embedding(n_languages, lang_emb_dim)

and then just follow the path wherever having a speaker id changes the shapes of the layers and is added or concatenated also change the shape and concatenate or add language_emb too.

@shivammehta25
Copy link
Owner

@tomschelsen That is a good suggestion; however, it was released as a supplement code for a research paper whose evaluations were run in English, so we didn't add any native support for multilingual text. It is a good suggestion, but I do not have the bandwidth right now. I do welcome PRs. :)

@anarucu
Copy link

anarucu commented Dec 12, 2024

Hi @shivammehta25, to train matcha model in spanish what code changes are needed? I don't mean multi language TTS
Would it be possible to start from a ckpt? how?

@shivammehta25
Copy link
Owner

@anarucu Hello, I have information about training to another languages here: https://github.com/shivammehta25/Matcha-TTS/wiki/Training-%F0%9F%8D%B5-Matcha%E2%80%90TTS-with-different-dataset-&-languages

As long as you have not changed the phonemizer, you can start with a pretrained checkpoint. However, further experimentation would be needed to measure its effectiveness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants