Training Multi-Language and Multi-Speaker Model #119

see2run · 2024-11-28T02:31:43Z

Hi, I want to ask about multi-language and multi-speaker training. Can I do that? Maybe the dataset format could be like this: file_wav|text|lang_id|speaker_id

The text was updated successfully, but these errors were encountered:

shivammehta25 · 2024-11-29T16:15:49Z

Hello! Yes, you can definitely do that. But you will need to devise ways to merge two conditional information. I suggest some naive ways that do not guarantee the disentanglement between language and speaker IDs.

You'll need to change filelists the way you've suggested.

Adjust the numbers of returned values from the parsed filelist in this method:

Matcha-TTS/matcha/data/text_mel_datamodule.py

Lines 164 to 170 in 0735e65

    
           def get_datapoint(self, filepath_and_text): 
        
               if self.n_spks > 1: 
        
                   filepath, spk, text = ( 
        
                       filepath_and_text[0], 
        
                       int(filepath_and_text[1]), 
        
                       filepath_and_text[2], 
        
                   )

Manage phonemization

Matcha-TTS/matcha/text/cleaners.py

Lines 27 to 34 in 0735e65

    
           global_phonemizer = phonemizer.backend.EspeakBackend( 
        
               language="en-us", 
        
               preserve_punctuation=True, 
        
               with_stress=True, 
        
               language_switch="remove-flags", 
        
               logger=critical_logger, 
        
           )

One way @jimregan did was to change this to a dict and initialise all languages globally and call whenever that language id was encountered.

Pass language information here:

Matcha-TTS/matcha/models/matcha_tts.py

Line 236 in 0735e65

diff_loss, _ = self.decoder.compute_loss(x1=y, mask=y_mask, mu=mu_y, spks=spks, cond=cond)

Merge it in some ways to the decoder a naive way is similar to this:

Matcha-TTS/matcha/models/components/decoder.py

Lines 386 to 388 in 0735e65

    
           if spks is not None: 
        
               spks = repeat(spks, "b c -> b c t", t=x.shape[-1]) 
        
               x = pack([x, spks], "b * t")[0]

And then you should be able to train it multi-language and multi speaker! Hope this helps.

see2run · 2024-12-03T08:31:50Z

Thank you for the explanation @shivammehta25. I will summarize it as follows, is it correct?

Format dataset : file_wav|text|lang_id|speaker_id
Adjust the numbers of returned values from the parsed filelist

        if self.n_spks > 1:
            filepath, lang, spk, text = (
                filepath_and_text[0],
                filepath_and_text[2]
                int(filepath_and_text[3]),
                filepath_and_text[1],
            )
        else:
            filepath, text = filepath_and_text[0], filepath_and_text[1]
            spk = None

        text, cleaned_text = self.get_text(text, add_blank=self.add_blank)
        mel = self.get_mel(filepath)

        durations = self.get_durations(filepath, text) if self.load_durations else None

        return {"x": text, "y": mel, "spk": spk, "lang":"lang", "filepath": filepath, "x_text": cleaned_text, "durations": durations}

Manage phonemization

def cleaners_multi(text, lang):
    text = convert_to_ascii(text)
    text = lowercase(text)
    text = expand_abbreviations(text)
    global_phonemizer = phonemizer.backend.EspeakBackend(
		    language=lang,
		    preserve_punctuation=True,
		    with_stress=True,
		    language_switch="remove-flags",
		    logger=critical_logger)
    phonemes = global_phonemizer.phonemize([text], strip=True, njobs=1)[0]
    # Added in some cases espeak is not removing brackets
    phonemes = remove_brackets(phonemes)
    phonemes = collapse_whitespace(phonemes)
    return phonemes

Pass language information here:

def forward(self, x, x_lengths, y, y_lengths, spks=None, lang=None, out_size=None, cond=None, durations=None):
	...
	diff_loss, _ = self.decoder.compute_loss(x1=y, mask=y_mask, mu=mu_y, spks=spks, lang=lang, cond=cond)```

I don't understand this step. How do I merge it?

tomschelsen · 2024-12-04T14:13:43Z

It would be great if this repository allowed to train on a non-English language without having to change the Python code, but only the config files. Even though I have not reviewed the whole repository it doesn't seem that it is there yet, e.g. if you use basic_cleaner in your config file, you end up training on raw text, not on phonemes, which will certainly pose problems.

@shivammehta25 I would suggest indeed changing global_phonemizer to an empty dict, then by using global global_phonemizer you could from any of the cleaner/text preprocessing functions add a key (containing the language id) and a value phonemizer.backend.EspeakBackend(language=language_id, .... other sane defaults working well with the Matcha model) or not do it if the key is already in the global dict. That way each EspeakBackend (one per language) gets only initialized once, and we can all train new Matcha models without having to fork the code and having to re-port our changes everytime a new Matcha-TTS version will be released :) (of course I just touched on the phonemization part but you seem to know where other things should be changed already ;) )

shivammehta25 · 2024-12-08T16:49:18Z

@see2run Hey sorry for the delay in response. Some thoughts/suggestions:

In point 3.

    global_phonemizer = phonemizer.backend.EspeakBackend(
		    language=lang,
		    preserve_punctuation=True,
		    with_stress=True,
		    language_switch="remove-flags",
		    logger=critical_logger)

move this outside to a global position or create a singleton of it, because this initialisation takes some time and will slow the training down.

For point 5.

Now you have language id in the input which is an integer (an ordinal in this case), we need to convert it into vector. One of the easiest way to do this is an nn.Embedding layer, so you would need to pass the integer to

in the__init__ create a new embedding layer, similar to

Matcha-TTS/matcha/models/matcha_tts.py

Lines 52 to 53 in 108906c

    
           if n_spks > 1: 
        
               self.spk_emb = torch.nn.Embedding(n_spks, spk_emb_dim)

if n_languages > 1:
    self.language_emb = nn.Embedding(n_languages, lang_emb_dim)

and then just follow the path wherever having a speaker id changes the shapes of the layers and is added or concatenated also change the shape and concatenate or add language_emb too.

shivammehta25 · 2024-12-08T16:51:52Z

@tomschelsen That is a good suggestion; however, it was released as a supplement code for a research paper whose evaluations were run in English, so we didn't add any native support for multilingual text. It is a good suggestion, but I do not have the bandwidth right now. I do welcome PRs. :)

anarucu · 2024-12-12T15:14:48Z

Hi @shivammehta25, to train matcha model in spanish what code changes are needed? I don't mean multi language TTS
Would it be possible to start from a ckpt? how?

shivammehta25 · 2024-12-16T03:20:26Z

@anarucu Hello, I have information about training to another languages here: https://github.com/shivammehta25/Matcha-TTS/wiki/Training-%F0%9F%8D%B5-Matcha%E2%80%90TTS-with-different-dataset-&-languages

As long as you have not changed the phonemizer, you can start with a pretrained checkpoint. However, further experimentation would be needed to measure its effectiveness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Multi-Language and Multi-Speaker Model #119

Training Multi-Language and Multi-Speaker Model #119

see2run commented Nov 28, 2024

shivammehta25 commented Nov 29, 2024

see2run commented Dec 3, 2024 •

edited

Loading

tomschelsen commented Dec 4, 2024 •

edited

Loading

shivammehta25 commented Dec 8, 2024

shivammehta25 commented Dec 8, 2024

anarucu commented Dec 12, 2024

shivammehta25 commented Dec 16, 2024

Training Multi-Language and Multi-Speaker Model #119

Training Multi-Language and Multi-Speaker Model #119

Comments

see2run commented Nov 28, 2024

shivammehta25 commented Nov 29, 2024

see2run commented Dec 3, 2024 • edited Loading

tomschelsen commented Dec 4, 2024 • edited Loading

shivammehta25 commented Dec 8, 2024

shivammehta25 commented Dec 8, 2024

anarucu commented Dec 12, 2024

shivammehta25 commented Dec 16, 2024

see2run commented Dec 3, 2024 •

edited

Loading

tomschelsen commented Dec 4, 2024 •

edited

Loading