Add [Draft][TTS] FastPitch multi-speaker pre-train and adapter fine-tune #6389

hsiehjackson · 2023-04-07T00:23:02Z

What does this PR do ?

Adds FastPitch speaker adaptation with adapters.

Collection: TTS

Changelog

Adds multi-speaker FastPitch pre-training with 1) Looked-up speaker embedding 2) Speaker verification model speaker embedding 3) Global Style Tokens 4) Conditional Layer Normalization
- nemo/collections/tts/data/tts_dataset.py
- nemo/collections/tts/torch/tts_data_types.py
- nemo/collections/tts/models/fastpitch.py
- nemo/collections/tts/modules/fastpitch.py
- nemo/collections/tts/modules/transformer.py
- nemo/collections/tts/modules/submodules.py
Add adapter modules for FastPitch fine-tuning
- nemo/collections/tts/models/fastpitch.py
- nemo/collections/tts/modules/fastpitch.py
- nemo/collections/tts/modules/transformer.py
- nemo/collections/tts/modules/aligner.py
- nemo/collections/tts/parts/mixins/__init__.py
- nemo/collections/tts/parts/mixins/fastpitch_adapter_mixins.py
Add config and fine-tuning python script
- examples/tts/conf/fastpitch_speaker_adaptation.yaml
- examples/tts/fastpitch_finetune_adapters.py
Add tutorial
- tutorials/tts/FastPitch_SpeakerAdaptation.ipynb
Add fast text normalization script
- scripts/dataset_processing/tts/add_normalized_text.py
Fix aligner nan loss bug
- nemo/collections/tts/losses/aligner_loss.py

Usage

Follow tutorials/tts/FastPitch_SpeakerAdaptation.ipynb
- Pre-train multi-speaker FastPitch
- Fine-tune HiFiGAN with mel-spectrograms from pre-trained FastPitch
- Fine-tune FastPitch with adapters
- Fine-tune HiFiFAN with mel-spectrograms from fine-tuned FastPitch

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

nemo/collections/tts/parts/mixins/fastpitch_adapter_mixins.py

nemo/collections/tts/modules/submodules.py

Signed-off-by: hsiehjackson <[email protected]>

nemo/collections/tts/torch/tts_data_types.py

nemo/collections/tts/data/dataset.py

subhankar-ghosh · 2023-04-07T22:42:14Z

nemo/collections/tts/modules/fastpitch.py

        else:
            spk_emb = self.speaker_emb(speaker).unsqueeze(1)

+        if self.speaker_encoder is not None:
+            spk_emb = self.speaker_encoder(


@rlangman @redoctopus We need your opinion on this. Currently speakers are represented as lookup table embeddings (Line 318), In Adapters project we represent speakers as GST Embeddings + Titanet embeddings + lookup table embeddings. Line 321 self.speaker_encoder is the class to represent speakers, this has GST, Titanet representation. We were thinking if we should include lookup table also in this speaker_encoder class

NeMo/nemo/collections/tts/modules/submodules.py

Line 613 in 385d46a

class SpeakerEncoder(torch.nn.Module):

. This will lead to just one speaker representation class, which will be much neater than having separate classes/methods and then somehow adding them.

This has only one down side -> the use of older models which do not use the SpeakerEncoder class to represent speakers but the lookup table was a part of the FastPitch class. To solve this problem, we either upload new versions of the model with SpeakerEncoder class to NGC or write a wrapper that will put the lookup table from older model checkpoints into the SpeakerEncoder class.

What do you think?

We can raise a separate PR for this, I would like to not make this PR more complicated.

After thinking about it, I don't think we should merge the speaker embeddings with the speaker encoder. The speaker encoder is very specific to these experiments with custom voice, and unlikely to be used by most other TTS problems.

But now that we are supporting utterance level titanet embeddings, we should really support all 3 workflows:

Condition only on lookup table.

Condition only on utterance level embeddings

Condition on both together using speaker encoder.

All 3 could be represented as instances of SpeakerEncoder, but just having a branching if-statement will be clearer.

So how about we add a configuration for whether to learn the embedding (defaulting to True for backwards compatibility):

if self.learn_speaker_embedding and n_speakers > 1: self.speaker_emb = torch.nn.Embedding(n_speakers, symbols_embedding_dim) else: self.speaker_emb = None

Then you can manually set it to false when you provide the speaker encoder. Then we can cleanly support all 3 common training workflows.

def get_speaker_embedding(self, speaker, reference_speaker_embedding, reference_spec, reference_spec_lens): if self.speaker_encoder: spk_emb = self.speaker_encoder(speaker=speaker, ...) elif self.speaker_emb: if speaker is None: raise ValueError() spk_emb = self.speaker_emb(speaker) elif reference_speaker_embedding: spk_emb = reference_speaker_embedding else: spk_emb = None return spk_emb

I'm in favor of including the lookup table in the speaker encoder class if it makes things simpler and improves the abstraction/organization. A little pain in re-training in the short term is okay!

@rlangman if you look at the SpeakerEncoder class in submodules.py, it has the same logic, where you can mix-and-match whichever speaker representation you want to use. It takes as input (speaker_embedding_from_lookup_table, ref_audio, ref_titanet_speaker_embedding) And then based on the config we use whichever we want to.

I also favor having one single speaker representation class which we can initialize based on config and then pass to Fastpitch module instead of initializing the lookup table in the constructor of the FastPitch module. Currently, we do the same for all our attribute predictors, we only fix speaker table in the FastPitch module constructor and hence have to pass parameters like n_speakers etc.

This would lead to uniformity. All our submodules (attribute predictors, aligner etc) are first initialized based on config in models/fastpitch.py and then the objects are passed to modules/fastpitch.py except speaker representation.
Thoughts? @redoctopus @rlangman

Either way technically works. It looks like we all agree the existing SpeakerEncoder should have an embedding array internally.

If we don't want to go with my above suggestion and do want to have a SpeakerEncoder as the only means later down, then we should create and finalize an interface in this PR to avoid breaking the adapter models later.

Current SpeakerEncoder in submodules.py is the "spaghetti code" approach with tons of if statements. The clean way to implement it would be like:

class SpeakerEncoder(ABC): @abstractmethod def get_speaker_embedding(self, speaker: Tensor, reference_speaker_embedding: Tensor, ...) -> Tensor: raise NotImplementedError

Then have one implementation which is the current GST logic with one logic flow and the if/else statements removed. And another simple one:

class LookupTableSpeakerEncoder(SpeakerEncoder): def __init__(self, num_speakers, embedding_dim): self.speaker_emb = torch.nn.Embedding(num_speakers, embedding_dim) def get_speaker_embedding(self, speaker: Tensor, ...): if speaker is None: raise ValueError() return self.speaker_emb(speaker)

Then to deprecate the old system we can have a branch that uses the existing lookup table when speaker encoder is not provided, and does not create a lookup table in the FP model when it is.

Then we take our time to retrain all of the FP models on NGC with the new LookupTableSpeakerEncoder and delete the old branch when we are ready.

nemo/collections/tts/modules/fastpitch.py

nemo/collections/tts/models/fastpitch.py

examples/tts/fastpitch_finetune_adapters.py

nemo/collections/tts/modules/submodules.py

rlangman

I know this is a draft, but we had similar PR before and the first main comment was that the PR can be broken into independent commits/features:

Adapter finetuning
Reference audio handling and GST
External speaker embeddings/titanet support
Auxilliary speaker encoder system
Tutorial(s)

examples/tts/conf/fastpitch_speaker_adaptation.yaml

scripts/dataset_processing/tts/add_normalized_text.py

examples/tts/conf/fastpitch_speaker_adaptation.yaml

nemo/collections/tts/data/dataset.py

examples/tts/fastpitch_finetune_adapters.py

nemo/collections/tts/modules/fastpitch.py

rlangman · 2023-04-10T23:10:44Z

nemo/collections/tts/modules/fastpitch.py

        else:
            spk_emb = self.speaker_emb(speaker).unsqueeze(1)

+        if self.speaker_encoder is not None:
+            spk_emb = self.speaker_encoder(


After thinking about it, I don't think we should merge the speaker embeddings with the speaker encoder. The speaker encoder is very specific to these experiments with custom voice, and unlikely to be used by most other TTS problems.

But now that we are supporting utterance level titanet embeddings, we should really support all 3 workflows:

Condition only on lookup table.

Condition only on utterance level embeddings

Condition on both together using speaker encoder.

All 3 could be represented as instances of SpeakerEncoder, but just having a branching if-statement will be clearer.

rlangman · 2023-04-10T23:20:56Z

nemo/collections/tts/modules/fastpitch.py

        else:
            spk_emb = self.speaker_emb(speaker).unsqueeze(1)

+        if self.speaker_encoder is not None:
+            spk_emb = self.speaker_encoder(


So how about we add a configuration for whether to learn the embedding (defaulting to True for backwards compatibility):

if self.learn_speaker_embedding and n_speakers > 1: self.speaker_emb = torch.nn.Embedding(n_speakers, symbols_embedding_dim) else: self.speaker_emb = None

Then you can manually set it to false when you provide the speaker encoder. Then we can cleanly support all 3 common training workflows.

def get_speaker_embedding(self, speaker, reference_speaker_embedding, reference_spec, reference_spec_lens): if self.speaker_encoder: spk_emb = self.speaker_encoder(speaker=speaker, ...) elif self.speaker_emb: if speaker is None: raise ValueError() spk_emb = self.speaker_emb(speaker) elif reference_speaker_embedding: spk_emb = reference_speaker_embedding else: spk_emb = None return spk_emb

Signed-off-by: hsiehjackson <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: hsiehjackson <[email protected]>

Signed-off-by: hsiehjackson <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: hsiehjackson <[email protected]>

Signed-off-by: hsiehjackson <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: hsiehjackson <[email protected]>

* Fix prompt template unescaping Signed-off-by: MaximumEntropy <[email protected]> * Fix for when prompt template is None Signed-off-by: MaximumEntropy <[email protected]> --------- Signed-off-by: MaximumEntropy <[email protected]> Signed-off-by: hsiehjackson <[email protected]>

Signed-off-by: Abhishree <[email protected]> Signed-off-by: hsiehjackson <[email protected]>

Signed-off-by: hsiehjackson <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: hsiehjackson <[email protected]>

Signed-off-by: hsiehjackson <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: hsiehjackson <[email protected]>

Signed-off-by: hsiehjackson <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: hsiehjackson <[email protected]>

Signed-off-by: hsiehjackson <[email protected]>

for more information, see https://pre-commit.ci

* Add support for Megatron GPT Untied Embd TP PP Change Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support for Megatron GPT Untied Embd TP PP Change Signed-off-by: smajumdar <[email protected]> * Update support for model_file to be None when passing model_extracted_dir Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Yi Dong <[email protected]>

Signed-off-by: Vladimir Bataev <[email protected]>

Signed-off-by: hsiehjackson <[email protected]>

for more information, see https://pre-commit.ci

hsiehjackson · 2023-04-14T22:11:35Z

This PR split into two parts.
#6416
#6417

hsiehjackson added the TTS label Apr 7, 2023

hsiehjackson requested a review from subhankar-ghosh April 7, 2023 00:23

hsiehjackson force-pushed the tts_adapters_cln branch from 0fe9182 to 575b952 Compare April 7, 2023 00:54

github-advanced-security bot found potential problems Apr 7, 2023

View reviewed changes

nemo/collections/tts/parts/mixins/fastpitch_adapter_mixins.py Fixed Show fixed Hide fixed

nemo/collections/tts/modules/submodules.py Fixed Show fixed Hide fixed

nemo/collections/tts/modules/submodules.py Fixed Show fixed Hide fixed

hsiehjackson force-pushed the tts_adapters_cln branch from 212d3d1 to 1bdd5ed Compare April 7, 2023 18:17

Add [Draft][TTS] FastPitch multi-speaker pre-train and adapter fine-tune

9546b6b

Signed-off-by: hsiehjackson <[email protected]>

hsiehjackson force-pushed the tts_adapters_cln branch from 350ccca to 385d46a Compare April 7, 2023 21:34

subhankar-ghosh requested review from redoctopus and rlangman April 7, 2023 21:38