Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add [Draft][TTS] FastPitch multi-speaker pre-train and adapter fine-tune #6389

Closed
wants to merge 28 commits into from

Conversation

hsiehjackson
Copy link
Collaborator

@hsiehjackson hsiehjackson commented Apr 7, 2023

What does this PR do ?

Adds FastPitch speaker adaptation with adapters.

Collection: TTS

Changelog

  • Adds multi-speaker FastPitch pre-training with 1) Looked-up speaker embedding 2) Speaker verification model speaker embedding 3) Global Style Tokens 4) Conditional Layer Normalization
    • nemo/collections/tts/data/tts_dataset.py
    • nemo/collections/tts/torch/tts_data_types.py
    • nemo/collections/tts/models/fastpitch.py
    • nemo/collections/tts/modules/fastpitch.py
    • nemo/collections/tts/modules/transformer.py
    • nemo/collections/tts/modules/submodules.py
  • Add adapter modules for FastPitch fine-tuning
    • nemo/collections/tts/models/fastpitch.py
    • nemo/collections/tts/modules/fastpitch.py
    • nemo/collections/tts/modules/transformer.py
    • nemo/collections/tts/modules/aligner.py
    • nemo/collections/tts/parts/mixins/__init__.py
    • nemo/collections/tts/parts/mixins/fastpitch_adapter_mixins.py
  • Add config and fine-tuning python script
    • examples/tts/conf/fastpitch_speaker_adaptation.yaml
    • examples/tts/fastpitch_finetune_adapters.py
  • Add tutorial
    • tutorials/tts/FastPitch_SpeakerAdaptation.ipynb
  • Add fast text normalization script
    • scripts/dataset_processing/tts/add_normalized_text.py
  • Fix aligner nan loss bug
    • nemo/collections/tts/losses/aligner_loss.py

Usage

  • Follow tutorials/tts/FastPitch_SpeakerAdaptation.ipynb
    • Pre-train multi-speaker FastPitch
    • Fine-tune HiFiGAN with mel-spectrograms from pre-trained FastPitch
    • Fine-tune FastPitch with adapters
    • Fine-tune HiFiFAN with mel-spectrograms from fine-tuned FastPitch

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

else:
spk_emb = self.speaker_emb(speaker).unsqueeze(1)

if self.speaker_encoder is not None:
spk_emb = self.speaker_encoder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlangman @redoctopus We need your opinion on this. Currently speakers are represented as lookup table embeddings (Line 318), In Adapters project we represent speakers as GST Embeddings + Titanet embeddings + lookup table embeddings. Line 321 self.speaker_encoder is the class to represent speakers, this has GST, Titanet representation. We were thinking if we should include lookup table also in this speaker_encoder class

class SpeakerEncoder(torch.nn.Module):
. This will lead to just one speaker representation class, which will be much neater than having separate classes/methods and then somehow adding them.

This has only one down side -> the use of older models which do not use the SpeakerEncoder class to represent speakers but the lookup table was a part of the FastPitch class. To solve this problem, we either upload new versions of the model with SpeakerEncoder class to NGC or write a wrapper that will put the lookup table from older model checkpoints into the SpeakerEncoder class.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can raise a separate PR for this, I would like to not make this PR more complicated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about it, I don't think we should merge the speaker embeddings with the speaker encoder. The speaker encoder is very specific to these experiments with custom voice, and unlikely to be used by most other TTS problems.

But now that we are supporting utterance level titanet embeddings, we should really support all 3 workflows:

  1. Condition only on lookup table.
  2. Condition only on utterance level embeddings
  3. Condition on both together using speaker encoder.

All 3 could be represented as instances of SpeakerEncoder, but just having a branching if-statement will be clearer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how about we add a configuration for whether to learn the embedding (defaulting to True for backwards compatibility):

    if self.learn_speaker_embedding and n_speakers > 1:
        self.speaker_emb = torch.nn.Embedding(n_speakers, symbols_embedding_dim)
    else:
        self.speaker_emb = None

Then you can manually set it to false when you provide the speaker encoder. Then we can cleanly support all 3 common training workflows.

def get_speaker_embedding(self, speaker, reference_speaker_embedding, reference_spec, reference_spec_lens):
  if self.speaker_encoder:
    spk_emb = self.speaker_encoder(speaker=speaker, ...)
  elif self.speaker_emb:
    if speaker is None:
      raise ValueError()
    spk_emb = self.speaker_emb(speaker)
  elif reference_speaker_embedding:
    spk_emb = reference_speaker_embedding
  else:
    spk_emb = None

  return spk_emb

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of including the lookup table in the speaker encoder class if it makes things simpler and improves the abstraction/organization. A little pain in re-training in the short term is okay!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlangman if you look at the SpeakerEncoder class in submodules.py, it has the same logic, where you can mix-and-match whichever speaker representation you want to use. It takes as input (speaker_embedding_from_lookup_table, ref_audio, ref_titanet_speaker_embedding) And then based on the config we use whichever we want to.

I also favor having one single speaker representation class which we can initialize based on config and then pass to Fastpitch module instead of initializing the lookup table in the constructor of the FastPitch module. Currently, we do the same for all our attribute predictors, we only fix speaker table in the FastPitch module constructor and hence have to pass parameters like n_speakers etc.

This would lead to uniformity. All our submodules (attribute predictors, aligner etc) are first initialized based on config in models/fastpitch.py and then the objects are passed to modules/fastpitch.py except speaker representation.
Thoughts? @redoctopus @rlangman

Copy link
Collaborator

@rlangman rlangman Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way technically works. It looks like we all agree the existing SpeakerEncoder should have an embedding array internally.

If we don't want to go with my above suggestion and do want to have a SpeakerEncoder as the only means later down, then we should create and finalize an interface in this PR to avoid breaking the adapter models later.

Current SpeakerEncoder in submodules.py is the "spaghetti code" approach with tons of if statements. The clean way to implement it would be like:

class SpeakerEncoder(ABC):

  @abstractmethod
  def get_speaker_embedding(self, speaker: Tensor, reference_speaker_embedding: Tensor, ...) -> Tensor:
    raise NotImplementedError

Then have one implementation which is the current GST logic with one logic flow and the if/else statements removed. And another simple one:

class LookupTableSpeakerEncoder(SpeakerEncoder):
    def __init__(self, num_speakers, embedding_dim):
      self.speaker_emb = torch.nn.Embedding(num_speakers, embedding_dim)

    def get_speaker_embedding(self, speaker: Tensor, ...):
      if speaker is None:
        raise ValueError()
     return self.speaker_emb(speaker)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then to deprecate the old system we can have a branch that uses the existing lookup table when speaker encoder is not provided, and does not create a lookup table in the FP model when it is.

Then we take our time to retrain all of the FP models on NGC with the new LookupTableSpeakerEncoder and delete the old branch when we are ready.

Copy link
Collaborator

@rlangman rlangman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a draft, but we had similar PR before and the first main comment was that the PR can be broken into independent commits/features:

  1. Adapter finetuning
  2. Reference audio handling and GST
  3. External speaker embeddings/titanet support
  4. Auxilliary speaker encoder system
  5. Tutorial(s)

examples/tts/conf/fastpitch_speaker_adaptation.yaml Outdated Show resolved Hide resolved
examples/tts/conf/fastpitch_speaker_adaptation.yaml Outdated Show resolved Hide resolved
nemo/collections/tts/data/dataset.py Outdated Show resolved Hide resolved
nemo/collections/tts/data/dataset.py Outdated Show resolved Hide resolved
nemo/collections/tts/data/dataset.py Show resolved Hide resolved
examples/tts/fastpitch_finetune_adapters.py Show resolved Hide resolved
nemo/collections/tts/modules/fastpitch.py Outdated Show resolved Hide resolved
else:
spk_emb = self.speaker_emb(speaker).unsqueeze(1)

if self.speaker_encoder is not None:
spk_emb = self.speaker_encoder(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about it, I don't think we should merge the speaker embeddings with the speaker encoder. The speaker encoder is very specific to these experiments with custom voice, and unlikely to be used by most other TTS problems.

But now that we are supporting utterance level titanet embeddings, we should really support all 3 workflows:

  1. Condition only on lookup table.
  2. Condition only on utterance level embeddings
  3. Condition on both together using speaker encoder.

All 3 could be represented as instances of SpeakerEncoder, but just having a branching if-statement will be clearer.

else:
spk_emb = self.speaker_emb(speaker).unsqueeze(1)

if self.speaker_encoder is not None:
spk_emb = self.speaker_encoder(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how about we add a configuration for whether to learn the embedding (defaulting to True for backwards compatibility):

    if self.learn_speaker_embedding and n_speakers > 1:
        self.speaker_emb = torch.nn.Embedding(n_speakers, symbols_embedding_dim)
    else:
        self.speaker_emb = None

Then you can manually set it to false when you provide the speaker encoder. Then we can cleanly support all 3 common training workflows.

def get_speaker_embedding(self, speaker, reference_speaker_embedding, reference_spec, reference_spec_lens):
  if self.speaker_encoder:
    spk_emb = self.speaker_encoder(speaker=speaker, ...)
  elif self.speaker_emb:
    if speaker is None:
      raise ValueError()
    spk_emb = self.speaker_emb(speaker)
  elif reference_speaker_embedding:
    spk_emb = reference_speaker_embedding
  else:
    spk_emb = None

  return spk_emb

hsiehjackson and others added 20 commits April 11, 2023 20:22
Signed-off-by: hsiehjackson <[email protected]>
* Fix prompt template unescaping

Signed-off-by: MaximumEntropy <[email protected]>

* Fix for when prompt template is None

Signed-off-by: MaximumEntropy <[email protected]>

---------

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
pre-commit-ci bot and others added 4 commits April 12, 2023 03:23
* Add support for Megatron GPT Untied Embd TP PP Change

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for Megatron GPT Untied Embd TP PP Change

Signed-off-by: smajumdar <[email protected]>

* Update support for model_file to be None when passing model_extracted_dir

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Yi Dong <[email protected]>
@github-actions github-actions bot added the ASR label Apr 12, 2023
@hsiehjackson
Copy link
Collaborator Author

This PR split into two parts.
#6416
#6417

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants