-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add [Draft][TTS] FastPitch multi-speaker pre-train and adapter fine-tune #6389
Conversation
0fe9182
to
575b952
Compare
212d3d1
to
1bdd5ed
Compare
Signed-off-by: hsiehjackson <[email protected]>
350ccca
to
385d46a
Compare
else: | ||
spk_emb = self.speaker_emb(speaker).unsqueeze(1) | ||
|
||
if self.speaker_encoder is not None: | ||
spk_emb = self.speaker_encoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rlangman @redoctopus We need your opinion on this. Currently speakers are represented as lookup table embeddings (Line 318), In Adapters project we represent speakers as GST Embeddings + Titanet embeddings + lookup table embeddings. Line 321 self.speaker_encoder
is the class to represent speakers, this has GST, Titanet representation. We were thinking if we should include lookup table also in this speaker_encoder class
class SpeakerEncoder(torch.nn.Module): |
This has only one down side -> the use of older models which do not use the SpeakerEncoder class to represent speakers but the lookup table was a part of the FastPitch class. To solve this problem, we either upload new versions of the model with SpeakerEncoder class to NGC or write a wrapper that will put the lookup table from older model checkpoints into the SpeakerEncoder class.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can raise a separate PR for this, I would like to not make this PR more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about it, I don't think we should merge the speaker embeddings with the speaker encoder. The speaker encoder is very specific to these experiments with custom voice, and unlikely to be used by most other TTS problems.
But now that we are supporting utterance level titanet embeddings, we should really support all 3 workflows:
- Condition only on lookup table.
- Condition only on utterance level embeddings
- Condition on both together using speaker encoder.
All 3 could be represented as instances of SpeakerEncoder, but just having a branching if-statement will be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how about we add a configuration for whether to learn the embedding (defaulting to True for backwards compatibility):
if self.learn_speaker_embedding and n_speakers > 1:
self.speaker_emb = torch.nn.Embedding(n_speakers, symbols_embedding_dim)
else:
self.speaker_emb = None
Then you can manually set it to false when you provide the speaker encoder. Then we can cleanly support all 3 common training workflows.
def get_speaker_embedding(self, speaker, reference_speaker_embedding, reference_spec, reference_spec_lens):
if self.speaker_encoder:
spk_emb = self.speaker_encoder(speaker=speaker, ...)
elif self.speaker_emb:
if speaker is None:
raise ValueError()
spk_emb = self.speaker_emb(speaker)
elif reference_speaker_embedding:
spk_emb = reference_speaker_embedding
else:
spk_emb = None
return spk_emb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of including the lookup table in the speaker encoder class if it makes things simpler and improves the abstraction/organization. A little pain in re-training in the short term is okay!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rlangman if you look at the SpeakerEncoder class in submodules.py, it has the same logic, where you can mix-and-match whichever speaker representation you want to use. It takes as input (speaker_embedding_from_lookup_table, ref_audio, ref_titanet_speaker_embedding) And then based on the config we use whichever we want to.
I also favor having one single speaker representation class which we can initialize based on config and then pass to Fastpitch module instead of initializing the lookup table in the constructor of the FastPitch module. Currently, we do the same for all our attribute predictors, we only fix speaker table in the FastPitch module constructor and hence have to pass parameters like n_speakers
etc.
This would lead to uniformity. All our submodules (attribute predictors, aligner etc) are first initialized based on config in models/fastpitch.py
and then the objects are passed to modules/fastpitch.py
except speaker representation.
Thoughts? @redoctopus @rlangman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way technically works. It looks like we all agree the existing SpeakerEncoder should have an embedding array internally.
If we don't want to go with my above suggestion and do want to have a SpeakerEncoder as the only means later down, then we should create and finalize an interface in this PR to avoid breaking the adapter models later.
Current SpeakerEncoder in submodules.py is the "spaghetti code" approach with tons of if statements. The clean way to implement it would be like:
class SpeakerEncoder(ABC):
@abstractmethod
def get_speaker_embedding(self, speaker: Tensor, reference_speaker_embedding: Tensor, ...) -> Tensor:
raise NotImplementedError
Then have one implementation which is the current GST logic with one logic flow and the if/else statements removed. And another simple one:
class LookupTableSpeakerEncoder(SpeakerEncoder):
def __init__(self, num_speakers, embedding_dim):
self.speaker_emb = torch.nn.Embedding(num_speakers, embedding_dim)
def get_speaker_embedding(self, speaker: Tensor, ...):
if speaker is None:
raise ValueError()
return self.speaker_emb(speaker)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then to deprecate the old system we can have a branch that uses the existing lookup table when speaker encoder is not provided, and does not create a lookup table in the FP model when it is.
Then we take our time to retrain all of the FP models on NGC with the new LookupTableSpeakerEncoder and delete the old branch when we are ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a draft, but we had similar PR before and the first main comment was that the PR can be broken into independent commits/features:
- Adapter finetuning
- Reference audio handling and GST
- External speaker embeddings/titanet support
- Auxilliary speaker encoder system
- Tutorial(s)
else: | ||
spk_emb = self.speaker_emb(speaker).unsqueeze(1) | ||
|
||
if self.speaker_encoder is not None: | ||
spk_emb = self.speaker_encoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about it, I don't think we should merge the speaker embeddings with the speaker encoder. The speaker encoder is very specific to these experiments with custom voice, and unlikely to be used by most other TTS problems.
But now that we are supporting utterance level titanet embeddings, we should really support all 3 workflows:
- Condition only on lookup table.
- Condition only on utterance level embeddings
- Condition on both together using speaker encoder.
All 3 could be represented as instances of SpeakerEncoder, but just having a branching if-statement will be clearer.
else: | ||
spk_emb = self.speaker_emb(speaker).unsqueeze(1) | ||
|
||
if self.speaker_encoder is not None: | ||
spk_emb = self.speaker_encoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how about we add a configuration for whether to learn the embedding (defaulting to True for backwards compatibility):
if self.learn_speaker_embedding and n_speakers > 1:
self.speaker_emb = torch.nn.Embedding(n_speakers, symbols_embedding_dim)
else:
self.speaker_emb = None
Then you can manually set it to false when you provide the speaker encoder. Then we can cleanly support all 3 common training workflows.
def get_speaker_embedding(self, speaker, reference_speaker_embedding, reference_spec, reference_spec_lens):
if self.speaker_encoder:
spk_emb = self.speaker_encoder(speaker=speaker, ...)
elif self.speaker_emb:
if speaker is None:
raise ValueError()
spk_emb = self.speaker_emb(speaker)
elif reference_speaker_embedding:
spk_emb = reference_speaker_embedding
else:
spk_emb = None
return spk_emb
Signed-off-by: hsiehjackson <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: hsiehjackson <[email protected]>
* Fix prompt template unescaping Signed-off-by: MaximumEntropy <[email protected]> * Fix for when prompt template is None Signed-off-by: MaximumEntropy <[email protected]> --------- Signed-off-by: MaximumEntropy <[email protected]> Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: Abhishree <[email protected]> Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
709e2f9
to
5bd65da
Compare
for more information, see https://pre-commit.ci
* Add support for Megatron GPT Untied Embd TP PP Change Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support for Megatron GPT Untied Embd TP PP Change Signed-off-by: smajumdar <[email protected]> * Update support for model_file to be None when passing model_extracted_dir Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
for more information, see https://pre-commit.ci
What does this PR do ?
Adds FastPitch speaker adaptation with adapters.
Collection: TTS
Changelog
nemo/collections/tts/data/tts_dataset.py
nemo/collections/tts/torch/tts_data_types.py
nemo/collections/tts/models/fastpitch.py
nemo/collections/tts/modules/fastpitch.py
nemo/collections/tts/modules/transformer.py
nemo/collections/tts/modules/submodules.py
nemo/collections/tts/models/fastpitch.py
nemo/collections/tts/modules/fastpitch.py
nemo/collections/tts/modules/transformer.py
nemo/collections/tts/modules/aligner.py
nemo/collections/tts/parts/mixins/__init__.py
nemo/collections/tts/parts/mixins/fastpitch_adapter_mixins.py
examples/tts/conf/fastpitch_speaker_adaptation.yaml
examples/tts/fastpitch_finetune_adapters.py
tutorials/tts/FastPitch_SpeakerAdaptation.ipynb
scripts/dataset_processing/tts/add_normalized_text.py
nan
loss bugnemo/collections/tts/losses/aligner_loss.py
Usage
tutorials/tts/FastPitch_SpeakerAdaptation.ipynb
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information