Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TTS] Update AudioCodec API #7310

Merged
merged 1 commit into from
Aug 28, 2023
Merged

Conversation

anteju
Copy link
Collaborator

@anteju anteju commented Aug 24, 2023

What does this PR do ?

This PR extends the API for AudioCodec, adds comments and small fixes.

Collection: TTS

Changelog

  • Added encode to convert a time-domain signal into a discrete representation
  • Added decode to convert a discrete (quantized) representation to audio
  • Use quantize to convert a continuous encoded representation into a quantized discrete representation
  • Use dequantize to convert a discrete representation to a continuous encoded representation
  • Use batch-first format for quantized discrete representation
  • Updated input/output type definitions
  • Added docstrings in audio_codec modules
  • Added logging in audio_codec modules and callbacks

Usage

Example of usage:

# instantiate a model
model = AudioCodecModel(…)

# convert `audio_in` to disrete representation
# `tokenized` has shape (batch_size, num_codebooks, num_frames)
tokenized, tokenized_len = model.encode(audio=audio_in, audio_len=audio_in_len)

# convert a discrete representation to time-domain audio
audio_out, audio_out_len = model.decode(tokens=tokenized, tokens_len=tokenized_len)

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the TTS label Aug 24, 2023
@anteju anteju marked this pull request as ready for review August 24, 2023 16:43
@anteju anteju requested a review from rlangman August 24, 2023 16:43
if not self.vector_quantizer:
raise ValueError("Cannot dequantize without quantizer")

quantized = self.vector_quantizer.decode(indices=indices, input_len=encoded_len)
return quantized
# [D, B, T], where D is the number of codebooks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to have BDT ? So that you can consistently index for each sample, D codebooks over T time ? Right now you'll have to check how many samples are there for each codebook

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, this PR is changing AudioCodecModel API to use BDT format for discrete tokens.
Check output type for AudioCodecModel:quantize and input type for AudioCodecModel:dequantize.

However, the underlying vector quantizer is using DBT layout -- and the line above (211) is preparing the input for the quantizer.
In this PR we don't touch the implementation of the vector quantizer, but we may change that in the future as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@anteju anteju requested a review from titu1994 August 24, 2023 22:08
Comment on lines 186 to 189
# [D, B, T], where D is the number of codebooks
indices = self.vector_quantizer.encode(inputs=encoded, input_len=encoded_len)
# [B, D, T], use batch first
indices = indices.transpose(0, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: can more succinctly document the shapes and transpose as rearrange(inputs, "D B T -> B D T")

Comment on lines 167 to 170
"encoded": NeuralType(('B', 'D', 'T_encoded'), EncodedRepresentation()),
"encoded_len": NeuralType(tuple('B'), LengthsType()),
},
output_types={"indices": NeuralType(('N', 'B', 'T_encoded'), Index())},
output_types={"indices": NeuralType(('B', 'D', 'T_encoded'), Index())},
Copy link
Collaborator

@rlangman rlangman Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is confusing for all of the documentation to refer to both the codebook dimension, and number of codebooks, as "D". If we don't want to use "N", which I guess is supposed to refer to batch dimensions in NeMo (https://github.com/NVIDIA/NeMo/blob/main/nemo/core/neural_types/axes.py#L62), then could we use "C" for the number of codebooks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, B and N are the same.
We can use C instead of D.

log_quantized: true
log_dequantized: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final codebook embeddings are a quantized version of the encoder output, so calling them "dequantized" might be misleading. I would favor keeping the convention in EnCodec and DAC and referring to the codebook indices as "codes" (instead of just "indices") and the corresponding embeddings as "quantized".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: dequantized
We have the following

continuous encoded representation --quantize--> discrete/quantized representation --dequantize-->  continuous representation

With log_dequantize, we log the final continuous representation after dequantization.
Using dequantized is the correct way to refer to the continuous (e.g., float) output of the _dequantize method.

In this PR I did not want to touch RVQ a lot, but it would be nice to change naming there as well to be consistent.


Re: indices
I absolutely I agree that we should change the name.
I originally changed the name to codes in this PR, but decided to scrap it because self.codes in EuclideanCodebook is denoting the embedding, and wanted to avoid confusion.
Another option, which I think would be very appropriate would be to use tokens to denote the discrete representation instead of indices.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking about it, I agree it makes the most sense to refer to the indices as "tokens" and the codebook embeddings can either be called "codes" or "dequantized" depending on the context. Some of the renaming and convention changes can be left for a future PR.

@anteju anteju requested a review from rlangman August 25, 2023 17:19
@anteju anteju force-pushed the pr/audio-codec-api-update branch 2 times, most recently from 3d4039c to 8955274 Compare August 25, 2023 21:51
@@ -286,7 +286,7 @@ def remove_weight_norm(self):
res_block.remove_weight_norm()

def forward(self, inputs, input_len):
audio_len = input_len
audio_len = input_len.detach().clone()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resulted in in-place updates to input_len, and caused some confusion. For example, running:

encoded, encoded_len = model.encode_audio(audio=audio, audio_len=audio_len)
output_audio, output_audio_len = model.decode_audio(inputs=encoded, input_len=encoded_len)

Now encoded_len is equal to output_audio_len (they're the same tensor).

Copy link
Collaborator

@rlangman rlangman Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it updating it in-place because it is written audio_len *= up_sample_rate instead of audio_len = audio_len * up_sample_rate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems to be the culprit.
We can change that instead.

@anteju anteju force-pushed the pr/audio-codec-api-update branch 2 times, most recently from 309909c to 9596479 Compare August 28, 2023 18:59
Signed-off-by: Ante Jukić <[email protected]>
@anteju anteju merged commit 90feee2 into NVIDIA:main Aug 28, 2023
10 of 11 checks passed
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants