[TTS] Update AudioCodec API #7310

anteju · 2023-08-24T04:16:00Z

What does this PR do ?

This PR extends the API for AudioCodec, adds comments and small fixes.

Collection: TTS

Changelog

Added encode to convert a time-domain signal into a discrete representation
Added decode to convert a discrete (quantized) representation to audio
Use quantize to convert a continuous encoded representation into a quantized discrete representation
Use dequantize to convert a discrete representation to a continuous encoded representation
Use batch-first format for quantized discrete representation
Updated input/output type definitions
Added docstrings in audio_codec modules
Added logging in audio_codec modules and callbacks

Usage

Example of usage:

# instantiate a model
model = AudioCodecModel(…)

# convert `audio_in` to disrete representation
# `tokenized` has shape (batch_size, num_codebooks, num_frames)
tokenized, tokenized_len = model.encode(audio=audio_in, audio_len=audio_in_len)

# convert a discrete representation to time-domain audio
audio_out, audio_out_len = model.decode(tokens=tokenized, tokens_len=tokenized_len)

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

titu1994 · 2023-08-24T17:59:03Z

nemo/collections/tts/models/audio_codec.py

        if not self.vector_quantizer:
            raise ValueError("Cannot dequantize without quantizer")

-        quantized = self.vector_quantizer.decode(indices=indices, input_len=encoded_len)
-        return quantized
+        # [D, B, T], where D is the number of codebooks


Wouldn't it be better to have BDT ? So that you can consistently index for each sample, D codebooks over T time ? Right now you'll have to check how many samples are there for each codebook

Exactly, this PR is changing AudioCodecModel API to use BDT format for discrete tokens.
Check output type for AudioCodecModel:quantize and input type for AudioCodecModel:dequantize.

However, the underlying vector quantizer is using DBT layout -- and the line above (211) is preparing the input for the quantizer.
In this PR we don't touch the implementation of the vector quantizer, but we may change that in the future as well.

Makes sense

rlangman · 2023-08-24T22:38:52Z

nemo/collections/tts/models/audio_codec.py

+        # [D, B, T], where D is the number of codebooks
        indices = self.vector_quantizer.encode(inputs=encoded, input_len=encoded_len)
+        # [B, D, T], use batch first
+        indices = indices.transpose(0, 1)


Nitpick: can more succinctly document the shapes and transpose as rearrange(inputs, "D B T -> B D T")

rlangman · 2023-08-24T22:54:48Z

nemo/collections/tts/models/audio_codec.py

            "encoded": NeuralType(('B', 'D', 'T_encoded'), EncodedRepresentation()),
            "encoded_len": NeuralType(tuple('B'), LengthsType()),
        },
-        output_types={"indices": NeuralType(('N', 'B', 'T_encoded'), Index())},
+        output_types={"indices": NeuralType(('B', 'D', 'T_encoded'), Index())},


I think it is confusing for all of the documentation to refer to both the codebook dimension, and number of codebooks, as "D". If we don't want to use "N", which I guess is supposed to refer to batch dimensions in NeMo (https://github.com/NVIDIA/NeMo/blob/main/nemo/core/neural_types/axes.py#L62), then could we use "C" for the number of codebooks?

Yes, B and N are the same.
We can use C instead of D.

rlangman · 2023-08-24T23:18:19Z

examples/tts/conf/audio_codec/encodec.yaml

-        log_quantized: true
+        log_dequantized: true


The final codebook embeddings are a quantized version of the encoder output, so calling them "dequantized" might be misleading. I would favor keeping the convention in EnCodec and DAC and referring to the codebook indices as "codes" (instead of just "indices") and the corresponding embeddings as "quantized".

Re: dequantized
We have the following

continuous encoded representation --quantize--> discrete/quantized representation --dequantize--> continuous representation

With log_dequantize, we log the final continuous representation after dequantization.
Using dequantized is the correct way to refer to the continuous (e.g., float) output of the _dequantize method.

In this PR I did not want to touch RVQ a lot, but it would be nice to change naming there as well to be consistent.

Re: indices
I absolutely I agree that we should change the name.
I originally changed the name to codes in this PR, but decided to scrap it because self.codes in EuclideanCodebook is denoting the embedding, and wanted to avoid confusion.
Another option, which I think would be very appropriate would be to use tokens to denote the discrete representation instead of indices.

After talking about it, I agree it makes the most sense to refer to the indices as "tokens" and the codebook embeddings can either be called "codes" or "dequantized" depending on the context. Some of the renaming and convention changes can be left for a future PR.

anteju · 2023-08-25T21:56:50Z

nemo/collections/tts/modules/encodec_modules.py

@@ -286,7 +286,7 @@ def remove_weight_norm(self):
            res_block.remove_weight_norm()

    def forward(self, inputs, input_len):
-        audio_len = input_len
+        audio_len = input_len.detach().clone()


This resulted in in-place updates to input_len, and caused some confusion. For example, running:

encoded, encoded_len = model.encode_audio(audio=audio, audio_len=audio_len) output_audio, output_audio_len = model.decode_audio(inputs=encoded, input_len=encoded_len)

Now encoded_len is equal to output_audio_len (they're the same tensor).

Is it updating it in-place because it is written audio_len *= up_sample_rate instead of audio_len = audio_len * up_sample_rate?

Yeah, that seems to be the culprit.
We can change that instead.

Signed-off-by: Ante Jukić <[email protected]>

github-actions bot added the TTS label Aug 24, 2023

anteju force-pushed the pr/audio-codec-api-update branch from c48e1c2 to b395966 Compare August 24, 2023 04:16

anteju marked this pull request as ready for review August 24, 2023 16:43

anteju requested a review from rlangman August 24, 2023 16:43

anteju force-pushed the pr/audio-codec-api-update branch from b395966 to d5491d2 Compare August 24, 2023 17:46

titu1994 reviewed Aug 24, 2023

View reviewed changes

anteju requested a review from titu1994 August 24, 2023 22:08

rlangman reviewed Aug 24, 2023

View reviewed changes

anteju requested a review from rlangman August 25, 2023 17:19

anteju force-pushed the pr/audio-codec-api-update branch 2 times, most recently from 3d4039c to 8955274 Compare August 25, 2023 21:51

anteju commented Aug 25, 2023

View reviewed changes

anteju force-pushed the pr/audio-codec-api-update branch 2 times, most recently from 309909c to 9596479 Compare August 28, 2023 18:59

Update AudioCodec API

1f8f715

Signed-off-by: Ante Jukić <[email protected]>

anteju force-pushed the pr/audio-codec-api-update branch from 9596479 to 1f8f715 Compare August 28, 2023 19:00

rlangman approved these changes Aug 28, 2023

View reviewed changes

anteju merged commit 90feee2 into NVIDIA:main Aug 28, 2023
10 of 11 checks passed

rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024

Update AudioCodec API (NVIDIA#7310)

c683e31

Signed-off-by: Ante Jukić <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TTS] Update AudioCodec API #7310

[TTS] Update AudioCodec API #7310

anteju commented Aug 24, 2023 •

edited

Loading

titu1994 Aug 24, 2023

anteju Aug 24, 2023

titu1994 Aug 24, 2023

rlangman Aug 24, 2023

rlangman Aug 24, 2023 •

edited

Loading

anteju Aug 25, 2023

rlangman Aug 24, 2023

anteju Aug 25, 2023

rlangman Aug 25, 2023

anteju Aug 25, 2023

rlangman Aug 25, 2023 •

edited

Loading

anteju Aug 25, 2023

[TTS] Update AudioCodec API #7310

[TTS] Update AudioCodec API #7310

Conversation

anteju commented Aug 24, 2023 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlangman Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlangman Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anteju commented Aug 24, 2023 •

edited

Loading

rlangman Aug 24, 2023 •

edited

Loading

rlangman Aug 25, 2023 •

edited

Loading