-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TTS] Update AudioCodec API #7310
Conversation
c48e1c2
to
b395966
Compare
b395966
to
d5491d2
Compare
if not self.vector_quantizer: | ||
raise ValueError("Cannot dequantize without quantizer") | ||
|
||
quantized = self.vector_quantizer.decode(indices=indices, input_len=encoded_len) | ||
return quantized | ||
# [D, B, T], where D is the number of codebooks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to have BDT ? So that you can consistently index for each sample, D codebooks over T time ? Right now you'll have to check how many samples are there for each codebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, this PR is changing AudioCodecModel
API to use BDT
format for discrete tokens.
Check output type for AudioCodecModel:quantize
and input type for AudioCodecModel:dequantize
.
However, the underlying vector quantizer is using DBT
layout -- and the line above (211) is preparing the input for the quantizer.
In this PR we don't touch the implementation of the vector quantizer, but we may change that in the future as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
# [D, B, T], where D is the number of codebooks | ||
indices = self.vector_quantizer.encode(inputs=encoded, input_len=encoded_len) | ||
# [B, D, T], use batch first | ||
indices = indices.transpose(0, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: can more succinctly document the shapes and transpose as rearrange(inputs, "D B T -> B D T")
"encoded": NeuralType(('B', 'D', 'T_encoded'), EncodedRepresentation()), | ||
"encoded_len": NeuralType(tuple('B'), LengthsType()), | ||
}, | ||
output_types={"indices": NeuralType(('N', 'B', 'T_encoded'), Index())}, | ||
output_types={"indices": NeuralType(('B', 'D', 'T_encoded'), Index())}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is confusing for all of the documentation to refer to both the codebook dimension, and number of codebooks, as "D". If we don't want to use "N", which I guess is supposed to refer to batch dimensions in NeMo (https://github.com/NVIDIA/NeMo/blob/main/nemo/core/neural_types/axes.py#L62), then could we use "C" for the number of codebooks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, B
and N
are the same.
We can use C
instead of D
.
log_quantized: true | ||
log_dequantized: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final codebook embeddings are a quantized version of the encoder output, so calling them "dequantized" might be misleading. I would favor keeping the convention in EnCodec and DAC and referring to the codebook indices as "codes" (instead of just "indices") and the corresponding embeddings as "quantized".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: dequantized
We have the following
continuous encoded representation --quantize--> discrete/quantized representation --dequantize--> continuous representation
With log_dequantize
, we log the final continuous representation after dequantization.
Using dequantized
is the correct way to refer to the continuous (e.g., float
) output of the _dequantize
method.
In this PR I did not want to touch RVQ a lot, but it would be nice to change naming there as well to be consistent.
Re: indices
I absolutely I agree that we should change the name.
I originally changed the name to codes
in this PR, but decided to scrap it because self.codes
in EuclideanCodebook
is denoting the embedding, and wanted to avoid confusion.
Another option, which I think would be very appropriate would be to use tokens
to denote the discrete representation instead of indices
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After talking about it, I agree it makes the most sense to refer to the indices as "tokens" and the codebook embeddings can either be called "codes" or "dequantized" depending on the context. Some of the renaming and convention changes can be left for a future PR.
3d4039c
to
8955274
Compare
@@ -286,7 +286,7 @@ def remove_weight_norm(self): | |||
res_block.remove_weight_norm() | |||
|
|||
def forward(self, inputs, input_len): | |||
audio_len = input_len | |||
audio_len = input_len.detach().clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resulted in in-place updates to input_len
, and caused some confusion. For example, running:
encoded, encoded_len = model.encode_audio(audio=audio, audio_len=audio_len)
output_audio, output_audio_len = model.decode_audio(inputs=encoded, input_len=encoded_len)
Now encoded_len
is equal to output_audio_len
(they're the same tensor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it updating it in-place because it is written audio_len *= up_sample_rate
instead of audio_len = audio_len * up_sample_rate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems to be the culprit.
We can change that instead.
309909c
to
9596479
Compare
Signed-off-by: Ante Jukić <[email protected]>
9596479
to
1f8f715
Compare
Signed-off-by: Ante Jukić <[email protected]>
What does this PR do ?
This PR extends the API for
AudioCodec
, adds comments and small fixes.Collection: TTS
Changelog
encode
to convert a time-domain signal into a discrete representationdecode
to convert a discrete (quantized) representation to audioquantize
to convert a continuous encoded representation into a quantized discrete representationdequantize
to convert a discrete representation to a continuous encoded representationUsage
Example of usage:
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information