Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Codec #12

Open
Paulmzr opened this issue Sep 5, 2024 · 2 comments
Open

Question about Codec #12

Paulmzr opened this issue Sep 5, 2024 · 2 comments

Comments

@Paulmzr
Copy link

Paulmzr commented Sep 5, 2024

Hi, thanks for your great efforts. I notice that you write "Meta's Encodec 24K version was also tested, but it could not be trained.". Does that mean that using meta's encodec leads to poor performance?

@CODEJIN
Copy link
Owner

CODEJIN commented Sep 5, 2024

Dear @Paulmzr ,

Hello,

The training itself has not been successful. Afterward, I conducted a few tests independently, and I have personally drawn the following conclusions.

  1. I think that combining the NaturalSpeech2 code from this repository with Encodec in my current environment does not allow proper training.

  2. The possible causes of this issue could be the following:

  • The written code is incomplete.
  • When using a codec trained on a much wider range of external audio, the complexity of the codec latent becomes too challenging for diffusion to handle.
  • As the number of RVQ stacks increases, the final latent complexity increases, making it difficult for diffusion to handle.
  • To learn the relationship between text and codec latent, convergence cannot be achieved without using a very large batch size.
  1. Regarding the first and second issues mentioned above, considering that a certain level of training is possible when using Hifi-codec, I believe they are unlikely to be the main reasons, even if they contribute to the issue.

  2. The increase in complexity due to many RVQ stacks could be a potential cause of the problem. In fact, the Hifi-codec, which does train, only uses 4 VQs and even splits the dimension in half with 2 stacks for each, following a simple structure.

  3. The need for a high batch size may be linked to the complexity and could be a potential cause of the problem. However, it is difficult to verify this with the time and GPU resources I have. Given the time constraints, it is not easy to fully validate this, even with the application of accumulation techniques.

If you have any feedback on this matter, I would greatly appreciate it.

Thank you.

@Paulmzr
Copy link
Author

Paulmzr commented Sep 6, 2024

@CODEJIN Thank you for your detailed response! I will try to train it and share my findings!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants