Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about the reproduction of this paper: from newcomer #25

Closed
wangxuanji opened this issue Nov 6, 2023 · 7 comments
Closed
Labels
documentation Improvements or additions to documentation

Comments

@wangxuanji
Copy link

wangxuanji commented Nov 6, 2023

Hello, I am a novice in the field of speech, and I don't understand many things. I hope you don't take the trouble to answer my questions. Thank you again.

Firstly, for the dataset, I used the LJspeed dataset used in the paper, with a total of 13100 wavs. I divided it into training, validation, and testing sets in a 7:2:1 ratio. Is there any problem with my approach and how did you divide the experiments at that time?

Secondly, I have some doubts about the code. Is the best model saved according to epoch (monitor: epoch # name of the logged metric which determines when the model is improving)? That's because the more iterations the model has, the better. I didn't find any other evaluation indicators in the code, such as RTF or WER after running, which may be because my code ability is too poor. I don't quite understand this point.

Thirdly, there is a section of code in train.py:

If logger:

Log. info ("Logging hyperparameters!")

Utils.log_ Hyperparameters (object_dict)

If cfg. get ("train"):

Log. info ("Starting training!")

Trainer. fit (model=model, datamodule=datamodule, ckpt_path=cfg. get ("ckpt_path"))

Train_ Metrics=trainer. callback_ Metrics

If cfg. get ("test"):

Log. info ("Starting testing!")

Ckpt_ Path=trainer. checkpoint_ Callback. best_ Model_ Path

If ckpt_ Path=="":

Log. warning ("Best ckpt not found! Using current weights for testing...")

Ckpt_ Path=None

Trainer. test (model=model, datamodule=datamodule, ckpt_path=ckpt_path)

Log. info (f "Best ckpt path: {ckpt_path}")

Test_ Metrics=trainer. callback_ Metrics

#Merge train and test metrics

Metric_ Dict={* * train_metrics, * * test_metrics}

Return metric_ Dict, object_ Dict

For this code, I only saw (Starting training!) but not (Starting testing!) during runtime, and the testing section was not found. When I tried running a small epoch instead of 50k iterations in the paper, I stopped it and encountered
( raise MisconfigurationException(f"No {step_name}() method defined to run Trainer.{trainer_method}.")
lightning.fabric.utilities.exceptions.MisconfigurationException: No test_step() method defined to run Trainer.test.)
What is the reason for this?

These questions may seem childish to you, but they are really important to me. Could you please answer them and express my gratitude to you again.

@shivammehta25
Copy link
Owner

shivammehta25 commented Nov 6, 2023

Hello! @wangxuanji,

Thank you for your interest in 🍵 Matcha-TTS. These are amazing hands on set of questions which I really enjoy answering.

Firstly, for the dataset, I used the LJspeed dataset used in the paper, with a total of 13100 wavs. I divided it into training, validation, and testing sets in a 7:2:1 ratio. Is there any problem with my approach and how did you divide the experiments at that time?

I used a split similar to Tacotron 2, but often diffusion-type losses do not correlate very well with model performance, I did not use the test set to evaluate any metrics during the training.

Is the best model saved according to epoch (monitor: epoch # name of the logged metric which determines when the model is improving)? That's because the more iterations the model has, the better. I didn't find any other evaluation indicators in the code, such as RTF or WER after running, which may be because my code ability is too poor. I don't quite understand this point.

Perhaps, this is a heuristic of diffusion-type models that the longer you train them generally they tend to improve and we saw the same during our preliminary experiments. Speaking of RTF, it doesn't change even if the model is not trained it is an architectural nuance rather than a training nuance. And for WER, I computed this by loading the Whisper model separately offline, I do not suggest doing it online during training as the Whisper model is a bulky model in itself and it significantly reduces the training speed when I tried transcribing parallelly.

Thirdly, there is a section of code in train.py:

I used pytorch-hydra-template which I found really amazing during my RnD iterations. Since my aim was not to not evaluate the test set (similar to diffusion, the CFM loss is very noisy). I removed the eval file, which you can put back in case you need to evaluate.

And then in baselightningmodule.py you would need to define a def test_step(self): to run your testing evaluations.

These questions may seem childish to you, but they are really important to me. Could you please answer them and express my gratitude to you again.

Please, these are amazing questions, thank you very much for coming out and asking them. It would improve the experience for someone else too.

Hope this answers your questions feel free to continue the discussion in case you have any further questions.

Regards,
Shivam

@shivammehta25 shivammehta25 added the documentation Improvements or additions to documentation label Nov 6, 2023
@wangxuanji
Copy link
Author

Hello, when I saw your reply, I felt very happy. Once again, I would like to express my gratitude. I have some unimportant questions, could you please answer them by the way? I hope it hasn't caused you any trouble.
Firstly, for RTF, it is only related to the model, which means that if the model changes, the RTF will change accordingly, and it has nothing to do with training.
Secondly, when calculating RTF, is it necessary to use ONNX inference to synthesize the test sets one by one and then take the mean and interval?
Thank you again for your reply amidst your busy schedule.

@shivammehta25
Copy link
Owner

Firstly, for RTF, it is only related to the model, which means that if the model changes, the RTF will change accordingly, and it has nothing to do with training.

This is correct, what I meant by model changes is the change in model architecture or some hyperparameters like network sizes, number of layers, number of ODE solver steps etc.

Secondly, when calculating RTF, is it necessary to use ONNX inference to synthesize the test sets one by one and then take the mean and interval?

No not at all. Actually, all the numbers in the paper are without ONNX export. In the section CLI Arguments there are commands to do that for you. Specifically, if you pass a file using

matcha-tts --file <PATH TO FILE>

You will have it synthesised one by one and get individual + mean RTF values.

However there is also a faster way to do it just pass --batched

matcha-tts --file <PATH TO FILE> --batched

And it will do batched synthesis which can be significantly faster if you have multiple utterances.

Hope this helps.

@wangxuanji
Copy link
Author

Thank you for your reply. It has been very helpful to me. I wish you success in your work and a happy life

@shivammehta25
Copy link
Owner

Thank you for your kind words! I wish you the same 😄

@kin0303
Copy link

kin0303 commented Dec 4, 2024

And then in baselightningmodule.py you would need to define a def test_step(self): to run your testing evaluations.

Hi @shivammehta25, Can you provide the function of def test_step(self)?

@shivammehta25
Copy link
Owner

We didn't use the test_step functionality instead, we just ran synthesis using the checkpoint provided and cli.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants