Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TTS]Vits Available Model List needs to be updated #5878

Closed
dustinjoe opened this issue Jan 27, 2023 · 5 comments
Closed

[TTS]Vits Available Model List needs to be updated #5878

dustinjoe opened this issue Jan 27, 2023 · 5 comments
Assignees
Labels
bug Something isn't working TTS

Comments

@dustinjoe
Copy link

Hello. I want to have a trial finetuning the newly updated Vits model in TTS module.
Trying to follow the instructions on:
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_lj_vits

But got the error of:
FileNotFoundError: Model tts_en_lj_vits was not found. Check cls.list_available_models() for the list of all available models.

Because currently print(VitsModel.list_available_models()) is an empty list.
This seems similar to the error in Tacotron2 I encountered last time here:
#5714

Thank you!

@dustinjoe dustinjoe added the bug Something isn't working label Jan 27, 2023
@dustinjoe
Copy link
Author

One additional question, I manually downloaded the nemo format model file and restore it.
But I am having some error in the finetuning trial.

I made a separate vits_finetune.py based on the fastpitch_finetune.py and vits.py as follows:

===================================================

import pytorch_lightning as pl

from nemo.collections.common.callbacks import LogEpochTimeCallback
from nemo.collections.tts.models.vits import VitsModel
from nemo.core.config import hydra_runner
from nemo.utils.exp_manager import exp_manager

from nemo.utils import logging

@hydra_runner(config_path="conf", config_name="vits")
def main(cfg):    
    if hasattr(cfg.model.optim, 'sched'):
        logging.warning("You are using an optimizer scheduler while finetuning. Are you sure this is intended?")
    if cfg.model.optim.lr > 1e-3 or cfg.model.optim.lr < 1e-5:
        logging.warning("The recommended learning rate for finetuning is 2e-4")
    #trainer = pl.Trainer(**cfg.trainer)
    trainer = pl.Trainer(replace_sampler_ddp=False, **cfg.trainer)
    exp_manager(trainer, cfg.get("exp_manager", None))
    #model = FastPitchModel(cfg=cfg.model, trainer=trainer)
    model = VitsModel(cfg=cfg.model, trainer=trainer)
    model.maybe_init_from_pretrained_checkpoint(cfg=cfg)
    lr_logger = pl.callbacks.LearningRateMonitor()
    epoch_time_logger = LogEpochTimeCallback()
    trainer.callbacks.extend([lr_logger, epoch_time_logger])
    trainer.fit(model)
    

if __name__ == '__main__':
    main()  # noqa pylint: disable=no-value-for-parameter

==============================================

Here is the command I am using:

!(python vits_finetune.py --config-name=vits.yaml \
  train_dataset=./data_train80.json \
  validation_datasets=./data_val20.json \
  phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.08 \
  heteronyms_path=tts_dataset_files/heteronyms-052722 \
  whitelist_path=tts_dataset_files/lj_speech.tsv \
  exp_manager.exp_dir=./ljspeech_to_target_no_mixing \
  +init_from_nemo_model=../vits_ljspeech_fp16_full.nemo \
  trainer.max_epochs=1000 \
  trainer.check_val_every_n_epoch=1 \
  model.train_ds.batch_sampler.batch_size=32 model.validation_ds.dataloader_params.batch_size=32 \
)

Here is the error I got:

RuntimeError: Error(s) in loading state_dict for VitsModel:
	size mismatch for net_g.enc_p.emb.weight: copying a param with shape torch.Size([96, 192]) from checkpoint, the shape in current model is torch.Size([57, 192]).
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I checked the audio files and all files are single channel and 22050hz, not sure what I am missing. Any suggestions on VITS finetuning stuff? Thank you!

@treacker
Copy link
Contributor

treacker commented Feb 8, 2023

Hi, thanks for notice, I've made PR with updated for list_available_models().

RuntimeError: Error(s) in loading state_dict for VitsModel:
	size mismatch for net_g.enc_p.emb.weight: copying a param with shape torch.Size([96, 192]) from checkpoint, the shape in current model is torch.Size([57, 192]).

net_g.enc_p.emb.weight is the first layer of TextEncoder and its shape corresponds to number of tokens model uses. I could not reproduce your error. Are you sure that you use the same vocabulary?

@dustinjoe
Copy link
Author

Hi, thanks for your reply. I am training on a set of wav audios in English language. For the command, I am trying to reuse the procedure from that FastPitch finetune:
phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.08
heteronyms_path=tts_dataset_files/heteronyms-052722
whitelist_path=tts_dataset_files/lj_speech.tsv
So I guess this is the vocabulary you mean here? Does it have more relevant settings I need to change? Maybe has something to do with computation env things? I am only training on one single GPU in hand so I changed the num of gpu from trainer.devices=2 to trainer.devices=1 in that yaml config file.
Thank you!

@treacker
Copy link
Contributor

The model was trained with IPA tokes, can you please use "scripts/tts_dataset_files/ipa_cmudict-0.7b_nv22.10.txt", I think this is the issue

@dustinjoe
Copy link
Author

Oh, thanks for your reply on this. This is indeed the issue. I have successfully train and test the finetuned model now.
Wondering if you could give some general suggestions on the VITS finetuning stuff? Like how many epochs these kind of things? I currently used 1000 in my experiment.
My current feeling is that, the audios from VITS finetuning are more active than FastPitch but also seem to have more little errors though.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working TTS
Projects
None yet
Development

No branches or pull requests

3 participants