[TTS]Vits Available Model List needs to be updated #5878

dustinjoe · 2023-01-27T19:46:14Z

Hello. I want to have a trial finetuning the newly updated Vits model in TTS module.
Trying to follow the instructions on:
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_lj_vits

But got the error of:
FileNotFoundError: Model tts_en_lj_vits was not found. Check cls.list_available_models() for the list of all available models.

Because currently print(VitsModel.list_available_models()) is an empty list.
This seems similar to the error in Tacotron2 I encountered last time here:
#5714

Thank you!

dustinjoe · 2023-01-27T22:05:35Z

One additional question, I manually downloaded the nemo format model file and restore it.
But I am having some error in the finetuning trial.

I made a separate vits_finetune.py based on the fastpitch_finetune.py and vits.py as follows:

===================================================

import pytorch_lightning as pl

from nemo.collections.common.callbacks import LogEpochTimeCallback
from nemo.collections.tts.models.vits import VitsModel
from nemo.core.config import hydra_runner
from nemo.utils.exp_manager import exp_manager

from nemo.utils import logging

@hydra_runner(config_path="conf", config_name="vits")
def main(cfg):    
    if hasattr(cfg.model.optim, 'sched'):
        logging.warning("You are using an optimizer scheduler while finetuning. Are you sure this is intended?")
    if cfg.model.optim.lr > 1e-3 or cfg.model.optim.lr < 1e-5:
        logging.warning("The recommended learning rate for finetuning is 2e-4")
    #trainer = pl.Trainer(**cfg.trainer)
    trainer = pl.Trainer(replace_sampler_ddp=False, **cfg.trainer)
    exp_manager(trainer, cfg.get("exp_manager", None))
    #model = FastPitchModel(cfg=cfg.model, trainer=trainer)
    model = VitsModel(cfg=cfg.model, trainer=trainer)
    model.maybe_init_from_pretrained_checkpoint(cfg=cfg)
    lr_logger = pl.callbacks.LearningRateMonitor()
    epoch_time_logger = LogEpochTimeCallback()
    trainer.callbacks.extend([lr_logger, epoch_time_logger])
    trainer.fit(model)
    

if __name__ == '__main__':
    main()  # noqa pylint: disable=no-value-for-parameter

==============================================

Here is the command I am using:

!(python vits_finetune.py --config-name=vits.yaml \
  train_dataset=./data_train80.json \
  validation_datasets=./data_val20.json \
  phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.08 \
  heteronyms_path=tts_dataset_files/heteronyms-052722 \
  whitelist_path=tts_dataset_files/lj_speech.tsv \
  exp_manager.exp_dir=./ljspeech_to_target_no_mixing \
  +init_from_nemo_model=../vits_ljspeech_fp16_full.nemo \
  trainer.max_epochs=1000 \
  trainer.check_val_every_n_epoch=1 \
  model.train_ds.batch_sampler.batch_size=32 model.validation_ds.dataloader_params.batch_size=32 \
)

Here is the error I got:

RuntimeError: Error(s) in loading state_dict for VitsModel:
	size mismatch for net_g.enc_p.emb.weight: copying a param with shape torch.Size([96, 192]) from checkpoint, the shape in current model is torch.Size([57, 192]).
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I checked the audio files and all files are single channel and 22050hz, not sure what I am missing. Any suggestions on VITS finetuning stuff? Thank you!

treacker · 2023-02-08T13:23:45Z

Hi, thanks for notice, I've made PR with updated for list_available_models().

RuntimeError: Error(s) in loading state_dict for VitsModel:
	size mismatch for net_g.enc_p.emb.weight: copying a param with shape torch.Size([96, 192]) from checkpoint, the shape in current model is torch.Size([57, 192]).

net_g.enc_p.emb.weight is the first layer of TextEncoder and its shape corresponds to number of tokens model uses. I could not reproduce your error. Are you sure that you use the same vocabulary?

dustinjoe · 2023-02-09T18:28:09Z

Hi, thanks for your reply. I am training on a set of wav audios in English language. For the command, I am trying to reuse the procedure from that FastPitch finetune:
phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.08
heteronyms_path=tts_dataset_files/heteronyms-052722
whitelist_path=tts_dataset_files/lj_speech.tsv
So I guess this is the vocabulary you mean here? Does it have more relevant settings I need to change? Maybe has something to do with computation env things? I am only training on one single GPU in hand so I changed the num of gpu from trainer.devices=2 to trainer.devices=1 in that yaml config file.
Thank you!

treacker · 2023-02-10T08:00:00Z

The model was trained with IPA tokes, can you please use "scripts/tts_dataset_files/ipa_cmudict-0.7b_nv22.10.txt", I think this is the issue

dustinjoe · 2023-02-11T18:54:30Z

Oh, thanks for your reply on this. This is indeed the issue. I have successfully train and test the finetuned model now.
Wondering if you could give some general suggestions on the VITS finetuning stuff? Like how many epochs these kind of things? I currently used 1000 in my experiment.
My current feeling is that, the audios from VITS finetuning are more active than FastPitch but also seem to have more little errors though.
Thank you!

dustinjoe added the bug Something isn't working label Jan 27, 2023

redoctopus assigned treacker Jan 27, 2023

XuesongYang added the TTS label Jan 31, 2023

dustinjoe closed this as completed Feb 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TTS]Vits Available Model List needs to be updated #5878

[TTS]Vits Available Model List needs to be updated #5878

dustinjoe commented Jan 27, 2023

dustinjoe commented Jan 27, 2023

treacker commented Feb 8, 2023

dustinjoe commented Feb 9, 2023

treacker commented Feb 10, 2023

dustinjoe commented Feb 11, 2023

[TTS]Vits Available Model List needs to be updated #5878

[TTS]Vits Available Model List needs to be updated #5878

Comments

dustinjoe commented Jan 27, 2023

dustinjoe commented Jan 27, 2023

treacker commented Feb 8, 2023

dustinjoe commented Feb 9, 2023

treacker commented Feb 10, 2023

dustinjoe commented Feb 11, 2023