Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Error occurs when resuming training a xtts model #3131

Closed
yiliu-mt opened this issue Nov 1, 2023 · 7 comments · Fixed by coqui-ai/Trainer#131 or coqui-ai/Trainer#135 · May be fixed by idiap/coqui-ai-TTS#270
Closed

[Bug] Error occurs when resuming training a xtts model #3131

yiliu-mt opened this issue Nov 1, 2023 · 7 comments · Fixed by coqui-ai/Trainer#131 or coqui-ai/Trainer#135 · May be fixed by idiap/coqui-ai-TTS#270
Assignees
Labels
bug Something isn't working

Comments

@yiliu-mt
Copy link

yiliu-mt commented Nov 1, 2023

Describe the bug

Error occurs when I try to resume training of a xtts model. Details are describe below.

To Reproduce

First, I train a xtts model using the official script:

cd TS/recipes/ljspeech/xtts_v1
CUDA_VISIBLE_DEVICES="0" python train_gpt_xtts.py

Then, during the training, it is corrupted. So I try to resume training using:

CUDA_VISIBLE_DEVICES="0" python train_gpt_xtts.py --continue_
path run/training/GPT_XTTS_LJSpeech_FT-November-01-2023_08+42AM-0000000

It failed to resume the tranining process.

Expected behavior

No response

Logs

>> DVAE weights restored from: /nfs2/yi.liu/src/TTS/recipes/ljspeech/xtts_v1/run/training/XTTS_v1.1_original_model_files/dvae.pth
 | > Found 13100 files in /nfs2/speech/data/tts/Datasets/LJSpeech-1.1
fatal: detected dubious ownership in repository at '/nfs2/yi.liu/src/TTS'
To add an exception for this directory, call:

        git config --global --add safe.directory /nfs2/yi.liu/src/TTS
 > Training Environment:
 | > Backend: Torch
 | > Mixed precision: False
 | > Precision: float32
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 64
 | > Num. of Torch Threads: 1
 | > Torch seed: 1
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=run/training/GPT_XTTS_LJSpeech_FT-November-01-2023_08+42AM-0000000/
 > Restoring from checkpoint_1973.pth ...
 > Restoring Model...
 > Restoring Optimizer...
 > Model restored from step 1973

 > Model has 543985103 parameters
 > Restoring best loss from best_model_1622.pth ...
--- Logging error ---
Traceback (most recent call last):
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: must be real number, not dict
Call stack:
  File "/nfs2/yi.liu/src/TTS/recipes/ljspeech/xtts_v1/train_gpt_xtts.py", line 182, in <module>
    main()
  File "/nfs2/yi.liu/src/TTS/recipes/ljspeech/xtts_v1/train_gpt_xtts.py", line 178, in main
    trainer.fit()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1808, in fit
    self._fit()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1746, in _fit
    self._restore_best_loss()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1710, in _restore_best_loss
    logger.info(" > Starting with loaded last best loss %f", self.best_loss)
Message: ' > Starting with loaded last best loss %f'
Arguments: {'train_loss': 0.03659261970647744, 'eval_loss': None}
--- Logging error ---
Traceback (most recent call last):
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: must be real number, not dict
Call stack:
  File "/nfs2/yi.liu/src/TTS/recipes/ljspeech/xtts_v1/train_gpt_xtts.py", line 182, in <module>
    main()
  File "/nfs2/yi.liu/src/TTS/recipes/ljspeech/xtts_v1/train_gpt_xtts.py", line 178, in main
    trainer.fit()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1808, in fit
    self._fit()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1746, in _fit
    self._restore_best_loss()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1710, in _restore_best_loss
    logger.info(" > Starting with loaded last best loss %f", self.best_loss)
Message: ' > Starting with loaded last best loss %f'
Arguments: {'train_loss': 0.03659261970647744, 'eval_loss': None}

 > EPOCH: 0/1000
 --> run/training/GPT_XTTS_LJSpeech_FT-November-01-2023_08+42AM-0000000/
 ! Run is kept in run/training/GPT_XTTS_LJSpeech_FT-November-01-2023_08+42AM-0000000/
Traceback (most recent call last):
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1808, in fit
    self._fit()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1762, in _fit
    self.eval_epoch()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1610, in eval_epoch
    self.get_eval_dataloader(
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 976, in get_eval_dataloader
    return self._get_loader(
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 895, in _get_loader
    loader = model.get_data_loader(
  File "/nfs2/yi.liu/src/TTS/TTS/tts/layers/xtts/trainer/gpt_trainer.py", line 337, in get_data_loader
    dataset = XTTSDataset(self.config, samples, self.xtts.tokenizer, config.audio.sample_rate, is_eval)
  File "/nfs2/yi.liu/src/TTS/TTS/tts/layers/xtts/trainer/dataset.py", line 83, in __init__
    self.debug_failures = model_args.debug_loading_failures
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/coqpit/coqpit.py", line 626, in __getattribute__
    value = super().__getattribute__(arg)
AttributeError: 'XttsArgs' object has no attribute 'debug_loading_failures'

Environment

TTS: v0.19.1
pytorch: 2.0.1+cu117
python: 3.9.18

Additional context

Please inform me if any other information is needed.

@yiliu-mt yiliu-mt added the bug Something isn't working label Nov 1, 2023
@manmay-nakhashi
Copy link
Collaborator

use --restore_path

@yiliu-mt
Copy link
Author

yiliu-mt commented Nov 1, 2023

Thanks for the advice!
However, I just wonder if it is the same behavior to use "continue_path" and "restore_path". As I understand, restore_path is mostly used for pretrained model. All the states of optimizers should start from scratch and the step starts from 0. While for continue_path, the states of optimizers are restored and continue training for the last step it breaks.
So does it work the same if I use --restore_path instead?

@Edresson Edresson self-assigned this Nov 7, 2023
@Edresson
Copy link
Contributor

Edresson commented Nov 7, 2023

Hi @yiliu-mt,

Like @manmay-nakhashi said you should use --restore_path or set the XTTS_checkpoint path.

--restore_path when possible it restore the optimizer as well : https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/trainer.py#L791-L850

Currently, --continue_path does not work for most models.

@yiliu-mt
Copy link
Author

yiliu-mt commented Nov 7, 2023

I see. Thanks for the advice!

@yiliu-mt yiliu-mt closed this as completed Nov 7, 2023
@shine-xia
Copy link

Hi @yiliu-mt,

Like @manmay-nakhashi said you should use --restore_path or set the XTTS_checkpoint path.

--restore_path when possible it restore the optimizer as well : https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/trainer.py#L791-L850

Currently, --continue_path does not work for most models.

Well I think it's better to indicate the users that "--continue_path" does not work for most models.
I just followed the instructions in https://tts.readthedocs.io/en/dev/tutorial_for_nervous_beginners.html and the tutorial tell me to use "--continue_path" for continuing a previous training process.

@mengting7tw
Copy link
Contributor

mengting7tw commented Dec 11, 2023

Hi @yiliu-mt,

Like @manmay-nakhashi said you should use --restore_path or set the XTTS_checkpoint path.

--restore_path when possible it restore the optimizer as well : https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/trainer.py#L791-L850

Currently, --continue_path does not work for most models.

Hi @Edresson,
Wondering if there's a difference between functions of "--restore_path" and previous "--continue_path".
Thanks for clarifying!

@NiHaoUCAS
Copy link

The root cause of the failure is that some fields in model_args were discarded when loading config.json(https://github.com/coqui-ai/Trainer/blob/main/trainer/trainer.py#L737).

The model_args type is GPTArgs, but when defining model_args in config, XttsArgs was used. In XttsArgs (https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/configs/xtts_config.py#L70), some GPTArgs fields are missing.

There are two solutions:
1. Complete the missing fields in XttsArgs
2. Replace config.load_json(args.config_path) with pass in the https://github.com/coqui-ai/Trainer/blob/main/trainer/trainer.py#L737 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment