[Bug] Error occurs when resuming training a xtts model #3131

yiliu-mt · 2023-11-01T10:04:46Z

Describe the bug

Error occurs when I try to resume training of a xtts model. Details are describe below.

To Reproduce

First, I train a xtts model using the official script:

cd TS/recipes/ljspeech/xtts_v1
CUDA_VISIBLE_DEVICES="0" python train_gpt_xtts.py

Then, during the training, it is corrupted. So I try to resume training using:

CUDA_VISIBLE_DEVICES="0" python train_gpt_xtts.py --continue_
path run/training/GPT_XTTS_LJSpeech_FT-November-01-2023_08+42AM-0000000

It failed to resume the tranining process.

Expected behavior

No response

Logs

>> DVAE weights restored from: /nfs2/yi.liu/src/TTS/recipes/ljspeech/xtts_v1/run/training/XTTS_v1.1_original_model_files/dvae.pth
 | > Found 13100 files in /nfs2/speech/data/tts/Datasets/LJSpeech-1.1
fatal: detected dubious ownership in repository at '/nfs2/yi.liu/src/TTS'
To add an exception for this directory, call:

        git config --global --add safe.directory /nfs2/yi.liu/src/TTS
 > Training Environment:
 | > Backend: Torch
 | > Mixed precision: False
 | > Precision: float32
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 64
 | > Num. of Torch Threads: 1
 | > Torch seed: 1
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=run/training/GPT_XTTS_LJSpeech_FT-November-01-2023_08+42AM-0000000/
 > Restoring from checkpoint_1973.pth ...
 > Restoring Model...
 > Restoring Optimizer...
 > Model restored from step 1973

 > Model has 543985103 parameters
 > Restoring best loss from best_model_1622.pth ...
--- Logging error ---
Traceback (most recent call last):
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: must be real number, not dict
Call stack:
  File "/nfs2/yi.liu/src/TTS/recipes/ljspeech/xtts_v1/train_gpt_xtts.py", line 182, in <module>
    main()
  File "/nfs2/yi.liu/src/TTS/recipes/ljspeech/xtts_v1/train_gpt_xtts.py", line 178, in main
    trainer.fit()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1808, in fit
    self._fit()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1746, in _fit
    self._restore_best_loss()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1710, in _restore_best_loss
    logger.info(" > Starting with loaded last best loss %f", self.best_loss)
Message: ' > Starting with loaded last best loss %f'
Arguments: {'train_loss': 0.03659261970647744, 'eval_loss': None}
--- Logging error ---
Traceback (most recent call last):
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/root/miniconda3/envs/xtts/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: must be real number, not dict
Call stack:
  File "/nfs2/yi.liu/src/TTS/recipes/ljspeech/xtts_v1/train_gpt_xtts.py", line 182, in <module>
    main()
  File "/nfs2/yi.liu/src/TTS/recipes/ljspeech/xtts_v1/train_gpt_xtts.py", line 178, in main
    trainer.fit()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1808, in fit
    self._fit()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1746, in _fit
    self._restore_best_loss()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1710, in _restore_best_loss
    logger.info(" > Starting with loaded last best loss %f", self.best_loss)
Message: ' > Starting with loaded last best loss %f'
Arguments: {'train_loss': 0.03659261970647744, 'eval_loss': None}

 > EPOCH: 0/1000
 --> run/training/GPT_XTTS_LJSpeech_FT-November-01-2023_08+42AM-0000000/
 ! Run is kept in run/training/GPT_XTTS_LJSpeech_FT-November-01-2023_08+42AM-0000000/
Traceback (most recent call last):
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1808, in fit
    self._fit()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1762, in _fit
    self.eval_epoch()
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 1610, in eval_epoch
    self.get_eval_dataloader(
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 976, in get_eval_dataloader
    return self._get_loader(
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/trainer/trainer.py", line 895, in _get_loader
    loader = model.get_data_loader(
  File "/nfs2/yi.liu/src/TTS/TTS/tts/layers/xtts/trainer/gpt_trainer.py", line 337, in get_data_loader
    dataset = XTTSDataset(self.config, samples, self.xtts.tokenizer, config.audio.sample_rate, is_eval)
  File "/nfs2/yi.liu/src/TTS/TTS/tts/layers/xtts/trainer/dataset.py", line 83, in __init__
    self.debug_failures = model_args.debug_loading_failures
  File "/root/miniconda3/envs/xtts/lib/python3.9/site-packages/coqpit/coqpit.py", line 626, in __getattribute__
    value = super().__getattribute__(arg)
AttributeError: 'XttsArgs' object has no attribute 'debug_loading_failures'

Environment

TTS: v0.19.1
pytorch: 2.0.1+cu117
python: 3.9.18

Additional context

Please inform me if any other information is needed.

The text was updated successfully, but these errors were encountered:

manmay-nakhashi · 2023-11-01T11:54:09Z

use --restore_path

yiliu-mt · 2023-11-01T11:58:22Z

Thanks for the advice!
However, I just wonder if it is the same behavior to use "continue_path" and "restore_path". As I understand, restore_path is mostly used for pretrained model. All the states of optimizers should start from scratch and the step starts from 0. While for continue_path, the states of optimizers are restored and continue training for the last step it breaks.
So does it work the same if I use --restore_path instead?

Edresson · 2023-11-07T12:58:14Z

Hi @yiliu-mt,

Like @manmay-nakhashi said you should use --restore_path or set the XTTS_checkpoint path.

--restore_path when possible it restore the optimizer as well : https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/trainer.py#L791-L850

Currently, --continue_path does not work for most models.

yiliu-mt · 2023-11-07T13:01:03Z

I see. Thanks for the advice!

shine-xia · 2023-11-24T01:10:27Z

Hi @yiliu-mt,

Like @manmay-nakhashi said you should use --restore_path or set the XTTS_checkpoint path.

--restore_path when possible it restore the optimizer as well : https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/trainer.py#L791-L850

Currently, --continue_path does not work for most models.

Well I think it's better to indicate the users that "--continue_path" does not work for most models.
I just followed the instructions in https://tts.readthedocs.io/en/dev/tutorial_for_nervous_beginners.html and the tutorial tell me to use "--continue_path" for continuing a previous training process.

mengting7tw · 2023-12-11T07:07:13Z

Hi @yiliu-mt,

Like @manmay-nakhashi said you should use --restore_path or set the XTTS_checkpoint path.

--restore_path when possible it restore the optimizer as well : https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/trainer.py#L791-L850

Currently, --continue_path does not work for most models.

Hi @Edresson,
Wondering if there's a difference between functions of "--restore_path" and previous "--continue_path".
Thanks for clarifying!

NiHaoUCAS · 2024-01-26T11:33:35Z

The root cause of the failure is that some fields in model_args were discarded when loading config.json(https://github.com/coqui-ai/Trainer/blob/main/trainer/trainer.py#L737).

The model_args type is GPTArgs, but when defining model_args in config, XttsArgs was used. In XttsArgs (https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/configs/xtts_config.py#L70), some GPTArgs fields are missing.

There are two solutions:
1. Complete the missing fields in XttsArgs
2. Replace config.load_json(args.config_path) with pass in the https://github.com/coqui-ai/Trainer/blob/main/trainer/trainer.py#L737 .

yiliu-mt added the bug Something isn't working label Nov 1, 2023

Edresson self-assigned this Nov 7, 2023

yiliu-mt closed this as completed Nov 7, 2023

This was referenced Dec 11, 2023

fix: make --continue_path work again coqui-ai/Trainer#131

Merged

Revert "Revert "fix: make --continue_path work again (#131)"" coqui-ai/Trainer#135

Merged

eginhard mentioned this issue Jan 19, 2025

Support use of --continue_path to resume XTTS training idiap/coqui-ai-TTS#270

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Error occurs when resuming training a xtts model #3131

[Bug] Error occurs when resuming training a xtts model #3131

yiliu-mt commented Nov 1, 2023 •

edited

Loading

manmay-nakhashi commented Nov 1, 2023

yiliu-mt commented Nov 1, 2023

Edresson commented Nov 7, 2023

yiliu-mt commented Nov 7, 2023

shine-xia commented Nov 24, 2023

mengting7tw commented Dec 11, 2023 •

edited

Loading

NiHaoUCAS commented Jan 26, 2024

[Bug] Error occurs when resuming training a xtts model #3131

[Bug] Error occurs when resuming training a xtts model #3131

Comments

yiliu-mt commented Nov 1, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

manmay-nakhashi commented Nov 1, 2023

yiliu-mt commented Nov 1, 2023

Edresson commented Nov 7, 2023

yiliu-mt commented Nov 7, 2023

shine-xia commented Nov 24, 2023

mengting7tw commented Dec 11, 2023 • edited Loading

NiHaoUCAS commented Jan 26, 2024

yiliu-mt commented Nov 1, 2023 •

edited

Loading

mengting7tw commented Dec 11, 2023 •

edited

Loading