Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when training model: AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_shutdown' #33

Closed
Subarasheese opened this issue Aug 29, 2023 · 14 comments

Comments

@Subarasheese
Copy link
Contributor

Greetings,

As seen on #10 (comment), someone successfully trained models, so I decided to try it myself.

I used the following command:

python train.py -c configs/vits2_voice_training.json -m mydataset

However, the following happens:

INFO:mydataset:{'train': {'log_interval': 867, 'eval_interval': 867, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 16, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/train_voice_1_filelist_v4.txt', 'validation_files': 'filelists/val_voice_1_filelist_v4.txt', 'text_cleaners': ['basic_cleaners'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': False, 'n_speakers': 0, 'cleaned_text': True, 'use_mel_spec_posterior': False}, 'model': {'use_mel_posterior_encoder': False, 'use_transformer_flows': True, 'transformer_flow_type': 'pre_conv', 'use_spk_conditioned_encoder': False, 'use_noise_scaled_mas': True, 'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False}, 'max_text_len': 500, 'model_dir': './logs/mydataset'}
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
Using lin posterior encoder for VITS1
Using transformer flows pre_conv for VITS2
Using normal encoder for VITS1
Using noise scaled MAS for VITS2
NOT using any duration discriminator like VITS1
Loading train data:   0%|                                                                                                                                 | 0/4 [00:00<?, ?it/s]
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f0f9a049240>
Traceback (most recent call last):
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1397, in _shutdown_workers
    if not self._shutdown:
AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_shutdown'
Traceback (most recent call last):
  File "/vits2_pytorch/train.py", line 417, in <module>
    main()
  File "/vits2_pytorch/train.py", line 54, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/vits2_pytorch/train.py", line 196, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc], [scheduler_g, scheduler_d, scheduler_dur_disc], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/vits2_pytorch/train.py", line 225, in train_and_evaluate
    for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths) in enumerate(loader):
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 435, in __iter__
    return self._get_iterator()
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 988, in __init__
    super(_MultiProcessingDataLoaderIter, self).__init__(loader)
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 598, in __init__
    self._sampler_iter = iter(self._index_sampler)
  File "/vits2_pytorch/data_utils.py", line 400, in __iter__
    ids_bucket = ids_bucket + ids_bucket * (rem // len_bucket) + ids_bucket[:(rem % len_bucket)]
ZeroDivisionError: integer division or modulo by zero

What could be the problem here, and what could I try doing to fix this?

@p0p4k
Copy link
Owner

p0p4k commented Aug 30, 2023

Is this your private dataset? I suspect the lengths of the text might be less than minimum value or something like that. You can just copy the dataloader part in a .ipynb file and try to debug your data loading part . Load the hps from file using function in the utils.py for the dataloader.

@hildazzz
Copy link

Or you can decrease the values of boundaries in DistributedBucketSampler and see what happens.

@Subarasheese
Copy link
Contributor Author

Subarasheese commented Aug 30, 2023

Is this your private dataset? I suspect the lengths of the text might be less than minimum value or something like that. You can just copy the dataloader part in a .ipynb file and try to debug your data loading part . Load the hps from file using function in the utils.py for the dataloader.

Yes, it is a custom dataset. My dataset looks normal, and the texts are pretty long. The code is erroring out after this line, not sure why as the error logs are not clear. What do I need to inspect to check what is wrong?

image

image

Or you can decrease the values of boundaries in DistributedBucketSampler and see what happens.

Where can I do that?

By the way, this is the config file I am using for training:


{
    "train": {
      "log_interval": 867,
      "eval_interval": 867,
      "seed": 1234,
      "epochs": 20000,
      "learning_rate": 2e-4,
      "betas": [0.8, 0.99],
      "eps": 1e-9,
      "batch_size": 16,
      "fp16_run": false,
      "lr_decay": 0.999875,
      "segment_size": 8192,
      "init_lr_ratio": 1,
      "warmup_epochs": 0,
      "c_mel": 45,
      "c_kl": 1.0
    },
    "data": {
      "training_files":"filelists/train_voice_1_filelist_v4.txt",
      "validation_files":"filelists/val_voice_1_filelist_v4.txt",
      "text_cleaners":["basic_cleaners"],
      "max_wav_value": 32768.0,
      "sampling_rate": 22050,
      "filter_length": 1024,
      "hop_length": 256,
      "win_length": 1024,
      "n_mel_channels": 80,
      "mel_fmin": 0.0,
      "mel_fmax": null,
      "add_blank": false,
      "n_speakers": 0,
      "cleaned_text": true,
      "use_mel_spec_posterior": false
    },
    "model": {
      "use_mel_posterior_encoder": false,
      "use_transformer_flows": true,
      "transformer_flow_type": "pre_conv",
      "use_spk_conditioned_encoder": false,
      "use_noise_scaled_mas": true,
      "inter_channels": 192,
      "hidden_channels": 192,
      "filter_channels": 768,
      "n_heads": 2,
      "n_layers": 6,
      "kernel_size": 3,
      "p_dropout": 0.1,
      "resblock": "1",
      "resblock_kernel_sizes": [3,7,11],
      "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
      "upsample_rates": [8,8,2,2],
      "upsample_initial_channel": 512,
      "upsample_kernel_sizes": [16,16,4,4],
      "n_layers_q": 3,
      "use_spectral_norm": false
    },
    "max_text_len": 500
  }

@beqabeqa473
Copy link

Hi all.

btw, can i turn off validation and test lists? i would be able to validate model myself. My dataset is not so large to throw away sentences to validations and tests

@p0p4k
Copy link
Owner

p0p4k commented Aug 31, 2023

just pass in the same list to both train and validation. For the validation loader use torch.utils.data.Subset and pass only 4-5 samples so that you can get the evaluation while training. If you want to completely turn off evaluation, just comment out evaluate() function in train.py.

@beqabeqa473
Copy link

beqabeqa473 commented Aug 31, 2023 via email

@p0p4k
Copy link
Owner

p0p4k commented Aug 31, 2023

Test is the evaluation in this repo.

@beqabeqa473
Copy link

beqabeqa473 commented Aug 31, 2023 via email

@p0p4k
Copy link
Owner

p0p4k commented Aug 31, 2023

Yes, we just send 2 lists in config. Train and val

@Subarasheese
Copy link
Contributor Author

No clue on what the problem might be?

What do you suggest me to do with the dataset?

@p0p4k
Copy link
Owner

p0p4k commented Aug 31, 2023

Print out the "len_bucket" in data_utils and try to debug from there.

@Subarasheese
Copy link
Contributor Author

Subarasheese commented Aug 31, 2023

Print out the "len_bucket" in data_utils and try to debug from there.

Those were the outputs

buckets from line 371:
0
8
34

image

So the first bucket has length 0, the second has length 8, and the last has length 34

Is there anything wrong about it?

@p0p4k
Copy link
Owner

p0p4k commented Aug 31, 2023

So bucket length 0 is causing the issue. Cannot divide by 0. You can just entirely disable this function the dataloader temporarily.

@Subarasheese
Copy link
Contributor Author

@p0p4k I managed to get the training working by making a few changes, including:

1 - skipping the "0" bucket in that for-loop to avoid the exception
2 - Editing the symbols file (non-English language)
3 - The mel_processing file had a bug, I needed to replace the mel attribution to this:
mel = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax)

And now it started training, but the bucket issue sure is strange and I believe it was not supposed to happen. If that is going to compromise the model, is something that has yet to be seen.

But I am going ahead and close the issue. Please, if you can, look into what this could be, or at least improve the log to make whatever the problem is clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants