Multiples bug fixes and add on_train_epoch_start callback #129

Edresson · 2023-11-13T17:54:34Z

What it does?

Solve KeyError: 'avg_loss_1' error when start_with_eval=True and target_loss is settled. This issue happens because the training keep_avg_target.avg_values is an empty dictionary. It is related to [Bug] KeyError: 'avg_loss_1' crash when training model TTS#2862. This error also happens in training when we try to save a checkpoint before we have updated self.keep_avg_train or self.keep_avg_eval. To solve it this PR also make _pick_target_avg_loss safe and it avoid issues like multiband_melgan Vocoder Fails on Step 10000 With KeyError: 'avg_loss_0' TTS#1608 to happens, if the keep_avg_target.avg_values is empty it will return None and all will be good.
It also raises an error if multiple-optimizer setup with grad accumulation and without a custom optimize method. It avoids the user training the model with our implementation that has dangling gradients in multiple-optimizer setup with grad accumulation (I already did it accidently, It is really bad because we can lose training time).
It added on_train_epoch_start and on_train_epoch_end callbacks. Currently, the only way to put modules in eval mode model during the training is via on_train_step_start callback, that is called each train_step. It is really slow. Adding this new callback we can do it only one time per epoch. It should decrease the step time for XTTS GPT and XTTS decoder training.

… without custom optimize method

Edresson added 2 commits November 13, 2023 14:45

Fix key error on target loss when start_with_eval=True

cc05f6f

Disable start_with_eval when run_eval is False

ebd9173

Edresson requested a review from erogol November 13, 2023 17:56

Edresson added 3 commits November 13, 2023 15:01

Make style

8eddf02

Raise an error if multiple-optimizer setup with grad accumulation and…

787fdef

… without custom optimize method

Add on_train_epoch_start and on_train_epoch_end callbacks

d12b503

Edresson changed the title ~~Fix key error on target loss when start_with_eval=True~~ Multiples bug fixes and add on_train_epoch_start callback Nov 13, 2023

Fix worflows

5b3cb63

erogol merged commit 385cced into main Nov 16, 2023
8 checks passed

erogol deleted the fix_eval branch November 16, 2023 10:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiples bug fixes and add on_train_epoch_start callback #129

Multiples bug fixes and add on_train_epoch_start callback #129

Edresson commented Nov 13, 2023 •

edited

Loading

Multiples bug fixes and add on_train_epoch_start callback #129

Multiples bug fixes and add on_train_epoch_start callback #129

Conversation

Edresson commented Nov 13, 2023 • edited Loading

What it does?

Edresson commented Nov 13, 2023 •

edited

Loading