Multiples bug fixes and add on_train_epoch_start callback #129
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What it does?
KeyError: 'avg_loss_1'
error when start_with_eval=True and target_loss is settled. This issue happens because the training keep_avg_target.avg_values is an empty dictionary. It is related to [Bug] KeyError: 'avg_loss_1' crash when training model TTS#2862. This error also happens in training when we try to save a checkpoint before we have updated self.keep_avg_train or self.keep_avg_eval. To solve it this PR also make _pick_target_avg_loss safe and it avoid issues like multiband_melgan Vocoder Fails on Step 10000 With KeyError: 'avg_loss_0' TTS#1608 to happens, if the keep_avg_target.avg_values is empty it will return None and all will be good.optimize
method. It avoids the user training the model with our implementation that has dangling gradients in multiple-optimizer setup with grad accumulation (I already did it accidently, It is really bad because we can lose training time).on_train_epoch_start
andon_train_epoch_end
callbacks. Currently, the only way to put modules in eval mode model during the training is viaon_train_step_start
callback, that is called each train_step. It is really slow. Adding this new callback we can do it only one time per epoch. It should decrease the step time for XTTS GPT and XTTS decoder training.