-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation loss #1864
base: sd3
Are you sure you want to change the base?
Validation loss #1864
Conversation
We only want to be enabling grad if we are training.
… to calculate validation loss. Balances the influence of different time steps on training performance (without affecting actual training results)
Hi, thank you for your work! The primary reason for using debiased_estimation_loss is to rescale the loss at different timesteps during SD training, allowing better reflection of high-timestep signals. However, after comparing the results with simple averaging, I found that apart from the magnitude, the linear progression didn't show significant differences. Perhaps fixing (t=500) could be sufficient. This could save time to increase the number of samples per run, potentially resulting in smoother outcomes. The motivation behind designing a separate function for the validation batch is to lock in the hyperparameters, facilitating direct comparisons across different hyperparameter configurations during training. Additionally, I'm not quite sure about the purpose of modifying train_util's timestep and noise part, as it is a library used by many training scripts. Changing the function names might cause other functionalities to stop working. In fact, I don't really agree with modifying too much shared code, as it raises the barrier to merging with other PRs. |
For the loss calculations I think keeping it consistent to the training even if it may be more or less impactful at different timesteps. I feel this could be iterated later, if a separation is necessary. The process_batch was to unify the training and validation to keep them consistent in their process. This prevents having to keep 2 places updated for batch processing for both. Note #914 (comment) where this was discussed. For the inconsistencies in train_util, there was the change of
I added more abstractions of different features of the larger functions like At this time, the following code I want to revert in This code extracts a few things, including timesteps, which makes it hard to apply our own timestamps here. I implemented all the components in noise_pred, target, timesteps, weighting = self.get_noise_pred_and_target(
args,
accelerator,
noise_scheduler,
latents,
batch,
text_encoder_conds,
unet,
network,
weight_dtype,
train_unet,
) Utilizing this function for the various loss adjustment parts. loss = self.post_process_loss(loss, args, timesteps, noise_scheduler) In terms of modifying the code, if you could reference to any PR's you may know of that would conflict with these changes (after reverting/fixing the 2 functions Let me know what you think of these. |
I'm working through some bugs with the process but one additional concern is how wandb does it's step calculations. If you provide a step that is not sequential (after the current stated step) it will not record that information at all. It states a warning but nothing gets logged. For example if we are stepping through
And then we want to set the step for the log in an epoch, it will not record this.
The "fix" is to create wandb metrics. https://docs.wandb.ai/support/log_metrics_two_different_time_scales_example_log_training/ . This would allow the different metrics but for other logging like tensorboard it wouldn't work right. Setting the step= to anything besides a higher value than the last recorded loss will fail. If we remove the step value we can be a little more flexible by allowing it to set a larger step each time it logs. But this makes the graphs pretty non-usable... Ultimately I am not sure what will solve all the options without testing each tracker to make sure it is working as well as update all the tracking I spoke with wandb about this but they do not seem to want to make it flexible to work with accelerate in how we are currently trying to use it. wandb instructions for accelerate https://docs.wandb.ai/guides/integrations/accelerate/ |
At this point train_network.py should be in parity will sd3 upstream and the associated validation/training through process_batch. I reverted/refactored the names of functions in train_util to keep them as they were in the current sd3 upstream. There are added functions that decoupled from the larger noise_noise_latents_timesteps function. I set all accelerator.log() in train_network to drop the steps but left the original there for testing. See the comment above about accelerator.log issues. These need to be fixed before release. |
The timesteps are random for training. The validation i pass a (These links require expanding train_network.py diff) I have them averaging over the timesteps presented.https://github.com/kohya-ss/sd-scripts/pull/1864/files#diff-62cf7de156b588b9acd7af26941d7bb189368221946c8b5e63f69df5cda56f39R457-R459 Regular training will produce one random timestep. This is averaged the same to produce the expected result. I do not think debiased estimation is necessary to be specific for validation. Whatever the user is training with will be used in validation which should keep them consistent. Since it's doing a range of timesteps it should average out in line with their chosen post process loss function. Is that the confusion? Or is there something else I'm missing? The loss variations seems to be appropriate for different timesteps? The batch should be processed together for each of the fixed timesteps and averaged out. It seems to produce an appropriate result in the testing I have done so far. What are you refering to as t=500? The fixed timesteps for validation to be only 500? Also I think it might be good for fixed timesteps to be configurable with a default so the user can set it or they can set the number of timesteps to test and we can distribute the timesteps across the whole 1-1000 range. Let me know what you think. Thanks for the charts. |
@rockerBOO I pulled in the original PR to my fork (https://github.com/67372a/sd-scripts) awhile ago and made some enhancements that might be worth considering:
I also implemented logic to allow explicitly setting subsets as validation via is_val, but that is spread out a bit more, as I have a bad habit of just committing things as I go into the primary branch. |
@67372a Thanks for sharing these. As an overall note I think part of the confusion is what is validation support to do, or the definition I am using here. Validation is to test differences between the training dataset and a separate validation dataset in terms of their loss. This is to highlight the differences between the training and validation to highlight overfitting of the training dataset. Validation is then the same exact process of the training dataset but we do not do training on it (so no gradient or backwards loss). This gets us a clear distinction with the minimal differences between the runs. I think what the confusion is that Validation sounds like it should make sure everything is correct. There is another pass usually that evaluates the results in a very consistent manner, and produces a metric that can be compared between runs. Like an eval or test run that is a separate dataset from training or validation. For generative image AI that is usually FID or CLIP score or some other options that may be available. Maybe that is the confusion of the goals of validation vs training and validation vs eval/test. That is why validation needs to be consistent to the training because it's suppose to highlight the differences between the overfitting of the training dataset and needs to be as consistent as possible to highlight that overfitting. The validation can also "overfit" eventually which can highlight why a third dataset might limit how much the validation dataset is overfitting. At some point, for many purposes of fine tuning, is that the dataset can be very small so we want to be flexible with what people decide to do here. Also eval/test runs can be run using the model inference and comparing prompt/samples, not requiring another dataset or splitting the current dataset. I have wanted to add an eval/test run using these metrics to be able to concretely compare different runs which should include a lot of the suggested behavior.
|
The validation seed has been used for shuffling the dataset. It's a minor change but can keep the validation dataset consistent if your training run has different seeds. This is important because the seed is used in random latent generation, and other random factors which you may modify for other reasons. If may be a good idea to apply the validation seed for the dataset specific random options as you suggested, caption dropout, caption shuffling and so on.
The model is set to not have a gradient but because I unified the training/validation to be processed in a consistent manner it is a little more nuanced in how it is approached. The torch.set_grad_enabled() allows us to toggle it based on the conditions of training or validation, and if we are training specific parts of the model. Dropout is applied in the back propagation so wouldn't be a factor for forward inference done in validation.
The loss modifications are to remain consistent for comparing tests and validation and is very important to remain consistent to highlight overfitting of the training dataset.
As mentioned in the current implementation the gradients are turned off for validation. Note that train_network.py has my intended implementation but train_db I haven't updated to make sure we are all on the same page of the implementation process. |
Looking at the code in Lycoris, if the module is in training mode, dropout is applied during the forward pass - https://github.com/KohakuBlueleaf/LyCORIS/blob/main/lycoris/modules/locon.py, nn.Dropout also behaves this way, so any reference to it in a forward has to be considered, https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html#torch.nn.Dropout. |
Ahh that seems accurate about dropout in forward and backward. I thought it was backward only. I still think it's relevant to have it be the same value though because it is specifically aligned with overfitting, so your validation results would be different if you didn't have it be the same amount as the training dataset. |
One thing I'm noting here to resolve is that
If timesteps is decoupled, this would allow us to utilize this abstraction while independently creating the timesteps where appropriate. This would allow higher compatibility with sd3 and flux scripts which may have different implementations. |
To allow this to be completed and support SD3 and Flux I decided to drop fixed_timesteps for validation. It adds a bunch of refactoring to support appropriately, but can be updated to support this after. This tries to limit the updated/refactored code so we can get this released, and we can iterate on it after. I also reverted train_db.py mostly to limit the changes in this PR. If the process_batch change is accepted, then each training script will need to be updated, and we can iterate on that. |
Added |
Last bit is the |
Related #1856 #1858 #1165 #914
Original implementation by @rockerBOO
Timestep validation implementation by @gesen2egee
Updated implementation for sd3/flux by @hinablue
I went through and tried to merge the different PR's together. I probably messed up some things in the process.
One thing I wanted to note is that
process_batch
was made to limit duplication of the code for validation and training to keep them consistent. I implemented the timestep processing so it could work for both. Noted that it was using only debiased_estimation in other PR's but i didn't know why it was like that.train_db.py
I did not update appropriately to my goal of a unifiedprocess_batch
, as I do not have a good way to test them. I will try to get them in an acceptable state and we can refine it.I'm posting this a little early so others can view and give me feedback. I am still working on some issues with the code so let me know before you dive in to fix anything. Open to commits to this PR, can post them to this branch on my fork.
Testing
--network_train_text_encoder_only
--network_train_unet_only
Parameters
Validation dataset is for dreambooth datasets (text/image pairs) and will split the dataset into 2 parts, train_dataset and validation_dataset depending on the split.
--validation_seed
Validation seed for shuffling validation dataset, training--seed
used otherwise / 検証データセットをシャッフルするための検証シード、それ以外の場合はトレーニング--seed
を使用する--validation_split
Split for validation images out of the training dataset / 学習画像から検証画像に分割する割合--validate_every_n_steps
Run validation on validation dataset every N steps. By default, validation will only occur every epoch if a validation dataset is available / 検証データセットの検証をNステップごとに実行します。デフォルトでは、検証データセットが利用可能な場合にのみ、検証はエポックごとに実行されます--validate_every_n_epochs
Run validation dataset every N epochs. By default, validation will run every epoch if a validation dataset is available / 検証データセットをNエポックごとに実行します。デフォルトでは、検証データセットが利用可能な場合、検証はエポックごとに実行されます--max_validation_steps
Max number of validation dataset items processed. By default, validation will run the entire validation dataset / 処理される検証データセット項目の最大数。デフォルトでは、検証は検証データセット全体を実行しますvalidation_seed
andvalidation_split
can be set inside the dataset_config.tomlI'm open to feedback about this approach and if anything needs to be fixed in the code to be accurate.