Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation loss #1864

Open
wants to merge 64 commits into
base: sd3
Choose a base branch
from
Open

Conversation

rockerBOO
Copy link
Contributor

@rockerBOO rockerBOO commented Jan 3, 2025

Related #1856 #1858 #1165 #914

Original implementation by @rockerBOO
Timestep validation implementation by @gesen2egee
Updated implementation for sd3/flux by @hinablue

I went through and tried to merge the different PR's together. I probably messed up some things in the process.

One thing I wanted to note is that process_batch was made to limit duplication of the code for validation and training to keep them consistent. I implemented the timestep processing so it could work for both. Noted that it was using only debiased_estimation in other PR's but i didn't know why it was like that.

train_db.py I did not update appropriately to my goal of a unified process_batch, as I do not have a good way to test them. I will try to get them in an acceptable state and we can refine it.

I'm posting this a little early so others can view and give me feedback. I am still working on some issues with the code so let me know before you dive in to fix anything. Open to commits to this PR, can post them to this branch on my fork.

Testing

  • Test training code is actually training
  • Test validation epoch (Test validation every epoch)
  • Test validate per n steps (After n steps it will run a validation run)
  • Test validate per n epochs (After n epochs will run validation epochs)
  • Test max validation steps
  • Test validation split (The validation split should be split accordingly, 0.2 should produce 20% dataset of the primary dataset)
  • Test validation split from train_network.py arguments (--validation_split) as well as dataset_config.toml (validation_split=0.1)
  • Test validation seed (Seed is used for dataset shuffling only right now)
  • Test image latent caching (validation and training datasets)
  • Test tokenizing strategy (SD, SDXL, SD3, Flux)
  • Test text encoding strategy (SD, SDXL, SD3, Flux)
  • Test --network_train_text_encoder_only
  • Test --network_train_unet_only
  • Test training some text encoders (I think this is a feature?)
  • Test on SD1.5, SDXL, SD3, Flux LoRAs

Parameters

Validation dataset is for dreambooth datasets (text/image pairs) and will split the dataset into 2 parts, train_dataset and validation_dataset depending on the split.

  • --validation_seed Validation seed for shuffling validation dataset, training --seed used otherwise / 検証データセットをシャッフルするための検証シード、それ以外の場合はトレーニング --seed を使用する
  • --validation_split Split for validation images out of the training dataset / 学習画像から検証画像に分割する割合
  • --validate_every_n_steps Run validation on validation dataset every N steps. By default, validation will only occur every epoch if a validation dataset is available / 検証データセットの検証をNステップごとに実行します。デフォルトでは、検証データセットが利用可能な場合にのみ、検証はエポックごとに実行されます
  • --validate_every_n_epochs Run validation dataset every N epochs. By default, validation will run every epoch if a validation dataset is available / 検証データセットをNエポックごとに実行します。デフォルトでは、検証データセットが利用可能な場合、検証はエポックごとに実行されます
  • --max_validation_steps Max number of validation dataset items processed. By default, validation will run the entire validation dataset / 処理される検証データセット項目の最大数。デフォルトでは、検証は検証データセット全体を実行します

validation_seed and validation_split can be set inside the dataset_config.toml

I'm open to feedback about this approach and if anything needs to be fixed in the code to be accurate.

@gesen2egee
Copy link
Contributor

gesen2egee commented Jan 3, 2025

Hi, thank you for your work!

The primary reason for using debiased_estimation_loss is to rescale the loss at different timesteps during SD training, allowing better reflection of high-timestep signals. However, after comparing the results with simple averaging, I found that apart from the magnitude, the linear progression didn't show significant differences. Perhaps fixing (t=500) could be sufficient. This could save time to increase the number of samples per run, potentially resulting in smoother outcomes.

The motivation behind designing a separate function for the validation batch is to lock in the hyperparameters, facilitating direct comparisons across different hyperparameter configurations during training.

Additionally, I'm not quite sure about the purpose of modifying train_util's timestep and noise part, as it is a library used by many training scripts. Changing the function names might cause other functionalities to stop working. In fact, I don't really agree with modifying too much shared code, as it raises the barrier to merging with other PRs.

@rockerBOO
Copy link
Contributor Author

For the loss calculations I think keeping it consistent to the training even if it may be more or less impactful at different timesteps. I feel this could be iterated later, if a separation is necessary.

The process_batch was to unify the training and validation to keep them consistent in their process. This prevents having to keep 2 places updated for batch processing for both. Note #914 (comment) where this was discussed.

For the inconsistencies in train_util, there was the change of get_timesteps and get_huber_threshold_if_needed.

get_timesteps did not exist previously so my implementation of get_random_timesteps which is similar was to implement this behavior. We can use the original or rename it to be more aligned to the args component providing context to different timesteps behavior. Also we convert the timesteps into long() but the implementations want a IntTensor, not sure why that is like that besides maybe historically set like that. Probably not a big deal ultimately but not having to convert it might be simpler.

get_huber_threshold_if_needed also didn't exist previously, which is similar to the created get_huber_c I created. I think we can use the upstream version and remove the get_huber_c.

I added more abstractions of different features of the larger functions like get_noise_noisy_latents_and_timesteps to do the individual components. This allows us to utilize them individually like I have translated into process_batch for timesteps. This allows us to keep consistent behavior and limit having similar behavior duplicated and having to be updated together.

At this time, the following code I want to revert in process_batch, which should be a little cleaner.

This code extracts a few things, including timesteps, which makes it hard to apply our own timestamps here. I implemented all the components in process_batch but maybe some compromise in the approach here would be good.

noise_pred, target, timesteps, weighting = self.get_noise_pred_and_target(
                        args,
                        accelerator,
                        noise_scheduler,
                        latents,
                        batch,
                        text_encoder_conds,
                        unet,
                        network,
                        weight_dtype,
                        train_unet,
                    )

Utilizing this function for the various loss adjustment parts.

loss = self.post_process_loss(loss, args, timesteps, noise_scheduler)

In terms of modifying the code, if you could reference to any PR's you may know of that would conflict with these changes (after reverting/fixing the 2 functions get_timesteps and get_huber_threshold_if_needed), I can help to update or align the changes in this PR to be better. Should mostly be new additions and type signatures on some of the functions.

Let me know what you think of these.

@rockerBOO
Copy link
Contributor Author

rockerBOO commented Jan 3, 2025

I'm working through some bugs with the process but one additional concern is how wandb does it's step calculations. If you provide a step that is not sequential (after the current stated step) it will not record that information at all. It states a warning but nothing gets logged.

2025-01-03_15-08

For example if we are stepping through

accelerator.log({'epoch_loss': 0.1}, step=global_step)

And then we want to set the step for the log in an epoch, it will not record this.

accelerator.log({'epoch_loss': 0.1}, step=epoch)

The "fix" is to create wandb metrics. https://docs.wandb.ai/support/log_metrics_two_different_time_scales_example_log_training/ . This would allow the different metrics but for other logging like tensorboard it wouldn't work right. Setting the step= to anything besides a higher value than the last recorded loss will fail.

If we remove the step value we can be a little more flexible by allowing it to set a larger step each time it logs. But this makes the graphs pretty non-usable...
Screenshot 2025-01-03 at 15-04-58 apricot-terrain-143 landscape-kohya-lora – Weights   Biases

Ultimately I am not sure what will solve all the options without testing each tracker to make sure it is working as well as update all the tracking accelerator.log to accommodate all these. Without updating these wandb won't get validation results, epoch, or validation epoch loss graphs.

I spoke with wandb about this but they do not seem to want to make it flexible to work with accelerate in how we are currently trying to use it.

wandb instructions for accelerate https://docs.wandb.ai/guides/integrations/accelerate/

@rockerBOO
Copy link
Contributor Author

At this point train_network.py should be in parity will sd3 upstream and the associated validation/training through process_batch.

I reverted/refactored the names of functions in train_util to keep them as they were in the current sd3 upstream. There are added functions that decoupled from the larger noise_noise_latents_timesteps function.

I set all accelerator.log() in train_network to drop the steps but left the original there for testing. See the comment above about accelerator.log issues. These need to be fixed before release.

@gesen2egee
Copy link
Contributor

I'm not entirely sure if it's necessary for the time steps of the training and validation sets to be consistent, since they should be independent, right?
On the contrary, I'm more concerned about the loss variations caused by different batch samples being randomly assigned to different time steps, as this can be significant at different SNR levels.
image

That's why I previously used fixed, averaged, and debiased methods.

By the way, I tested it yesterday with t=500, and it seems to work?
image

@rockerBOO
Copy link
Contributor Author

rockerBOO commented Jan 4, 2025

The timesteps are random for training.

The validation i pass a timesteps_list which makes them fixed.

(These links require expanding train_network.py diff)
https://github.com/kohya-ss/sd-scripts/pull/1864/files#diff-62cf7de156b588b9acd7af26941d7bb189368221946c8b5e63f69df5cda56f39R378-R388

I have them averaging over the timesteps presented.https://github.com/kohya-ss/sd-scripts/pull/1864/files#diff-62cf7de156b588b9acd7af26941d7bb189368221946c8b5e63f69df5cda56f39R457-R459

Regular training will produce one random timestep. This is averaged the same to produce the expected result. I do not think debiased estimation is necessary to be specific for validation. Whatever the user is training with will be used in validation which should keep them consistent. Since it's doing a range of timesteps it should average out in line with their chosen post process loss function. Is that the confusion? Or is there something else I'm missing?

The loss variations seems to be appropriate for different timesteps? The batch should be processed together for each of the fixed timesteps and averaged out. It seems to produce an appropriate result in the testing I have done so far.

What are you refering to as t=500? The fixed timesteps for validation to be only 500? Also I think it might be good for fixed timesteps to be configurable with a default so the user can set it or they can set the number of timesteps to test and we can distribute the timesteps across the whole 1-1000 range.

Let me know what you think. Thanks for the charts.

@67372a
Copy link

67372a commented Jan 4, 2025

@rockerBOO I pulled in the original PR to my fork (https://github.com/67372a/sd-scripts) awhile ago and made some enhancements that might be worth considering:

I also implemented logic to allow explicitly setting subsets as validation via is_val, but that is spread out a bit more, as I have a bad habit of just committing things as I go into the primary branch.

@rockerBOO
Copy link
Contributor Author

rockerBOO commented Jan 4, 2025

@67372a Thanks for sharing these.

As an overall note I think part of the confusion is what is validation support to do, or the definition I am using here. Validation is to test differences between the training dataset and a separate validation dataset in terms of their loss. This is to highlight the differences between the training and validation to highlight overfitting of the training dataset.

Validation is then the same exact process of the training dataset but we do not do training on it (so no gradient or backwards loss). This gets us a clear distinction with the minimal differences between the runs.

I think what the confusion is that Validation sounds like it should make sure everything is correct. There is another pass usually that evaluates the results in a very consistent manner, and produces a metric that can be compared between runs. Like an eval or test run that is a separate dataset from training or validation. For generative image AI that is usually FID or CLIP score or some other options that may be available.

Maybe that is the confusion of the goals of validation vs training and validation vs eval/test. That is why validation needs to be consistent to the training because it's suppose to highlight the differences between the overfitting of the training dataset and needs to be as consistent as possible to highlight that overfitting. The validation can also "overfit" eventually which can highlight why a third dataset might limit how much the validation dataset is overfitting. At some point, for many purposes of fine tuning, is that the dataset can be very small so we want to be flexible with what people decide to do here. Also eval/test runs can be run using the model inference and comparing prompt/samples, not requiring another dataset or splitting the current dataset.

I have wanted to add an eval/test run using these metrics to be able to concretely compare different runs which should include a lot of the suggested behavior.

@rockerBOO
Copy link
Contributor Author

rockerBOO commented Jan 4, 2025

Always start from the same validation seed when a validation sequence runs, preserving rng states before, and reapplying after. The logic to allow randomization of the dataloader seed based on global step is not required, it was implemented to allow a niche use case (caption modification, shuffling, dropout, etc) that I allowed for, but don't recommend.
https://github.com/67372a/sd-scripts/blob/ab6a33ac6de50efbdcc347644d0ce27f83201e44/train_network.py#L605
Why: To make val loss more consistent, otherwise RNG can cause it to vary more, AND to avoid impacting non-val rng

The validation seed has been used for shuffling the dataset. It's a minor change but can keep the validation dataset consistent if your training run has different seeds. This is important because the seed is used in random latent generation, and other random factors which you may modify for other reasons.

If may be a good idea to apply the validation seed for the dataset specific random options as you suggested, caption dropout, caption shuffling and so on.

Always make sure to flip the network to eval before running val loss, and train after. You can do that in the function if you like, I did it outside to cover samples as well.
https://github.com/67372a/sd-scripts/blob/ab6a33ac6de50efbdcc347644d0ce27f83201e44/train_network.py#L1600
Why: If the network has dropout and it is set to train, dropout will be applied, which is not desirable.

The model is set to not have a gradient but because I unified the training/validation to be processed in a consistent manner it is a little more nuanced in how it is approached. The torch.set_grad_enabled() allows us to toggle it based on the conditions of training or validation, and if we are training specific parts of the model.

Dropout is applied in the back propagation so wouldn't be a factor for forward inference done in validation.

Set to statically use L2 loss and NOT apply any loss modification during process_val_batch (e.x. min snr gamma, debiased, etc), the exception being that v-pred and ztnsr should still apply if enabled. I also decided not to apply loss masking, but I could see that going either way. I also disable any noise modifications (offset, multires, ip noise gamma, etc) for the val sequence, which I did by going through the function calls and adding a train parameter, which is true by default.
https://github.com/67372a/sd-scripts/blob/ab6a33ac6de50efbdcc347644d0ce27f83201e44/library/train_util.py#L6259
Why: To be able to compare val loss across training loss modification settings, different loss types, and noise modification settings, we need val loss to remain unmodified. v-pred and ztsnr are the exception, at least from my testing, as they fundamentally change the training dynamic, so if they aren't set, val loss seems to break.

The loss modifications are to remain consistent for comparing tests and validation and is very important to remain consistent to highlight overfitting of the training dataset.

Use torch inference mode context, and make sure anything called is not later enabling grads (using train parameter as mentioned before).
https://github.com/67372a/sd-scripts/blob/ab6a33ac6de50efbdcc347644d0ce27f83201e44/train_network.py#L477C9-L477C65
Why: We don't want to generate gradients or apply autograd, this disables both, and may provide a performance benefit. See https://pytorch.org/docs/stable/notes/autograd.html#locally-disable-grad-doc

As mentioned in the current implementation the gradients are turned off for validation. Note that train_network.py has my intended implementation but train_db I haven't updated to make sure we are all on the same page of the implementation process.

@67372a
Copy link

67372a commented Jan 4, 2025

@rockerBOO

Looking at the code in Lycoris, if the module is in training mode, dropout is applied during the forward pass - https://github.com/KohakuBlueleaf/LyCORIS/blob/main/lycoris/modules/locon.py, nn.Dropout also behaves this way, so any reference to it in a forward has to be considered, https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html#torch.nn.Dropout.

@rockerBOO
Copy link
Contributor Author

Ahh that seems accurate about dropout in forward and backward. I thought it was backward only.

I still think it's relevant to have it be the same value though because it is specifically aligned with overfitting, so your validation results would be different if you didn't have it be the same amount as the training dataset.

@rockerBOO
Copy link
Contributor Author

rockerBOO commented Jan 5, 2025

One thing I'm noting here to resolve is that self.get_noise_pred_and_target() also gets the timesteps.

return noise_pred, target, timesteps, None

If timesteps is decoupled, this would allow us to utilize this abstraction while independently creating the timesteps where appropriate. This would allow higher compatibility with sd3 and flux scripts which may have different implementations.

@rockerBOO
Copy link
Contributor Author

To allow this to be completed and support SD3 and Flux I decided to drop fixed_timesteps for validation. It adds a bunch of refactoring to support appropriately, but can be updated to support this after. This tries to limit the updated/refactored code so we can get this released, and we can iterate on it after.

I also reverted train_db.py mostly to limit the changes in this PR. If the process_batch change is accepted, then each training script will need to be updated, and we can iterate on that.

@rockerBOO
Copy link
Contributor Author

Added
--validate_every_n_epochs
Changed --validation_every_n_step to --validate_every_n_steps

@rockerBOO
Copy link
Contributor Author

Last bit is the accelerator.log() factor but otherwise should be in a good state now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants