-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make timesteps work in the standard way when Huber loss is used #1628
Conversation
def conditional_loss( | ||
model_pred: torch.Tensor, target: torch.Tensor, reduction: str = "mean", loss_type: str = "l2", huber_c: float = 0.1 | ||
model_pred: torch.Tensor, target: torch.Tensor, reduction: str, loss_type: str, huber_c: Optional[torch.Tensor] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Callers I could find are explicitly passing all arguments, so it should be safe to remove default values
986c892
to
e1f23af
Compare
I think we can make some additional changes to more easily integrate Huber loss into the upcoming SD3/Flux work:
@kohya-ss Should this be merged as-is or can I implement the above changes? |
After spending an afternoon going thru a rabbit hole, I concluded that the current Huber loss code is simply incorrect. This is the textbook definition of Huber loss (from wikipedia): This is a piecewise defined function. What we have in the code is:
Which is not the same and appears to be an approximation, at best. The code seems to be identical to a version that exists in the Looking at the PyTorch internal implementation we have:
It matches the textbook definition. I've been getting some inconsistent results in Flux using Huber loss when following the recommendations in the original paper. I suspect this might be one of the causes. Smooth L1 loss seems to have the same problem outlined above. I am currently testing changes to use the same logic as PyTorch currently does. |
After some additional measurements I think the current formulation and defaults for Huber schedule are not suitable for use in Flux. The primary reason being that mean latent loss term (and variance) is much higher than what we find in prior models (like SDXL). A default like If I understand the math correctly, given that MAE grows linearly this means we'll get smaller gradient magnitudes, on average, when compared to MSE (which has quadratic behavior). As a result the learning gets severely dampened. I've done a few test runs with different When it comes to the exponential schedule we have an additional problem. The current formulation (per the paper) is:
Where We need a way to control the upper bound of this function, probably with an additional |
Thank you for this PR.
I think this PR can be merged as is.
I don't fully understand the mathematical background, but kabachuha, the author of #1228 said "Pseudo-Huber Loss" in huggingface/diffusers#7527, so I think this difference is intentional. Are the results significantly different and can we be sure that Wikipedia's implementation is better? |
In the last few days I've been testing both approaches, both seem to work well enough. The the problems I was facing (in Flux testing) seem to be related to the current Huber schedules and default values, which are inadequate for the new model. I have another set of proposed changes but I'll leave that for a later PR. |
Thank you! I will merge this PR as is. |
Tested with SDXL LoRA.
When Huber loss is selected, the sampled timestep is the same for the entire batch. This PR changes the behavior to match what is used for L2 loss.
Instead of passing a scalar
huber_c
around, turn it into a batch sizedTensor
object.