-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to properly skip samples that cause inf/nan gradients/loss #4956
Comments
Hi! thanks for your contribution!, great first issue! |
@levhaikin Thanks for the proposal. I think it should be fine to use. However, I am not sure we want to have this as option within lightning. |
In the case of invalid losses, you can return |
thanks @carmocca, this is definitely a much simpler way!
|
I'm not sure. We don't have a test for it using DDP. I'll try it and report back. If it doesn't, this could be fixed with #3325 cc @rohan-varma
See the returns of Do you still want to support skipping invalid gradients or is skipping losses enough? |
if it works with ddp then I guess it should be enough. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Is there a way to change the result of |
No. The optimization procedure is completely managed by the loops calling the |
It seems returning Should I submit a bug report? |
No. Returning |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
The code presented in the first comment does not work for me. I'm using mixed precision with pytorch_lightning
However, with My guess is that with float16 datatype, lower values for the gradient would also count as the inf. However, while updating the weights, the value of the gradient is used. I'm using |
FWIW, I encountered a similar problem and it seems to have been resolved by switching from |
@ashesh-0 I use like this def optimizer_step(
self,
epoch,
batch_idx,
optimizer,
optimizer_idx,
optimizer_closure,
on_tpu=False,
using_lbfgs=False,
):
"""
Skipping updates in case of unstable gradients
https://github.com/Lightning-AI/lightning/issues/4956
"""
valid_gradients = True
for name, param in self.named_parameters():
if param.grad is not None:
# valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
valid_gradients = not (torch.isnan(param.grad).any())
if not valid_gradients:
break
if not valid_gradients:
print("detected inf or nan values in gradients. not updating model parameters")
self.zero_grad()
optimizer.step(closure=optimizer_closure) and precision 16, gradient_clip_val 1.0 i just can have gradient nan problem |
@YooSungHyun can I put this code in self.training_step or do I need to create self.optimizer step after the training_step happens? |
@unlugi i just override optimizer_step and, it is called on global step in training loop |
Hi, @YooSungHyun! When you mentioned -
do you mean that it is still possible that |
FYI, the suggested TypeError: optimizer_step() missing 1 required positional argument: 'optimizer_closure' for me in lightning 2.0. Removing |
I had "MisconfigurationException: When def optimizer_step(
self,
*args, **kwargs
):
"""
Skipping updates in case of unstable gradients
https://github.com/Lightning-AI/lightning/issues/4956
"""
valid_gradients = True
for name, param in self.named_parameters():
if param.grad is not None:
# valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any())
valid_gradients = not (torch.isnan(param.grad).any())
if not valid_gradients:
break
if not valid_gradients:
print("detected inf or nan values in gradients. not updating model parameters")
self.zero_grad()
pl.LightningModule.optimizer_step(self, *args, **kwargs) I'm not sure if this is enough when using gradient accumulation + ddp + amp. |
tl;dr
does the approach in the code snippet below look ok, or is there a better alternative for automatically skipping few "bad" samples in the data that cause inf/nan gradients/loss? (is it a good practice altogether?)
details
sometimes, there is a small percentage (but annoyingly large in absolute value) of "dirty" samples in the data that cause the loss to be nan, although the neural-network architecture itself is fine and stable in terms of numerical stability.
one approach is to automatically stop training (use
terminate_on_nan
) and then somehow isolate all these samples and remove them from the data permanently. but..sometimes we simply want to automatically skip these samples as if they never existed (perhaps with a warning), and continue training.
I couldn't find any documentation about how to do that, nor anyone who asked this question. so i decided to ask and offer a solution I found, for others that might need it as well.
in the end, i came up with the following approach - override
on_after_backwards
method in my lightning-module with the following code:code
pros
cons
final question
is it worth having such functionality integrated into lightning as a simple command-line-switch/parameter?
The text was updated successfully, but these errors were encountered: