Getting NaN when training using FP16 #12510
Unanswered
cmlakhan
asked this question in
code help: NLP / ASR / TTS
Replies: 1 comment
-
Hi @cmlakhan! Generally speaking, training with fp16 is unstable due to its small range compared to fp32 and can lead to nan at some point, so instead, I'd also try bf16 whose range is wider than that of fp16. Trainer(..., precision="bf16") |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to train a transformer model using FP16 precision, but the loss eventually goes to nan after around ~1000 steps. I set the detect anomaly flag to True and I seem to be getting the following error below and am wondering if others have encountered this issue as well. I have tried a variety of learning rates the current error comes from setting a low learning rate (1e-5). I'm curious if others have encountered this issue. I have had this same error with CNN models as well. When I run the same model using FP32 precision it seems to be fine and no issues so is it something beyond a model issue?
Beta Was this translation helpful? Give feedback.
All reactions