Getting NaN when training using FP16 #12510

cmlakhan · 2022-03-29T15:17:39Z

cmlakhan
Mar 29, 2022

I am trying to train a transformer model using FP16 precision, but the loss eventually goes to nan after around ~1000 steps. I set the detect anomaly flag to True and I seem to be getting the following error below and am wondering if others have encountered this issue as well. I have tried a variety of learning rates the current error comes from setting a low learning rate (1e-5). I'm curious if others have encountered this issue. I have had this same error with CNN models as well. When I run the same model using FP32 precision it seems to be fine and no issues so is it something beyond a model issue?

akihironitta · 2022-03-30T04:15:24Z

akihironitta
Mar 30, 2022

Hi @cmlakhan!

Generally speaking, training with fp16 is unstable due to its small range compared to fp32 and can lead to nan at some point, so instead, I'd also try bf16 whose range is wider than that of fp16.

Trainer(..., precision="bf16")

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting NaN when training using FP16 #12510

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Getting NaN when training using FP16 #12510

cmlakhan Mar 29, 2022

Replies: 1 comment

akihironitta Mar 30, 2022

cmlakhan
Mar 29, 2022

akihironitta
Mar 30, 2022