-
Notifications
You must be signed in to change notification settings - Fork 31.8k
[T5] enable T5 fp16 #9487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[T5] enable T5 fp16 #9487
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the -1000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be on the safer side, setting it to the exact max value might again lead to inf values in subsequent layers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okey just noticed that we do the same in Bart as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe improve comment slightly:
| # clamp inf values | |
| # clamp inf values to enable fp16 training |
|
This is great! |
|
Dear @patil-suraj Can you tell me, should your code fix fp16 on google/t5-v1_1-xl model? Upd: I run my code on Transformers's branch from your current PR #9487 merged with PR #9211 needed for deepspeed integration. |
4e284b6 to
5a47157
Compare
Hey @exelents, can you include a code snippet to reproduce your error as well as the full stack trace of your error? |
|
as stated in #9432 This fix works for following models and versions, with apex
Just did a small experiment with also, @exelents by overflow error do you mean the gradient overflow warning thrown by |
Ah ok, we still see |
Here is error stack: |
I'm again trying to locate where exactly in the model this happen. In case it's the same as above (first |
|
I have checked a loss value, and it seems in is not NaN. It got values like "48.7500" or "40.9688" but there are vaild values. Despite that I see messages like "OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024.0, reducing to 512.0", that it seems means that something bad happened with model's loss. |
Those warnings don't mean anything went wrong, it's logical with dynamic loss scaling that some loss scale values are too big at the beginning of training. |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this!
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! Thanks for working on this @patil-suraj!
What does this PR do?
This PR enables fp16 for T5 models, by clamping hidden states to the max value of the current data type.
As detailed in #9295, T5 produces large (
inf) activations at 3 placesT5LayerFFT5LayerSelfAttentionT5LayerCrossAttentionTo avoid these
infactivations this PR clamps thehidden_statesafter above 3 outputs