[T5] enable T5 fp16 #9487

patil-suraj · 2021-01-08T17:54:38Z

What does this PR do?

This PR enables fp16 for T5 models, by clamping hidden states to the max value of the current data type.

As detailed in #9295, T5 produces large (inf) activations at 3 places

Output of T5LayerFF
Output of T5LayerSelfAttention
Output of T5LayerCrossAttention

To avoid these inf activations this PR clamps the hidden_states after above 3 outputs

patrickvonplaten · 2021-01-10T13:14:16Z

src/transformers/models/t5/modeling_t5.py

why the -1000?

Just to be on the safer side, setting it to the exact max value might again lead to inf values in subsequent layers

Okey just noticed that we do the same in Bart as well

patrickvonplaten · 2021-01-10T13:15:57Z

src/transformers/models/t5/modeling_t5.py

maybe improve comment slightly:

Suggested change

# clamp inf values

# clamp inf values to enable fp16 training

patrickvonplaten · 2021-01-10T13:16:54Z

This is great!

exelents · 2021-01-10T16:19:51Z

Dear @patil-suraj
Your PR works well for t5 model, thank you for your work.
But now I tried new t5 model version released recently by Google: google/t5-v1_1-xl
The same code after loading google/t5-v1_1-xl instead of t5-3b is going to return a lot "overflow" errors.

Can you tell me, should your code fix fp16 on google/t5-v1_1-xl model?
Here is training code:
https://github.com/exelents/try_t5_qa
Run ./run-qa-3b.sh

Upd: I run my code on Transformers's branch from your current PR #9487 merged with PR #9211 needed for deepspeed integration.
Can you confirm a problem, or it's just mine?

patrickvonplaten · 2021-01-11T13:49:40Z

Dear @patil-suraj
Your PR works well for t5 model, thank you for your work.
But now I tried new t5 model version released recently by Google: google/t5-v1_1-xl
The same code after loading google/t5-v1_1-xl instead of t5-3b is going to return a lot "overflow" errors.

Can you tell me, should your code fix fp16 on google/t5-v1_1-xl model?
Here is training code:
https://github.com/exelents/try_t5_qa
Run ./run-qa-3b.sh

Upd: I run my code on Transformers's branch from your current PR #9487 merged with PR #9211 needed for deepspeed integration.
Can you confirm a problem, or it's just mine?

Hey @exelents, can you include a code snippet to reproduce your error as well as the full stack trace of your error?

patil-suraj · 2021-01-11T13:58:29Z

@patrickvonplaten , @exelents

as stated in #9432

This fix works for following models and versions, with apex 01 and native amp

T5v1: t5-small, t5-base, t5-large
T5v1_1: google/t5-v1_1-small, google/t5-v1_1-base
MT5: google/mt5-small, google/mt5-base

Just did a small experiment with t5-v1_1-large and it still gives nan loss after 200 steps, so might not work for xl,

also, @exelents by overflow error do you mean the gradient overflow warning thrown by apex ?

patrickvonplaten · 2021-01-11T14:00:48Z

@patrickvonplaten , @exelents

as stated in #9432

This fix works for following models and versions, with apex 01 and native amp

T5v1: t5-small, t5-base, t5-large

T5v1_1: google/t5-v1_1-small, google/t5-v1_1-base

MT5: google/mt5-small, google/mt5-base

Just did a small experiment with t5-v1_1-large and it still gives nan loss after 200 steps, so might not work for xl,

also, @exelents by overflow error do you mean the gradient overflow warning thrown by apex ?

Ah ok, we still see nan's with t5-v1_1-large then :-/ Do you think this could be fixed by adding one more clamp statement? @patil-suraj

exelents · 2021-01-11T14:02:27Z

Hey @exelents, can you include a code snippet to reproduce your error as well as the full stack trace of your error?
My code is here:
https://github.com/exelents/try_t5_qa
It requires deepspeed to run, as well as code from #9211 PR (deepspeed integration) be merged. Use run-qa-3b.sh to test.

Here is error stack:
https://gist.github.com/exelents/10f1d03e61059ddf2dfba7068114c93a
Look at the end - we have a message after every step:
[2021-01-11 16:58:18,163] [INFO] [stage2.py:1361:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 256.0, reducing to 128.0
Wait a second, I'll try to check loss value tensor.

patil-suraj · 2021-01-11T14:03:19Z

Do you think this could be fixed by adding one more clamp statement?

I'm again trying to locate where exactly in the model this happen. In case it's the same as above (first inf then nan ) then we could fix it by adding one more clamp

exelents · 2021-01-11T14:19:03Z

I have checked a loss value, and it seems in is not NaN. It got values like "48.7500" or "40.9688" but there are vaild values. Despite that I see messages like "OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024.0, reducing to 512.0", that it seems means that something bad happened with model's loss.

sgugger · 2021-01-11T15:04:21Z

Attempted loss scale: 1024.0, reducing to 512.0", that it seems means that something bad happened with model's loss.

Those warnings don't mean anything went wrong, it's logical with dynamic loss scaling that some loss scale values are too big at the beginning of training.

sgugger

LGTM, thanks for fixing this!

LysandreJik

Very cool! Thanks for working on this @patil-suraj!

patil-suraj requested a review from patrickvonplaten January 8, 2021 17:54

flozi00 mentioned this pull request Jan 8, 2021

Using Huggingface library with DeepSpeed #9490

Closed

huggingface deleted a comment from patil-suraj Jan 9, 2021

patrickvonplaten reviewed Jan 10, 2021

View reviewed changes

patil-suraj added 2 commits January 11, 2021 12:49

fix t5 fp16

20a0592

better comment

5a47157

patil-suraj force-pushed the fix-t5-fp16 branch from 4e284b6 to 5a47157 Compare January 11, 2021 07:20

patrickvonplaten approved these changes Jan 11, 2021

View reviewed changes

patrickvonplaten requested review from LysandreJik and sgugger January 11, 2021 13:48

sgugger approved these changes Jan 11, 2021

View reviewed changes

LysandreJik approved these changes Jan 12, 2021

View reviewed changes

patil-suraj merged commit ccd1923 into huggingface:master Jan 12, 2021

SSamDav mentioned this pull request Sep 28, 2022

Clamping hidden state values to allow FP16 #19229

Merged

tomaarsen mentioned this pull request Jun 3, 2024

fp16 training errors for mt5 huggingface/sentence-transformers#2703

Open

	# clamp inf values
	# clamp inf values to enable fp16 training

[T5] enable T5 fp16 #9487

[T5] enable T5 fp16 #9487

Uh oh!

Conversation

patil-suraj commented Jan 8, 2021

What does this PR do?

Uh oh!

patrickvonplaten Jan 10, 2021

Choose a reason for hiding this comment

Uh oh!

patil-suraj Jan 11, 2021

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Jan 11, 2021

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Jan 10, 2021

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented Jan 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

exelents commented Jan 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Jan 11, 2021

Uh oh!

patil-suraj commented Jan 11, 2021

Uh oh!

patrickvonplaten commented Jan 11, 2021

Uh oh!

exelents commented Jan 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patil-suraj commented Jan 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

exelents commented Jan 11, 2021

Uh oh!

sgugger commented Jan 11, 2021

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

patrickvonplaten commented Jan 10, 2021 •

edited

Loading

exelents commented Jan 10, 2021 •

edited

Loading

exelents commented Jan 11, 2021 •

edited

Loading

patil-suraj commented Jan 11, 2021 •

edited

Loading