Fix ORTTrainer failure on gpt2 fp16 training #18017

JingyaHuang · 2022-07-04T17:48:01Z

What does this PR do?

Context

Optimum users reported that the mixed-precision training on gpt2 with optimum.onnxruntime.ORTTrainer is broken since transformers>4.16.0. After investigation, the break comes from the removal of float() in gpt2 modeling from PR #14321.

Reproduction

Run optimum onnxruntime training example run_glue.py with:

python run_glue.py \
    --model_name_or_path gpt2 \
    --task_name sst2 \
    --do_train \
    --do_eval \
    --fp16 \
    --output_dir /tmp/ort-gpt2-sst2/

Error Message

RuntimeError: /onnxruntime_src/orttraining/orttraining/python/orttraining_pybind_state.cc:752 
onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, 
onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder*, const 
pybind11::bytes&, const onnxruntime::training::OrtModuleGraphBuilderConfiguration&)> [ONNXRuntimeError] : 1 : FAIL : 
Type Error: Type parameter (T) of Optype (Where) bound to different types (tensor(float) and tensor(float16) in node 
(Where_201).

As mentioned in the error message, the forward with onnxruntime InferenceSession will fail on a node Where in the graph, which corresponds to the Where op in gpt2 modeling.

And the problem comes from the fact that after removing float(), during fp16 training, the inputs of Where have different dtype (one in fp32 and one in fp16), which violates the definition in ONNX and leads to the failure.

Who can review?

@michaelbenayoun @patrickvonplaten, @LysandreJik

Fix

Ensure attn_weights and value has the same type in exported ONNX IR.

HuggingFaceDocBuilderDev · 2022-07-04T18:01:03Z

The documentation is not available anymore as the PR was closed or merged.

michaelbenayoun

LGTM (once all the tests pass)

ydshieh · 2022-07-07T09:14:17Z

Hello @JingyaHuang By looking this 2 lines

transformers/src/transformers/models/decision_transformer/modeling_decision_transformer.py

Lines 196 to 197 in 04ffba9

    
           mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(attn_weights.device) 
        
           attn_weights = torch.where(causal_mask, attn_weights, mask_value)

mask_value is already of type attn_weights.dtype. Does the issue only occur when we use ONNX? (i.e. if we run the model in Python with FP16, does it work?). This issue seems strange. Do you happen to know which argument gets fp32 and which one gets fp16?

JingyaHuang · 2022-07-07T14:49:26Z

Hi @ydshieh, yes this issue only occurs with ONNX. When exporting the ONNX IR mask_value is exported as a constant initializer(min of fp32) with dtype float32. Thus during the mixed-precision training with onnxruntime, the attn_weights will be in dtype fp16 and mask_value as a constant always fp32 -> two inputs with different dtype -> training failed.
Here is the ONNX IR which illustrates what happened with Where op:

[EDIT] Here I made a mistake, according to the training graph, actually mask_value was successfully cast to fp16, but attn_weights not. Check the local exported IR below.

JingyaHuang · 2022-07-07T14:51:12Z

And if we run the model with PyTorch backend, there is no problem of the tricky tracing or op definition, it should work fine.

ydshieh · 2022-07-07T15:36:05Z

@JingyaHuang Thank you!

src/transformers/models/gpt2/modeling_gpt2.py

JingyaHuang · 2022-07-07T17:20:52Z

Hi @ydshieh, I've just double-checked the debug exported training onnx graph. Actually the mask_value has been cast to fp16 before Where node, and it was attn_weights which was fp32, and the fix inserts another cast op to cast attention_mask from fp32 to fp16.

The IR corresponding to this line

transformers/src/transformers/models/decision_transformer/modeling_decision_transformer.py

Line 181 in 2544c14

attn_weights = attn_weights / (value.size(-1) ** 0.5)

The IR before fix:

The IR after fix:

So this is exactly what we want for fp16 training.

JingyaHuang · 2022-07-18T10:04:08Z

Gently pinging @patrickvonplaten and @LysandreJik for a review.

src/transformers/models/decision_transformer/modeling_decision_transformer.py

Co-authored-by: Lysandre Debut <[email protected]>

LysandreJik

Thanks, merging!

TXacs · 2022-07-28T03:01:30Z

I got the same error. But I use the torch.fx and amp to train the GPT2 model. I fix this error with the method is add attn_weights.to(attn_weights.dtype) in torch.where. I don't know why this way can fix it, but it does.

ydshieh · 2022-07-28T07:44:30Z

Hi @TXacs , which version you tried? Could you try to install the latest version on main:

pip install git+https://github.com/huggingface/accelerate

and see if you still have the issue (without your fix). Thanks!

JingyaHuang added 2 commits July 4, 2022 17:12

Ensure value and attn weights have the same dtype

a5afeeb

Remove prints

478ac8e

JingyaHuang requested review from LysandreJik, michaelbenayoun and patrickvonplaten July 4, 2022 17:54

michaelbenayoun approved these changes Jul 5, 2022

View reviewed changes

Modify decision transformers copied from gpt2

04ffba9

ydshieh reviewed Jul 7, 2022

View reviewed changes

src/transformers/models/gpt2/modeling_gpt2.py Outdated Show resolved Hide resolved

JingyaHuang requested review from LysandreJik and patrickvonplaten and removed request for LysandreJik and patrickvonplaten July 13, 2022 16:24

LysandreJik approved these changes Jul 21, 2022

View reviewed changes

src/transformers/models/decision_transformer/modeling_decision_transformer.py Outdated Show resolved Hide resolved

JingyaHuang and others added 3 commits July 21, 2022 12:38

Merge branch 'huggingface:main' into fix-ort-train-fp16-gpt2

4d72476

Nit device

90e5ef1

Co-authored-by: Lysandre Debut <[email protected]>

Fix style

e1c7a1c

LysandreJik approved these changes Jul 26, 2022

View reviewed changes

LysandreJik merged commit 2844c5d into huggingface:main Jul 26, 2022

This was referenced Dec 7, 2022

FP16 training for GPT2 broken again due to recent change in the modeling huggingface/optimum#557

Closed

Fix gpt2 fp16 training when tracing is enabled #20656

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ORTTrainer failure on gpt2 fp16 training #18017

Fix ORTTrainer failure on gpt2 fp16 training #18017

Uh oh!

JingyaHuang commented Jul 4, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jul 4, 2022 •

edited

Loading

Uh oh!

michaelbenayoun left a comment

Uh oh!

ydshieh commented Jul 7, 2022

Uh oh!

JingyaHuang commented Jul 7, 2022 •

edited

Loading

Uh oh!

JingyaHuang commented Jul 7, 2022

Uh oh!

ydshieh commented Jul 7, 2022

Uh oh!

Uh oh!

JingyaHuang commented Jul 7, 2022 •

edited

Loading

Uh oh!

JingyaHuang commented Jul 18, 2022

Uh oh!

Uh oh!

LysandreJik left a comment •

edited

Loading

Uh oh!

TXacs commented Jul 28, 2022

Uh oh!

ydshieh commented Jul 28, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fix ORTTrainer failure on gpt2 fp16 training #18017

Fix ORTTrainer failure on gpt2 fp16 training #18017

Uh oh!

Conversation

JingyaHuang commented Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Context

Who can review?

Fix

Uh oh!

HuggingFaceDocBuilderDev commented Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelbenayoun left a comment

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Jul 7, 2022

Uh oh!

JingyaHuang commented Jul 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JingyaHuang commented Jul 7, 2022

Uh oh!

ydshieh commented Jul 7, 2022

Uh oh!

Uh oh!

JingyaHuang commented Jul 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JingyaHuang commented Jul 18, 2022

Uh oh!

Uh oh!

LysandreJik left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TXacs commented Jul 28, 2022

Uh oh!

ydshieh commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

JingyaHuang commented Jul 4, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 4, 2022 •

edited

Loading

JingyaHuang commented Jul 7, 2022 •

edited

Loading

JingyaHuang commented Jul 7, 2022 •

edited

Loading

LysandreJik left a comment •

edited

Loading

ydshieh commented Jul 28, 2022 •

edited

Loading