Fix gpt2 fp16 training when tracing is enabled #20656

JingyaHuang · 2022-12-07T15:51:51Z

What does this PR do?

With the PR #20061, the tracing will fail during mixed-precision training, as the dtype for the inputs of a where node are not the same, which is invalid while reusing the ONNX model for inference.

The node:

transformers/src/transformers/models/gpt2/modeling_gpt2.py

Line 201 in 3ac040b

attn_weights = torch.where(causal_mask, attn_weights, mask_value)

Error message:

======================================================================
ERROR: test_ort_trainer (__main__.TestORTTrainer) (model_name='gpt2', dataset_name='sst2', inference_with_ort=False)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_onnxruntime_train.py", line 131, in test_ort_trainer
    train_result = trainer.train()
  File "/workspace/optimum/onnxruntime/trainer.py", line 349, in train
    return inner_training_loop(
  File "/workspace/optimum/onnxruntime/trainer.py", line 615, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2523, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2555, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_utils.py", line 371, in _forward
    return ortmodule._torch_module.forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_utils.py", line 351, in _forward
    return torch_module_ort._execution_manager(torch_module_ort.is_training()).forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 273, in forward
    self._fallback_manager.handle_exception(
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_fallback.py", line 162, in handle_exception
    raise exception
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 210, in forward
    self._initialize_graph_builder()
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 478, in _initialize_graph_builder
    self._graph_builder.initialize(self._onnx_models.exported_model.SerializeToString(), grad_builder_config)
RuntimeError: /onnxruntime_src/orttraining/orttraining/python/orttraining_pybind_state.cc:731 onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder*, const pybind11::bytes&, const onnxruntime::training::OrtModuleGraphBuilderConfiguration&)> [ONNXRuntimeError] : 1 : FAIL : Type Error: Type parameter (T) of Optype (Where) bound to different types (tensor(float) and tensor(float16) in node (Where_223).

JingyaHuang · 2022-12-07T15:54:52Z

A little bit more context on the issue, I previously fixed the tracing issue in #18017, but it will harm the performance due to host<->device synchronization, which has been targeted in #20061, but cause the tracing once again failed.

It seems that we can't guarantee the tracing correctness and inference performance with the same line of code while using PyTorch at the same time, that's why in the PR, I distinguish two cases to solve it:

Case 1: Tracing
Case 2: Inference with PyTorch

JingyaHuang · 2022-12-07T15:57:20Z

Also @michaelbenayoun I saw this: #18017 (comment), does the current modeling won't have an issue while doing mixed-precision training for torch.fx?

sgugger

This is the kind of if/else we try to avoid in the modeling code as it will become completely unreadable if we add support for all optimizations/exports like this. Let's forego the optimized path here and only do what works for ONNX/tracing.

JingyaHuang · 2022-12-07T16:10:38Z

Feel the same, If/else removed!

HuggingFaceDocBuilderDev · 2022-12-07T16:25:11Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks! Let's just wait for @michaelbenayoun and then we can merge!

michaelbenayoun

LGTM

* ONNX tracing fix * Remove conditional

JingyaHuang and others added 2 commits December 7, 2022 15:36

ONNX tracing fix

0d950a8

Merge branch 'huggingface:main' into fix-gpt2-onnx-fp16

9d9b874

JingyaHuang requested review from michaelbenayoun, sgugger and ydshieh December 7, 2022 15:57

sgugger reviewed Dec 7, 2022

View reviewed changes

Remove conditional

2e093fa

JingyaHuang mentioned this pull request Dec 7, 2022

FP16 training for GPT2 broken again due to recent change in the modeling huggingface/optimum#557

Closed

4 tasks

ydshieh approved these changes Dec 7, 2022

View reviewed changes

sgugger approved these changes Dec 7, 2022

View reviewed changes

michaelbenayoun reviewed Dec 8, 2022

View reviewed changes

michaelbenayoun approved these changes Dec 8, 2022

View reviewed changes

sgugger merged commit 521da65 into huggingface:main Dec 8, 2022

mpierrau pushed a commit to mpierrau/transformers that referenced this pull request Dec 15, 2022

Fix gpt2 fp16 training when tracing is enabled (huggingface#20656)

44a868b

* ONNX tracing fix * Remove conditional

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix gpt2 fp16 training when tracing is enabled #20656

Fix gpt2 fp16 training when tracing is enabled #20656

Uh oh!

JingyaHuang commented Dec 7, 2022

Uh oh!

JingyaHuang commented Dec 7, 2022 •

edited

Loading

Uh oh!

JingyaHuang commented Dec 7, 2022

Uh oh!

sgugger left a comment

Uh oh!

JingyaHuang commented Dec 7, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Dec 7, 2022 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

michaelbenayoun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix gpt2 fp16 training when tracing is enabled #20656

Fix gpt2 fp16 training when tracing is enabled #20656

Uh oh!

Conversation

JingyaHuang commented Dec 7, 2022

What does this PR do?

Uh oh!

JingyaHuang commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JingyaHuang commented Dec 7, 2022

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

JingyaHuang commented Dec 7, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

michaelbenayoun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JingyaHuang commented Dec 7, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 7, 2022 •

edited

Loading