-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Fix gpt2 fp16 training when tracing is enabled #20656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
A little bit more context on the issue, I previously fixed the tracing issue in #18017, but it will harm the performance due to host<->device synchronization, which has been targeted in #20061, but cause the tracing once again failed. It seems that we can't guarantee the tracing correctness and inference performance with the same line of code while using PyTorch at the same time, that's why in the PR, I distinguish two cases to solve it:
|
|
Also @michaelbenayoun I saw this: #18017 (comment), does the current modeling won't have an issue while doing mixed-precision training for torch.fx? |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the kind of if/else we try to avoid in the modeling code as it will become completely unreadable if we add support for all optimizations/exports like this. Let's forego the optimized path here and only do what works for ONNX/tracing.
|
Feel the same, If/else removed! |
|
The documentation is not available anymore as the PR was closed or merged. |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Let's just wait for @michaelbenayoun and then we can merge!
michaelbenayoun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* ONNX tracing fix * Remove conditional
What does this PR do?
With the PR #20061, the tracing will fail during mixed-precision training, as the dtype for the inputs of a where node are not the same, which is invalid while reusing the ONNX model for inference.
The node:
transformers/src/transformers/models/gpt2/modeling_gpt2.py
Line 201 in 3ac040b
Error message: