-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mix precision training #451
Comments
It's interesting that you got a significant speedup. The last time I tried it I didn't see much because the LSTM layers can't be run with half precision. So, yes sure we can have an AMP switch (and set it to 16bits per default when running on CUDA). |
Have you tried the new pytorch Jit compiler? |
I've only tried the old torchscript JIT. It is in feature/jit. Back then it was somewhat slower than the uncompiled version but the primary reason I wanted to implement it was coreML's torchscript conversion routine (which doesn't work reliably with scripted models as it turns out) to get away from the somewhat limiting VGSL layer. |
I am wondering if AMP might be a good factor in lowering the memory size of the model at inference too (at least for segmentation) |
While it's about diffusion, it might still be interesting: https://huggingface.co/docs/diffusers/optimization/fp16#memory-and-speed |
I was under the impression that just casting to half precision won't provide any significant benefit, except on GPUs with tensor cores. There's also quantization but that usually causes a loss in accuracy and is the serialized models are hardware-dependent (and it didn't work particularly well with LSTMs). |
@colibrisson @mittagessen I'll stay updated to check on the results of the PR to add this to prediction as well if it makes a difference. |
I just ran a couple of test with autocasting enabled during inference with the default segmentation model. No difference in resource consumption although there's a ~10% performance penalty with the autocaster. |
I am not surprised that much that model trained with fp32 would perform
worse in fp16. It kinda makes sense no? 😅🤔
…On Wed, 22 Feb 2023, 6:38 pm mittagessen, ***@***.***> wrote:
I just ran a couple of test with autocasting enabled during inference with
the default segmentation model. No difference in resource consumption
although there's a ~10% performance penalty with the autocaster.
—
Reply to this email directly, view it on GitHub
<#451 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOXEZWJORMZW3HXYSMZGI3WYZFJRANCNFSM6AAAAAAVEJMDC4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
No, I'm talking about the running time. I haven't even looked at the quality of the output. |
OH. That's weird O_O |
https://huggingface.co/docs/diffusers/optimization/fp16#half-precision-weights |
Yeah, we can't do that as the recurrent layers are not fp16-compatible. Unless there's an easy way to do mixed precision that I'm not aware of. |
I'm not an expert but I have used AMP with LSTM in the past and the speedup is mind-blowing, especially on ADA GPU. |
In general, AMP benefited from a lot of work thanks to the LLM-world. Lightning-AI/pytorch-lightning#14356 (reply in thread) seems to indicate that PL Trainer can move to 16p easily, Listed compatibility with Torch native AMP: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float16
What I can't figure correctly out is if the use of AMP at prediction time should come from the model call (ie at inference time), at the model loading (loading fp16 would only produce fp16 vars) or both (which would require some kind of new metadata in the MLModel for FP16 switch, à la Hugginface Diffuser). |
Notabene: Hugginface recommend two other small code bit to speed up inference and training in the link I gave few comments up there. |
Then there might be something else going on. I didn't look into it too closely as I was just fiddling around with all the pytorch-lightning options. But it might be that the Turing GPUs I've been trying it on don't provide much benefit (and the system with Ampere ones is I/O limited).
You can do either but personally I prefer casting the model weights to FP16 after loading as it doesn't require us to have different model files for different quantization. And it should be equivalent to instantiating the model directly with FP16 weights as the diffuser library. BUT in both cases you need to make sure that everything in the net is FP16 compatible, otherwise AMP is the way to go. |
FYI I get this warning -- which is fine -- but it could open the door to a little more optiimization:
|
I was going to bring this up. I ran some tests using
torch.set_float32_matmul_precision('medium') and didn't notice any impact
on accuracy. However, I don't know what is the behavior on older GPU.
…On Fri, Feb 24, 2023 at 7:44 AM Thibault Clérice ***@***.***> wrote:
FYI I get this warning -- which is fine -- but it could open the door to a
little more optiimization:
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor
Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium'
| 'high') which will trade-off precision for performance. For more
details, read
https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
—
Reply to this email directly, view it on GitHub
<#451 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZKGRJUMASQGITYBTP5333WZBKGBANCNFSM6AAAAAAVEJMDC4>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
Do you still get the warning when you set --precision to bt16?
…On Fri, Feb 24, 2023 at 7:59 AM Colin Brisson ***@***.***> wrote:
I was going to bring this up. I ran some tests using
torch.set_float32_matmul_precision('medium') and didn't notice any impact
on accuracy. However, I don't know what is the behavior on older GPU.
On Fri, Feb 24, 2023 at 7:44 AM Thibault Clérice ***@***.***>
wrote:
> FYI I get this warning -- which is fine -- but it could open the door to
> a little more optiimization:
>
> You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor
> Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium'
> | 'high') which will trade-off precision for performance. For more
> details, read
> https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
>
> —
> Reply to this email directly, view it on GitHub
> <#451 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAZKGRJUMASQGITYBTP5333WZBKGBANCNFSM6AAAAAAVEJMDC4>
> .
> You are receiving this because you modified the open/close state.Message
> ID: ***@***.***>
>
|
From the nature of the flag it shouldn't produce any degradation in older GPUs (or CPUs, or MPS, or ...). If no higher performance implementation is available it will just be ignored. Do you see any performance gain when reducing the matmul precision? Especially when in fp16 mode? Because the majority of matrix multiplications should then be in fp16 so the benefit will most likely be marginal. |
When setting precision to bf16 pytorch throws an error: |
I ran some tests. When using full precision, setting ft32_matmul_precision to medium provides
moderate speedup. I didn't notice any improvement when using mixed precision.
|
I trained segmentation and recognition models using Cuda's Automatic Mix Precision (AMP). I noticed a significant speedup in training (as much as x2) and a lower memory footprint, with zero impact on accuracy. I know you want to reduce the number of options, but this would be a very useful feature.
If you are interested I can create a pull request today.
The text was updated successfully, but these errors were encountered: