Mix precision training #451

colibrisson · 2023-02-22T12:50:33Z

I trained segmentation and recognition models using Cuda's Automatic Mix Precision (AMP). I noticed a significant speedup in training (as much as x2) and a lower memory footprint, with zero impact on accuracy. I know you want to reduce the number of options, but this would be a very useful feature.

If you are interested I can create a pull request today.

mittagessen · 2023-02-22T13:04:07Z

It's interesting that you got a significant speedup. The last time I tried it I didn't see much because the LSTM layers can't be run with half precision. So, yes sure we can have an AMP switch (and set it to 16bits per default when running on CUDA).

colibrisson · 2023-02-22T13:07:49Z

Have you tried the new pytorch Jit compiler?

mittagessen · 2023-02-22T13:22:49Z

I've only tried the old torchscript JIT. It is in feature/jit. Back then it was somewhat slower than the uncompiled version but the primary reason I wanted to implement it was coreML's torchscript conversion routine (which doesn't work reliably with scripted models as it turns out) to get away from the somewhat limiting VGSL layer.

PonteIneptique · 2023-02-22T13:45:57Z

I am wondering if AMP might be a good factor in lowering the memory size of the model at inference too (at least for segmentation)

PonteIneptique · 2023-02-22T13:51:31Z

While it's about diffusion, it might still be interesting: https://huggingface.co/docs/diffusers/optimization/fp16#memory-and-speed

mittagessen · 2023-02-22T14:00:27Z

I was under the impression that just casting to half precision won't provide any significant benefit, except on GPUs with tensor cores. There's also quantization but that usually causes a loss in accuracy and is the serialized models are hardware-dependent (and it didn't work particularly well with LSTMs).

PonteIneptique · 2023-02-22T15:35:49Z

@colibrisson @mittagessen I'll stay updated to check on the results of the PR to add this to prediction as well if it makes a difference.

mittagessen · 2023-02-22T17:38:21Z

I just ran a couple of test with autocasting enabled during inference with the default segmentation model. No difference in resource consumption although there's a ~10% performance penalty with the autocaster.

PonteIneptique · 2023-02-22T17:41:12Z

I am not surprised that much that model trained with fp32 would perform worse in fp16. It kinda makes sense no? 😅🤔

…

On Wed, 22 Feb 2023, 6:38 pm mittagessen, ***@***.***> wrote: I just ran a couple of test with autocasting enabled during inference with the default segmentation model. No difference in resource consumption although there's a ~10% performance penalty with the autocaster. — Reply to this email directly, view it on GitHub <#451 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOXEZWJORMZW3HXYSMZGI3WYZFJRANCNFSM6AAAAAAVEJMDC4> . You are receiving this because you commented.Message ID: ***@***.***>

mittagessen · 2023-02-22T19:31:40Z

No, I'm talking about the running time. I haven't even looked at the quality of the output.

PonteIneptique · 2023-02-22T19:33:02Z

OH. That's weird O_O

PonteIneptique · 2023-02-22T19:34:26Z

It is strongly discouraged to make use of torch.autocast in any of the pipelines as it can lead to black images and is always slower than using pure float16 precision.

https://huggingface.co/docs/diffusers/optimization/fp16#half-precision-weights

mittagessen · 2023-02-22T20:30:17Z

Yeah, we can't do that as the recurrent layers are not fp16-compatible. Unless there's an easy way to do mixed precision that I'm not aware of.

colibrisson · 2023-02-22T20:39:57Z

I'm not an expert but I have used AMP with LSTM in the past and the speedup is mind-blowing, especially on ADA GPU.

PonteIneptique · 2023-02-22T20:44:52Z

In general, AMP benefited from a lot of work thanks to the LLM-world.

Lightning-AI/pytorch-lightning#14356 (reply in thread) seems to indicate that PL Trainer can move to 16p easily,

Listed compatibility with Torch native AMP:

https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float16

CUDA Ops that can autocast to float16
matmul, addbmm, addmm, addmv, addr, baddbmm, bmm, chain_matmul, multi_dot, conv1d, conv2d, conv3d, conv_transpose1d, conv_transpose2d, conv_transpose3d, GRUCell, linear, LSTMCell, matmul, mm, mv, prelu, RNNCell
CUDA Ops that can autocast to float32
pow, rdiv, rpow, rtruediv, acos, asin, binary_cross_entropy_with_logits, cosh, cosine_embedding_loss, cdist, cosine_similarity, cross_entropy, cumprod, cumsum, dist, erfinv, exp, expm1, group_norm, hinge_embedding_loss, kl_div, l1_loss, layer_norm, log, log_softmax, log10, log1p, log2, margin_ranking_loss, mse_loss, multilabel_margin_loss, multi_margin_loss, nll_loss, norm, normalize, pdist, poisson_nll_loss, pow, prod, reciprocal, rsqrt, sinh, smooth_l1_loss, soft_margin_loss, softmax, softmin, softplus, sum, renorm, tan, triplet_margin_loss

What I can't figure correctly out is if the use of AMP at prediction time should come from the model call (ie at inference time), at the model loading (loading fp16 would only produce fp16 vars) or both (which would require some kind of new metadata in the MLModel for FP16 switch, à la Hugginface Diffuser).

PonteIneptique · 2023-02-22T20:45:34Z

Notabene: Hugginface recommend two other small code bit to speed up inference and training in the link I gave few comments up there.

mittagessen · 2023-02-23T09:33:52Z

I'm not an expert but I have used AMP with LSTM in the past and the speedup is mind-blowing, especially on ADA GPU.

Then there might be something else going on. I didn't look into it too closely as I was just fiddling around with all the pytorch-lightning options. But it might be that the Turing GPUs I've been trying it on don't provide much benefit (and the system with Ampere ones is I/O limited).

What I can't figure correctly out is if the use of AMP at prediction time should come from the model call (ie at inference time), at the model loading (loading fp16 would only produce fp16 vars) or both (which would require some kind of new metadata in the MLModel for FP16 switch, à la Hugginface Diffuser).

You can do either but personally I prefer casting the model weights to FP16 after loading as it doesn't require us to have different model files for different quantization. And it should be equivalent to instantiating the model directly with FP16 weights as the diffuser library. BUT in both cases you need to make sure that everything in the net is FP16 compatible, otherwise AMP is the way to go.

PonteIneptique · 2023-02-24T06:44:38Z

FYI I get this warning -- which is fine -- but it could open the door to a little more optiimization:

You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision

colibrisson · 2023-02-24T07:00:12Z

I was going to bring this up. I ran some tests using torch.set_float32_matmul_precision('medium') and didn't notice any impact on accuracy. However, I don't know what is the behavior on older GPU.

…

On Fri, Feb 24, 2023 at 7:44 AM Thibault Clérice ***@***.***> wrote: FYI I get this warning -- which is fine -- but it could open the door to a little more optiimization: You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision — Reply to this email directly, view it on GitHub <#451 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZKGRJUMASQGITYBTP5333WZBKGBANCNFSM6AAAAAAVEJMDC4> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

colibrisson · 2023-02-24T07:01:57Z

Do you still get the warning when you set --precision to bt16?

…

On Fri, Feb 24, 2023 at 7:59 AM Colin Brisson ***@***.***> wrote: I was going to bring this up. I ran some tests using torch.set_float32_matmul_precision('medium') and didn't notice any impact on accuracy. However, I don't know what is the behavior on older GPU. On Fri, Feb 24, 2023 at 7:44 AM Thibault Clérice ***@***.***> wrote: > FYI I get this warning -- which is fine -- but it could open the door to > a little more optiimization: > > You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor > Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' > | 'high') which will trade-off precision for performance. For more > details, read > https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision > > — > Reply to this email directly, view it on GitHub > <#451 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAZKGRJUMASQGITYBTP5333WZBKGBANCNFSM6AAAAAAVEJMDC4> > . > You are receiving this because you modified the open/close state.Message > ID: ***@***.***> >

mittagessen · 2023-02-24T09:18:47Z

From the nature of the flag it shouldn't produce any degradation in older GPUs (or CPUs, or MPS, or ...). If no higher performance implementation is available it will just be ignored. Do you see any performance gain when reducing the matmul precision? Especially when in fp16 mode? Because the majority of matrix multiplications should then be in fp16 so the benefit will most likely be marginal.

colibrisson · 2023-02-24T09:55:25Z

When setting precision to bf16 pytorch throws an error: _thnn_fused_lstm_cell_cuda" not implemented for 'BFloat16'

colibrisson · 2023-02-24T12:31:31Z

I ran some tests. When using full precision, setting ft32_matmul_precision to medium provides moderate speedup. I didn't notice any improvement when using mixed precision.

colibrisson mentioned this issue Feb 22, 2023

add --precision option to ketos train and ketos segtrain #453

Merged

colibrisson closed this as completed Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mix precision training #451

Mix precision training #451

colibrisson commented Feb 22, 2023 •

edited

Loading

mittagessen commented Feb 22, 2023

colibrisson commented Feb 22, 2023

mittagessen commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023

mittagessen commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023 •

edited

Loading

mittagessen commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023 via email

mittagessen commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023

mittagessen commented Feb 22, 2023

colibrisson commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023

mittagessen commented Feb 23, 2023

PonteIneptique commented Feb 24, 2023

colibrisson commented Feb 24, 2023 via email

colibrisson commented Feb 24, 2023 via email

mittagessen commented Feb 24, 2023

colibrisson commented Feb 24, 2023

colibrisson commented Feb 24, 2023 via email •

edited

Loading

Mix precision training #451

Mix precision training #451

Comments

colibrisson commented Feb 22, 2023 • edited Loading

mittagessen commented Feb 22, 2023

colibrisson commented Feb 22, 2023

mittagessen commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023

mittagessen commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023 • edited Loading

mittagessen commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023 via email

mittagessen commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023

mittagessen commented Feb 22, 2023

colibrisson commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023

PonteIneptique commented Feb 22, 2023

mittagessen commented Feb 23, 2023

PonteIneptique commented Feb 24, 2023

colibrisson commented Feb 24, 2023 via email

colibrisson commented Feb 24, 2023 via email

mittagessen commented Feb 24, 2023

colibrisson commented Feb 24, 2023

colibrisson commented Feb 24, 2023 via email • edited Loading

colibrisson commented Feb 22, 2023 •

edited

Loading

PonteIneptique commented Feb 22, 2023 •

edited

Loading

colibrisson commented Feb 24, 2023 via email •

edited

Loading