Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mix precision training #451

Closed
colibrisson opened this issue Feb 22, 2023 · 23 comments
Closed

Mix precision training #451

colibrisson opened this issue Feb 22, 2023 · 23 comments

Comments

@colibrisson
Copy link
Contributor

colibrisson commented Feb 22, 2023

I trained segmentation and recognition models using Cuda's Automatic Mix Precision (AMP). I noticed a significant speedup in training (as much as x2) and a lower memory footprint, with zero impact on accuracy. I know you want to reduce the number of options, but this would be a very useful feature.

If you are interested I can create a pull request today.

@mittagessen
Copy link
Owner

It's interesting that you got a significant speedup. The last time I tried it I didn't see much because the LSTM layers can't be run with half precision. So, yes sure we can have an AMP switch (and set it to 16bits per default when running on CUDA).

@colibrisson
Copy link
Contributor Author

Have you tried the new pytorch Jit compiler?

@mittagessen
Copy link
Owner

I've only tried the old torchscript JIT. It is in feature/jit. Back then it was somewhat slower than the uncompiled version but the primary reason I wanted to implement it was coreML's torchscript conversion routine (which doesn't work reliably with scripted models as it turns out) to get away from the somewhat limiting VGSL layer.

@PonteIneptique
Copy link
Contributor

I am wondering if AMP might be a good factor in lowering the memory size of the model at inference too (at least for segmentation)

@PonteIneptique
Copy link
Contributor

While it's about diffusion, it might still be interesting: https://huggingface.co/docs/diffusers/optimization/fp16#memory-and-speed

@mittagessen
Copy link
Owner

I was under the impression that just casting to half precision won't provide any significant benefit, except on GPUs with tensor cores. There's also quantization but that usually causes a loss in accuracy and is the serialized models are hardware-dependent (and it didn't work particularly well with LSTMs).

@PonteIneptique
Copy link
Contributor

PonteIneptique commented Feb 22, 2023

@colibrisson @mittagessen I'll stay updated to check on the results of the PR to add this to prediction as well if it makes a difference.

@mittagessen
Copy link
Owner

I just ran a couple of test with autocasting enabled during inference with the default segmentation model. No difference in resource consumption although there's a ~10% performance penalty with the autocaster.

@PonteIneptique
Copy link
Contributor

PonteIneptique commented Feb 22, 2023 via email

@mittagessen
Copy link
Owner

No, I'm talking about the running time. I haven't even looked at the quality of the output.

@PonteIneptique
Copy link
Contributor

OH. That's weird O_O

@PonteIneptique
Copy link
Contributor

It is strongly discouraged to make use of torch.autocast in any of the pipelines as it can lead to black images and is always slower than using pure float16 precision.

https://huggingface.co/docs/diffusers/optimization/fp16#half-precision-weights

@mittagessen
Copy link
Owner

Yeah, we can't do that as the recurrent layers are not fp16-compatible. Unless there's an easy way to do mixed precision that I'm not aware of.

@colibrisson
Copy link
Contributor Author

I'm not an expert but I have used AMP with LSTM in the past and the speedup is mind-blowing, especially on ADA GPU.

@PonteIneptique
Copy link
Contributor

In general, AMP benefited from a lot of work thanks to the LLM-world.

Lightning-AI/pytorch-lightning#14356 (reply in thread) seems to indicate that PL Trainer can move to 16p easily,

Listed compatibility with Torch native AMP:

https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float16

CUDA Ops that can autocast to float16
matmul, addbmm, addmm, addmv, addr, baddbmm, bmm, chain_matmul, multi_dot, conv1d, conv2d, conv3d, conv_transpose1d, conv_transpose2d, conv_transpose3d, GRUCell, linear, LSTMCell, matmul, mm, mv, prelu, RNNCell
CUDA Ops that can autocast to float32
pow, rdiv, rpow, rtruediv, acos, asin, binary_cross_entropy_with_logits, cosh, cosine_embedding_loss, cdist, cosine_similarity, cross_entropy, cumprod, cumsum, dist, erfinv, exp, expm1, group_norm, hinge_embedding_loss, kl_div, l1_loss, layer_norm, log, log_softmax, log10, log1p, log2, margin_ranking_loss, mse_loss, multilabel_margin_loss, multi_margin_loss, nll_loss, norm, normalize, pdist, poisson_nll_loss, pow, prod, reciprocal, rsqrt, sinh, smooth_l1_loss, soft_margin_loss, softmax, softmin, softplus, sum, renorm, tan, triplet_margin_loss

What I can't figure correctly out is if the use of AMP at prediction time should come from the model call (ie at inference time), at the model loading (loading fp16 would only produce fp16 vars) or both (which would require some kind of new metadata in the MLModel for FP16 switch, à la Hugginface Diffuser).

@PonteIneptique
Copy link
Contributor

Notabene: Hugginface recommend two other small code bit to speed up inference and training in the link I gave few comments up there.

@mittagessen
Copy link
Owner

I'm not an expert but I have used AMP with LSTM in the past and the speedup is mind-blowing, especially on ADA GPU.

Then there might be something else going on. I didn't look into it too closely as I was just fiddling around with all the pytorch-lightning options. But it might be that the Turing GPUs I've been trying it on don't provide much benefit (and the system with Ampere ones is I/O limited).

What I can't figure correctly out is if the use of AMP at prediction time should come from the model call (ie at inference time), at the model loading (loading fp16 would only produce fp16 vars) or both (which would require some kind of new metadata in the MLModel for FP16 switch, à la Hugginface Diffuser).

You can do either but personally I prefer casting the model weights to FP16 after loading as it doesn't require us to have different model files for different quantization. And it should be equivalent to instantiating the model directly with FP16 weights as the diffuser library. BUT in both cases you need to make sure that everything in the net is FP16 compatible, otherwise AMP is the way to go.

@PonteIneptique
Copy link
Contributor

FYI I get this warning -- which is fine -- but it could open the door to a little more optiimization:

You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision

@colibrisson
Copy link
Contributor Author

colibrisson commented Feb 24, 2023 via email

@colibrisson
Copy link
Contributor Author

colibrisson commented Feb 24, 2023 via email

@mittagessen
Copy link
Owner

From the nature of the flag it shouldn't produce any degradation in older GPUs (or CPUs, or MPS, or ...). If no higher performance implementation is available it will just be ignored. Do you see any performance gain when reducing the matmul precision? Especially when in fp16 mode? Because the majority of matrix multiplications should then be in fp16 so the benefit will most likely be marginal.

@colibrisson
Copy link
Contributor Author

When setting precision to bf16 pytorch throws an error: _thnn_fused_lstm_cell_cuda" not implemented for 'BFloat16'

@colibrisson
Copy link
Contributor Author

colibrisson commented Feb 24, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants