cuDNN: Store convolution algorithm choice to disk. #1947

RomeoV · 2023-06-11T17:22:29Z

Problem description.

In deep learning (DL), we usually define a computation using high-level instructions (e.g. with Flux.jl) and then map the computation over many elements.
It's the compilers job to figure out how to actually execute this in a performant manner, which has lead to the invention of e.g. XLA, and is also at the core of the motivation of the recently announced Mojo programming language.

Usually, the only way to find out the details of the execution is to run micro-benchmarks.
In the cpu world, for example, ATLAS has long been a way to automatically tune execution details to new hardware.
(PS: I believe LoopVectorization.jl does something similar, but looking at the code actually I didn't find any "on-the-fly benchmarking?" @chriselrod)
cuDNN.jl currently does something similar only for convolutions, delegating the specifics to the actual cuDNN library (e.g. here).
Performance results are cached in a local dictionary and discarded at the end of the session.

Currently, I'm not sure how long the microbenchmarks take compared to the startup time in general (I found it a bit hard to measure).
In the future however, I think we may see more involved performance optimizations, and I think it would be interesting to (optionally) store the benchmark results to disk.

Solution proposal.

We already have three cache variables cudnnConvolutionFwdAlgoPerfCache, cudnnConvolutionBwdDataAlgoPerfCache, and cudnnConvolutionBwdFilterAlgoPerfCache.
Conceptually, they map from a tuple of descriptors (basically structs describing the convolution parameters, weight tensor, and data tensor) to performance results for different implementations.
In general, it should be relatively straightforward to repopulate them from disk.
However, the key is currently actually not the julia struct/tuple, but a pointer to a C struct, which dies at the end of the julia session.

This should be an easy fix though, if we change the key from the pointer to the actual descriptor.
A proof-of-concept implementation is provided at #1948 .

Broader impact.

I think if Julia wants to stay competitive in this domain, especially with the Mojo language coming up, more optimizations for gpu execution must be available, without paying a high compile time cost.
In the future I hope that there can be further research projects related to different ways of executing on the hardware.

I'm happy for any feedback on the topic in general, and also for this issue in particular.

The text was updated successfully, but these errors were encountered:

chriselrod · 2023-06-11T17:37:24Z

PS: I believe LoopVectorization.jl does something similar, but looking at the code actually I didn't find any "on-the-fly benchmarking?"

I hope the compile times don't feel like it is doing on the fly benchmarking!
Tuning with the likes of TVM is not fast.

It uses a cost model to choose register tile sizes.
It does not tile at higher levels.

I am working on a replacement, where I plan on having a later release support both tiling cache sizes, and auto-tuning on top of using a cost model by default.
Models are much quicker, but auto-tuning results could (a) be cached (and saved to disk, like you say here, or also distributed with a dictionary mapping specific hardware to kernel, if the hardware [e.g. A100, H100] and kernels are common enough) and (b) inform and improve the model.
Additionally, if users specify the search space, it is likely that auto-tuning will get worse results. When a coworker was demoing TVM for me, he didn't include LoopVectorization's answer in the search space, and ended up with much worse performance than it achieved.

I also intend to have the rewrite support targeting accelerators, so long as those accelerators are supported by LLVM.

ToucheSir · 2023-06-11T17:50:12Z

Instead of creating a mapping from cuDNN struct pointers to Julia-side info, it would probably be better to extract all the relevant info from the descriptor(s) and use some subset of that as the cache key. Some previous discussion in #1461 (comment) and #1461 (comment) (which references microsoft/CNTK#3362).

Currently, I'm not sure how long the microbenchmarks take compared to the startup time in general (I found it a bit hard to measure).

You can check this using nsight. I find the benchmarking phase is pretty easy to distinguish from actual execution.

RomeoV · 2023-06-11T20:54:23Z

Thanks @chriselrod, I also had another look in the LoopVectorization docs and indeed it does mention being model based, I must have missed it the first time ;)

As an aside: Here's an API example of Mojo's approach to determining the tile size, which seems to be (for now) basically defining a search space and benchmarking. I'm not sure they're doing anything further at this point @lattner? I suppose we won't know for some time.

I also intend to have the rewrite support targeting accelerators, so long as those accelerators are supported by LLVM.

That would certainly be exciting. If something like this can become competitive with vendor implementations of common kernels I think it would be a very exciting result, and certainly a show-piece for the Julia community.
Either way, if you also consider implementing autotuning, perhaps there is space for a shared abstraction for storing autotuning results for both cpu/cuda/xpu, i.e. what this issue is about in the first place.

RomeoV · 2023-06-11T21:04:29Z

Thanks @ToucheSir for the feedback! Here's what the keys in the pull request example actually look like:

# cudnnFilterDescriptor
(cuDNN.CUDNN_DATA_FLOAT, cuDNN.CUDNN_TENSOR_NCHW, 4, Int32[32, 64, 3, 3])
# cudnnTensorDescriptor
(cuDNN.CUDNN_TENSOR_NCHW, cuDNN.CUDNN_DATA_FLOAT, 4, Int32[7, 32, 30, 30])
# cudnnConvDescriptor
# (padding, stride, dilation, ..., beta) where beta is some scaling factor(?)
(Int32[0, 0], Int32[1, 1], Int32[1, 1], cuDNN.CUDNN_CONVOLUTION, cuDNN.CUDNN_DATA_FLOAT, cuDNN.CUDNN_TENSOR_OP_MATH, cuDNN.CUDNN_DEFAULT_REORDER, 1)

I don't think it would be wise to strip much away from this. Even for the cudnnConvDescriptor, I could only imagine (maybe) stripping the padding - but otherwise I think there will always be cases where you end up needing different execution implementations.
In the issue you linked it was considered to remove the batch size, for example. But specifically the batch size can be quite significant for the concrete choice of implementation/algorithm.

Therefore imo using the descriptors as is is quite reasonable. For the case where the batch size is varying all the time, one could consider filling the dictionary with copies of the same keys (but with different batch sizes), or providing an algorithm override

lattner · 2023-06-11T21:04:37Z

That code is a simple example meant to motivate builtin language autotuning. Our production matmul algorithms use that functionality, but their implementation is much different and more complicated.

RomeoV · 2023-06-11T21:07:59Z

Thanks @lattner, excited to see more details as they become available, especially for the gpu/tpu/xpu autotuning!

lattner · 2023-06-11T21:08:59Z

Me too! We're seeing amazing results, can't wait to share them.

ToucheSir · 2023-06-11T22:03:30Z

Therefore imo using the descriptors as is is quite reasonable. For the case where the batch size is varying all the time, one could consider filling the dictionary with copies of the same keys (but with different batch sizes), or providing an algorithm override

I think my point was unclear, sorry. The main observation is that it is not necessary to cache the descriptors themselves but just their contents, which should make deserialization much easier. Will go into more detail on #1948.

RomeoV mentioned this issue Jun 11, 2023

Implement reverse lookup (Ptr->Tuple) for CUDNN descriptors. #1948

Merged

maleadt added the performance How fast can we go? label Jun 13, 2023

RomeoV mentioned this issue Aug 7, 2023

Use PrecompileTools.jl FluxML/FastAI.jl#284

Open

maleadt closed this as completed in #1948 Aug 19, 2023

This was referenced Oct 3, 2023

Investigate using MIOpen's immediate mode for conv algorithm search JuliaGPU/AMDGPU.jl#507

Open

Provide heuristic-based fallback mode for conv algo search #2108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDNN: Store convolution algorithm choice to disk. #1947

cuDNN: Store convolution algorithm choice to disk. #1947

RomeoV commented Jun 11, 2023 •

edited

Loading

chriselrod commented Jun 11, 2023 •

edited

Loading

ToucheSir commented Jun 11, 2023 •

edited

Loading

RomeoV commented Jun 11, 2023 •

edited

Loading

RomeoV commented Jun 11, 2023 •

edited

Loading

lattner commented Jun 11, 2023

RomeoV commented Jun 11, 2023

lattner commented Jun 11, 2023

ToucheSir commented Jun 11, 2023

cuDNN: Store convolution algorithm choice to disk. #1947

cuDNN: Store convolution algorithm choice to disk. #1947

Comments

RomeoV commented Jun 11, 2023 • edited Loading

Problem description.

Solution proposal.

Broader impact.

chriselrod commented Jun 11, 2023 • edited Loading

ToucheSir commented Jun 11, 2023 • edited Loading

RomeoV commented Jun 11, 2023 • edited Loading

RomeoV commented Jun 11, 2023 • edited Loading

lattner commented Jun 11, 2023

RomeoV commented Jun 11, 2023

lattner commented Jun 11, 2023

ToucheSir commented Jun 11, 2023

RomeoV commented Jun 11, 2023 •

edited

Loading

chriselrod commented Jun 11, 2023 •

edited

Loading

ToucheSir commented Jun 11, 2023 •

edited

Loading

RomeoV commented Jun 11, 2023 •

edited

Loading

RomeoV commented Jun 11, 2023 •

edited

Loading