-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuDNN: Store convolution algorithm choice to disk. #1947
Comments
I hope the compile times don't feel like it is doing on the fly benchmarking! It uses a cost model to choose register tile sizes. I am working on a replacement, where I plan on having a later release support both tiling cache sizes, and auto-tuning on top of using a cost model by default. I also intend to have the rewrite support targeting accelerators, so long as those accelerators are supported by LLVM. |
Instead of creating a mapping from cuDNN struct pointers to Julia-side info, it would probably be better to extract all the relevant info from the descriptor(s) and use some subset of that as the cache key. Some previous discussion in #1461 (comment) and #1461 (comment) (which references microsoft/CNTK#3362).
You can check this using nsight. I find the benchmarking phase is pretty easy to distinguish from actual execution. |
Thanks @chriselrod, I also had another look in the LoopVectorization docs and indeed it does mention being model based, I must have missed it the first time ;) As an aside: Here's an API example of Mojo's approach to determining the tile size, which seems to be (for now) basically defining a search space and benchmarking. I'm not sure they're doing anything further at this point @lattner? I suppose we won't know for some time.
That would certainly be exciting. If something like this can become competitive with vendor implementations of common kernels I think it would be a very exciting result, and certainly a show-piece for the Julia community. |
Thanks @ToucheSir for the feedback! Here's what the keys in the pull request example actually look like: # cudnnFilterDescriptor
(cuDNN.CUDNN_DATA_FLOAT, cuDNN.CUDNN_TENSOR_NCHW, 4, Int32[32, 64, 3, 3])
# cudnnTensorDescriptor
(cuDNN.CUDNN_TENSOR_NCHW, cuDNN.CUDNN_DATA_FLOAT, 4, Int32[7, 32, 30, 30])
# cudnnConvDescriptor
# (padding, stride, dilation, ..., beta) where beta is some scaling factor(?)
(Int32[0, 0], Int32[1, 1], Int32[1, 1], cuDNN.CUDNN_CONVOLUTION, cuDNN.CUDNN_DATA_FLOAT, cuDNN.CUDNN_TENSOR_OP_MATH, cuDNN.CUDNN_DEFAULT_REORDER, 1) I don't think it would be wise to strip much away from this. Even for the cudnnConvDescriptor, I could only imagine (maybe) stripping the padding - but otherwise I think there will always be cases where you end up needing different execution implementations. Therefore imo using the descriptors as is is quite reasonable. For the case where the batch size is varying all the time, one could consider filling the dictionary with copies of the same keys (but with different batch sizes), or providing an algorithm override |
That code is a simple example meant to motivate builtin language autotuning. Our production matmul algorithms use that functionality, but their implementation is much different and more complicated. |
Thanks @lattner, excited to see more details as they become available, especially for the gpu/tpu/xpu autotuning! |
Me too! We're seeing amazing results, can't wait to share them. |
I think my point was unclear, sorry. The main observation is that it is not necessary to cache the descriptors themselves but just their contents, which should make deserialization much easier. Will go into more detail on #1948. |
Problem description.
In deep learning (DL), we usually define a computation using high-level instructions (e.g. with Flux.jl) and then map the computation over many elements.
It's the compilers job to figure out how to actually execute this in a performant manner, which has lead to the invention of e.g. XLA, and is also at the core of the motivation of the recently announced Mojo programming language.
Usually, the only way to find out the details of the execution is to run micro-benchmarks.
In the cpu world, for example, ATLAS has long been a way to automatically tune execution details to new hardware.
(PS: I believe LoopVectorization.jl does something similar, but looking at the code actually I didn't find any "on-the-fly benchmarking?" @chriselrod)
cuDNN.jl currently does something similar only for convolutions, delegating the specifics to the actual cuDNN library (e.g. here).
Performance results are cached in a local dictionary and discarded at the end of the session.
Currently, I'm not sure how long the microbenchmarks take compared to the startup time in general (I found it a bit hard to measure).
In the future however, I think we may see more involved performance optimizations, and I think it would be interesting to (optionally) store the benchmark results to disk.
Solution proposal.
We already have three cache variables
cudnnConvolutionFwdAlgoPerfCache
,cudnnConvolutionBwdDataAlgoPerfCache
, andcudnnConvolutionBwdFilterAlgoPerfCache
.Conceptually, they map from a tuple of descriptors (basically structs describing the convolution parameters, weight tensor, and data tensor) to performance results for different implementations.
In general, it should be relatively straightforward to repopulate them from disk.
However, the key is currently actually not the julia struct/tuple, but a pointer to a C struct, which dies at the end of the julia session.
This should be an easy fix though, if we change the key from the pointer to the actual descriptor.
A proof-of-concept implementation is provided at #1948 .
Broader impact.
I think if Julia wants to stay competitive in this domain, especially with the Mojo language coming up, more optimizations for gpu execution must be available, without paying a high compile time cost.
In the future I hope that there can be further research projects related to different ways of executing on the hardware.
I'm happy for any feedback on the topic in general, and also for this issue in particular.
The text was updated successfully, but these errors were encountered: