Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nervana's Neon and Winograd #93

Open
soumith opened this issue Mar 4, 2016 · 68 comments
Open

Nervana's Neon and Winograd #93

soumith opened this issue Mar 4, 2016 · 68 comments

Comments

@soumith
Copy link
Owner

soumith commented Mar 4, 2016

After serious perf improvements by NVIDIA's CUDNN R4 across board, I suppose Nervana weren't too happy to be left behind.
They've just released (as part of Neon) their Winograd-based kernels which have a non-trivial improvement in performance.
Their blog post can be found here, where Scott will be going into full detail about the technical implementation, challenges as well as data points showing no side-effects of using these kernels in terms of convergence. The implementation seems very sophisticated, and quite a challenge.

I've benchmarked them independently, and here are the numbers:

FP-16

Network Type Nervana Neon CuDNN R4 (Torch) Speedup
AlexNet 78 71 0.91x
Overfeat 176 242 1.37x
VGG-A 254 471 1.85x
Googlenet-1 230 462 2.00x

FP-32

Network Type Nervana Neon CuDNN R4 (Torch) Speedup
AlexNet 87 81 0.93x
Overfeat 211 268 1.27x
VGG-A 320 529 1.65x
Googlenet-1 270 470 1.74x

It's really cool that they're still squeezing out performance of this generation of hardware.
They seem to have real wins when the network uses 3x3 convolutions.

At this point, I expect that this is the last round of software optimizations for small convolutions, considering that they're hitting peak limits of the GPU, but happy to be surprised :)

Full logs are checked in to the nervana/ folder.

@scott-gray
Copy link

Sorry, the full blog post isn't quite done yet, so we put up a quick performance overview post instead. But do feel free to download the new neon and try out the kernels (and even browse the source if you're curious). I'm almost done with a new Winograd kernel that should speed things up quite a bit more for smaller 3x3 layers (like in googlenet).

@andravin
Copy link

andravin commented Mar 4, 2016

@soumith Why have the cuDNN R4 Googlenet-1 numbers changed?

@soumith
Copy link
Owner Author

soumith commented Mar 4, 2016

@andravin my copy-paste screwup. fixed.

@andravin
Copy link

andravin commented Mar 4, 2016

OK, thanks, I wasn't expecting the Neon Googlenet-1 speedup to decrease. ;-) So these numbers make more sense.

@scott-gray
Copy link

I should point out that Andrew here not only worked through all the math for Winograd but our long discussions were pretty integral to the successful design of these kernels.

Oh and while we're giving credit we were just discussing that these two guys probably deserve as much as Shmuel Winograd for working out the original math:

https://en.wikipedia.org/wiki/Andrei_Toom
https://en.wikipedia.org/wiki/Stephen_Cook

I'll go into a bit more detail on this in the full blog.

@jdemouth
Copy link

jdemouth commented Mar 4, 2016

Thanks Scott and Andrew. We also see good gains for 3x3 with Winograd.

@andravin
Copy link

andravin commented Mar 4, 2016

That's great, Julien. Can't wait to see Winograd/Cook/Toom in cuDNN. :-)

It has been almost a year since I discovered this new approach to convnet acceleration, and it is great to see these ideas having a real impact on performance now. Everybody should check out Scott's F(4x4,3x3) implementation, it is extremely clever.

@jdemouth
Copy link

jdemouth commented Mar 4, 2016

It is indeed amazing that you were able to get F(4x4, 3x3) to work. I'm really impressed because I know for a fact that F(2x2, 3x3) is already super hard :). I am really looking forward to making it work in cuDNN.

@jdemouth
Copy link

jdemouth commented Mar 5, 2016

@scott-gray Awesome work! The way you specialize warps in the F(4x4,3x3) kernel is just brilliant! I'm super excited and it's going to be fun to implement such a scheme for cuDNN :) and bring that speedup to the different frameworks.

@andravin
Copy link

andravin commented Mar 5, 2016

Now if you guys could put your heads together and figure out a way to end the NCHW vs CHWN vs NHWC wars. There must be some way to equip these kernels with pluggable front-ends and back-ends that understand the tensor order for load / store, and leaves the computation pipeline unchanged.

@scott-gray
Copy link

I already have a fully fused version of that kernel that I should finish debugging this weekend. I'm hoping it will bring fprop/bprop performance for fp32 much closer to the fp16 level, as well as perform much more consistently with the size of the C/K dimensions. On the weight update side, fusion probably isn't possible due to the extremely strided memory access pattern required and no shared memory left for mitigating that. But 2 out of 3 fast operations isn't bad. In NHWC it would be the update operation that is fast and the other two slower.

I guess there's a chance in NCHW that the overlaps in the super-tiling might make full fusion possible in update, but on the downside you're slower on fprop/bprop for smallish HW because your effective tile size needs to be much bigger and you end up with a lot of zero overlap. At very small N NCHW and CHWN are pretty equivalent. But, I'm confident that CHWN is fastest overall for Winograd.

For direct conv, I'm starting to think that NHWC might be best for good performance across all minibatch sizes. There's plenty of shared memory around to efficiently transpose in place at no cost and having C as the inner dimension means that you minimize the slicing logic for all values of N and not just larger ones. But CHWN is just as good for N bigger than about 8 or so.

Also, having HWN contiguous means that you can do 1x1 conv super efficiently in a basic gemm kernel.

If I had to pick one, I'd stick with what I have: CHWN. But longer term it probably makes sense to have them all implemented to best suit the needs of the task.

Speaking of longer term, it would be nice if the community migrated to a fully open sourced implementation for all of this. This stuff is just too important to the progress of the field for it to be locked away in proprietary implementations. The more people working together on this the better for everyone. There's plenty of room to compete on the hardware implementation side.

@benanne
Copy link

benanne commented Mar 5, 2016

Also, having HWN contiguous means that you can do 1x1 conv super efficiently in a basic gemm kernel.

The same goes for NHWC though, right? If I'm not mistaken, the order of these dimensions doesn't matter as long as they are contiguous. This is the TensorFlow default, I don't know if any other frameworks use it though. I think CHWN might not get adopted very easily, because everyone is used to having the leading dimension be the batch dimension nowadays (the only established framework I know of that deviates from this is cuda-convnet, which isn't used much anymore).

@jdemouth
Copy link

jdemouth commented Mar 5, 2016

@benanne: You're right, NHWC works fine with 1x1 (as does CHWN). The issue we're having with cuDNN with 1x1 is that NCHW has the C "in the middle". Today, our direct convolution is similar to 3x3 or 5x5 for 1x1 and we are having a complex logic that we could simplify for 1x1. Scott's CHWN is "easier" to deal with in many cases. We also suffer from the fact that our filters are KCRS when CRSK (used by Scott) would be better.

On paper, NHWC has advantages over NCHW thanks to the fact that data is partly contiguous in memory. I'm only worried about the fact that NHWC could have a bad impact on the behavior of the TEX cache as fetching 8xFP16 (8 is the unrolling factor of the main loop - except for Scott's new F(4x4,3x3)) is only 16B and it's not so great with respect to cache line size.

@andravin, @scott-gray: Indeed, I think we should sit together and find a way to get the awesome performance of Scott's implementations for CHWN available to popular frameworks. We'll all be at GTC, for example. Making it open-source is a long discussion ;)

@scott-gray
Copy link

Right, I just meant grouped. So that's a plus for both NHWC and CHWN. You're right in that there is a lot of cuda code written for the cuDNN layout, and migrating away from that will likely be painful. But for some writing fresh code to a different layout might be a good option if they know they'll get a bit more speed. As I said, ideally you have the option for any layout.

Anyway, if you guys are happy sticking with NCHW, then we'll be happy to continue topping you on the benchmarks :) The neon framework isn't burdened by any legacy code and everything is being built from the ground up for speed. And with the new graph backend we're working on hopefully we can substantially improve on the ease of use as well (not that it's too bad right now).

@scott-gray
Copy link

@jdemouth You're not thinking creatively enough with leveraging shared memory to read deeper than 8 lines. You can cast any gemm or conv operation as a batched gemm and sum the results prior to writing out. The batch dimension in this case is just alternating groups of 8 rows.

Happy to chat at GTC. I was looking forward to attending your talk. And I guess I should advertise my own talk here. I was scheduled for an hour but that was mysteriously shortened to just 25 minutes. So I wont be able to go into as much depth as I'd like. But on the other hand it makes preparing for it a lot easier, which means more time to be writing kernels.

@jdemouth
Copy link

jdemouth commented Mar 5, 2016

@scott-gray: Funny that you just mentioned that because I was thinking along those lines when you posted your comment :). I already have batched GEMM code for some scenarios. Not to mention my Winograd implementation for NCHW.

The DL track is pretty packed, that's the reason why your slot was shortened from 50 to 25 minutes. Like all the other talks. My talk was even cancelled.

@scott-gray
Copy link

Yah, I thought of the technique while developing the first winograd kernel. It's what I meant above in being able to leverage shared memory for in place transpose. I actually already have some fp16 32x32 gemm tiles that use this that get over 5Tflops with a minibatch of 32. The TN tile (col major) can even out perform the 128x128 tile because it halves the overall number of strided accesses to ddr at any one time.

I haven't had a chance to release these yet since I need to still need to finish the complete set. Hopefully I'll get to that in the next week or so. I have new direct conv kernels I want to build first. The cuDNN advantage on small minibatch (for non 3x3s1) will soon be going away :) The goal is end to end training of convnets at very small minibatches at full utilization.

@benanne
Copy link

benanne commented Mar 5, 2016

The issue we're having with cuDNN with 1x1 is that NCHW has the C "in the middle". Today, our direct convolution is similar to 3x3 or 5x5 for 1x1 and we are having a complex logic that we could simplify for 1x1. Scott's CHWN is "easier" to deal with in many cases.

Right -- what I was saying is that, if NHWC is almost as good as CHWN, the former might be adopted much more quickly. Because TensorFlow already uses it, and because many people would find it "more natural" to have the batch size as the leading dimension.

@scott-gray
Copy link

I actually really like NHWC a lot, but it means I can't use my fancy new fully fused fprop/bprop F(4x4,3x3) with it. And I have a feeling the performance with it will be too good to throw away. The current partially fused kernels are trivial to convert to any layout. Just modify the external transform cuda-c code and then tweak a few lines in the batched gemm assembly for setting up the output pointers.

@jdemouth
Copy link

jdemouth commented Mar 5, 2016

@benanne: I agree with you... We want a layout which can be adopted by the community and which is good for performance. NCHW is widely adopted (and we are making it faster at each new release of cuDNN). CHWN is easier to deal with in many cases and Scott is pushing its performance to awesome levels but it is a somewhat weird layout. Maybe NHWC brings the best of both worlds together :).

@jdemouth
Copy link

jdemouth commented Mar 5, 2016

@scott-gray: What prevents you from using your new fused kernel with NHWC (except for the time to write it)? Is it a fetch issue due to your need for LDG.128?

Btw, were the numbers quoted in the benchmark all obtained using F(4x4,3x3) or do you use F(2x2,3x3) for some of the layers?

@benanne
Copy link

benanne commented Mar 5, 2016

Here's another approach for tackling this "optimal layout" issue in frameworks using the computational graph paradigm (such as Theano and TensorFlow): stick with the canonical NCHW or NHWC layout on the surface, but have optimizations that insert alternative implementations using more efficient layouts, as well as the necessary reshape operations at the input and output side. Since many convolution and pooling operations usually follow each other (and elementwise nonlinearities are not affected by the layout), spurious reshapes can then be eliminated quite easily.

@andravin
Copy link

andravin commented Mar 5, 2016

An API is also an important requirement for adoption. That is the real reason cuDNN has been so successful, it defined a low level C API for deep learning primitives, and nobody else did. cuDNN is both the standard API and its only implementation.

If Neon kernels were wrapped in the cuDNN API then it would be trivial to support them in your favorite framework (provided they are sane about allowing different tensor formats).

Maybe the cuDNN API is not ideal, maybe we could do better. But coding to a standard API is key to providing fast kernels that framework maintainers can actually use.

@scott-gray
Copy link

@jdemouth: I pick the best of both. Most of the time it's the 4x4 numbers. But for small HWN the 2x2 can be faster. Or I guess for small C/K too when the external transform isn't well amortized. But the fully fused kernel will solve that.

The fused kernel has no available shared memory to do the in place transpose required of NHWC in fprop/bprop. For update, the data is laid out fine, and that kernel could be fused instead. But that's just 1 of 3 ops instead of 2/3. Plus you want fprop to be the fastest for use in inference. Also fusing update is much more problematic because there's a lot of predicates that need to be recomputed any time you change x or y.

@benanne That's basically what I recommended to the TF guys. But you definitely want to avoid dimshuffles between every op. But I guess your point is that the graph optimizer should be smart enough to eliminate dimshuffles that cancel each other out.

@andravin: I've wanted to put together an API but all of my time is devoted to writing kernels and just when I start to think things have stabilized enough to do this, someone comes along and asks you to implement some new fancy algorithm :)

@scott-gray
Copy link

To elaborate on NHWC in fprop/bprop, the 2 image load warps are making 32*36 loads to distinct addresses, only 2 channels deep. That's way more than can fit in L1 so you end up fetching the same transaction 4 times and saturating both L2 and DDR traffic.

@andravin
Copy link

andravin commented Mar 5, 2016

A standard API for deep learning primitives would also mean that frameworks would be able to support any GPU or hardware platform that implements the API. The fact that none of us are even thinking about that is another symptom of our dangerous monoculture.

@scott-gray
Copy link

An API has definitely been on my mind.. I just wanted to finish a complete set of kernels first. The only problem is that I keep changing the definition of complete. Anyway, I need to get some sleep. It's been nice chatting with you guys.

@jdemouth
Copy link

jdemouth commented Mar 5, 2016

Indeed, it was great chatting with all of you. Thanks.

@scott-gray
Copy link

Oh, and another interesting constraint is batch norm. Reducing HWN is rather straight forward and fast with CHWN. It's just a reshape(C,-1).sum(axis=1). NCHW isn't too bad (but probably annoying). NHWC is a bit trickier to optimize as axis=0 reductions lead to expensive strided memory access patterns if done naively.

Another interesting point on the Nervana kernels is that they all have the "mean" component of batchnorm optionally compounded directly inside of the conv kernel at not cost. Currently this is done with atomics but I have a deterministic way I want to change it to that should be just as fast. Many other common operations can be compounded inside of gemm/conv kernels. Incidentally, all the kernels can now be run in full deterministic mode with virtually no change in performance.

Anyway, I'm looking forward to Soumith's new set of benchmarks. There's a ton of optimizations that we've made in neon that the current set just don't expose.

@scott-gray
Copy link

Oh, and another thought. I recently wrote a very fast generalized dimshuffle (src) routine for neon that implements the full numpy.transpose spec. So if there is some custom kernel you want to write that is more natural in one format over another, then it's now easy to get that. And so long as you're not doing it on every layer there would be negligible impact to speed. For example, ROI pooling for RCNN networks is far easier to implement with NHWC. But even if you are using it a lot, it's about as fast as an fprop_relu op.

@bhack
Copy link

bhack commented Mar 19, 2016

@jdemouth Yes I'm only guessing if exist and what is the vendor neutral "lowest common denominator". This could still give a margin for hardware vendors to compete but also give some chance for an unified API for developer interested in a vendor neutral solution.

@jdemouth
Copy link

I get your point. So far, I do not see the solution but having a higher level solution - if interesting for vendor neutrality - would surely help with development time and innovation.

@bhack
Copy link

bhack commented Mar 19, 2016

OpenVX was a good example for interface collaboration that involved many stakeholders (Nvidia included). But it has totally lost the occasions to cover deep learning needs in the actual release. In the meantime Google is trying to push an llvm subgroup for stream executor with "deep learning" canned operations. See last messages in tensorflow/tensorflow#22 and @henline bootstrap doc at https://github.com/henline/streamexecutordoc

@scott-gray
Copy link

It is simply not possible to develop efficient dense linear algebra kernels with the current intermediate representations available (like ptx). That's not to say that an IR couldn't be developed that would make it possible. Pascal will be largely binary compatible with the Maxwell ISA, but when Volta rolls around I may adapt my assembler to let you target both architectures with one language. Though maybe it's not worth the effort since new hardware always frees up more resources making kernel design decisions very different.

I guess the real key would be to have an IR that can target both Nvidia and AMD. I've read through the GCN spec and there's a lot of overlap, but again the differences are big enough to make a large impact in how you would design kernels. But still, having a common language would make development for both targets much easier and perhaps allow some code sharing.

@bhack
Copy link

bhack commented Mar 19, 2016

The common IR that target both NVIDIA and AMD is SPIR-V (and was co-designed). But seems that it is not enough for achieve this level of optimization. So if the biggest common multi stakeholders effort on a common IR is not enough I think that is better to extend standard API at higher level like pushing, in the next version of OpenVX, support for unfied tensor operations API that fits deep learning needs.

@jdemouth
Copy link

Yes, the high level API approach looks more promising.

@scott-gray
Copy link

I think there's still room for better abstractions at the lowest level. Deep learning is currently somewhere between 60-90% dense linear algebra of some kind. And the parts that aren't will soon largely be merged into the dense kernels for better efficiency. That doesn't leave a lot left for higher level gpu languages.

Aside form simple operation compounding there may be other innovations we'll want to make with these dense kernels. We know the brain's connections are far from the simple feed forward nets currently in use.

@bhack
Copy link

bhack commented Mar 19, 2016

@jdemouth If you like the idea you can try to talk about this with Thierry Lepley. He is the Nvidia representative in the Khronos OpenVX standardization group.

@jdemouth
Copy link

Sure. Thierry and I are both in France :)

@bhack
Copy link

bhack commented Mar 20, 2016

And I think that we could start to consider also the interesting results from binarizzation

@scott-gray
Copy link

I plan on implementing a set of fast binary gemm and conv kernels as soon as I'm done with the small minibatch floating point kernels I'm working on.

@bhack
Copy link

bhack commented Mar 20, 2016

This start to become interesting also on CPU. /cc @xianyi

@hengck23
Copy link

@scott-gray
"I plan on implementing a set of fast binary gemm and conv kernels"
I am interested in this work are you planning for
i) real input + binary weights
or iI) binary input + binary weights (XNOR net) ?

@bhack
Copy link

bhack commented Mar 23, 2016

@xianyi On AVX-512 I think that there is a XNOR instruction __mmask16 _mm512_kxnor? Do you think that this kind of operators could be included in openblas?

@scott-gray
Copy link

@hengck23 The XNOR/POPC variety. Should be pretty easy to implement with the only real challenge being implementing padding in convolution. It will all just be cuda-c. I don't think there's a need for assembly optimization until we have a full throughput integrated xnor/popc/accumulate instruction

@bhack
Copy link

bhack commented Mar 23, 2016

@scott-gray
Copy link

@bhack yup, I'm aware of that and am actually working with Matthieu on a much faster and more flexible version of that code. I just need to push through this other work first. Which, btw, is working out pretty well. I'm getting really good L1 cache utilization in CHWN for small N. This keeps the power levels lower and let's the clock run faster. Starting to think NWHC is really only good for N<4. Below 4 your slicing logic starts to eat into compute without NHWC.

@hughperkins
Copy link
Contributor

hughperkins commented May 25, 2016

bhack wrote:

The common IR that target both NVIDIA and AMD is SPIR-V (and was co-designed).

Interesting. Seems I should take a look at this...

andrew wrote:

An API is also an important requirement for adoption. That is the real reason cuDNN has been so successful, it defined a low level C API for deep learning primitives, and nobody else did. cuDNN is both the standard API and its only implementation.
If Neon kernels were wrapped in the cuDNN API then it would be trivial to support them in your favorite framework (provided they are sane about allowing different tensor formats).
Maybe the cuDNN API is not ideal, maybe we could do better. But coding to a standard API is key to providing fast kernels that framework maintainers can actually use.

Agree with these points.

Note: if you can allow me to pass a cl_mem to your api, I'd be happy to figure out a way to use neon convolutional kernels from my OpenCL frameworks, ie clTorch and DeepCL. I'd load it in some pluggable way, so it would be used when running on a CUDA GPU.

Scott wrote:

I recently wrote a very fast generalized dimshuffle (src) routine for neon that implements the full numpy.transpose spec. [...] it's about as fast as an fprop_relu op.

That sounds pretty fast. So on the whole the layout doesnt matter too much, can just shuffle before calling the convolutional implementation, if necessary?

Scott wrote:

It is simply not possible to develop efficient dense linear algebra kernels with the current intermediate representations available (like ptx).

Ok, you mean, for SPIR-V, SPIR-V plausibly targets the PTX level, rather than the SASS level, and therefore cant get results much better than current CUDA and OpenCL high-level languages?

@bhack
Copy link

bhack commented May 25, 2016

Kronos maintain a bidirectional LLVM to SPIR-V translator. There are also some efforts inside Kronos on the OpenVX roadmap to support dnn.

@hughperkins
Copy link
Contributor

bhack wrote:

Kronos maintain a bidirectional LLVM to SPIR-V translator.

If I understand correctly, that will translate between OpenCL and SPIR-V? What advantage(s) do you see to doing this? At the moment, the source code I want to use is all written in sass I was thinking of porting it to OpenCL (admittedly saying 'only 1300 lines of sass' is a bit like saying nips only needs 8 pages ...), but maybe I should port it to SPIR-V instead???

Thoughts on how useable/mature/portable SPIR-V is right now? I dont know much beyond the name, and a one-line description...

@bhack
Copy link

bhack commented May 25, 2016

It will translate also compatibile LLVM bytecode in SPIR-V.
SPIR-V is also the machine model for Vulkan.

@bhack
Copy link

bhack commented May 25, 2016

@hughperkins If you want to generally take an overview of these differents IR you can see https://www.linkedin.com/pulse/era-intermediate-languages-vincent-hindriksen. I think that sass it is more at gcn level in the AMD world. But the majority of opinions here is that we cannot go so fast with an IR cause GPU assembly ninjas are still better than compilers to optimize 😃

@hughperkins
Copy link
Contributor

If you want to generally take an overview of these differents IR you can see https://www.linkedin.com/pulse/era-intermediate-languages-vincent-hindriksen.

Good information. Thanks!

@hughperkins
Copy link
Contributor

hughperkins commented May 25, 2016

If you want to generally take an overview of these differents IR you can see https://www.linkedin.com/pulse/era-intermediate-languages-vincent-hindriksen.

Probably I should target AMD IL or HSAIL. However, I lack an AMD GPU. So ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants