Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate utilization #10

Closed
ozabluda opened this issue Mar 29, 2016 · 18 comments
Closed

Calculate utilization #10

ozabluda opened this issue Mar 29, 2016 · 18 comments

Comments

@ozabluda
Copy link

I would like to calculate utilization, but I can't find AVX2 frequency for i7-6700K. For example, if AVX2 frequency were 4.0 GHz (which it isn't) max FLOP/s would be:

32 FLOP/clock * 4.0GHz * 4core=512 GFLOP/s (which it isn't)

See discussion, starting here:
soumith/convnet-benchmarks#59 (comment)

@Maratyszcza
Copy link
Owner

The CPU frequency for tests was set at 4.0 GHz (fixed in EFI menu). So, you can assume 512 SP GFLOPS peak.

@ozabluda
Copy link
Author

"AVX2 clock" is different from CPU frequency. It is ~40% lower. Please search the reference above for "AVX" for the discussion.

Even after we figure out AVX2 clock, there is still ambiguity between "AVX Max All Core Turbo" and "AVX base clock" (unless it can be fixed in BIOS as well, please check). In that thread @andravin was able to track it down for Haswell E5 CPUs (even Intel people participating in that thread had trouble figuring it out for other CPUs).

@Maratyszcza
Copy link
Owner

IIRC, AVX2 clock applies only to Xeon CPUs, and Core i7 6700K is a desktop part. I'll run a benchmark later today to measure practical peak for FMA3 code.

@ozabluda
Copy link
Author

That would be great. Please measure 1,2,4 cores. If it indeed turns out to be the same as CPU clock, it may mean that Haswell Xeon E5 has a mysterious bizarre AVX2 bottleneck (not a simple total power/TDP, etc), not present anywhere else.

@Maratyszcza
Copy link
Owner

@ozabluda Turned out I was wrong about frequency fixing: the timings in the readme were benchmarked with TurboBoost enabled. However, I didn't find any evidence of reduced AVX2 clocks. Here are the FLOPS I get on synthetic benchmark:

Threads SP GFLOPS
8 510
4 507
2 254
1 134

The benchmark runs for about 15 seconds, should be enough for AVX2 underclocking to kick in, if the CPU had it.

@Maratyszcza
Copy link
Owner

Checking frequency while the synthetic benchmark is running shows that CPU downclocks to base frequency (4.0 GHz), but not below it.

@andravin
Copy link

Here is the best response I got for my question about AVX2 clock rates on "client" CPUs: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/596383

Basically the consensus seems to be there is no AVX2 clock rate for i7 or other client parts.

@andravin
Copy link

andravin commented Apr 2, 2016

But AVX clock has returned on the new Intel Xeon e5 v4 CPUs, ie, big Broadwell. So the patterns seems to be the AVX clock is a limitation peculiar to the many core Xeon CPUs.

http://www.anandtech.com/show/10158/the-intel-xeon-e5-v4-review/3

@ozabluda
Copy link
Author

ozabluda commented Apr 6, 2016

Yes, the strange and mysterious AVX throttling on manycore Xeon CPUs is back. As I explained before, I think it's neither total power nor TDP. Now that we know it's not on 4-core desktop Xeons, its even more mysterious, as it can't be a per-core local bottleneck.

On Haswell, one AVX instruction on one core forced all cores on the same socket to slow down their clockspeed by around 2 to 4 speed bins (-200,-400 MHz) for at least 1 ms, [...]. On Broadwell, only the cores that run AVX code will be reducing their clockspeed, allowing the other cores to run at higher speeds.
http://www.anandtech.com/show/10158/the-intel-xeon-e5-v4-review/3

@jeffhammond
Copy link

See Table 3 in Intel® Xeon® Processor E5 v3 Product Family Processor Specification Update for Xeon processor AVX frequency info.

@ozabluda
Copy link
Author

ozabluda commented Jun 28, 2016

Here is a long-promised utilization calculation. I assumed 512 GFLOP/s peak.

nnpack_utilization

FFT for 3x3 filters, 16x16 tile is plausible:
3x3 filter gives 14x14 outputs per tile.
1.5 mults/input x (16/14)^2 inputs/output = 1.96 mults/output
9 mults/output / 1.96 mults/output = 4.59 theoretical speedup for 3x3 convolution using 16x16 tile

FFT for 3x3 filters, 8x8 tile is plausible:
3x3 filter gives 6x6 outputs per tile.
1.5 mults/input x (8/6)^2 inputs/output = 2.67 mults/output
9 mults/output / 2.67 mults/output = 3.38 theoretical speedup for 3x3 convolution using 8x8 tile

Winograd is also well within theory (300-400% depending).

@andravin
Copy link

andravin commented Jul 1, 2016

Thanks for the numbers @ozabluda .. What was the batch size in these experiments? If NNPACK is intended for inference, then it would be useful to see speed as a function of batch size, as a typical inference application will use a small batch size (possibly a single image) or a range of small batch sizes (ie however many images are in a request queue). Edit: So it would be interesting to see speed tests for batch sizes N = 1, 2, 4, 8, ...

Are we sure NNPACK is doing gauss's complex multiplication now? Pretty sure it wasn't originally, so that would make FFT 2 mults/input instead of 1.5.

So the NNPACK Winograd algorithm was F(6x6,3x3) last I checked, with theoretical speedup 5.07. Is that what you used for the Winograd utilization calculations? Looks like the effective utilization for Winograd on Alexnet and VGG is in the 80% - %140 range .. that is pretty low if the max is 507%.

These ideas should help with performance:

  • For small CPUs (or small NxHxW), use smaller tile algorithms (F(4x4,3x3) or F(2x2,3x3)) in order to reduce transform data per tile or filter, and process a chunk of tiles / filters at a time, small enough to fit in L3 cache.
  • Store data / filters in fp16, int16, or int8 format to reduce memory bandwidth and to fit more transform data in cache.
  • Use integrated GPU and fp16 arithmetic

@Maratyszcza
Copy link
Owner

@andravin IIUC, in this analysis @ozabluda used numbers from NNPACK README. The batch sizes are 128 for AlexNet & OverFeat and 64 for VGG-A.

NNPACK does use complex multiplication with 2 macs/input. The efficiency on complex matrix multiplication is higher than on real matrix multiplication, so using fast complex multiplication would delivers smaller speedup than you probably expect. While it is an enhancement to consider, I have more rewarding optimizations to try first.

About the other suggestions:

  • Smaller tiles wouldn't help unless NNPACK process all tiles of an image consequentially, which it doesn't now. I'm working on processing all tiles right now, but not sure if I finish during the weekend, and I little time during the week.
  • I consider supporting FP16 activations & weights. It would reduce bandwidth requirements, but data in cache would still have to be stored in FP32, because FP16<->FP32 conversions aren't free on CPU.
  • No plans for using integrated GPU in NNPACK. Arithmetic in FP16 is not natively supported on x86 (only conversion FP16<->FP32), but I'm thinking about providing inference functions for 8-bit fixed point weights & activations.

@ozabluda
Copy link
Author

ozabluda commented Jul 1, 2016

@Maratyszcza >in this analysis @ozabluda used numbers from NNPACK README. The batch sizes are 128 for AlexNet & OverFeat and 64 for VGG-A.

max minibatch of 64 for VGG-A is an unfortunate atavism of running benchmarks on Titan, which couldn't fit a larger minibatch into RAM at some point. Recently, I was able to fit 96, which doesn't really matter, because @soumith's VGG-A is just a benchmark, in practice would be using more RAM anyway. RAM is not a problem on CPU.

@andravin >Are we sure NNPACK is doing gauss's complex multiplication now? Pretty sure it wasn't originally, so that would make FFT 2 mults/input instead of 1.5.
@Maratyszcza>NNPACK does use complex multiplication with 2 macs/input.

In this case we have:

FFT for 3x3 filters, 16x16 tile is plausible:
9/(2*(16/14)^2)=3.44 max theoretic asymptotic speedup

FFT for 3x3 filters, 8x8 tile is plausible:
9/(2*(8/6)^2)=2.53 max theoretic asymptotic speedup

How does it all work with Alexnet-conv5 (13x13 pad 1), and VGG-A:conv5 (14x14 pad 1)?

@andravin>So the NNPACK Winograd algorithm was F(6x6,3x3) last I checked, with theoretical speedup 5.07.

My numbers were for F(4x4,3x3) for @scott-gray's tiling:
soumith/convnet-benchmarks#59 (comment)

For F(6x6,3x3) theoretical speedup is:
6_6_9/(8*8)=5.06

@Maratyszcza
Copy link
Owner

Maratyszcza commented Jul 1, 2016

@ozabluda convnet-benchmarks were around for long time, so it makes sense to use the same parameters just to make it easy to compare. On the test machine I have 64 GB of RAM, so could fit pretty large batch sizes.

How does it all work with Alexnet-conv5 (13x13 pad 1), and VGG-A:conv5 (14x14 pad 1)?

If the tile is larger than image + padding, some pixels of the tile remain unused, but we still have to do computations on them.

@ozabluda
Copy link
Author

ozabluda commented Jul 1, 2016

How does it all work with Alexnet-conv5 (13x13 pad 1), and VGG-A:conv5 (14x14 pad 1)?
If the tile is larger than image + padding, some pixels of the tile remain unused, but we still have to do computations on them.

Let's see if I get it right: Alexnet-conv5 (13x13 pad 1 = 15x15 input), produces 13x13 output:
with FFT 16x16: 9/(2*(16/13)^2)=2.97 max theoretic asymptotic speedup

but what happens with FFT 8x8, which produces 6x6 output? Do we really need to do use 9 tiles?
9 tiles * (2*8^2) MAC = 1152 MAC: FFT
9 MAC * 13^2 pixels = 1521 MAC: direct
1521 MAC / 1152 MAC = 1.32 max theoretic asymptotic speedup
.
If this is correct, I'll do the calculations for VGG-A:conv5.

@Maratyszcza
Copy link
Owner

@ozabluda My calculations are here. They are, however, based on numbers measured without Caffe integration.

@ozabluda
Copy link
Author

ozabluda commented Jul 1, 2016

Awesome. It appears that my asymptotic calculations are correct, at least they check out with your great spreadsheet, which does much more precise non-asymptotic calculations. Asymptotics are easier to grasp, so here are asymptotic calculations for VGG-A:conv5: (14x14 pad 1 = 16x16 input), produces 14x14 output:

with FFT 16x16: 9/(2*(16/14)^2)=3.45 max theoretic asymptotic speedup.

with FFT 8x8, which produces 6x6 output, we need to use 9 tiles
9 tiles * (2*8^2) MAC = 1152 MAC: FFT
9 MAC * 14^2 pixels = 1764 MAC: direct
1764 MAC / 1152 MAC = 1.53 max theoretic asymptotic speedup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants