-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate utilization #10
Comments
The CPU frequency for tests was set at 4.0 GHz (fixed in EFI menu). So, you can assume 512 SP GFLOPS peak. |
"AVX2 clock" is different from CPU frequency. It is ~40% lower. Please search the reference above for "AVX" for the discussion. Even after we figure out AVX2 clock, there is still ambiguity between "AVX Max All Core Turbo" and "AVX base clock" (unless it can be fixed in BIOS as well, please check). In that thread @andravin was able to track it down for Haswell E5 CPUs (even Intel people participating in that thread had trouble figuring it out for other CPUs). |
IIRC, AVX2 clock applies only to Xeon CPUs, and Core i7 6700K is a desktop part. I'll run a benchmark later today to measure practical peak for FMA3 code. |
That would be great. Please measure 1,2,4 cores. If it indeed turns out to be the same as CPU clock, it may mean that Haswell Xeon E5 has a mysterious bizarre AVX2 bottleneck (not a simple total power/TDP, etc), not present anywhere else. |
@ozabluda Turned out I was wrong about frequency fixing: the timings in the readme were benchmarked with TurboBoost enabled. However, I didn't find any evidence of reduced AVX2 clocks. Here are the FLOPS I get on synthetic benchmark:
The benchmark runs for about 15 seconds, should be enough for AVX2 underclocking to kick in, if the CPU had it. |
Checking frequency while the synthetic benchmark is running shows that CPU downclocks to base frequency (4.0 GHz), but not below it. |
Here is the best response I got for my question about AVX2 clock rates on "client" CPUs: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/596383 Basically the consensus seems to be there is no AVX2 clock rate for i7 or other client parts. |
But AVX clock has returned on the new Intel Xeon e5 v4 CPUs, ie, big Broadwell. So the patterns seems to be the AVX clock is a limitation peculiar to the many core Xeon CPUs. http://www.anandtech.com/show/10158/the-intel-xeon-e5-v4-review/3 |
Yes, the strange and mysterious AVX throttling on manycore Xeon CPUs is back. As I explained before, I think it's neither total power nor TDP. Now that we know it's not on 4-core desktop Xeons, its even more mysterious, as it can't be a per-core local bottleneck.
|
See Table 3 in Intel® Xeon® Processor E5 v3 Product Family Processor Specification Update for Xeon processor AVX frequency info. |
Here is a long-promised utilization calculation. I assumed 512 GFLOP/s peak. FFT for 3x3 filters, 16x16 tile is plausible: FFT for 3x3 filters, 8x8 tile is plausible: Winograd is also well within theory (300-400% depending). |
Thanks for the numbers @ozabluda .. What was the batch size in these experiments? If NNPACK is intended for inference, then it would be useful to see speed as a function of batch size, as a typical inference application will use a small batch size (possibly a single image) or a range of small batch sizes (ie however many images are in a request queue). Edit: So it would be interesting to see speed tests for batch sizes N = 1, 2, 4, 8, ... Are we sure NNPACK is doing gauss's complex multiplication now? Pretty sure it wasn't originally, so that would make FFT 2 mults/input instead of 1.5. So the NNPACK Winograd algorithm was F(6x6,3x3) last I checked, with theoretical speedup 5.07. Is that what you used for the Winograd utilization calculations? Looks like the effective utilization for Winograd on Alexnet and VGG is in the 80% - %140 range .. that is pretty low if the max is 507%. These ideas should help with performance:
|
@andravin IIUC, in this analysis @ozabluda used numbers from NNPACK README. The batch sizes are 128 for AlexNet & OverFeat and 64 for VGG-A. NNPACK does use complex multiplication with 2 macs/input. The efficiency on complex matrix multiplication is higher than on real matrix multiplication, so using fast complex multiplication would delivers smaller speedup than you probably expect. While it is an enhancement to consider, I have more rewarding optimizations to try first. About the other suggestions:
|
@Maratyszcza >in this analysis @ozabluda used numbers from NNPACK README. The batch sizes are 128 for AlexNet & OverFeat and 64 for VGG-A. max minibatch of 64 for VGG-A is an unfortunate atavism of running benchmarks on Titan, which couldn't fit a larger minibatch into RAM at some point. Recently, I was able to fit 96, which doesn't really matter, because @soumith's VGG-A is just a benchmark, in practice would be using more RAM anyway. RAM is not a problem on CPU. @andravin >Are we sure NNPACK is doing gauss's complex multiplication now? Pretty sure it wasn't originally, so that would make FFT 2 mults/input instead of 1.5. In this case we have: FFT for 3x3 filters, 16x16 tile is plausible: FFT for 3x3 filters, 8x8 tile is plausible: How does it all work with Alexnet-conv5 (13x13 pad 1), and VGG-A:conv5 (14x14 pad 1)? @andravin>So the NNPACK Winograd algorithm was F(6x6,3x3) last I checked, with theoretical speedup 5.07. My numbers were for F(4x4,3x3) for @scott-gray's tiling: For F(6x6,3x3) theoretical speedup is: |
@ozabluda
If the tile is larger than image + padding, some pixels of the tile remain unused, but we still have to do computations on them. |
Let's see if I get it right: Alexnet-conv5 (13x13 pad 1 = 15x15 input), produces 13x13 output: but what happens with FFT 8x8, which produces 6x6 output? Do we really need to do use 9 tiles? |
Awesome. It appears that my asymptotic calculations are correct, at least they check out with your great spreadsheet, which does much more precise non-asymptotic calculations. Asymptotics are easier to grasp, so here are asymptotic calculations for VGG-A:conv5: (14x14 pad 1 = 16x16 input), produces 14x14 output: with FFT 16x16: 9/(2*(16/14)^2)=3.45 max theoretic asymptotic speedup. with FFT 8x8, which produces 6x6 output, we need to use 9 tiles |
I would like to calculate utilization, but I can't find AVX2 frequency for i7-6700K. For example, if AVX2 frequency were 4.0 GHz (which it isn't) max FLOP/s would be:
32 FLOP/clock * 4.0GHz * 4core=512 GFLOP/s (which it isn't)
See discussion, starting here:
soumith/convnet-benchmarks#59 (comment)
The text was updated successfully, but these errors were encountered: