-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[October 2015] Intel are CPU magicians. But there's no one weird trick.... #59
Comments
For comparison, here's the log of Caffe + OpenBLAS numbers on the same machine (It's the Digits box ;-) ) |
More info is in the CPU branch: The alexnet-owt protobuf, with the same architecture I use for the GPU versions is here: The intel-adapted version is here: |
well, assuming i didn't mess up the analysis, and used the right inputs/etc, a runtime of 0.146s on the (non-intel) alexnet-owl prototxt you linked above, for a batch of 128 forward and backward, implies 3.77TF/s. AFAIK, haswell can do at most 32FLOPs/cycle/core. for your 6-core cpu @ 3.5 GHZ, that would be 672GF/s peak. so, i guess that seems pretty fishy overall (i.e. perf ~6X peak). i might suspect benchmarking error, such as accidentally running in GPU mode with who-knows-what backend (i.e BLAS, cudnn v?, i dunno). it's not clear that intel themselves was claiming perf anything like that in thier blog post, but i didn't try to runs the #s on their post. then again, i have no idea what the intel code might be doing (got scared off by the license, so didn't dig into it), but if there are some algorithmic changes and/or anything that means they're not doing the same set of FLOPS, then all bets are off. but of course such improvement might port to GPUs as well. or not; i'd believe there are algorithms that are more suited to CPUs that trade uniformity/complexity for doing less raw FLOPS. for ref, here's the #s i'm working from:
|
@moskewcz 3.77TF/s doesn't hold true if you switch to FFT or Winograd based convolutions. References: |
"With these optimizations time to train AlexNet* network on full ILSVRC-2012 dataset to 80% top5 accuracy reduces from 58 days to about 5 days." The benchmark used dual E5-2699-v3 CPUs, which have 18 cores at 2.3 GHz => 2x18x32FLOPs/cyclex2.3Ghz=2.65TFLOPs Sounds about right. TitanX running Nervanagpu probably about 1 day? I would guess Intel just implemented a more efficient direct convolution for many-core Intel CPUs. I do not see any indication they are using fast algorithns. |
So anyway the numbers Intel reported sound plausible, but your numbers don't. :-) |
again, if i got my #s right, if we assume 70M images (~65 epochs * 1.1M images/epoch, not sure if that's a good value or not) in 5 days to train alexnet_owl as per the blog post, that implies 783GF/s -- given the peak #s that andravin gave above, that would be ~35% efficiency, which is perhaps pretty impressive but believable. but it'd be good to know the actual # of epochs/images/etc to get a real value, i could easily be off by quite a bit on those guesses. corrections welcome. mwm
|
.. and having looked a bit at Caffe's CPU implementation, im2col is single-threaded, and will be a pretty nasty bottleneck in a 36-core system. |
@moskewcz your numbers sound plausible to me.. and so Intel's post really points to what a disaster out of the box Caffe performance must be on many-core CPUs. |
sounds like a plan. make sure you fire up nvidia-smi while you're running it ... ;) |
@moskewcz I've already verified that it's running on CPU and using intel code-paths, simply by collecting samples from the stack and looking at hotspots. |
hmm, well, i was mostly joking and i mostly believe you. however, i'm not sure that what you say precludes the GPU being active. in fact, if, say, the new intel layers were running on the CPU, but all/some conv layers were on the GPU, you'd probably see perf similar to what you reported. and if you look at the CPU usage/stack, it'll be pegged at 100%, and it'll always be inside the intel code if you stop it ... i'm really just suggesting that, given the fishiness of the #s, some form(s) of sanity checking are in order. in particular, for example, did you compile in CPU only mode? again, i don't really think that's the issue, but if (for example) intel ran/compiled on boxes without GPUs, then maybe something unexpected happens with their code/build on a box that has GPUs. but i'm not really fixated on the maybe-running-on-GPU idea, there are plenty of other places for errors. batch size issues, shared library wackiness, straight-up user error, etc ... on a side note, thanks for all your hard work running these benchmarks! mwm |
caffe is getting no access to the GPUs, I disabled it at the driver level. |
2-columnn AlexNet Intel is benchmarking at the announcement (different from 1-col AlexNet "One weird trick" from Soumith's benchmark) has 1449 MFLOPs per image in the forward pass and 2x that in the backward pass, ignoring biases, LRN, activations, pooling, and loss. Taking numbers from Intel's announcement we have: Forward pass: 1449 MFLOP * 731images/1sec=1.059 TFLOP/s which is easily believable (exact max FLOPs on those Intel CPUs to be posted later). |
@soumith>A full [forward + backward] on AlexNet on a Desktop 6-core Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz takes an average of 164ms EDIT: 268 ms. [...] I need a couple more sanity checks before I can believe this result. Look at how little time they are spending in the convolution layers, even the biggest ones: https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L329-L365 i7-5930K AVX2 clock is smaller than 3.50 GHz base clock. I don't recall exact value, but it seems to be ~3.2 GHz. It can issue 2 AVX256 (8 operand) SP MAD (=2FLOP) per clock, for the total of 2 * 8 * 2=32 FLOP/clock. 32 FLOP/clock * 3.0GHz * 6core=576 GFLOP/s. Your numbers at the url above (output from Intel's Caffe) seem to be per image for conv and per minibatch for fc) and are comfortably below that (except for fc6 backward, which must be an artifact of Caffe timing), so they are totally believable. In fact, there is a lot of room for improvement. In fact, they are not that much better than your numbers for OpenBLAS (except for conv1)
|
@ozabluda i think your analysis of the intel #s looks good and is believable. as per an above comment, we're guessing ~2.65TFLOPs peak for the dual-socket 36-core machine intel used for the announcement. so again it comes out to ~35% or so efficiency. but, i think there are some issues with your per-layer analysis in your second comment. firstly, i don't think we can trust the per-layer #s from the caffe log too much; for example the pack+relu1 times are >> the conv1 time, so i'd assume there's some timing wonkiness there -- time and/or work being shifted among layers for example. but, perhaps more importantly (and confusingly):
PS: using 268ms / batch, and 4.2GF / image, that yields a still-implausible ~2TF/s for the 6-core digits box, and again it seems to disagree with the more-reasonable intel announced #s, so i'm still assuming benchmarking error. |
There is no such thing as an AVX2 clock. |
@moskewcz I also noticed that Intel's Caffe seems to report timings for conv layers per image and for fc per minibatch. I corrected the table above (I also realized Soumith's numbers are for 1-col AlexNet, while Intel's are for 2-col AlexNet). Please check if it makes sense to you now. AVX2 (32 SP ops/clock) can't run at the base clock frequency, so it throttles down to a lower "AVX clock". Although, maybe it is only true for AVX-512, which none of the CPUs in question have. |
@ozabluda hmm, i'm not sure what you changed, but i guess it looks more/differently wrong to me now, still as per my (1) and (2). AFAIK all the caffe timings are supposedly per batch/iteration, not per image (as per my comment section (2)). and in this case, they look like garbage, as per my comment section (1). FWIW it's been a while since i dug into the caffe timing code and it has changed over time but on the whole i've always found it hard to work with / understand; i'm mostly just looking at things here from the top level and using my own calculations, so i'm not the best one to comment on the details of the caffe reported #s. |
@moskewcz Stock Caffe timings sure are per minibatch (like Soumith's OpenBLAS timings). Intel's port timings do look like garbage (say 0.726ms for conv1), unless they are per image (except for fc), in which case they totally make sense (and approximately equal to stock Caffe/OpenBLAS). See my table above. |
@andravin> The benchmark used dual E5-2699-v3 CPUs, which have 18 cores at 2.3 GHz => 2x18x32FLOPs/cyclex2.3Ghz=2.65TFLOPs Actual AVX base clock is 1.9 Ghz (see quote below). 2 CPU * 18 cores * 32FLOPs/cycle * 1.9Ghz =2.189 TFLOP/s I am almost willing to bet that the scaling to the second CPU is extremely poor in this Intel's iteration. i.e. 2 CPUs are not that much faster than 1 CPU.
|
@ozabluda Ah, I did not know about this feature of Xeon processors, thanks. So it is Xeon only? soumith's Core(TM) i7-5930K will not have this? My i7-5775C seems to sustain AVX2 256-bit FMA instructions at regular turbo boost speed with liquid cooling. |
I tracked down AVX base frequency specs for haswell e5 processors here: https://www.microway.com/knowledge-center-articles/detailed-specifications-intel-xeon-e5-2600v3-haswell-ep-processors/ Would be nice to find an official Intel source. I suspect this is only a feature of the big Xeon chips. |
@soumith What command line did you use? README.txt says:
When I run that on my 4-core i7-5775C I get:
Most telling are the Total FP/BP jobs numbers, which are exactly equal to 256X the values in your log file. 256 is the batch size specified in train_val.prototxt. |
@soumith Oh I see now you are using your own prototxt file, not the one that was provided by Intel. Obviously there is something wrong that is causing your prototxt to use minibatch size 1. |
Actually I get reasonable numbers using your alexnet.prototxt too. So I am not sure what is wrong with your setup. |
I think all CPUs have it, if they overheat. Liquid cooling helps (I notice dthat with my liquid cooled Haswell as well. Can your CPU run AVX2 256-bit FMA instructions at regular turbo boost speed on all cores simultaneously or just one?
This is awesome, thank you. |
I think something caused conv layers to report time per image, while everything else is per minibatch.
My calculations are per-layer. Total Forward/Backward are also calculated from per-layer (reported numbers are all screwed up), exactly as you suggest.
I ignore 2680/268 number.
that's right.
I have 4.231 GF/image for the 'original' 2-groups version and 4.285 GF/image for the "One weird trick" 1-col version, ignoring biases, LRN, activations, pooling, and loss. Your 6.1 GF/image is probably the 'original' 2-groups version without groups, but it's not what 1-col version is (the number of filtermaps is different).
My calculated "total time" conv*128+fc comes to 4524 ms/minibatch. I ignore 268, because it doesn't correspond to anything in the per-layer I can think of. 90ms and 72ms correspond to the sum, but is incorrect because conv is per image and everything else is per minibatch. |
@andravin thanks for the log on your side. I suppose doing pure-benchmarking instead of having that lmdb data layer before might be having side-effects on the intel caffe. I'll follow-up on Monday. |
I clarified the tiling I was using in a later post here. It's actually 32x32, not 32x8. The 32x8 is what is visible to the user, but 32x32 is how it actually works. The outer product dims of the batched gemm are K and Y/4_X/4_N. So I don't just have 8 points of N on the outer product, but 4 sets x,y coordinates of 8 points of N arranged in a 2x2 superblock. With the 2 units of overlap in each direction, this hugely increases the utilization of the L1 cache and its what makes it possible for this kernel to have such dense global loads (16 loads in ~256 cycles is a lot). I'm actually working on a 2x1 superblock for fp16 (2xy points of 16n) so as to eliminate the half empty 32 byte transaction size. |
I think I kinda understood a little bit the main idea how you get high L1 utilization, removing L2 bottleneck, but I don't understand how this can help with max theoretical peak FLIP/s calculation I am making. You still can't amortize filter transform over "effective" N=32, only over real N=8. Or can you? |
x and y also factor into the number of image transforms you need, not just n. So 32 is the unit you need to use when calculating redundant transforms. |
Aha! I get it now. For F(4x4,3x3) correct formula for K32xN8, X2xY2(=4) is For C>>3: 144/(36+156/32+72/4/8)=3.3 For the overlapped data transform, the correct number of FLIP is actually smaller than 156. Last convolutional layer of VGG image dimention is 6x6, preventing X2xY2 superblock tiling. For that layer: For C>>3: 144/(36+156/32+72/8)=2.9 For F(2x2,3x3) correct formula for K32xN8, X2xY2(=4) is For C>>3: 36/(16+32/32+28/4/8)=2.01 (1.63 actually achieved) For the overlapped data transform, the correct number of FLIP is actually smaller than 32, but, since it's at least 24 (by my calculation), it doesn't matter for K=32: For C>>3: 36/(16+24/32+28/4/8)=2.04 |
initially, I though this was a typo (as Titan-X has 6.144 real Tflops). Now I think it may mean an awesome 1.7 utilization (C=3, theoretical max utilization is 1.8, see previous comment), although it's weird, because C=3 is i/o bound. |
@ozabluda, My colleague timed IntelCaffe on 14 and 28 cores (1 and 2 sockets). Affinity setup: There are quite a few cases when the ratio is less than 2 and even some cases where it is less than 1, but the most time-consuming layers have scaled pretty well. The total ratio is 1.83.
|
@rsdubtso Taking the minimum timing of each layer rather than the average is a bit misleading and is not a standard in benchmarking. I think you should consider changing that, even though the overall difference might be minor. |
@rsdubtso, thank you, these are great. I see great scalability to 2 sockets, with ~50% utilization (either one or two sockets), exactly opposite of my earlier guesses. Next natural experiments would be to run it on 1,2,4,8 cores to see where utilization breaks down (are you using AVX2 MADD?) E5-2697v3 has: Note that you ran 2-col AlexNet with minibatch=256, while @soumith ran 1-col AlexNet with minibatch=256
|
@rsdubtso |
@rsdubtso Also, the relu forward seems much slower on two sockets. Why is that? Drop6 and Drop7 seems to still use one socket even when you have two socket? the scaling ratio is 1. |
I asked around, and here's what I was told: AVX frequency is not SW visible. But even desktop processors have a fused 'AVX' frequency that they throttle down to when executing heavy instructions. I could not find the frequency fused for the i7 CPU mentioned above, but you can find it out using prime95 v27.9 or later, for example. However, current-related throttling may occur earlier than you hit TDP budget limit related to heavy instructions. |
Hi all, @soumith, you are right, we should've pointed we report timings for the fastest iteration. Though, if you use the same package for comparing 'intel_alexnet' and 'bvlc_alexnet', the comparison will be quite representative. @gujunli relu1-5 scale well, relu6-7 seem to be too small for scaling across sockets. drop6 and drop7 use rng (not parallelized), which most likely takes most time. We didn't optimize drop layer, except for adding parallelization on the loop. @ozabluda, @gujunli, I rerun the package on the same machine @rsdubtso did. The only change here is that I put database on /tmp (local hard drive). @rsdubtso reported timings when the DB was on Lustre FS (distributed cluster filesystem). That was the reason, why the timings were pure for data layer. Iterations: 10
small comment on cmp columns: |
Thank you for checking. Even though Intel's documentation does lists current and power limits, I think all(?) Intel CPUs are in practice limited only by TDP. For example, overclocked Intel CPUs are known to suck 400W on Prime95 likely with long-term damage, and Intel CPUs don't prevent it, if cooled: From official Asus overclocking guide: |
@emfomenk, thanks for the excellent table. Looking only at conv and fc layers, I see excellent scalability 2=>4=>8=>14=>28 in conv layers (except 2=>4=>8=14 in conv1 forward and 14=>28 in conv1 forward,conv2 backward), and some degradation in scalbiltiy 8=>14 and 14=>28 in fc layers. Updating my utilization table for 2 cores, we see that utilization improved conv+fc forward 65%=>73% (maybe some of it is due to AVX clock boost?), while conv+fc backward didn't improve much (65%=>68%). We can see that utilization does/doesn't improve for 2 cores. Now, the only thing missing is 1 core :-) E5-2697v3 with 2 cores has:
|
Just quick update. Recently we released technical preview of Multinode Caffe. The link: https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems-based-on-intel-xeon-processor-e5 The results are shown for Alexnet. We use data parallelism (for the first half of the net: from data till pool5) as well as model parallelism (for the second half: from fc6 till the end). The behavior of Multinode Caffe almost duplicates the behavior of Singlenode Caffe. This puts some limitations on scalability. Though we were able to achieve 12.3x, 19.2x and 29.4x speed-up on 16, 32 and 64 nodes respectively. |
@emfomenk, thank you for the summary. Sorry, I don't understand what you mean by
I also don't understand from the article what the effective minibatch is for, say, 64 nodes. Is is still 256 i.e. 4 per node? For multinode syncronous SGD, it's probably best to switch to the 1-col AlexNet from the "One weird Trick..." paper and follow the paper. |
@ozabluda, The only difference in Multinode version (from math point of view) is slightly modified SGD solver, which allows to apply diff right after backward step for current layer (this was made to be able to benefit from MPI parallelization in current approach). It looks like this modification doesn't affect convergence -- at least we were able to train Alexnet in the same amount of iterations as in Singlenode case. Regarding minibatch: for 16 nodes minibatch=256 was used, for 32 nodes minibatch=512, and for 64 nodes minibatch=1024. It means that each node (in 16 nodes case) took 256/16=16 images in its "local" minibatch. Yes, you are right that there are much better ways to implement multinode training (though, the math would be slightly different...), but the original idea was just to show that it possible to implement good parallelization even for this particular model. |
I see. Does it mean it is approximately the same as single-node multi-GPU Caffe? What about parameter update step? Is it centralized, or also distributed, just single-node multi-GPU Caffe? Article says """reached 80% top-5 accuracy in just over 5 hours on a 64-node""". Is that 90 epochs with minibatch=1024? AlexNet from the original paper reached 81.8% after 90 epochs with minbatch=128. P.S. Graph incorrectly says "E5-2697 v3 18 cores" |
Correction: 81.8% top-5 from the paper was with averaging predictions of 5 crops plus their horizontal reflections. Standard Caffe "test" does 1 random crop with no reflections, for which 80.2-80.4% top-5 is reached in 60-70 epochs, depending. How many epochs was it with minibatch=1024? |
@ozabluda, In Multinode Caffe for the first half of the net the parameter updates are centralized (since parallelization happens on minibatch, all convolutions parameters are the same for all nodes). For the second half updates are distributed, since fully-connected layers' weights are distributed across the nodes. Just to be aligned: one epoch == one full database turn around. |
Great. I think the web article should say that explicitly, especially since it is actually faster than what could be guessed from """reached 80% top-5 accuracy""", which can mean as little as 40, as you noticed:
Thank you for the offer, knowing that it's 90 epochs is good enough for me.Off-topic part: I am actually more interested in the number more precise than 80% (precision like 80.xx% would be better) for minibatch=1024 [1], single model, single crop, top-5 and top-1 (Caffe can do both simultaneously). I am also interested your ultimate accuracy for minibatch=256,512 as well. As you noticed, with the growing number of nodes you have to increase minibatch size, which negatively affects accuracy. [1] BTW, did you increase learning rate 4x, compared to minibatch=256? If yes, how did that affect accuracy? How about increasing learning rate sqtr(4)=2x? |
This somewhat explains how Intel's Multi-node Caffe works |
Please take a look at https://communities.intel.com/community/itpeernetwork/datastack/blog/2015/11/12/myth-busted-general-purpose-cpus-can-t-tackle-deep-neural-network-training-part-2 for more information on technical details of Intel Multinode Caffe tech-preview, which actually uses one weird trick... :) |
Speaking about accuracy, this could be used as baseline: |
There is now an official Intel Opencl PR at BVLC/caffe#3355. /cc @gongzg |
With F(2x2,3x3), (super)block 2x2 we have tile size of 6x6. In two other dimensions the tile size is K32xN8. Outer loop is over input channels (C). With 4-byte fp32 each 6x6 (super)block (=tile) we have: Filters: 32_3_3_4=1152 bytes
Do I understand correctly that Filters and Input go to L1 (24 KB per SM) and output is accumulated in the registers (64k 32-bit registers per SM)? Do you use Shared Memory (96 KB per SM) at all? What limits it to two blocks on an SM?
Do I understand correctly that the 4 thread blocks (256 threads each) that work on the same 2x2 superblock, really know nothing about each other, solely relying on L1 for transparent data reuse? |
You can find the latest code for F(2x2,3x3) here: This kernel uses 256 threads, 128 registers and 32kb shared memory. This means the threads and registers are limiting the occupancy to 2 blocks per SM and 4 warps per scheduler. The shared memory is mainly used for storing the computed transforms and facilitating the batched gemm. The gemm tile is 32x32 and we have 16 of them in the same block. This means we only have enough shared memory to store 4 outer product lines at a time, double buffered. So the gemm loops are unrolled 4 times. We use 2 separate loops to compute the image and filter transforms inline. When super blocking is in effect, you can get a lot of L1 cache hits, reducing the bandwidth from L2. This implementation is currently significantly more efficient than the one found in cuDNN 5.0 and up. |
Intel released a small blog-post recently covering that they have crazy-talk speeds for ConvNets on their Haswell CPU line.
I took their Caffe implementation, painfully installed the dependencies, and the numbers look almost too good to be true. Either someone refutes me, or these are very cool numbers.
Link to blog-post:
https://software.intel.com/en-us/articles/single-node-caffe-scoring-and-training-on-intel-xeon-e5-series-processors
A full [forward + backward] on AlexNet on a Desktop 6-core Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz takes an average of 164ms EDIT: 268 ms.
Just for comparison, the latest and greatest NVIDIA Titan-X does the same round-trip in 96 ms. An older generation GPU like Tesla K40 is slower, pegging at around 200+ ms.
I tried to get VGG working, but ran into assertions about unimplemented code pathways, but regardless, if AlexNet seems to be this fast, the others will probably in the ballpark.
Can someone else try the Intel stuff? I need a couple more sanity checks before I can believe this result. Look at how little time they are spending in the convolution layers, even the biggest ones: https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L329-L365
The text was updated successfully, but these errors were encountered: