Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we compare against other toolkits #59

Open
XapaJIaMnu opened this issue Jan 28, 2020 · 6 comments
Open

How do we compare against other toolkits #59

XapaJIaMnu opened this issue Jan 28, 2020 · 6 comments

Comments

@XapaJIaMnu
Copy link
Collaborator

On ssse3 (tested on the mac)

Arch: any
Matrix size: M: 1024 K: 1024 N: 1024 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 160.7630360000 seconds.
  dnnl u8s8s32 gemm took: 162.6090670000 seconds.
         dnnl sgemm took: 37.5122590000 seconds.
            Intgemm took: 26.6763590000 seconds.
    Intgemm Shifted took: 37.4154280000 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 256 K: 10368 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 120.7655340000 seconds.
  dnnl u8s8s32 gemm took: 131.6190160000 seconds.
         dnnl sgemm took: 44.2511700000 seconds.
            Intgemm took: 18.3821070000 seconds.
    Intgemm Shifted took: 22.4899050000 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 256 K: 5312 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 67.8882950000 seconds.
  dnnl u8s8s32 gemm took: 61.2556680000 seconds.
         dnnl sgemm took: 22.5068530000 seconds.
            Intgemm took: 9.4558490000 seconds.
    Intgemm Shifted took: 11.5820400000 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 8 K: 2048 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 5.0743890000 seconds.
  dnnl u8s8s32 gemm took: 4.9938150000 seconds.
         dnnl sgemm took: 0.3691870000 seconds.
            Intgemm took: 0.1078810000 seconds.
    Intgemm Shifted took: 0.1410670000 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 320 K: 256 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 3.9427480000 seconds.
  dnnl u8s8s32 gemm took: 3.7518470000 seconds.
         dnnl sgemm took: 1.2999140000 seconds.
            Intgemm took: 0.5302770000 seconds.
    Intgemm Shifted took: 0.8196150000 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 472 K: 256 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 5.7050050000 seconds.
  dnnl u8s8s32 gemm took: 5.4563000000 seconds.
         dnnl sgemm took: 1.9841880000 seconds.
            Intgemm took: 0.7823240000 seconds.
    Intgemm Shifted took: 1.2160360000 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 248 K: 256 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 3.3614970000 seconds.
  dnnl u8s8s32 gemm took: 2.9069520000 seconds.
         dnnl sgemm took: 1.0137430000 seconds.
            Intgemm took: 0.4135380000 seconds.
    Intgemm Shifted took: 0.6389010000 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 200 K: 256 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 2.8379860000 seconds.
  dnnl u8s8s32 gemm took: 2.4539370000 seconds.
         dnnl sgemm took: 0.8104770000 seconds.
            Intgemm took: 0.3314730000 seconds.
    Intgemm Shifted took: 0.5087140000 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 1 K: 64 N: 8 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 0.0048230000 seconds.
  dnnl u8s8s32 gemm took: 0.0047840000 seconds.
         dnnl sgemm took: 0.0005170000 seconds.
            Intgemm took: 0.0000670000 seconds.
    Intgemm Shifted took: 0.0000630000 seconds.
Alignment was: 256.

On AVX2 (Tested on my laptop)

Arch: any
Matrix size: M: 1024 K: 1024 N: 1024 in loop, for 200 interations:
  dnnl s8s8s32 gemm took: 3.8168575710 seconds.
  dnnl u8s8s32 gemm took: 3.5513647600 seconds.
         dnnl sgemm took: 3.9564677720 seconds.
            Intgemm took: 3.0320555790 seconds.
    Intgemm Shifted took: 2.6607167070 seconds.
fbgemm SparseXDense took: 2.8762658060 seconds.
      fbgemm Packed took: 2.5303888970 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 256 K: 10368 N: 256 in loop, for 200 interations:
  dnnl s8s8s32 gemm took: 2.9163228270 seconds.
  dnnl u8s8s32 gemm took: 2.3117003190 seconds.
         dnnl sgemm took: 2.6101417560 seconds.
            Intgemm took: 1.9078559750 seconds.
    Intgemm Shifted took: 1.9673491580 seconds.
fbgemm SparseXDense took: 1.9887489830 seconds.
      fbgemm Packed took: 1.5653211720 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 256 K: 5312 N: 256 in loop, for 200 interations:
  dnnl s8s8s32 gemm took: 1.5031878530 seconds.
  dnnl u8s8s32 gemm took: 1.1827831940 seconds.
         dnnl sgemm took: 1.3960096210 seconds.
            Intgemm took: 0.9642166380 seconds.
    Intgemm Shifted took: 0.9947312040 seconds.
fbgemm SparseXDense took: 1.0348749650 seconds.
      fbgemm Packed took: 0.7993707530 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 8 K: 2048 N: 256 in loop, for 200 interations:
  dnnl s8s8s32 gemm took: 0.1371742300 seconds.
  dnnl u8s8s32 gemm took: 0.0215478650 seconds.
         dnnl sgemm took: 0.0280146870 seconds.
            Intgemm took: 0.0123623300 seconds.
    Intgemm Shifted took: 0.0105849290 seconds.
fbgemm SparseXDense took: 0.0355475240 seconds.
      fbgemm Packed took: 0.0122486380 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 320 K: 256 N: 256 in loop, for 200 interations:
  dnnl s8s8s32 gemm took: 0.0852186880 seconds.
  dnnl u8s8s32 gemm took: 0.0708069750 seconds.
         dnnl sgemm took: 0.0862559440 seconds.
            Intgemm took: 0.0589404700 seconds.
    Intgemm Shifted took: 0.0535070910 seconds.
fbgemm SparseXDense took: 0.0534052950 seconds.
      fbgemm Packed took: 0.0537651650 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 472 K: 256 N: 256 in loop, for 200 interations:
  dnnl s8s8s32 gemm took: 0.1192502650 seconds.
  dnnl u8s8s32 gemm took: 0.1045028840 seconds.
         dnnl sgemm took: 0.1272612120 seconds.
            Intgemm took: 0.0873322810 seconds.
    Intgemm Shifted took: 0.0791147750 seconds.
fbgemm SparseXDense took: 0.0777058490 seconds.
      fbgemm Packed took: 0.0790565330 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 248 K: 256 N: 256 in loop, for 200 interations:
  dnnl s8s8s32 gemm took: 0.0692625470 seconds.
  dnnl u8s8s32 gemm took: 0.0550089420 seconds.
         dnnl sgemm took: 0.0690275630 seconds.
            Intgemm took: 0.0454910790 seconds.
    Intgemm Shifted took: 0.0415837170 seconds.
fbgemm SparseXDense took: 0.0424727770 seconds.
      fbgemm Packed took: 0.0417065990 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 200 K: 256 N: 256 in loop, for 200 interations:
  dnnl s8s8s32 gemm took: 0.0583410730 seconds.
  dnnl u8s8s32 gemm took: 0.0445252690 seconds.
         dnnl sgemm took: 0.0549694490 seconds.
            Intgemm took: 0.0366770430 seconds.
    Intgemm Shifted took: 0.0331239330 seconds.
fbgemm SparseXDense took: 0.0342776520 seconds.
      fbgemm Packed took: 0.0334005700 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 1 K: 64 N: 8 in loop, for 200 interations:
  dnnl s8s8s32 gemm took: 0.0005636220 seconds.
  dnnl u8s8s32 gemm took: 0.0001262460 seconds.
         dnnl sgemm took: 0.0002341430 seconds.
            Intgemm took: 0.0000137880 seconds.
    Intgemm Shifted took: 0.0000076180 seconds.
fbgemm SparseXDense took: 0.0001084460 seconds.
      fbgemm Packed took: 0.0001391700 seconds.
Alignment was: 256.

On AVX512VNNI

Arch: any
Matrix size: M: 1024 K: 1024 N: 1024 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 3.0805698930 seconds.
  dnnl u8s8s32 gemm took: 3.0335771660 seconds.
         dnnl sgemm took: 11.4418824780 seconds.
            Intgemm took: 10.4515422080 seconds.
    Intgemm Shifted took: 8.2289911150 seconds.
fbgemm SparseXDense took: 4.8233884470 seconds.
      fbgemm Packed took: 4.7012166050 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 256 K: 10368 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 2.4343342120 seconds.
  dnnl u8s8s32 gemm took: 2.4769278340 seconds.
         dnnl sgemm took: 8.3684561560 seconds.
            Intgemm took: 6.8943220870 seconds.
    Intgemm Shifted took: 6.1097037920 seconds.
fbgemm SparseXDense took: 3.3428150030 seconds.
      fbgemm Packed took: 3.1115883280 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 256 K: 5312 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 1.2665478430 seconds.
  dnnl u8s8s32 gemm took: 1.2202449190 seconds.
         dnnl sgemm took: 4.3336066470 seconds.
            Intgemm took: 3.5151793340 seconds.
    Intgemm Shifted took: 3.0669876760 seconds.
fbgemm SparseXDense took: 1.6181674400 seconds.
      fbgemm Packed took: 1.6256571530 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 8 K: 2048 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 0.0832580680 seconds.
  dnnl u8s8s32 gemm took: 0.0577219100 seconds.
         dnnl sgemm took: 0.1630447130 seconds.
            Intgemm took: 0.0543942070 seconds.
    Intgemm Shifted took: 0.0283744450 seconds.
fbgemm SparseXDense took: 0.1101626230 seconds.
      fbgemm Packed took: 0.0565417540 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 320 K: 256 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 0.0957957560 seconds.
  dnnl u8s8s32 gemm took: 0.0670620450 seconds.
         dnnl sgemm took: 0.2337507040 seconds.
            Intgemm took: 0.2188310160 seconds.
    Intgemm Shifted took: 0.1614387950 seconds.
fbgemm SparseXDense took: 0.0931338430 seconds.
      fbgemm Packed took: 0.1018407810 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 472 K: 256 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 0.1227829230 seconds.
  dnnl u8s8s32 gemm took: 0.0953531580 seconds.
         dnnl sgemm took: 0.3441788240 seconds.
            Intgemm took: 0.3257040630 seconds.
    Intgemm Shifted took: 0.2396385250 seconds.
fbgemm SparseXDense took: 0.1329304730 seconds.
      fbgemm Packed took: 0.1514016630 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 248 K: 256 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 0.0844883130 seconds.
  dnnl u8s8s32 gemm took: 0.0518359310 seconds.
         dnnl sgemm took: 0.1831681580 seconds.
            Intgemm took: 0.1700371060 seconds.
    Intgemm Shifted took: 0.1249343910 seconds.
fbgemm SparseXDense took: 0.0838161900 seconds.
      fbgemm Packed took: 0.0796740550 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 200 K: 256 N: 256 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 0.0890493030 seconds.
  dnnl u8s8s32 gemm took: 0.0437124190 seconds.
         dnnl sgemm took: 0.1508868270 seconds.
            Intgemm took: 0.1375590840 seconds.
    Intgemm Shifted took: 0.1007324120 seconds.
fbgemm SparseXDense took: 0.0744372750 seconds.
      fbgemm Packed took: 0.0644506520 seconds.
Alignment was: 256.
Arch: any
Matrix size: M: 1 K: 64 N: 8 in loop, for 1000 interations:
  dnnl s8s8s32 gemm took: 0.0007670340 seconds.
  dnnl u8s8s32 gemm took: 0.0006232130 seconds.
         dnnl sgemm took: 0.0007033860 seconds.
            Intgemm took: 0.0000521970 seconds.
    Intgemm Shifted took: 0.0000378480 seconds.
fbgemm SparseXDense took: 0.0003777070 seconds.
      fbgemm Packed took: 0.0006217930 seconds.
Alignment was: 256.

Marian uses fbgemm Packed, which does unsignedXsigned and unquantizes to floats after. We should aim for those numbers. For comparison, use https://github.com/XapaJIaMnu/gemmbench

@pengzhao-intel
Copy link

How do you test DNNL performance?
Could you put the HW/SW configuration and DNNL verbose log?

@XapaJIaMnu
Copy link
Collaborator Author

@pengzhao-intel
Copy link

Thanks for the information @XapaJIaMnu
For DNNL, you can use benchdnn in the Intel repo and the 2nd generation of Intel Xeon is the preferred platform for testing INT8, like AWS EC2 c5.18xlarge, c5.24xlarge.
https://github.com/intel/mkl-dnn/tree/master/tests/benchdnn

@XapaJIaMnu
Copy link
Collaborator Author

@pengzhao-intel to give you some background about this particular benchmark:
The matrix sizes chosen are the matrix sizes that constitute the biggest computational cost for our machine translation models. We also aim to run on variety of outdated consumer grade hardware still in use. This is why we have benchmarks on architectures spanning from SSSE3 to VNNI, not just recent xeons.

@alvations
Copy link

alvations commented Feb 21, 2020

Is the lossless-ness of Intgemm Shifted similar to fbgemm Packed?

@kpu
Copy link
Owner

kpu commented Feb 21, 2020

All intgemm operations are packed (though the formats are not necessarily the same). Shifted refers to adding a constant to work around Intel's unsigned * signed instruction.

leezu pushed a commit to apache/mxnet that referenced this issue Aug 31, 2020
This pull request adds wrappers to the intgemm matrix multiplication library: https://github.com/kpu/intgemm .

A performance comparison with DNNL aka MKL-DNN is at kpu/intgemm#59

The library targets thin matrix sizes seen in neural machine translation inference and was part of the top submission to the 2018 Workshop on Neural Generation and Translation efficiency task: https://neural.mt/papers/edinburgh/wnmt_marian_paper.pdf . The purpose of this issue is to add similar functionality to Sockeye: awslabs/sockeye#771 .

Quantized Sockeye performance is 2.95x as fast. One problem with the current MXQuantizeSymbol approach is that Sockeye does not have a static graph for everything.

intgemm uses a custom memory layout for the weight matrix to make more memory accesses consecutive, so there are operators to convert weights to that format. The idea is that weights are typically loaded once for inference.

On architectures without VNNI, intgemm uses saturating 16-bit accumulation. This avoids an expensive madd_epi16 instruction every multiply by exploiting the fact that most neural network parameters are near 0.

Because x86 only offers a unsigned * signed instruction and most people want signed * signed, there are two strategies one can take.

Add 128 to data so now it's unsigned.  But that biases the output.  DNNL calculates this bias on the fly by summing weights then subtracts it out during GEMM.  intgemm calculates this bias in advance, which can then be subtracted from the bias term with no overhead at runtime.  A problem with this strategy is that it makes the accumulator bigger, requiring more upcasting with an expensive madd_epi16 instruction. 
Emulate signed * signed by normalizing the sign bit into the second argument. This requires extra instructions in the hot loop but keeps the accumulator small, so it's less necessary to accumulate into 32-bit integers and madd_epi16 can be avoided. 

Both intgemm and DNNL implement strategy 1; intgemm also implements strategy 2.

Similar to DNNL, intgemm has runtime CPUID selection among backends for SSSE3, AVX2, AVX512BW, and AVX512VNNI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants