Add AVX-512 optimised vector distance functions for int7 on x64#109084
Add AVX-512 optimised vector distance functions for int7 on x64#109084ldematte merged 12 commits intoelastic:mainfrom
Conversation
… for manual unrolling
|
An initial run on my SkyLake, AVX 512 show approx the same as @ldematte: |
|
First, a comment about the content of the PR. To my surprise, GCC is was already doing a great job unrolling the loop - probably because to loop body is extremely simple. On the Java side, we use the value returned from |
|
As we can see from the previous comments, JMH benchmarks for AVX-512 (which include calling into native code from Java) indicate a similar speedup to AVX2. C micro-benckmarks show a similar story: Depending on the CPU, we are between 11 and 14ns for a "op", i.e. a call to the As part of my work, I wanted to check if we were leaving something on the table; I found some interesting points that I'm going to list here. I believe that for AVX2 we are at the limit. For AVX-512 we could extract/expect double the performance but under very specific circumstances. 1GHz is 10^9 Hz, which means 1ns (10^-9 seconds) for 1 CPU cycle. Or 0.33ns @ 3GHz (the most common CPU frequency). We are benchmarking with vectors of 1024 elements, each one 8 bits. This means we can fit 32 elements in a 256-bit vector (AVX2), and we need 32 loop iterations to compute the complete dot product (1024 / 32). I'm leaving unrolling aside at the moment. The best we can do is to do 1 loop iteration in 1 CPU cycle (more on this later). |
|
Before moving on to AVX-512, let me explain the "1 loop iteration in 1 CPU cycle" statement. I examined the assembly code produced for the main loop; unsurprisingly, this almost maps 1:1 with intrinsics (that's their purpose): We can see we have 2 vector mul/adds, 1 vector add, 2 vector loads (1 "disguised" in one of the mul/add), 1 vector move. Plus 2 integer adds and 1 jump for the loop control. I analyzed the scheduling of these instructions using For Zen4: https://godbolt.org/z/d8TxzvdMs As you may notice, this is without the 2 adds and the jump; since they share execution units, adding them reduces the RThroughput from 1 to 1.5. This is where unrolling becomes important: by unrolling x8, we get a RThroughput of almost 8 (8.4); at this level, even 2 simple adds can hurt performance. |
|
Speaking of which (hurting performace): the C benchmarks show 11-14ns for a call ("op", e.g. dot product of 1024 elements vector), or ~80-90 ops/us We are talking extremely low numbers here; 32 CPU cycles total. A function call can easily be between 2 and 20 cycles, for comparison. Both Chris and I will investigate this further. |
|
For AVX-512, the story is complicated. TL;DR: in most cases AVX-512 will be identical to AVX2 Most processors today just implement AVX-512 in a "reduced" fashion: AMD does it as 2x AVX2 (so no change whatsovever in dot7) and many Intel processors have just 1 FMA which is AVX-512 capable. Double the bits, but half the execution units. For the Intels with 2 FMAs, we should get a RThroughput of 1.0, and therefore a nice 2x speedup (since we only need 16 loop iterations to cover 1024 elements), but a bug in GCC was preventing some optimal code generation, so at best we were getting 1.5x theoretical: Notice the extra For processors with 1 FMA, and alternative may be to use only one mul/add, and explicitly do a h-sum instead: or in asm However, this is the same: extract, move, and 2 adds are too many operations for 1 cycle, at least 2 of them will end up on the same port, making a total RThroughput of 2 on both AMD and Intel processors with 1 FMA. A final option is to use more advanced instructions available for Icelake/Zen4: AVX-512 VNNI. In that instruction set there is one "dot product" instruction, which does to job of both the mul/add in the original code. |
|
Hi @ldematte, I've created a changelog YAML for you. |
|
Pinging @elastic/es-search (Team:Search) |
On Intel(R) Xeon(R) Platinum 8488C (AWS C7i/M7i, 3rd gen Xeon scalable) -
2 FMA units per-coreacutally I see contradictory information on this, it behaves like a CPU with 1 FMA, and clang thinks that too.So, is it faster than AVX2?
It Depends 🙂
On a "good" processor, 10 to 20% faster than the AVX2 version on the same hardware. On the "wrong" one (1FMA/instruction split into 2, e.g. SapphireRapids or Zen4), the performances are identical.
Still, it's worth doing it: new generations keep improving on AVX-512. On the right upcoming/next-gen processors (MeterorLake or Zen5), it promises be 40 to 50% faster.
Further unrolling helps in using more registers, but it shows no gain, meaning we are hitting some bound (likely execution units, maybe memory bandwidth).
Comments below report a detailed analysis.