Skip to content

Add AVX-512 optimised vector distance functions for int7 on x64#109084

Merged
ldematte merged 12 commits intoelastic:mainfrom
ldematte:native-vec-linux-x64-avx512
Jun 28, 2024
Merged

Add AVX-512 optimised vector distance functions for int7 on x64#109084
ldematte merged 12 commits intoelastic:mainfrom
ldematte:native-vec-linux-x64-avx512

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

@ldematte ldematte commented May 27, 2024

On Intel(R) Xeon(R) Platinum 8488C (AWS C7i/M7i, 3rd gen Xeon scalable) - 2 FMA units per-core acutally I see contradictory information on this, it behaves like a CPU with 1 FMA, and clang thinks that too.

Benchmark                                   (dims)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.dotProductLucene        1024  thrpt    5  17.432 ± 0.119  ops/us
VectorScorerBenchmark.dotProductNative        1024  thrpt    5  33.055 ± 0.225  ops/us
VectorScorerBenchmark.dotProductScalar        1024  thrpt    5   2.767 ± 0.017  ops/us
VectorScorerBenchmark.squareDistanceLucene    1024  thrpt    5  14.693 ± 1.685  ops/us
VectorScorerBenchmark.squareDistanceNative    1024  thrpt    5  30.992 ± 1.301  ops/us
VectorScorerBenchmark.squareDistanceScalar    1024  thrpt    5   2.537 ± 0.004  ops/us

So, is it faster than AVX2?
It Depends 🙂

On a "good" processor, 10 to 20% faster than the AVX2 version on the same hardware. On the "wrong" one (1FMA/instruction split into 2, e.g. SapphireRapids or Zen4), the performances are identical.
Still, it's worth doing it: new generations keep improving on AVX-512. On the right upcoming/next-gen processors (MeterorLake or Zen5), it promises be 40 to 50% faster.

Further unrolling helps in using more registers, but it shows no gain, meaning we are hitting some bound (likely execution units, maybe memory bandwidth).
Comments below report a detailed analysis.

@ChrisHegarty
Copy link
Copy Markdown
Contributor

An initial run on my SkyLake, AVX 512 show approx the same as @ldematte:

VectorScorerBenchmark.dotProductLucene        1024  thrpt    5  19.814 ± 0.043  ops/us
VectorScorerBenchmark.dotProductNative        1024  thrpt    5  31.033 ± 0.706  ops/us
VectorScorerBenchmark.dotProductScalar        1024  thrpt    5   3.154 ± 0.048  ops/us
VectorScorerBenchmark.squareDistanceLucene    1024  thrpt    5  18.153 ± 0.056  ops/us
VectorScorerBenchmark.squareDistanceNative    1024  thrpt    5  26.950 ± 0.056  ops/us
VectorScorerBenchmark.squareDistanceScalar    1024  thrpt    5   2.333 ± 0.027  ops/us

@ldematte
Copy link
Copy Markdown
Contributor Author

First, a comment about the content of the PR.
I switched to C++ for the native implementation to make use of templates, and make manual unrolling easier and more pleasant to read/more understandable.
Since we are using intrinsics, and we do not use anything from standard libraries etc, the produced code is completely identical (checked with objdump -- more on this later).

To my surprise, GCC is was already doing a great job unrolling the loop - probably because to loop body is extremely simple.
Also, unrolling does help, but not in the way I was initially thinking (break dependency chains from one loop iteration to the next). More on this later too.

On the Java side, we use the value returned from vec_caps to bind either the AVX2 or the AVX-512 implementation to the dot7u$mh method handle. We discussed various options with @ChrisHegarty; this is not the most pleasant to see, but it does not pay any performance penalty.

@ldematte
Copy link
Copy Markdown
Contributor Author

ldematte commented May 29, 2024

As we can see from the previous comments, JMH benchmarks for AVX-512 (which include calling into native code from Java) indicate a similar speedup to AVX2.

C micro-benckmarks show a similar story:

Run on (2 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 2048 KiB (x1)
  L3 Unified 107520 KiB (x1)
Load Average: 0.00, 0.00, 0.00
-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_dot7u_scalar              232 ns          232 ns      3020470    <-- auto-vectorized with g++ 14, ~2x of non-SIMD code
BM_dot7u_avx2                12.1 ns         12.1 ns     58192052   <-- #pragma GCC unroll
BM_dot7u_avx2_2              12.1 ns         12.1 ns     57370452   <-- manual unroll x8
BM_dot7u_avx512              11.9 ns         11.9 ns     59539236   <-- #pragma GCC unroll
BM_dot7u_avx512_2            11.8 ns         11.8 ns     59449448   <-- manual unroll x8

Depending on the CPU, we are between 11 and 14ns for a "op", i.e. a call to the dot function to compute the dot product of 2 vectors of 1024 elements. Each element is a byte, even if we consider only positive values (0-127).

As part of my work, I wanted to check if we were leaving something on the table; I found some interesting points that I'm going to list here. I believe that for AVX2 we are at the limit. For AVX-512 we could extract/expect double the performance but under very specific circumstances.

1GHz is 10^9 Hz, which means 1ns (10^-9 seconds) for 1 CPU cycle. Or 0.33ns @ 3GHz (the most common CPU frequency).

We are benchmarking with vectors of 1024 elements, each one 8 bits. This means we can fit 32 elements in a 256-bit vector (AVX2), and we need 32 loop iterations to compute the complete dot product (1024 / 32). I'm leaving unrolling aside at the moment.

The best we can do is to do 1 loop iteration in 1 CPU cycle (more on this later).
This means we need 32 CPU cycles to process all 1024 elements (32 loop iterations with AVX2), which would take ~10ns @ 3GHz.
From the C microbenchmarks above we get ~12ns @ 2.4GHz , which is very, very close.

@ldematte
Copy link
Copy Markdown
Contributor Author

Before moving on to AVX-512, let me explain the "1 loop iteration in 1 CPU cycle" statement.

I examined the assembly code produced for the main loop; unsurprisingly, this almost maps 1:1 with intrinsics (that's their purpose):

code_snippet:
   vmovdqu ymm3,YMMWORD PTR [rdi]	
   add    rdi,0x20	
   vpmaddubsw ymm3,ymm3,YMMWORD PTR [rsi]
   add    rsi,0x20
   vpmaddwd ymm3,ymm0,ymm3
   vpaddd ymm6,ymm3,ymm12
   vmovdqa ymm12,ymm6
   jne code_snippet

We can see we have 2 vector mul/adds, 1 vector add, 2 vector loads (1 "disguised" in one of the mul/add), 1 vector move. Plus 2 integer adds and 1 jump for the loop control.
Leaving aside the 2 integer adds and the jump for the moment, a typical modern processor can handle the other instructions across all its execution units (or more precisely, ports).
For example, vpmaddubsw and vpmaddwd (mul/add), have a typical RThroughput (reciprocal throughput/cycles per instructions) of 0.5, as most processors have 2 FMAs (see the intel docs). So we can do both in 1 cycle, but no more.
Same for load operations: there are 2 or 3 load ports on most processors, so 2 loads in a cycle fit, but not much more.

I analyzed the scheduling of these instructions using llvm-mca, which confirmed this.
For Skylake:

Dispatch Width:    6
uOps Per Cycle:    2.73
IPC:               2.27
Block RThroughput: 1.0

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      7     0.50    *                   vmovdqu       ymm3, ymmword ptr [rdi]
 2      12    0.50    *                   vpmaddubsw    ymm3, ymm3, ymmword ptr [rsi]
 1      5     0.50                        vpmaddwd      ymm3, ymm0, ymm3
 1      1     0.33                        vpaddd        ymm6, ymm12, ymm3
 1      1     0.33                        vmovdqa       ymm12, ymm6

Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     1.36   1.37   1.00   1.00    -     1.27    -      -     

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -      -      -     1.00    -      -      -      -     vmovdqu   ymm3, ymmword ptr [rdi]
 -      -     0.62   0.38   1.00    -      -      -      -      -     vpmaddubsw        ymm3, ymm3, ymmword ptr [rsi]
 -      -     0.33   0.67    -      -      -      -      -      -     vpmaddwd  ymm3, ymm0, ymm3
 -      -     0.26   0.25    -      -      -     0.49    -      -     vpaddd    ymm6, ymm12, ymm3
 -      -     0.15   0.07    -      -      -     0.78    -      -     vmovdqa   ymm12, ymm6

For Zen4: https://godbolt.org/z/d8TxzvdMs

As you may notice, this is without the 2 adds and the jump; since they share execution units, adding them reduces the RThroughput from 1 to 1.5. This is where unrolling becomes important: by unrolling x8, we get a RThroughput of almost 8 (8.4); at this level, even 2 simple adds can hurt performance.

@ldematte
Copy link
Copy Markdown
Contributor Author

ldematte commented May 29, 2024

Speaking of which (hurting performace): the C benchmarks show 11-14ns for a call ("op", e.g. dot product of 1024 elements vector), or ~80-90 ops/us
The JMH benchmarks show ~30-33 ops/us (or ~30ns for one "op"). Why the gap?

We are talking extremely low numbers here; 32 CPU cycles total. A function call can easily be between 2 and 20 cycles, for comparison.
By examining the asm code produced by the JIT compiler with @ChrisHegarty, we saw an atomic compareAndSet before and after each call; Chris is investigating why; meanwhile, we can already see that this translates to a lock cmpxchg instruction, which has a latency between 14 and 43 cycles. Typically, it takes 20 cycles (measured). This would already account for more than half the performance gap.

Both Chris and I will investigate this further.

@ldematte
Copy link
Copy Markdown
Contributor Author

For AVX-512, the story is complicated. TL;DR: in most cases AVX-512 will be identical to AVX2

Most processors today just implement AVX-512 in a "reduced" fashion: AMD does it as 2x AVX2 (so no change whatsovever in dot7) and many Intel processors have just 1 FMA which is AVX-512 capable. Double the bits, but half the execution units.

For the Intels with 2 FMAs, we should get a RThroughput of 1.0, and therefore a nice 2x speedup (since we only need 16 loop iterations to cover 1024 elements), but a bug in GCC was preventing some optimal code generation, so at best we were getting 1.5x theoretical:

code_snippet:
   vmovdqu64 zmm3,ZMMWORD PTR [rdi]
   #add    rdi,0x20
   #add    rsi,0x20
   vpmaddubsw zmm0,zmm3,ZMMWORD PTR [rsi]
   vpmaddwd zmm0,zmm2,zmm0
   vpaddd zmm0,zmm0,zmm16
   vmovdqa64 zmm16,zmm0 <---- ??

    #jne code_snippet

Notice the extra vmovdqa64?
Re-compiling with clang++ (ver 18) shows no issues, and a RThroughput of 1.0 on processors with 2 FMAs (e.g. icelake)

For processors with 1 FMA, and alternative may be to use only one mul/add, and explicitly do a h-sum instead:

const __m256i sum256 = _mm256_add_epi32(_mm512_castsi512_si256(dot), _mm512_extracti32x8_epi32(dot, 1));
 acc = _mm512_add_epi32(acc, _mm512_cvtepi16_epi32(sum256));

or in asm

vmovdqu64 zmm8,ZMMWORD PTR [rdi]	
vpmaddubsw zmm8,zmm8,ZMMWORD PTR [rsi]
vextracti64x4 ymm12,zmm8,0x1
vpaddd ymm8,ymm12,ymm8
vpmovsxwd zmm8,ymm8
vpaddd zmm0,zmm0,zmm8

However, this is the same: extract, move, and 2 adds are too many operations for 1 cycle, at least 2 of them will end up on the same port, making a total RThroughput of 2 on both AMD and Intel processors with 1 FMA.

A final option is to use more advanced instructions available for Icelake/Zen4: AVX-512 VNNI. In that instruction set there is one "dot product" instruction, which does to job of both the mul/add in the original code.
In theory, this will give us a RThroughput of 1.0 on processors with 1 FMA and possibly even on Zen4.
In practice, I do not see any improvement though :(

@ldematte ldematte added >enhancement :Search/Search Search-related issues that do not fall into other categories test-windows Trigger CI checks on Windows test-arm Pull Requests that should be tested against arm agents and removed WIP labels Jun 27, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ldematte, I've created a changelog YAML for you.

@ldematte ldematte marked this pull request as ready for review June 27, 2024 18:39
@ldematte ldematte requested a review from a team as a code owner June 27, 2024 18:39
@ldematte ldematte requested a review from a team as a code owner June 27, 2024 18:39
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jun 27, 2024
Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ldematte ldematte merged commit 0bc2b19 into elastic:main Jun 28, 2024
@ldematte ldematte deleted the native-vec-linux-x64-avx512 branch June 28, 2024 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team test-arm Pull Requests that should be tested against arm agents test-windows Trigger CI checks on Windows v8.15.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants