Add AVX-512 optimised vector distance functions for int7 on x64 by ldematte · Pull Request #109084 · elastic/elasticsearch

ldematte · 2024-05-27T17:27:13Z

On Intel(R) Xeon(R) Platinum 8488C (AWS C7i/M7i, 3rd gen Xeon scalable) - ~~2 FMA units per-core~~ acutally I see contradictory information on this, it behaves like a CPU with 1 FMA, and clang thinks that too.

Benchmark                                   (dims)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.dotProductLucene        1024  thrpt    5  17.432 ± 0.119  ops/us
VectorScorerBenchmark.dotProductNative        1024  thrpt    5  33.055 ± 0.225  ops/us
VectorScorerBenchmark.dotProductScalar        1024  thrpt    5   2.767 ± 0.017  ops/us
VectorScorerBenchmark.squareDistanceLucene    1024  thrpt    5  14.693 ± 1.685  ops/us
VectorScorerBenchmark.squareDistanceNative    1024  thrpt    5  30.992 ± 1.301  ops/us
VectorScorerBenchmark.squareDistanceScalar    1024  thrpt    5   2.537 ± 0.004  ops/us

So, is it faster than AVX2?
It Depends 🙂

On a "good" processor, 10 to 20% faster than the AVX2 version on the same hardware. On the "wrong" one (1FMA/instruction split into 2, e.g. SapphireRapids or Zen4), the performances are identical.
Still, it's worth doing it: new generations keep improving on AVX-512. On the right upcoming/next-gen processors (MeterorLake or Zen5), it promises be 40 to 50% faster.

Further unrolling helps in using more registers, but it shows no gain, meaning we are hitting some bound (likely execution units, maybe memory bandwidth).
Comments below report a detailed analysis.

…4-avx512

… for manual unrolling

ChrisHegarty · 2024-05-27T19:23:38Z

An initial run on my SkyLake, AVX 512 show approx the same as @ldematte:

VectorScorerBenchmark.dotProductLucene        1024  thrpt    5  19.814 ± 0.043  ops/us
VectorScorerBenchmark.dotProductNative        1024  thrpt    5  31.033 ± 0.706  ops/us
VectorScorerBenchmark.dotProductScalar        1024  thrpt    5   3.154 ± 0.048  ops/us
VectorScorerBenchmark.squareDistanceLucene    1024  thrpt    5  18.153 ± 0.056  ops/us
VectorScorerBenchmark.squareDistanceNative    1024  thrpt    5  26.950 ± 0.056  ops/us
VectorScorerBenchmark.squareDistanceScalar    1024  thrpt    5   2.333 ± 0.027  ops/us

ldematte · 2024-05-29T12:09:51Z

First, a comment about the content of the PR.
I switched to C++ for the native implementation to make use of templates, and make manual unrolling easier and more pleasant to read/more understandable.
Since we are using intrinsics, and we do not use anything from standard libraries etc, the produced code is completely identical (checked with objdump -- more on this later).

To my surprise, GCC is was already doing a great job unrolling the loop - probably because to loop body is extremely simple.
Also, unrolling does help, but not in the way I was initially thinking (break dependency chains from one loop iteration to the next). More on this later too.

On the Java side, we use the value returned from vec_caps to bind either the AVX2 or the AVX-512 implementation to the dot7u$mh method handle. We discussed various options with @ChrisHegarty; this is not the most pleasant to see, but it does not pay any performance penalty.

ldematte · 2024-05-29T12:16:52Z

As we can see from the previous comments, JMH benchmarks for AVX-512 (which include calling into native code from Java) indicate a similar speedup to AVX2.

C micro-benckmarks show a similar story:

Run on (2 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x1)
  L1 Instruction 32 KiB (x1)
  L2 Unified 2048 KiB (x1)
  L3 Unified 107520 KiB (x1)
Load Average: 0.00, 0.00, 0.00
-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_dot7u_scalar              232 ns          232 ns      3020470    <-- auto-vectorized with g++ 14, ~2x of non-SIMD code
BM_dot7u_avx2                12.1 ns         12.1 ns     58192052   <-- #pragma GCC unroll
BM_dot7u_avx2_2              12.1 ns         12.1 ns     57370452   <-- manual unroll x8
BM_dot7u_avx512              11.9 ns         11.9 ns     59539236   <-- #pragma GCC unroll
BM_dot7u_avx512_2            11.8 ns         11.8 ns     59449448   <-- manual unroll x8

Depending on the CPU, we are between 11 and 14ns for a "op", i.e. a call to the dot function to compute the dot product of 2 vectors of 1024 elements. Each element is a byte, even if we consider only positive values (0-127).

As part of my work, I wanted to check if we were leaving something on the table; I found some interesting points that I'm going to list here. I believe that for AVX2 we are at the limit. For AVX-512 we could extract/expect double the performance but under very specific circumstances.

1GHz is 10^9 Hz, which means 1ns (10^-9 seconds) for 1 CPU cycle. Or 0.33ns @ 3GHz (the most common CPU frequency).

We are benchmarking with vectors of 1024 elements, each one 8 bits. This means we can fit 32 elements in a 256-bit vector (AVX2), and we need 32 loop iterations to compute the complete dot product (1024 / 32). I'm leaving unrolling aside at the moment.

The best we can do is to do 1 loop iteration in 1 CPU cycle (more on this later).
This means we need 32 CPU cycles to process all 1024 elements (32 loop iterations with AVX2), which would take ~10ns @ 3GHz.
From the C microbenchmarks above we get ~12ns @ 2.4GHz , which is very, very close.

ldematte · 2024-05-29T12:39:25Z

Before moving on to AVX-512, let me explain the "1 loop iteration in 1 CPU cycle" statement.

I examined the assembly code produced for the main loop; unsurprisingly, this almost maps 1:1 with intrinsics (that's their purpose):

code_snippet:
   vmovdqu ymm3,YMMWORD PTR [rdi]	
   add    rdi,0x20	
   vpmaddubsw ymm3,ymm3,YMMWORD PTR [rsi]
   add    rsi,0x20
   vpmaddwd ymm3,ymm0,ymm3
   vpaddd ymm6,ymm3,ymm12
   vmovdqa ymm12,ymm6
   jne code_snippet

We can see we have 2 vector mul/adds, 1 vector add, 2 vector loads (1 "disguised" in one of the mul/add), 1 vector move. Plus 2 integer adds and 1 jump for the loop control.
Leaving aside the 2 integer adds and the jump for the moment, a typical modern processor can handle the other instructions across all its execution units (or more precisely, ports).
For example, vpmaddubsw and vpmaddwd (mul/add), have a typical RThroughput (reciprocal throughput/cycles per instructions) of 0.5, as most processors have 2 FMAs (see the intel docs). So we can do both in 1 cycle, but no more.
Same for load operations: there are 2 or 3 load ports on most processors, so 2 loads in a cycle fit, but not much more.

I analyzed the scheduling of these instructions using llvm-mca, which confirmed this.
For Skylake:

Dispatch Width:    6
uOps Per Cycle:    2.73
IPC:               2.27
Block RThroughput: 1.0

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      7     0.50    *                   vmovdqu       ymm3, ymmword ptr [rdi]
 2      12    0.50    *                   vpmaddubsw    ymm3, ymm3, ymmword ptr [rsi]
 1      5     0.50                        vpmaddwd      ymm3, ymm0, ymm3
 1      1     0.33                        vpaddd        ymm6, ymm12, ymm3
 1      1     0.33                        vmovdqa       ymm12, ymm6

Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     1.36   1.37   1.00   1.00    -     1.27    -      -     

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -      -      -     1.00    -      -      -      -     vmovdqu   ymm3, ymmword ptr [rdi]
 -      -     0.62   0.38   1.00    -      -      -      -      -     vpmaddubsw        ymm3, ymm3, ymmword ptr [rsi]
 -      -     0.33   0.67    -      -      -      -      -      -     vpmaddwd  ymm3, ymm0, ymm3
 -      -     0.26   0.25    -      -      -     0.49    -      -     vpaddd    ymm6, ymm12, ymm3
 -      -     0.15   0.07    -      -      -     0.78    -      -     vmovdqa   ymm12, ymm6

For Zen4: https://godbolt.org/z/d8TxzvdMs

As you may notice, this is without the 2 adds and the jump; since they share execution units, adding them reduces the RThroughput from 1 to 1.5. This is where unrolling becomes important: by unrolling x8, we get a RThroughput of almost 8 (8.4); at this level, even 2 simple adds can hurt performance.

ldematte · 2024-05-29T12:46:17Z

Speaking of which (hurting performace): the C benchmarks show 11-14ns for a call ("op", e.g. dot product of 1024 elements vector), or ~80-90 ops/us
The JMH benchmarks show ~30-33 ops/us (or ~30ns for one "op"). Why the gap?

We are talking extremely low numbers here; 32 CPU cycles total. A function call can easily be between 2 and 20 cycles, for comparison.
By examining the asm code produced by the JIT compiler with @ChrisHegarty, we saw an atomic compareAndSet before and after each call; Chris is investigating why; meanwhile, we can already see that this translates to a lock cmpxchg instruction, which has a latency between 14 and 43 cycles. Typically, it takes 20 cycles (measured). This would already account for more than half the performance gap.

Both Chris and I will investigate this further.

ldematte · 2024-05-30T10:17:02Z

For AVX-512, the story is complicated. TL;DR: in most cases AVX-512 will be identical to AVX2

Most processors today just implement AVX-512 in a "reduced" fashion: AMD does it as 2x AVX2 (so no change whatsovever in dot7) and many Intel processors have just 1 FMA which is AVX-512 capable. Double the bits, but half the execution units.

For the Intels with 2 FMAs, we should get a RThroughput of 1.0, and therefore a nice 2x speedup (since we only need 16 loop iterations to cover 1024 elements), but a bug in GCC was preventing some optimal code generation, so at best we were getting 1.5x theoretical:

code_snippet:
   vmovdqu64 zmm3,ZMMWORD PTR [rdi]
   #add    rdi,0x20
   #add    rsi,0x20
   vpmaddubsw zmm0,zmm3,ZMMWORD PTR [rsi]
   vpmaddwd zmm0,zmm2,zmm0
   vpaddd zmm0,zmm0,zmm16
   vmovdqa64 zmm16,zmm0 <---- ??

    #jne code_snippet

Notice the extra vmovdqa64?
Re-compiling with clang++ (ver 18) shows no issues, and a RThroughput of 1.0 on processors with 2 FMAs (e.g. icelake)

For processors with 1 FMA, and alternative may be to use only one mul/add, and explicitly do a h-sum instead:

const __m256i sum256 = _mm256_add_epi32(_mm512_castsi512_si256(dot), _mm512_extracti32x8_epi32(dot, 1));
 acc = _mm512_add_epi32(acc, _mm512_cvtepi16_epi32(sum256));

or in asm

vmovdqu64 zmm8,ZMMWORD PTR [rdi]	
vpmaddubsw zmm8,zmm8,ZMMWORD PTR [rsi]
vextracti64x4 ymm12,zmm8,0x1
vpaddd ymm8,ymm12,ymm8
vpmovsxwd zmm8,ymm8
vpaddd zmm0,zmm0,zmm8

However, this is the same: extract, move, and 2 adds are too many operations for 1 cycle, at least 2 of them will end up on the same port, making a total RThroughput of 2 on both AMD and Intel processors with 1 FMA.

A final option is to use more advanced instructions available for Icelake/Zen4: AVX-512 VNNI. In that instruction set there is one "dot product" instruction, which does to job of both the mul/add in the original code.
In theory, this will give us a RThroughput of 1.0 on processors with 1 FMA and possibly even on Zen4.
In practice, I do not see any improvement though :(

…4-avx512

elasticsearchmachine · 2024-06-27T17:16:07Z

Hi @ldematte, I've created a changelog YAML for you.

elasticsearchmachine · 2024-06-27T18:39:50Z

Pinging @elastic/es-search (Team:Search)

ChrisHegarty

LGTM

ldematte added 5 commits May 14, 2024 15:12

Add vec_caps and inner implementation for AVX-512-F (without VNNI)

9951eb5

WIP

98e677f

Merge remote-tracking branch 'upstream/main' into native-vec-linux-x6…

7b1c11c

…4-avx512

select FNNI function name based on vec_caps; templated implementation…

866199c

… for manual unrolling

Manual unroll sqr7u + static bind mh in outer class

83b820a

ldematte added the WIP label May 27, 2024

ldematte requested a review from ChrisHegarty May 27, 2024 17:27

elasticsearchmachine added the v8.15.0 label May 27, 2024

ldematte added 2 commits May 31, 2024 11:06

Switched compiler to clang for x64, as gcc has a bug

ee7094b

Merge remote-tracking branch 'upstream/main' into native-vec-linux-x6…

02d2503

…4-avx512

ldematte mentioned this pull request May 31, 2024

Add AVX-512 optimised dot product distance function for int4 on x64 #109238

Closed

ldematte added 4 commits June 27, 2024 12:15

Merge remote-tracking branch 'upstream/main' into native-vec-linux-x6…

b423701

…4-avx512

Update native simdvec library version

f59b383

Small fix to Dockerfile.amd64

3e90ff8

Fix end boundaries for non-power of 2 dims

584b54d

ldematte added >enhancement :Search/Search Search-related issues that do not fall into other categories test-windows Trigger CI checks on Windows test-arm Pull Requests that should be tested against arm agents and removed WIP labels Jun 27, 2024

Update docs/changelog/109084.yaml

aced309

ldematte marked this pull request as ready for review June 27, 2024 18:39

ldematte requested a review from a team as a code owner June 27, 2024 18:39

elasticsearchmachine added the Team:Search Meta label for search team label Jun 27, 2024

ChrisHegarty approved these changes Jun 28, 2024

View reviewed changes

ldematte merged commit 0bc2b19 into elastic:main Jun 28, 2024

ldematte deleted the native-vec-linux-x64-avx512 branch June 28, 2024 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX-512 optimised vector distance functions for int7 on x64#109084

Add AVX-512 optimised vector distance functions for int7 on x64#109084
ldematte merged 12 commits intoelastic:mainfrom
ldematte:native-vec-linux-x64-avx512

ldematte commented May 27, 2024 •

edited

Loading

Uh oh!

ChrisHegarty commented May 27, 2024

Uh oh!

ldematte commented May 29, 2024

Uh oh!

ldematte commented May 29, 2024 •

edited

Loading

Uh oh!

ldematte commented May 29, 2024

Uh oh!

ldematte commented May 29, 2024 •

edited

Loading

Uh oh!

ldematte commented May 30, 2024

Uh oh!

elasticsearchmachine commented Jun 27, 2024

Uh oh!

elasticsearchmachine commented Jun 27, 2024

Uh oh!

ChrisHegarty left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ldematte commented May 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisHegarty commented May 27, 2024

Uh oh!

ldematte commented May 29, 2024

Uh oh!

ldematte commented May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldematte commented May 29, 2024

Uh oh!

ldematte commented May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldematte commented May 30, 2024

Uh oh!

elasticsearchmachine commented Jun 27, 2024

Uh oh!

elasticsearchmachine commented Jun 27, 2024

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ldematte commented May 27, 2024 •

edited

Loading

ldematte commented May 29, 2024 •

edited

Loading

ldematte commented May 29, 2024 •

edited

Loading