Skip to content

Add AVX-512 optimised dot product distance function for int4 on x64#109238

Closed
ldematte wants to merge 8 commits intoelastic:mainfrom
ldematte:native-vec-x64-avx512-int4
Closed

Add AVX-512 optimised dot product distance function for int4 on x64#109238
ldematte wants to merge 8 commits intoelastic:mainfrom
ldematte:native-vec-x64-avx512-int4

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

Based on #109084 -- only the last commit is relevant for this draft.

Add a int4 implementation for dot product, between a unpacked vector (1 value between 0x00 to 0x0F in a byte) and a packed vector (2 values between 0x0 and 0xF in a byte).
When compiled with clang (gcc presents the same bug as in #109084), it produces the following code:

loop:
    vmovdqu64       zmm18, zmmword ptr [r9]
    vpandq  zmm22, zmm18, zmm2
    vmovdqu64       zmm23, zmmword ptr [r8]
    vpmaddubsw      zmm22, zmm23, zmm22
    vpaddw  zmm17, zmm22, zmm17
    vmovdqu64       zmm22, zmmword ptr [r8 + 64]
    vpsrld  zmm18, zmm18, 4
    vpandq  zmm18, zmm18, zmm2
    vpmaddubsw      zmm18, zmm22, zmm18
    vpaddw  zmm17, zmm17, zmm18
    
    add     r8, 128
    add     r9, 64
    jne loop

Notice the 2 vector mul/add, 2 vector and, 2 vector adds, 1 vector (intra-lane) shift, and 3 loads.
Notice also we are doing 2 operations per loop iteration; this means that we are not FMA-units limited: a CPU with enough ports should be able to have a RThroughput of 2.0 (perform both operations in 2 CPU cycles); static analysis reveals this should be possible for Zen4 (link), and close on Intel (between 2.3 and 3.0).

TODO:

  • sqr4u
  • AVX2 variants
  • Binding on Java side
  • packed-packed (both operands) variant

@ldematte
Copy link
Copy Markdown
Contributor Author

Benchmarking on cloud machines (which is not really ideal, but..) show an opposite picture: performance of int4 and int7 are identical, around 80/90 ops/us, on AMD machines and Xeon 4th gen (sapphirerapids). Performance are 50% better on Xeon 3th gen (icelake).

@ldematte
Copy link
Copy Markdown
Contributor Author

Superseded by #144649

@ldematte ldematte closed this Mar 20, 2026
ldematte added a commit that referenced this pull request Mar 25, 2026
This PR introduces some smaller optimizations to the x64 int4 implementations.

Now that #144429 is merged, I resumed #109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions.

The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants.

Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 27, 2026
This PR introduces some smaller optimizations to the x64 int4 implementations.

Now that elastic#144429 is merged, I resumed elastic#109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions.

The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants.

Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants