Add AVX-512 optimised dot product distance function for int4 on x64 by ldematte · Pull Request #109238 · elastic/elasticsearch

ldematte · 2024-05-31T10:12:03Z

Based on #109084 -- only the last commit is relevant for this draft.

Add a int4 implementation for dot product, between a unpacked vector (1 value between 0x00 to 0x0F in a byte) and a packed vector (2 values between 0x0 and 0xF in a byte).
When compiled with clang (gcc presents the same bug as in #109084), it produces the following code:

loop:
    vmovdqu64       zmm18, zmmword ptr [r9]
    vpandq  zmm22, zmm18, zmm2
    vmovdqu64       zmm23, zmmword ptr [r8]
    vpmaddubsw      zmm22, zmm23, zmm22
    vpaddw  zmm17, zmm22, zmm17
    vmovdqu64       zmm22, zmmword ptr [r8 + 64]
    vpsrld  zmm18, zmm18, 4
    vpandq  zmm18, zmm18, zmm2
    vpmaddubsw      zmm18, zmm22, zmm18
    vpaddw  zmm17, zmm17, zmm18
    
    add     r8, 128
    add     r9, 64
    jne loop

Notice the 2 vector mul/add, 2 vector and, 2 vector adds, 1 vector (intra-lane) shift, and 3 loads.
Notice also we are doing 2 operations per loop iteration; this means that we are not FMA-units limited: a CPU with enough ports should be able to have a RThroughput of 2.0 (perform both operations in 2 CPU cycles); static analysis reveals this should be possible for Zen4 (link), and close on Intel (between 2.3 and 3.0).

TODO:

sqr4u
AVX2 variants
Binding on Java side
packed-packed (both operands) variant

…4-avx512

… for manual unrolling

…4-avx512

ldematte · 2024-05-31T11:00:04Z

Benchmarking on cloud machines (which is not really ideal, but..) show an opposite picture: performance of int4 and int7 are identical, around 80/90 ops/us, on AMD machines and Xeon 4th gen (sapphirerapids). Performance are 50% better on Xeon 3th gen (icelake).

ldematte · 2026-03-20T14:16:02Z

Superseded by #144649

This PR introduces some smaller optimizations to the x64 int4 implementations. Now that #144429 is merged, I resumed #109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions. The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants. Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).

This PR introduces some smaller optimizations to the x64 int4 implementations. Now that elastic#144429 is merged, I resumed elastic#109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions. The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants. Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).

ldematte added 8 commits May 14, 2024 15:12

Add vec_caps and inner implementation for AVX-512-F (without VNNI)

9951eb5

WIP

98e677f

Merge remote-tracking branch 'upstream/main' into native-vec-linux-x6…

7b1c11c

…4-avx512

select FNNI function name based on vec_caps; templated implementation…

866199c

… for manual unrolling

Manual unroll sqr7u + static bind mh in outer class

83b820a

Switched compiler to clang for x64, as gcc has a bug

ee7094b

Merge remote-tracking branch 'upstream/main' into native-vec-linux-x6…

02d2503

…4-avx512

AVX-512 int4 dot product

fbea023

ldematte added the WIP label May 31, 2024

elasticsearchmachine added the v8.15.0 label May 31, 2024

ldematte mentioned this pull request Jun 18, 2024

Investigate native impl for int4 vector comparators #109811

Open

elasticsearchmachine added v8.16.0 and removed v8.15.0 labels Jul 4, 2024

mark-vieira added v9.0.0 and removed v8.16.0 labels Sep 11, 2024

elasticsearchmachine added v9.1.0 and removed v9.0.0 labels Jan 30, 2025

elasticsearchmachine added v9.2.0 and removed v9.1.0 labels Jun 26, 2025

ldematte mentioned this pull request Sep 12, 2025

Native OSQ scoring #134623

Closed

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

elasticsearchmachine added v9.4.0 and removed v9.3.0 labels Dec 17, 2025

ldematte mentioned this pull request Dec 18, 2025

Implement native (Disk)BBQ scoring (single/bulk) #139750

Closed

13 tasks

ldematte mentioned this pull request Mar 20, 2026

[Native] int4 x86 SIMD optimizations #144649

Merged

ldematte closed this Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX-512 optimised dot product distance function for int4 on x64#109238

Add AVX-512 optimised dot product distance function for int4 on x64#109238
ldematte wants to merge 8 commits intoelastic:mainfrom
ldematte:native-vec-x64-avx512-int4

ldematte commented May 31, 2024

Uh oh!

ldematte commented May 31, 2024

Uh oh!

ldematte commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ldematte commented May 31, 2024

Uh oh!

ldematte commented May 31, 2024

Uh oh!

ldematte commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants