Add AVX-512 optimised dot product distance function for int4 on x64#109238
Closed
ldematte wants to merge 8 commits intoelastic:mainfrom
Closed
Add AVX-512 optimised dot product distance function for int4 on x64#109238ldematte wants to merge 8 commits intoelastic:mainfrom
ldematte wants to merge 8 commits intoelastic:mainfrom
Conversation
… for manual unrolling
Contributor
Author
|
Benchmarking on cloud machines (which is not really ideal, but..) show an opposite picture: performance of int4 and int7 are identical, around 80/90 ops/us, on AMD machines and Xeon 4th gen (sapphirerapids). Performance are 50% better on Xeon 3th gen (icelake). |
Closed
13 tasks
Contributor
Author
|
Superseded by #144649 |
ldematte
added a commit
that referenced
this pull request
Mar 25, 2026
This PR introduces some smaller optimizations to the x64 int4 implementations. Now that #144429 is merged, I resumed #109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions. The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants. Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).
seanzatzdev
pushed a commit
to seanzatzdev/elasticsearch
that referenced
this pull request
Mar 27, 2026
This PR introduces some smaller optimizations to the x64 int4 implementations. Now that elastic#144429 is merged, I resumed elastic#109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions. The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants. Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Based on #109084 -- only the last commit is relevant for this draft.
Add a int4 implementation for dot product, between a unpacked vector (1 value between 0x00 to 0x0F in a byte) and a packed vector (2 values between 0x0 and 0xF in a byte).
When compiled with clang (gcc presents the same bug as in #109084), it produces the following code:
Notice the 2 vector mul/add, 2 vector and, 2 vector adds, 1 vector (intra-lane) shift, and 3 loads.
Notice also we are doing 2 operations per loop iteration; this means that we are not FMA-units limited: a CPU with enough ports should be able to have a RThroughput of 2.0 (perform both operations in 2 CPU cycles); static analysis reveals this should be possible for Zen4 (link), and close on Intel (between 2.3 and 3.0).
TODO:
sqr4u