[Native] BBQ Int4 to 1-bit dot product functions#140264
[Native] BBQ Int4 to 1-bit dot product functions#140264ldematte merged 48 commits intoelastic:mainfrom
Conversation
|
Benchmarks on ARM (Mac) show that the "basic" native functions are on-par with the current Panama implementations. After (this PR): So basically the same for non-bulk, and ~20% better with bulk. That looks promising, but I'm going to benchmark on x64 before marking this PR as ready |
|
Performance on Intel shows a decrease wrt Panama, confirming that vectorization on x64 has better support than ARM. After: To prevent a regression, I'm going to tweak the x64 code to use SIMD already, instead of deferring this to a follow-up PR |
|
A simple AVX2 implementation brings things ~ on par with the Panama implementation: A more advanced one gives a ~15% performance boost in bulk over Panama I think it's worth using it. |
...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java
Show resolved
Hide resolved
...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java
Show resolved
Hide resolved
|
AVX-512 (on c8i) brings mixed news. The Panama version is fast! you see, 3x on bulk and 1.7x in the single case. A bit better for bulk (~ +10%), a bit worse for single scoring (~ -20%). I'll test on AMD; so far, no big win on x64 despite optimizations. Panama on x64 for this simple things (at the end, this is just AND + POPCNT) is already quite good. ARM (in our experience) it's a different topic. |
|
Hi @ldematte, I've updated the changelog YAML for you. |
| * Produces a method handle returning the dot product of an int4 (half-byte) vector and | ||
| * a bit vector (one bit per element) | ||
| * | ||
| * |
|
Turns out we need ARM optimizations right away: while on Mac the vanilla native implementation is enough, on Graviton it would have introduced a performance regression: c8g (Graviton 4) Scalar Panama Native builtin NEON |
|
Very similar on Graviton 2 (c6g) Panama Native builtin NEON |
|
The upside is that we can claim a performance boost on ARM up to 20% for single scoring, and between 15% and a very nice 70% for bulk. |
libs/native/src/test/java/org/elasticsearch/nativeaccess/jdk/JDKVectorLibraryInt4Tests.java
Show resolved
Hide resolved
|
|
||
| // Fast AVX2 popcount, based on "Faster Population Counts Using AVX2 Instructions" | ||
| // See https://arxiv.org/abs/1611.07612 and https://github.com/WojciechMula/sse-popcount | ||
| static inline __m256i dot_bit_256(const __m256i a, const int8_t* b) { |
There was a problem hiding this comment.
I'm surprised this isnt named popcount in some fashion, given its relation to __builtin_popcount
There was a problem hiding this comment.
It's popcount after and betweend the operands, so effectively it's a partial bitwise dot product
...java/org/elasticsearch/simdvec/internal/vectorization/MSBitToInt4ESNextOSQVectorsScorer.java
Outdated
Show resolved
Hide resolved
thecoop
left a comment
There was a problem hiding this comment.
A few questions/comments, but otherwise ok
…uct (#141047) SVE Implementation of the dot product between int4 and int1 (single bit). Follows #140264 SVE is supported by e.g. Graviton 3 and 4 processors; it supports variable length SIMD registers, so on some hardware it should give a performance boost over NEON (which supports a fixed width of 128 bits). Graviton 3 has a register width of 256 bits, but Graviton 4 has a width of just 128 bits, so we should see little to none performance gains. However, SVE is more future proof: if the next processors have significantly wider SIMD registers, the SVE implementation should already take advantage from this implementation. As expected, the SVE implementation is 40% faster than NEON on Graviton 3, (and 10% faster than the Panama version), and just 5% faster than NEON on Graviton 4 (and no faster than the Panama version). The bulk operations, which take advantage of inlining + unrolling to give the processor a strong "hint" of which data it could prefetch, are between 15% and 30% faster than the Panama version (on Graviton 3 and 4 respectively). Relates to: #139750
This PR introduces the scaffolding needed for native dot product of a int4 query vector against 1-bit vectors, in BBQ.
The native function implementations already have optimized, platform specific versions, as the Panama versions were faster than vanilla native implementations:
vpopcntq, 512-bit wide popcount (very likely, the same that Panama translates to), plus some better prefetching.The speedup is between none and 20% better for single scoring, and between 20-40% better for bulk scoring. This is without score adjustment, which is still done Java side (with Panama). That would be tackled in a follow-up, and it is expected to give it a bit more performance boost.
Benchmarks for various architectures and vendors can be found below.
Closes #128523