[Native] Optimized ARM (SVE) functions for BBQ Int4 to 1-bit dot product#141047
[Native] Optimized ARM (SVE) functions for BBQ Int4 to 1-bit dot product#141047ldematte merged 3 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
|
Hi @ldematte, I've created a changelog YAML for you. |
| const int8_t* a2 = a + mapper(c + 2, offsets) * pitch; | ||
| const int8_t* a3 = a + mapper(c + 3, offsets) * pitch; | ||
|
|
||
| int64_t subRet0_0 = 0; |
There was a problem hiding this comment.
This is quite verbose, but experiments showed that this unroll x4 is the best way to instruct ARM processors about data access patterns.
Due to (strong) differences in architecture between each ARM implementation, use of specific prefetch operations might give more performance, but it's strongly processor dependent, with the risk of "overfitting" for a particular processor.
"Hinting" which data we are going to access next (by unrolling) is more "portable", as each processor will figure out what to do with the hardware it has available.
ChrisHegarty
left a comment
There was a problem hiding this comment.
Very nice. And no tests!! which is a good thing - existing tests cover this functionality with run on Graviton 3/4 ! :-)
Yes.. what we need is to ensure the tests run on the target HW, like we discussed yesterday. For now, I've run the relevant ones manually. |
SVE Implementation of the dot product between int4 and int1 (single bit). Follows #140264
SVE is supported by e.g. Graviton 3 and 4 processors; it supports variable length SIMD registers, so on some hardware it should give a performance boost over NEON (which supports a fixed width of 128 bits).
Graviton 3 has a register width of 256 bits, but Graviton 4 has a width of just 128 bits, so we should see little to none performance gains. However, SVE is more future proof: if the next processors have significantly wider SIMD registers, the SVE implementation should already take advantage from this implementation.
As expected, the SVE implementation is 40% faster than NEON on Graviton 3, (and 10% faster than the Panama version), and just 5% faster than NEON on Graviton 4 (and no faster than the Panama version).
The bulk operations, which take advantage of inlining + unrolling to give the processor a strong "hint" of which data it could prefetch, are between 15% and 30% faster than the Panama version (on Graviton 3 and 4 respectively).
Relates to: #139750
Benchmarks:
Graviton 3
Panama
NEON
SVE
Graviton 4
Panama
NEON
SVE