Conversation
|
Hi @iverase, I've created a changelog YAML for you. |
| return result; | ||
| } | ||
|
|
||
| EXPORT int64_t int4Bit(uint8_t* query, uint8_t* doc, int64_t offset, int length) { |
There was a problem hiding this comment.
If you want/need help with the x86 version give me a shout; you can get a look of what's needed for int4 manipulation here: #109238
Never came around to merge it, but it implements dot product and sqr for int4.
There was a problem hiding this comment.
Sure I am going to need help. I have never work on this area and it seems you need docker, and I don't have a license at the moment.
There was a problem hiding this comment.
If we proceed with this change for AArch64 first (then x64 later) let's add a negative error return here, so that we can assert a non-negative from the caller?
| return dot_q0 + (dot_q1 << 1) + (dot_q2 << 2) + (dot_q3 << 3); | ||
| } | ||
|
|
||
| EXPORT int32_t int4BitBulk(uint8_t* query, uint8_t* doc, int64_t offset, float32_t* scores, int count, int length) { |
There was a problem hiding this comment.
So, I wonder if we would get even more speedups if we completely switched to also applying the corrections natively in the same function....I realize that this might switch some logic around up stream and possibly require another native method. But I think
I would expect this to play big dividends on larger block sizes.
There was a problem hiding this comment.
I think that can be a great follow up.
|
I think that this is an awesome direction and change - I need to do a detailed review. Just to say, we can add architectures incrementally. We can choose to only use native scoring here for ARM first, then later add x64 (this is how we did things originally, and allows to split into smaller PRs and make progress ) |
it makes sense to me. |
libs/native/libraries/build.gradle
Outdated
| libs "org.elasticsearch:zstd:${zstdVersion}:linux-x86-64" | ||
| libs "org.elasticsearch:zstd:${zstdVersion}:windows-x86-64" | ||
| libs "org.elasticsearch:vec:${vecVersion}@zip" // temporarily comment this out, if testing a locally built native lib | ||
| // libs "org.elasticsearch:vec:${vecVersion}@zip" // temporarily comment this out, if testing a locally built native lib |
There was a problem hiding this comment.
As you know, this is good for local testing but will need to be reverted before merging.
This PR introduces the scaffolding needed for native dot product of a int4 query vector against 1-bit vectors, in BBQ. The native function implementations already have optimized, platform specific versions, as the Panama versions were faster than vanilla native implementations: - for ARM we have a NEON implementation (largely lifted from Native OSQ scoring #134623, thanks @iverase) - for AVX2, I used a combination of lookup tables and shuffling, as described in ["Faster Population Counts Using AVX2 Instructions"](https://arxiv.org/abs/1611.07612 and https://github.com/WojciechMula/sse-popcount) - for AVX-512, I used the vpopcntq, 512-bit wide popcount (very likely, the same that Panama translates to), plus some better prefetching. The speedup is between none and 20% better for single scoring, and between 20-40% better for bulk scoring. This is without score adjustment, which is still done Java side (with Panama). That would be tackled in a follow-up, and it is expected to give it a bit more performance boost.
This is an initial draft to have native scoring for DiskBBQ vectors. It is currently only implemented for mac (run locally on a laptop) but is producing very interesting results, specially for bulk scoring where we can see speed up of aroubd 10-20% for high dimensional vectors.
running
./gradlew -p benchmarks run --args 'OSQScorerBenchmark' -Druntime.java=21which uses the panamized version as native is only supported for java 22 and higher we get:With native scoring:
We see that for bulk scoring we get around 20% speed up for high dimensional vectors. For single scoring there moght be a slow down. It is probably not worthy in the case of NIOFS where we need to read the data on-heap.
In order to move this forward, I need help for the implementation in other architectures. Maybe @benwtrent, @ChrisHegarty or @ldematte can help here, thanks!.