Skip to content

Native OSQ scoring#134623

Closed
iverase wants to merge 16 commits intoelastic:mainfrom
iverase:nativeOSQ
Closed

Native OSQ scoring#134623
iverase wants to merge 16 commits intoelastic:mainfrom
iverase:nativeOSQ

Conversation

@iverase
Copy link
Copy Markdown
Contributor

@iverase iverase commented Sep 12, 2025

This is an initial draft to have native scoring for DiskBBQ vectors. It is currently only implemented for mac (run locally on a laptop) but is producing very interesting results, specially for bulk scoring where we can see speed up of aroubd 10-20% for high dimensional vectors.

running ./gradlew -p benchmarks run --args 'OSQScorerBenchmark' -Druntime.java=21 which uses the panamized version as native is only supported for java 22 and higher we get:

Benchmark                                                         (dims)   Mode  Cnt   Score   Error   Units
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkMmapVect             384  thrpt    5  67.940 ± 4.605  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkMmapVect             782  thrpt    5  35.235 ± 0.536  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkMmapVect            1024  thrpt    5  36.123 ± 0.254  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkNiofsVect            384  thrpt    5  12.982 ± 0.699  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkNiofsVect            782  thrpt    5   7.565 ± 0.129  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkNiofsVect           1024  thrpt    5   6.036 ± 2.088  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkMmapVect      384  thrpt    5  49.656 ± 1.822  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkMmapVect      782  thrpt    5  28.744 ± 1.200  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkMmapVect     1024  thrpt    5  28.825 ± 0.369  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkNiofsVect     384  thrpt    5  10.946 ± 1.772  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkNiofsVect     782  thrpt    5   6.387 ± 0.133  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkNiofsVect    1024  thrpt    5   6.289 ± 0.794  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorMmapVect          384  thrpt    5  34.670 ± 0.158  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorMmapVect          782  thrpt    5  22.191 ± 0.406  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorMmapVect         1024  thrpt    5  23.807 ± 1.498  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorNiofsVect         384  thrpt    5   9.933 ± 3.100  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorNiofsVect         782  thrpt    5   5.901 ± 0.098  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorNiofsVect        1024  thrpt    5   5.122 ± 0.523  ops/ms

With native scoring:

Benchmark                                                         (dims)   Mode  Cnt   Score   Error   Units
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkMmapVect             384  thrpt    5  69.430 ± 2.969  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkMmapVect             782  thrpt    5  48.346 ± 4.695  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkMmapVect            1024  thrpt    5  44.576 ± 0.128  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkNiofsVect            384  thrpt    5  12.882 ± 1.495  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkNiofsVect            782  thrpt    5   7.886 ± 0.230  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkNiofsVect           1024  thrpt    5   6.723 ± 0.183  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkMmapVect      384  thrpt    5  52.539 ± 0.194  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkMmapVect      782  thrpt    5  39.930 ± 1.520  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkMmapVect     1024  thrpt    5  35.051 ± 1.521  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkNiofsVect     384  thrpt    5  11.510 ± 0.829  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkNiofsVect     782  thrpt    5   6.093 ± 1.535  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkNiofsVect    1024  thrpt    5   6.417 ± 0.177  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorMmapVect          384  thrpt    5  32.481 ± 1.043  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorMmapVect          782  thrpt    5  27.422 ± 0.321  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorMmapVect         1024  thrpt    5  25.294 ± 1.797  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorNiofsVect         384  thrpt    5  10.284 ± 0.286  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorNiofsVect         782  thrpt    5   6.210 ± 0.316  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorNiofsVect        1024  thrpt    5   5.074 ± 0.367  ops/ms

We see that for bulk scoring we get around 20% speed up for high dimensional vectors. For single scoring there moght be a slow down. It is probably not worthy in the case of NIOFS where we need to read the data on-heap.

In order to move this forward, I need help for the implementation in other architectures. Maybe @benwtrent, @ChrisHegarty or @ldematte can help here, thanks!.

@iverase iverase requested a review from a team as a code owner September 12, 2025 09:33
@iverase iverase marked this pull request as draft September 12, 2025 09:33
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @iverase, I've created a changelog YAML for you.

return result;
}

EXPORT int64_t int4Bit(uint8_t* query, uint8_t* doc, int64_t offset, int length) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want/need help with the x86 version give me a shout; you can get a look of what's needed for int4 manipulation here: #109238
Never came around to merge it, but it implements dot product and sqr for int4.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I am going to need help. I have never work on this area and it seems you need docker, and I don't have a license at the moment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we proceed with this change for AArch64 first (then x64 later) let's add a negative error return here, so that we can assert a non-negative from the caller?

return dot_q0 + (dot_q1 << 1) + (dot_q2 << 2) + (dot_q3 << 3);
}

EXPORT int32_t int4BitBulk(uint8_t* query, uint8_t* doc, int64_t offset, float32_t* scores, int count, int length) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I wonder if we would get even more speedups if we completely switched to also applying the corrections natively in the same function....I realize that this might switch some logic around up stream and possibly require another native method. But I think

I would expect this to play big dividends on larger block sizes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that can be a great follow up.

@ChrisHegarty
Copy link
Copy Markdown
Contributor

I think that this is an awesome direction and change - I need to do a detailed review. Just to say, we can add architectures incrementally. We can choose to only use native scoring here for ARM first, then later add x64 (this is how we did things originally, and allows to split into smaller PRs and make progress )

@iverase
Copy link
Copy Markdown
Contributor Author

iverase commented Sep 12, 2025

we can add architectures incrementally. We can choose to only use native scoring here for ARM first

it makes sense to me.

libs "org.elasticsearch:zstd:${zstdVersion}:linux-x86-64"
libs "org.elasticsearch:zstd:${zstdVersion}:windows-x86-64"
libs "org.elasticsearch:vec:${vecVersion}@zip" // temporarily comment this out, if testing a locally built native lib
// libs "org.elasticsearch:vec:${vecVersion}@zip" // temporarily comment this out, if testing a locally built native lib
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know, this is good for local testing but will need to be reverted before merging.

ldematte added a commit that referenced this pull request Jan 21, 2026
This PR introduces the scaffolding needed for native dot product of a int4 query vector against 1-bit vectors, in BBQ.
The native function implementations already have optimized, platform specific versions, as the Panama versions were faster than vanilla native implementations:

-    for ARM we have a NEON implementation (largely lifted from Native OSQ scoring #134623, thanks @iverase)
-        for AVX2, I used a combination of lookup tables and shuffling, as described in ["Faster Population Counts Using AVX2 Instructions"](https://arxiv.org/abs/1611.07612 and https://github.com/WojciechMula/sse-popcount)
-        for AVX-512, I used the vpopcntq, 512-bit wide popcount (very likely, the same that Panama translates to), plus some better prefetching.

The speedup is between none and 20% better for single scoring, and between 20-40% better for bulk scoring. This is without score adjustment, which is still done Java side (with Panama). That would be tackled in a follow-up, and it is expected to give it a bit more performance boost.
@iverase iverase closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants