Native OSQ scoring by iverase · Pull Request #134623 · elastic/elasticsearch

iverase · 2025-09-12T09:33:54Z

This is an initial draft to have native scoring for DiskBBQ vectors. It is currently only implemented for mac (run locally on a laptop) but is producing very interesting results, specially for bulk scoring where we can see speed up of aroubd 10-20% for high dimensional vectors.

running ./gradlew -p benchmarks run --args 'OSQScorerBenchmark' -Druntime.java=21 which uses the panamized version as native is only supported for java 22 and higher we get:

Benchmark                                                         (dims)   Mode  Cnt   Score   Error   Units
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkMmapVect             384  thrpt    5  67.940 ± 4.605  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkMmapVect             782  thrpt    5  35.235 ± 0.536  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkMmapVect            1024  thrpt    5  36.123 ± 0.254  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkNiofsVect            384  thrpt    5  12.982 ± 0.699  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkNiofsVect            782  thrpt    5   7.565 ± 0.129  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkNiofsVect           1024  thrpt    5   6.036 ± 2.088  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkMmapVect      384  thrpt    5  49.656 ± 1.822  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkMmapVect      782  thrpt    5  28.744 ± 1.200  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkMmapVect     1024  thrpt    5  28.825 ± 0.369  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkNiofsVect     384  thrpt    5  10.946 ± 1.772  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkNiofsVect     782  thrpt    5   6.387 ± 0.133  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkNiofsVect    1024  thrpt    5   6.289 ± 0.794  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorMmapVect          384  thrpt    5  34.670 ± 0.158  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorMmapVect          782  thrpt    5  22.191 ± 0.406  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorMmapVect         1024  thrpt    5  23.807 ± 1.498  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorNiofsVect         384  thrpt    5   9.933 ± 3.100  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorNiofsVect         782  thrpt    5   5.901 ± 0.098  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorNiofsVect        1024  thrpt    5   5.122 ± 0.523  ops/ms

With native scoring:

Benchmark                                                         (dims)   Mode  Cnt   Score   Error   Units
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkMmapVect             384  thrpt    5  69.430 ± 2.969  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkMmapVect             782  thrpt    5  48.346 ± 4.695  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkMmapVect            1024  thrpt    5  44.576 ± 0.128  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkNiofsVect            384  thrpt    5  12.882 ± 1.495  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkNiofsVect            782  thrpt    5   7.886 ± 0.230  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentAllBulkNiofsVect           1024  thrpt    5   6.723 ± 0.183  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkMmapVect      384  thrpt    5  52.539 ± 0.194  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkMmapVect      782  thrpt    5  39.930 ± 1.520  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkMmapVect     1024  thrpt    5  35.051 ± 1.521  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkNiofsVect     384  thrpt    5  11.510 ± 0.829  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkNiofsVect     782  thrpt    5   6.093 ± 1.535  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorBulkNiofsVect    1024  thrpt    5   6.417 ± 0.177  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorMmapVect          384  thrpt    5  32.481 ± 1.043  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorMmapVect          782  thrpt    5  27.422 ± 0.321  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorMmapVect         1024  thrpt    5  25.294 ± 1.797  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorNiofsVect         384  thrpt    5  10.284 ± 0.286  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorNiofsVect         782  thrpt    5   6.210 ± 0.316  ops/ms
OSQScorerBenchmark.scoreFromMemorySegmentOnlyVectorNiofsVect        1024  thrpt    5   5.074 ± 0.367  ops/ms

We see that for bulk scoring we get around 20% speed up for high dimensional vectors. For single scoring there moght be a slow down. It is probably not worthy in the case of NIOFS where we need to read the data on-heap.

In order to move this forward, I need help for the implementation in other architectures. Maybe @benwtrent, @ChrisHegarty or @ldematte can help here, thanks!.

elasticsearchmachine · 2025-09-12T09:34:19Z

Hi @iverase, I've created a changelog YAML for you.

ldematte · 2025-09-12T10:48:46Z

libs/simdvec/native/src/vec/c/amd64/vec.c

    return result;
 }
+
+EXPORT int64_t int4Bit(uint8_t* query, uint8_t* doc, int64_t offset, int length) {


If you want/need help with the x86 version give me a shout; you can get a look of what's needed for int4 manipulation here: #109238
Never came around to merge it, but it implements dot product and sqr for int4.

Sure I am going to need help. I have never work on this area and it seems you need docker, and I don't have a license at the moment.

If we proceed with this change for AArch64 first (then x64 later) let's add a negative error return here, so that we can assert a non-negative from the caller?

benwtrent · 2025-09-12T13:09:49Z

libs/simdvec/native/src/vec/c/aarch64/vec.c

+   return dot_q0 + (dot_q1 << 1) + (dot_q2 << 2) + (dot_q3 << 3);
+}
+
+EXPORT int32_t int4BitBulk(uint8_t* query, uint8_t* doc, int64_t offset, float32_t* scores, int count, int length) {


So, I wonder if we would get even more speedups if we completely switched to also applying the corrections natively in the same function....I realize that this might switch some logic around up stream and possibly require another native method. But I think

I would expect this to play big dividends on larger block sizes.

I think that can be a great follow up.

ChrisHegarty · 2025-09-12T14:31:44Z

I think that this is an awesome direction and change - I need to do a detailed review. Just to say, we can add architectures incrementally. We can choose to only use native scoring here for ARM first, then later add x64 (this is how we did things originally, and allows to split into smaller PRs and make progress )

iverase · 2025-09-12T15:00:46Z

we can add architectures incrementally. We can choose to only use native scoring here for ARM first

it makes sense to me.

ChrisHegarty · 2025-09-15T08:39:36Z

libs/native/libraries/build.gradle

  libs "org.elasticsearch:zstd:${zstdVersion}:linux-x86-64"
  libs "org.elasticsearch:zstd:${zstdVersion}:windows-x86-64"
-  libs "org.elasticsearch:vec:${vecVersion}@zip" // temporarily comment this out, if testing a locally built native lib
+  // libs "org.elasticsearch:vec:${vecVersion}@zip" // temporarily comment this out, if testing a locally built native lib


As you know, this is good for local testing but will need to be reverted before merging.

@iverase

This PR introduces the scaffolding needed for native dot product of a int4 query vector against 1-bit vectors, in BBQ. The native function implementations already have optimized, platform specific versions, as the Panama versions were faster than vanilla native implementations: - for ARM we have a NEON implementation (largely lifted from Native OSQ scoring #134623, thanks @iverase) - for AVX2, I used a combination of lookup tables and shuffling, as described in ["Faster Population Counts Using AVX2 Instructions"](https://arxiv.org/abs/1611.07612 and https://github.com/WojciechMula/sse-popcount) - for AVX-512, I used the vpopcntq, 512-bit wide popcount (very likely, the same that Panama translates to), plus some better prefetching. The speedup is between none and 20% better for single scoring, and between 20-40% better for bulk scoring. This is without score adjustment, which is still done Java side (with Panama). That would be tackled in a follow-up, and it is expected to give it a bit more performance boost.

Native OSQ scoring

0e526ce

iverase requested a review from a team as a code owner September 12, 2025 09:33

iverase added >enhancement :Search Relevance/Vectors Vector search v9.2.0 labels Sep 12, 2025

iverase marked this pull request as draft September 12, 2025 09:33

Update docs/changelog/134623.yaml

ebd61aa

ldematte reviewed Sep 12, 2025

View reviewed changes

benwtrent reviewed Sep 12, 2025

View reviewed changes

Add test and return void for bulk

203f9d6

ChrisHegarty reviewed Sep 15, 2025

View reviewed changes

iverase and others added 11 commits September 15, 2025 10:34

iter

1993bef

Merge branch 'main' into nativeOSQ

f03193b

iter

b147d39

iter

68a3d9b

iter

da9aa0e

iter

2e1b33e

iter

c16a04e

better this way

0c20cf2

iter

cc9d3db

[CI] Update transport version definitions

274e9ba

[CI] Auto commit changes from spotless

fbee06a

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

[CI] Update transport version definitions

e21444f

elasticsearchmachine added v9.4.0 and removed v9.3.0 labels Dec 17, 2025

[CI] Update transport version definitions

e287632

ldematte mentioned this pull request Dec 18, 2025

Implement native (Disk)BBQ scoring (single/bulk) #139750

Closed

13 tasks

ldematte mentioned this pull request Jan 20, 2026

[Native] BBQ Int4 to 1-bit dot product functions #140264

Merged

iverase closed this Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native OSQ scoring#134623

Native OSQ scoring#134623
iverase wants to merge 16 commits intoelastic:mainfrom
iverase:nativeOSQ

iverase commented Sep 12, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Sep 12, 2025

Uh oh!

ldematte Sep 12, 2025

Uh oh!

iverase Sep 12, 2025

Uh oh!

ChrisHegarty Sep 15, 2025

Uh oh!

benwtrent Sep 12, 2025

Uh oh!

iverase Sep 12, 2025

Uh oh!

ChrisHegarty commented Sep 12, 2025

Uh oh!

iverase commented Sep 12, 2025 •

edited

Loading

Uh oh!

ChrisHegarty Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

iverase commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 12, 2025

Uh oh!

ldematte Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

iverase Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

ChrisHegarty Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

benwtrent Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

iverase Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

ChrisHegarty commented Sep 12, 2025

Uh oh!

iverase commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisHegarty Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iverase commented Sep 12, 2025 •

edited

Loading

iverase commented Sep 12, 2025 •

edited

Loading