Skip to content

[Native] Using native scorers in BBQ#141762

Merged
ldematte merged 43 commits intoelastic:mainfrom
ldematte:native/es-i4i1-scorers
Mar 9, 2026
Merged

[Native] Using native scorers in BBQ#141762
ldematte merged 43 commits intoelastic:mainfrom
ldematte:native/es-i4i1-scorers

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

@ldematte ldematte commented Feb 3, 2026

This PR introduces native scorers for BBQ.
It introduces and exposes a new ES93BinaryQuantizedVectorsScorer from the simdvec library, and uses it in ES818BinaryFlatVectorsScorer to do the scoring on the quantized data.
The approach taken by this PR is slightly different from the other HNSW scorers exposed via VectorScorerFactory, and instead it uses an approach similar to DiskBBQ. This is necessary, as some of the classes involved (e.g. BinarizedByteVectorValues and implementations, BQVectorUtils, etc.) are not yet in Lucene, but are ES specific and implemented in server.

To avoid a big refactoring, we keep everything in server as it is today, and change existing scorers (ES818BinaryFlatVectorsScorer) to call into the simdvec provided implementation passing the raw values.

When BinarizedByteVectorValues and supporting classes are moved to Lucene, we can revisit the classes introduced by this PR and move to VectorScorerFactory.

Microbenchmarks show a small speedup in single scoring (between 0 and 1.6x on ARM, and between 1.3x and 1.6x on x64) and a very nice speedup in bulk (~2.5x on ARM and between 2x and 4x on x64)

Copy link
Copy Markdown
Contributor

@tteofili tteofili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks good to me ;)

@ldematte
Copy link
Copy Markdown
Contributor Author

ldematte commented Feb 6, 2026

Very quick first round of benchmarks on Mac (will do more significant benchmarks on x64 later)

Benchmark                                    (dims)  (directoryType)  (implementation)  (similarityFunction)   Mode  Cnt  Score   Error   Units
VectorScorerBQBenchmark.bulkScoreRandom        1024             MMAP            SCALAR             EUCLIDEAN  thrpt    5  2,964 ± 0,066  ops/ms
VectorScorerBQBenchmark.bulkScoreRandom        1024             MMAP        VECTORIZED             EUCLIDEAN  thrpt    5  7,746 ± 0,188  ops/ms
VectorScorerBQBenchmark.bulkScoreSequential    1024             MMAP            SCALAR             EUCLIDEAN  thrpt    5  3,188 ± 0,103  ops/ms
VectorScorerBQBenchmark.bulkScoreSequential    1024             MMAP        VECTORIZED             EUCLIDEAN  thrpt    5  7,445 ± 0,132  ops/ms
VectorScorerBQBenchmark.scoreRandom            1024             MMAP            SCALAR             EUCLIDEAN  thrpt    5  2,695 ± 0,014  ops/ms
VectorScorerBQBenchmark.scoreRandom            1024             MMAP        VECTORIZED             EUCLIDEAN  thrpt    5  4,516 ± 0,190  ops/ms
VectorScorerBQBenchmark.scoreSequential        1024             MMAP            SCALAR             EUCLIDEAN  thrpt    5  3,135 ± 0,128  ops/ms
VectorScorerBQBenchmark.scoreSequential        1024             MMAP        VECTORIZED             EUCLIDEAN  thrpt    5  4,464 ± 0,681  ops/ms

That's 2.6x for the most common case (or what should be the most common case) bulk scoring with non-sequential access to the vector file

@ldematte ldematte marked this pull request as ready for review February 6, 2026 11:27
@benwtrent benwtrent added build-benchmark Trigger build benchmark job and removed build-benchmark Trigger build benchmark job labels Feb 6, 2026
@ldematte
Copy link
Copy Markdown
Contributor Author

It might be good to wait until Chris's work is finished so that the "onheap fallback" isn't horrible due to lack of memory segments ;)

I think it's worth it, I have a TODO in the code, and I think that if Chris's work is merged soon it's better to just do it instead of revisiting it again. I will sync with him.

@ldematte
Copy link
Copy Markdown
Contributor Author

Buildkite benchmark this with so-vector-default please

@benwtrent
Copy link
Copy Markdown
Member

Buildkite benchmark this with so-vector please

@ldematte
Copy link
Copy Markdown
Contributor Author

ldematte commented Mar 2, 2026

Buildkite benchmark this with so-vector please

@elasticmachine
Copy link
Copy Markdown
Collaborator

elasticmachine commented Mar 2, 2026

💚 Build Succeeded

This build ran two so-vector benchmarks to evaluate performance impact of this PR.

History

@ldematte
Copy link
Copy Markdown
Contributor Author

ldematte commented Mar 6, 2026

I was not able to sort out the buildkite benchmark results, so I run so_vector myself. Looking at flamegraphs, I can see it being used, and I was able to verify that the Java "slow" path is not used anywhere:
image

However, despite the speedup, the overall impact on this benchmark is limited. What I found via profiling is that:

  • the benchmark is heavily influenced by script-score tasks, which use individual float32 scoring (and copy data from MemorySegment to heap arrays every time -- I made a note to investigate in depth what's happening there and if we can improve it somehow).
  • even excluding script-related tasks, scoring is just ~30% of KNN time. bulkScore is only between 7 and 22% of KNN search time. Another 7 to 22% is spent in the single-scorer path; that is using native code too. The remaining is HNSW graph traversal (HnswGraphSearcher.search), which is the bottleneck, taking on average 62% of KNN time, then neighbor queue operations, and other overhead. Even a 2x speedup in scoring would only yield a ~10% improvement in KNN search latency (which is what we see here).
    • bulkScore vs score: it seems the larger the graph, the more we spend in single scoring. FilteredHnswGraphSearcher.searchLevel and AbstractKnnVectorQuery.searchExact seem to be the main users of single-scoring.
  • within bulkScore, dot product is ~90% — the native scorer is doing its job efficiently. There is no significant overhead from corrections, memory copies, or other wrapper code.

Here are the results; I think we need to switch gears a bit, and understand how we can optimize (and if it is worth it) things outside scoring. It seems obvious to me that scoring in BBQ is not the bottleneck.

Task Branch Main Delta
default-match-all-fm 245 ops/s 232 ops/s +5.4%
10-50-match-all-fm 249 ops/s 226 ops/s +10.3%
100-300-match-all-fm 117 ops/s 111 ops/s +5.0%

@ldematte
Copy link
Copy Markdown
Contributor Author

ldematte commented Mar 6, 2026

I also updated the code to use @ChrisHegarty changes from #141718.
I think this PR is good to go, and any further work should be handled separately. Wdyt? Can I have a final round of checks?

Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ldematte ldematte merged commit 631eab3 into elastic:main Mar 9, 2026
36 checks passed
@ldematte ldematte deleted the native/es-i4i1-scorers branch March 9, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants