Skip to content

Add bulk off-heap scoring for float32 vectors#14980

Merged
ChrisHegarty merged 42 commits intoapache:mainfrom
ChrisHegarty:bulk_vector_scoring
Aug 7, 2025
Merged

Add bulk off-heap scoring for float32 vectors#14980
ChrisHegarty merged 42 commits intoapache:mainfrom
ChrisHegarty:bulk_vector_scoring

Conversation

@ChrisHegarty
Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty commented Jul 22, 2025

This commit adds bulk off-heap scoring for float32 vectors.

The bulk scorer scores 4 vectors against the query vector at a time. The general idea is to structure things so that we can somewhat tackle memory latency allowing the CPU to do overlapping memory loads.

Initial results from the micro benchmarks shows good potential improvement. The benchmark creates a flat vector index with 128,000 float32 vectors with 1024 dimensions (~500MB). And times how long it takes to scores 20,000 random vectors against a query vector (lower times are better)

Benchmark                                            (size)  Mode  Cnt  Score   Error  Units
VectorScorerFloat32Benchmark.dotProductDefault         1024  avgt   15  8.505 ± 0.256  ms/op
VectorScorerFloat32Benchmark.dotProductNewBulkScore    1024  avgt   15  3.717 ± 0.158  ms/op
VectorScorerFloat32Benchmark.dotProductNewScorer       1024  avgt   15  7.287 ± 0.181  ms/op

Notes:

  • Just dot product for now, but other distance functions can be added as a follow up.
  • The bulk scorer just does 4 vectors at time, since the implementation in Lucene is more straightforward, but this could be adjusted.
  • we seem to suffer pollution of the query vector type, so for now I just added two separate independent almost identical versions of the vector dot op.

@ChrisHegarty
Copy link
Copy Markdown
Contributor Author

/cc @mccullocht

@ChrisHegarty ChrisHegarty force-pushed the bulk_vector_scoring branch 2 times, most recently from 9275012 to 9c8d9d1 Compare July 25, 2025 13:25
@ChrisHegarty ChrisHegarty force-pushed the bulk_vector_scoring branch from 9c8d9d1 to d67772d Compare July 25, 2025 13:28
@benwtrent
Copy link
Copy Markdown
Member

I switched the benchmark to be throughput per second. I get similar results, JMH indicates 2x+ improvement on macbook ARM with the vector ops side of things:

Benchmark                                            (size)   Mode  Cnt    Score    Error  Units
VectorScorerFloat32Benchmark.dotProductDefault         1024  thrpt   15  106.205 ± 10.434  ops/s
VectorScorerFloat32Benchmark.dotProductDefaultBulk     1024  thrpt   15  119.524 ±  1.094  ops/s
VectorScorerFloat32Benchmark.dotProductOptBulkScore    1024  thrpt   15  283.673 ±  2.391  ops/s
VectorScorerFloat32Benchmark.dotProductOptScorer       1024  thrpt   15  140.599 ±  0.449  ops/s

Copy link
Copy Markdown
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments. Seems like jars were updated by accident?

However, the change looks great!

Copy link
Copy Markdown
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it! Now to bulk score ALL THE THINGS!!!!

@ChrisHegarty ChrisHegarty merged commit 84e5df3 into apache:main Aug 7, 2025
8 checks passed
@rmuir
Copy link
Copy Markdown
Member

rmuir commented Aug 7, 2025

I think when I tried my hand at a similar approach with panama I ran into
similar (neutral) results on graviton2 and really the only thing that
helped there was prefetching into cpu cache. Totally believe the results
are better elsewhere, it's just been a bit of a struggle to stand up the
tests on other machines given the resources I have at hand.

When benchmarking ARM, I really recommend to only worry about Graviton3 (ARM SVE).

IMO we should stop investing any time into 128-bit Neon (Macs, Graviton2, etc). I realize macs are convenient, but the sooner we drop vectorization support for this legacy instruction set, the better. It is not well supported by openjdk, so i see no path to success there.

ChrisHegarty added a commit that referenced this pull request Aug 7, 2025
This commit adds bulk off-heap scoring for float32 vectors.

The bulk scorer scores 4 vectors against the query vector at a time. The general idea is to structure things so that we can somewhat tackle memory latency allowing the CPU to do overlapping memory loads.

Initial results from the micro benchmarks shows good potential improvement. The benchmark creates a flat vector index with 128,000 float32 vectors with 1024 dimensions (~500MB). And times how long it takes to scores 20,000 random vectors against a query vector (lower times are better)

Benchmark                                            (size)  Mode  Cnt  Score   Error  Units
VectorScorerFloat32Benchmark.dotProductDefault         1024  avgt   15  8.505 ± 0.256  ms/op
VectorScorerFloat32Benchmark.dotProductNewBulkScore    1024  avgt   15  3.717 ± 0.158  ms/op
VectorScorerFloat32Benchmark.dotProductNewScorer       1024  avgt   15  7.287 ± 0.181  ms/op
Notes:

Note: Just dot product for now, but other distance functions can be added as a follow up.
@ChrisHegarty ChrisHegarty deleted the bulk_vector_scoring branch August 7, 2025 13:54
jpountz pushed a commit to shubhamvishu/lucene that referenced this pull request Aug 10, 2025
This commit adds bulk off-heap scoring for float32 vectors.

The bulk scorer scores 4 vectors against the query vector at a time. The general idea is to structure things so that we can somewhat tackle memory latency allowing the CPU to do overlapping memory loads.

Initial results from the micro benchmarks shows good potential improvement. The benchmark creates a flat vector index with 128,000 float32 vectors with 1024 dimensions (~500MB). And times how long it takes to scores 20,000 random vectors against a query vector (lower times are better)

Benchmark                                            (size)  Mode  Cnt  Score   Error  Units
VectorScorerFloat32Benchmark.dotProductDefault         1024  avgt   15  8.505 ± 0.256  ms/op
VectorScorerFloat32Benchmark.dotProductNewBulkScore    1024  avgt   15  3.717 ± 0.158  ms/op
VectorScorerFloat32Benchmark.dotProductNewScorer       1024  avgt   15  7.287 ± 0.181  ms/op
Notes:

Note: Just dot product for now, but other distance functions can be added as a follow up.
@benwtrent
Copy link
Copy Markdown
Member

Yep, Lucene nightlies show a nice improvement (variance is high, but its consistently better than before): https://benchmarks.mikemccandless.com/VectorSearch.html

image

Maybe 10+% better it seems.

ChrisHegarty added a commit that referenced this pull request Aug 14, 2025
This commit adds the remaining bulk float32 off-heap scoring similarities, cosine, euclidean, and max inner product.

The changes in #14980 deliberately added only dot product, to avoid additional bloat on the PR and benchmarking. This PR now refactors things a little to allow for the remaining similarities to be added. Benchmarking will be carried out on them independently, as well as consideration for not negatively affecting dot product.

relates #14980
ChrisHegarty added a commit to ChrisHegarty/lucene that referenced this pull request Aug 14, 2025
This commit adds the remaining bulk float32 off-heap scoring similarities, cosine, euclidean, and max inner product.

The changes in apache#14980 deliberately added only dot product, to avoid additional bloat on the PR and benchmarking. This PR now refactors things a little to allow for the remaining similarities to be added. Benchmarking will be carried out on them independently, as well as consideration for not negatively affecting dot product.

relates apache#14980
@mikemccand
Copy link
Copy Markdown
Member

Yep, Lucene nightlies show a nice improvement (variance is high, but its consistently better than before): https://benchmarks.mikemccandless.com/VectorSearch.html

Yay! I will add annotation for this, even though it is noisy. (Separately I would love to reduce this noise...)

akhilesh-k pushed a commit to akhilesh-k/lucene that referenced this pull request Aug 24, 2025
This commit adds the remaining bulk float32 off-heap scoring similarities, cosine, euclidean, and max inner product.

The changes in apache#14980 deliberately added only dot product, to avoid additional bloat on the PR and benchmarking. This PR now refactors things a little to allow for the remaining similarities to be added. Benchmarking will be carried out on them independently, as well as consideration for not negatively affecting dot product.

relates apache#14980
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants