Skip to content

Add remaining bulk float32 off-heap scoring similarities#15037

Merged
ChrisHegarty merged 8 commits intoapache:mainfrom
ChrisHegarty:bulk_scoring_more_sims
Aug 14, 2025
Merged

Add remaining bulk float32 off-heap scoring similarities#15037
ChrisHegarty merged 8 commits intoapache:mainfrom
ChrisHegarty:bulk_scoring_more_sims

Conversation

@ChrisHegarty
Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty commented Aug 7, 2025

This commit adds the remaining bulk float32 off-heap scoring similarities, cosine, euclidean, and max inner product.

The changes in #14980 deliberately added only dot product, to avoid additional bloat on the PR and benchmarking. This PR now refactors things a little to allow for the remaining similarities to be added. Benchmarking will be carried out on them independently, as well as consideration for not negatively affecting dot product.

relates #14980

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Aug 7, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 10.3.0 milestone Aug 7, 2025
@ChrisHegarty
Copy link
Copy Markdown
Contributor Author

We're growing the number of vector distance implementations, which is a bit concerning. However, the implementations in this PR are straightforward, no unrolling, no vector-width specific version, etc, and they are all well covered by new and existing tests. So this should be fine. We can review what we have at some future point if necessary.

@ChrisHegarty
Copy link
Copy Markdown
Contributor Author

Note to self: verify the benchmark with
CI=true ./gradlew :lucene:benchmark-jmh:test -Dtests.iters=1000 -Dtests.vectorsize=default

@benwtrent
Copy link
Copy Markdown
Member

I ran on AVX256 (FMA enabled) AMD EPYC 7B13

Benchmark                                            (size)  Mode  Cnt  Score   Error  Units
VectorScorerFloat32Benchmark.cosineDefault             1024  avgt   15  9.039 ± 0.374  ms/op
VectorScorerFloat32Benchmark.cosineDefaultBulk         1024  avgt   15  8.837 ± 0.630  ms/op
VectorScorerFloat32Benchmark.cosineOptBulkScore        1024  avgt   15  5.334 ± 0.185  ms/op
VectorScorerFloat32Benchmark.cosineOptScorer           1024  avgt   15  9.281 ± 0.591  ms/op
VectorScorerFloat32Benchmark.dotProductDefault         1024  avgt   15  8.726 ± 0.640  ms/op
VectorScorerFloat32Benchmark.dotProductDefaultBulk     1024  avgt   15  8.738 ± 0.442  ms/op
VectorScorerFloat32Benchmark.dotProductOptBulkScore    1024  avgt   15  4.313 ± 0.134  ms/op
VectorScorerFloat32Benchmark.dotProductOptScorer       1024  avgt   15  7.286 ± 0.626  ms/op
VectorScorerFloat32Benchmark.euclideanDefault          1024  avgt   15  8.460 ± 0.417  ms/op
VectorScorerFloat32Benchmark.euclideanDefaultBulk      1024  avgt   15  8.823 ± 0.705  ms/op
VectorScorerFloat32Benchmark.euclideanOptBulkScore     1024  avgt   15  4.833 ± 0.244  ms/op
VectorScorerFloat32Benchmark.euclideanOptScorer        1024  avgt   15  7.880 ± 0.931  ms/op
VectorScorerFloat32Benchmark.mipDefault                1024  avgt   15  8.853 ± 0.393  ms/op
VectorScorerFloat32Benchmark.mipDefaultBulk            1024  avgt   15  8.898 ± 0.509  ms/op
VectorScorerFloat32Benchmark.mipOptBulkScore           1024  avgt   15  4.304 ± 0.180  ms/op
VectorScorerFloat32Benchmark.mipOptScorer              1024  avgt   15  8.320 ± 0.407  ms/op

Yep, bulk scoring is consistently better :)

@ChrisHegarty
Copy link
Copy Markdown
Contributor Author

Dumping some jmh results.
Summary:

  1. Bulk scoring is ~2x faster on in all results.
  2. Off-heap scoring of a single vector is anywhere up to 50% better.

linux-arm: c7g.4xlarge, Graviton 3, Neoverse-V1, preferredBitSize=256

Benchmark                                            (size)  Mode  Cnt   Score   Error  Units
VectorScorerFloat32Benchmark.cosineDefault             1024  avgt   15  10.805 ± 0.285  ms/op
VectorScorerFloat32Benchmark.cosineDefaultBulk         1024  avgt   15  10.423 ± 0.124  ms/op
VectorScorerFloat32Benchmark.cosineOptBulkScore        1024  avgt   15   5.888 ± 0.151  ms/op
VectorScorerFloat32Benchmark.cosineOptScorer           1024  avgt   15   9.601 ± 0.633  ms/op
VectorScorerFloat32Benchmark.dotProductDefault         1024  avgt   15  10.484 ± 0.201  ms/op
VectorScorerFloat32Benchmark.dotProductDefaultBulk     1024  avgt   15  10.408 ± 0.194  ms/op
VectorScorerFloat32Benchmark.dotProductOptBulkScore    1024  avgt   15   5.856 ± 0.166  ms/op
VectorScorerFloat32Benchmark.dotProductOptScorer       1024  avgt   15   7.214 ± 0.281  ms/op
VectorScorerFloat32Benchmark.euclideanDefault          1024  avgt   15  10.572 ± 0.459  ms/op
VectorScorerFloat32Benchmark.euclideanDefaultBulk      1024  avgt   15  10.692 ± 0.335  ms/op
VectorScorerFloat32Benchmark.euclideanOptBulkScore     1024  avgt   15   5.797 ± 0.279  ms/op
VectorScorerFloat32Benchmark.euclideanOptScorer        1024  avgt   15   8.324 ± 0.504  ms/op
VectorScorerFloat32Benchmark.mipDefault                1024  avgt   15  10.664 ± 0.264  ms/op
VectorScorerFloat32Benchmark.mipDefaultBulk            1024  avgt   15  10.566 ± 0.244  ms/op
VectorScorerFloat32Benchmark.mipOptBulkScore           1024  avgt   15   5.820 ± 0.103  ms/op
VectorScorerFloat32Benchmark.mipOptScorer              1024  avgt   15   7.420 ± 0.615  ms/op

linux-x64: m6i.2xlarge, Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz, preferredBitSize=512

Benchmark                                            (size)  Mode  Cnt   Score   Error  Units
VectorScorerFloat32Benchmark.cosineDefault             1024  avgt   15  11.099 ± 0.225  ms/op
VectorScorerFloat32Benchmark.cosineDefaultBulk         1024  avgt   15  10.563 ± 0.772  ms/op
VectorScorerFloat32Benchmark.cosineOptBulkScore        1024  avgt   15   6.342 ± 0.135  ms/op
VectorScorerFloat32Benchmark.cosineOptScorer           1024  avgt   15   7.302 ± 0.141  ms/op
VectorScorerFloat32Benchmark.dotProductDefault         1024  avgt   15   9.935 ± 0.636  ms/op
VectorScorerFloat32Benchmark.dotProductDefaultBulk     1024  avgt   15  10.014 ± 0.269  ms/op
VectorScorerFloat32Benchmark.dotProductOptBulkScore    1024  avgt   15   5.627 ± 0.139  ms/op
VectorScorerFloat32Benchmark.dotProductOptScorer       1024  avgt   15   5.720 ± 0.119  ms/op
VectorScorerFloat32Benchmark.euclideanDefault          1024  avgt   15  10.751 ± 0.521  ms/op
VectorScorerFloat32Benchmark.euclideanDefaultBulk      1024  avgt   15  10.438 ± 0.718  ms/op
VectorScorerFloat32Benchmark.euclideanOptBulkScore     1024  avgt   15   6.037 ± 0.153  ms/op
VectorScorerFloat32Benchmark.euclideanOptScorer        1024  avgt   15   6.485 ± 0.147  ms/op
VectorScorerFloat32Benchmark.mipDefault                1024  avgt   15   9.672 ± 0.608  ms/op
VectorScorerFloat32Benchmark.mipDefaultBulk            1024  avgt   15   9.910 ± 0.473  ms/op
VectorScorerFloat32Benchmark.mipOptBulkScore           1024  avgt   15   5.722 ± 0.102  ms/op
VectorScorerFloat32Benchmark.mipOptScorer              1024  avgt   15   5.899 ± 0.167  ms/op

linux-amd64: m6a.4xlarge, AMD EPYC 7R13 Processor, preferredBitSize=256

Benchmark                                            (size)  Mode  Cnt  Score   Error  Units
VectorScorerFloat32Benchmark.cosineDefault             1024  avgt   15  7.643 ± 0.144  ms/op
VectorScorerFloat32Benchmark.cosineDefaultBulk         1024  avgt   15  7.709 ± 0.159  ms/op
VectorScorerFloat32Benchmark.cosineOptBulkScore        1024  avgt   15  4.592 ± 0.026  ms/op
VectorScorerFloat32Benchmark.cosineOptScorer           1024  avgt   15  8.149 ± 0.047  ms/op
VectorScorerFloat32Benchmark.dotProductDefault         1024  avgt   15  7.396 ± 0.151  ms/op
VectorScorerFloat32Benchmark.dotProductDefaultBulk     1024  avgt   15  7.460 ± 0.163  ms/op
VectorScorerFloat32Benchmark.dotProductOptBulkScore    1024  avgt   15  3.915 ± 0.045  ms/op
VectorScorerFloat32Benchmark.dotProductOptScorer       1024  avgt   15  5.920 ± 0.053  ms/op
VectorScorerFloat32Benchmark.euclideanDefault          1024  avgt   15  7.357 ± 0.157  ms/op
VectorScorerFloat32Benchmark.euclideanDefaultBulk      1024  avgt   15  7.284 ± 0.132  ms/op
VectorScorerFloat32Benchmark.euclideanOptBulkScore     1024  avgt   15  4.260 ± 0.050  ms/op
VectorScorerFloat32Benchmark.euclideanOptScorer        1024  avgt   15  6.747 ± 0.047  ms/op
VectorScorerFloat32Benchmark.mipDefault                1024  avgt   15  7.462 ± 0.142  ms/op
VectorScorerFloat32Benchmark.mipDefaultBulk            1024  avgt   15  7.347 ± 0.124  ms/op
VectorScorerFloat32Benchmark.mipOptBulkScore           1024  avgt   15  3.915 ± 0.048  ms/op
VectorScorerFloat32Benchmark.mipOptScorer              1024  avgt   15  5.839 ± 0.085  ms/op

Copy link
Copy Markdown
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to wait until we get a bench run justifying the dot-product change first?

Intuitively, this should be better (all benchmarking indicates bulk off-heap improves things), but we should justify with end-to-end benchmarking :)

@benwtrent
Copy link
Copy Markdown
Member

@ChrisHegarty lucene nightlies indeed showed an improvement in dot-product! Looks good!

https://benchmarks.mikemccandless.com/VectorSearch.html

@ChrisHegarty ChrisHegarty merged commit 41fcc22 into apache:main Aug 14, 2025
8 checks passed
ChrisHegarty added a commit to ChrisHegarty/lucene that referenced this pull request Aug 14, 2025
This commit adds the remaining bulk float32 off-heap scoring similarities, cosine, euclidean, and max inner product.

The changes in apache#14980 deliberately added only dot product, to avoid additional bloat on the PR and benchmarking. This PR now refactors things a little to allow for the remaining similarities to be added. Benchmarking will be carried out on them independently, as well as consideration for not negatively affecting dot product.

relates apache#14980
akhilesh-k pushed a commit to akhilesh-k/lucene that referenced this pull request Aug 24, 2025
This commit adds the remaining bulk float32 off-heap scoring similarities, cosine, euclidean, and max inner product.

The changes in apache#14980 deliberately added only dot product, to avoid additional bloat on the PR and benchmarking. This PR now refactors things a little to allow for the remaining similarities to be added. Benchmarking will be carried out on them independently, as well as consideration for not negatively affecting dot product.

relates apache#14980
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants