Add bulk off-heap scoring for float32 vectors#14980
Add bulk off-heap scoring for float32 vectors#14980ChrisHegarty merged 42 commits intoapache:mainfrom
Conversation
…ry point search as benefit would be marginal
|
/cc @mccullocht |
9275012 to
9c8d9d1
Compare
9c8d9d1 to
d67772d
Compare
|
I switched the benchmark to be throughput per second. I get similar results, JMH indicates 2x+ improvement on macbook ARM with the vector ops side of things: |
...org/apache/lucene/internal/vectorization/Lucene99MemorySegmentFloatVectorScorerSupplier.java
Outdated
Show resolved
Hide resolved
benwtrent
left a comment
There was a problem hiding this comment.
I left some minor comments. Seems like jars were updated by accident?
However, the change looks great!
benwtrent
left a comment
There was a problem hiding this comment.
Love it! Now to bulk score ALL THE THINGS!!!!
When benchmarking ARM, I really recommend to only worry about Graviton3 (ARM SVE). IMO we should stop investing any time into 128-bit Neon (Macs, Graviton2, etc). I realize macs are convenient, but the sooner we drop vectorization support for this legacy instruction set, the better. It is not well supported by openjdk, so i see no path to success there. |
This commit adds bulk off-heap scoring for float32 vectors. The bulk scorer scores 4 vectors against the query vector at a time. The general idea is to structure things so that we can somewhat tackle memory latency allowing the CPU to do overlapping memory loads. Initial results from the micro benchmarks shows good potential improvement. The benchmark creates a flat vector index with 128,000 float32 vectors with 1024 dimensions (~500MB). And times how long it takes to scores 20,000 random vectors against a query vector (lower times are better) Benchmark (size) Mode Cnt Score Error Units VectorScorerFloat32Benchmark.dotProductDefault 1024 avgt 15 8.505 ± 0.256 ms/op VectorScorerFloat32Benchmark.dotProductNewBulkScore 1024 avgt 15 3.717 ± 0.158 ms/op VectorScorerFloat32Benchmark.dotProductNewScorer 1024 avgt 15 7.287 ± 0.181 ms/op Notes: Note: Just dot product for now, but other distance functions can be added as a follow up.
This commit adds bulk off-heap scoring for float32 vectors. The bulk scorer scores 4 vectors against the query vector at a time. The general idea is to structure things so that we can somewhat tackle memory latency allowing the CPU to do overlapping memory loads. Initial results from the micro benchmarks shows good potential improvement. The benchmark creates a flat vector index with 128,000 float32 vectors with 1024 dimensions (~500MB). And times how long it takes to scores 20,000 random vectors against a query vector (lower times are better) Benchmark (size) Mode Cnt Score Error Units VectorScorerFloat32Benchmark.dotProductDefault 1024 avgt 15 8.505 ± 0.256 ms/op VectorScorerFloat32Benchmark.dotProductNewBulkScore 1024 avgt 15 3.717 ± 0.158 ms/op VectorScorerFloat32Benchmark.dotProductNewScorer 1024 avgt 15 7.287 ± 0.181 ms/op Notes: Note: Just dot product for now, but other distance functions can be added as a follow up.
|
Yep, Lucene nightlies show a nice improvement (variance is high, but its consistently better than before): https://benchmarks.mikemccandless.com/VectorSearch.html
Maybe 10+% better it seems. |
This commit adds the remaining bulk float32 off-heap scoring similarities, cosine, euclidean, and max inner product. The changes in #14980 deliberately added only dot product, to avoid additional bloat on the PR and benchmarking. This PR now refactors things a little to allow for the remaining similarities to be added. Benchmarking will be carried out on them independently, as well as consideration for not negatively affecting dot product. relates #14980
This commit adds the remaining bulk float32 off-heap scoring similarities, cosine, euclidean, and max inner product. The changes in apache#14980 deliberately added only dot product, to avoid additional bloat on the PR and benchmarking. This PR now refactors things a little to allow for the remaining similarities to be added. Benchmarking will be carried out on them independently, as well as consideration for not negatively affecting dot product. relates apache#14980
Yay! I will add annotation for this, even though it is noisy. (Separately I would love to reduce this noise...) |
This commit adds the remaining bulk float32 off-heap scoring similarities, cosine, euclidean, and max inner product. The changes in apache#14980 deliberately added only dot product, to avoid additional bloat on the PR and benchmarking. This PR now refactors things a little to allow for the remaining similarities to be added. Benchmarking will be carried out on them independently, as well as consideration for not negatively affecting dot product. relates apache#14980

This commit adds bulk off-heap scoring for float32 vectors.
The bulk scorer scores 4 vectors against the query vector at a time. The general idea is to structure things so that we can somewhat tackle memory latency allowing the CPU to do overlapping memory loads.
Initial results from the micro benchmarks shows good potential improvement. The benchmark creates a flat vector index with 128,000 float32 vectors with 1024 dimensions (~500MB). And times how long it takes to scores 20,000 random vectors against a query vector (lower times are better)
Notes:
we seem to suffer pollution of the query vector type, so for now I just added two separate independent almost identical versions of the vector dot op.