Add prefetching to x64 bulk vector implementations by thecoop · Pull Request #142387 · elastic/elasticsearch

thecoop · 2026-02-12T13:40:59Z

Prefetch the next vectors whilst processing one set of vectors for sqri7u, bulki8, and 2, 4-bit BBQ.
Remove AVX512 BBQ implementations, only using AVX2, as AVX512 is often slower than just the AVX2 impls

This causes a small performance drop for small number of vectors (-5%), but substantially faster for large numbers (+70%).

eg,

VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk              256    1024   EUCLIDEAN            NATIVE           128  thrpt    5  335315.141 ±  4260.120  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk              256    1024   EUCLIDEAN            NATIVE          1500  thrpt    5   27202.387 ±   305.012  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk              256    1024   EUCLIDEAN            NATIVE        130000  thrpt    5     535.276 ±    57.630  ops/s

becomes

VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk              256    1024   EUCLIDEAN            NATIVE           128  thrpt    5  328152.116 ±  2151.307  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk              256    1024   EUCLIDEAN            NATIVE          1500  thrpt    5   26192.751 ±   128.196  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk              256    1024   EUCLIDEAN            NATIVE        130000  thrpt    5     767.641 ±    25.418  ops/s

KnnIndexTester shows a 20% speed improvement for sqri7 searches, and 10-20% faster for BBQ 2/4-bit

thecoop · 2026-02-12T15:18:30Z

libs/simdvec/native/src/vec/c/amd64/vec_1.cpp

    f32_t* results
 ) {
-    for (int c = 0; c < count; c++) {
+    const int blk = dims & ~(STRIDE_BYTES_LEN - 1);


This bulk pattern seems to be template-able, but it needs:

the 'inner' method to be a template param (here sqri7u_inner)

the vectors tail method can be a template param (vec_sqri7u)

the dimension tail. This is a lot less obvious, as it's in-line to the method, and there are several method variables updated by that bit of code. How could that be templated out sensibly?

ALternatively, we could use classes, but then I'm not sure what that does to inline-ability and (potentially virtual) method calls

We tried to stay avoid from classes, and especially virtual methods, up to now, and focus on what can be done with templating instead.
A template would probably be not easy on the eyes; as you point out, it would require 3 template functions.

And it would be good to look at the cost of virtual calls for bulk; the overhead for single distance/score made it not feasible, but maybe with bulk operations we are OK?

Let's split and try both; I can give a template-based version a go, you could try a class-based one.

…etch

benwtrent

I am not sure how to interpret the benchmarks.

How native scoring is actually used, we will score only bulks of N size at a time, and drop to native only for those bulk sizes. Then return to "java land", rinse and repeat.

This is true for diskbbq, hnsw, and flat (eventually).

Do the benchmarks indicate that there is only a performance gain if we do VERY large bulk sizes? Or do the benchmarks also take into account the switching between Java and Native every 16/32/64 vectors at a time?

benwtrent · 2026-02-13T13:46:27Z

Do the benchmarks indicate that there is only a performance gain if we do VERY large bulk sizes? Or do the benchmarks also take into account the switching between Java and Native every 16/32/64 vectors at a time?

Reading the benchmarks, they don't it just assumes we are bulk scoring every vector in the set, this is not how things actually work. I will see if I can iterate on adjusting this one benchmark, but the others do seem complicated to add this logic to.

benwtrent · 2026-02-13T14:16:28Z

So, I have a local branch that adds bulkScore parameter, I ran it against main (so, not this branch right now)

PR: #142480

Could we see how the prefetching works with this type of thing?

Benchmark                                                    (bulkSize)  (dims)   (function)  (implementation)  (numVectors)   Mode  Cnt      Score      Error  Units
VectorScorerInt7uBulkBenchmark.scoreMultipleRandom                   32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  15568.442 ±  840.138  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandom                 1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  16323.343 ±  796.789  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk               32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  25851.679 ± 2296.438  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk             1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  25972.461 ± 1751.552  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleSequential               32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  17561.196 ±  957.424  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleSequential             1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  17510.794 ± 1528.157  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleSequentialBulk           32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  26485.337 ±  380.044  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleSequentialBulk         1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  26095.281 ±  701.095  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandom              32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  15366.295 ± 1316.324  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandom            1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  15804.847 ±  149.808  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  27118.706 ±  892.149  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  27325.290 ±  226.967  ops/s

libs/simdvec/native/src/vec/c/amd64/vec_2.cpp

ldematte · 2026-02-23T14:05:27Z

Base on what we found, it makes sense to concentrate on smaller bulk sizes.
We should keep in mind that prefetching has very little benefit when accessing data sequentially (bbq/diskbbq); Also, I'm not sure how we should reconcile it with the striped nature of 2 and 4 bit vectors -- I think we are missing something here, we need to prefetch different areas. Ssee my comment.

Prefetching should matter more for HNSW; the "MultipleRandomBulk" thing, where we use bulk but we have offsets (int[] nodes)

thecoop · 2026-02-24T16:16:58Z

The AVX512 BBQ implementations are slower than the AVX2 implementations, so I've just removed them so we always use AVX2 implementations.

elasticsearchmachine · 2026-02-25T17:17:02Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

ldematte

Looks good, some questions about possible alternative for prefetching and experimenting, if they make sense/if you haven't already tried them.
Also one consideration about the bbq stuff for AVX-512 -- looking at the code, I'm not surprised this is slower for dims < 1024.
However, I'm happy to re-introduce them with the proposed fixes in another PR.

libs/simdvec/native/src/vec/c/amd64/vec_1.cpp

ldematte · 2026-02-26T14:19:08Z

libs/simdvec/native/src/vec/c/amd64/vec_2.cpp

-
-static inline __m512i dot_bit_512(const __m512i a, const int8_t* b) {
-    const __m512i q0 = _mm512_loadu_si512((const __m512i *)b);
-    return _mm512_popcnt_epi64(_mm512_and_si512(q0, a));


I'm really surprised this is slower, as it should be the optimal instruction to use. However, of course this path is taken only if we can fill the whole STRIDE_BYTES_LEN (512 bits), which is not a given! Dimensions < 512 will not benefit from this at all, and probably this makes sense only for 1024 or more.

ldematte · 2026-02-26T14:21:46Z

libs/simdvec/native/src/vec/c/amd64/vec_2.cpp

-    int64_t subRet2 = _mm512_reduce_add_epi64(acc2);
-    int64_t subRet3 = _mm512_reduce_add_epi64(acc3);
-
-    for (; r < length; r++) {


A way of making this better/on par with AVX2 without removing it (as I think this should still be faster for higher dimensions), is to use fall back to the AVX2 code path here -- or to _mm256_popcnt_epi64, _mm_popcnt_epi64, __builtin_popcountll progressivelty,

thecoop · 2026-02-26T14:44:30Z

I think it'll be best if we do AVX512 for D[124]Q4 implementations all at once, then we can ensure it's all done together. We also need to work out how to do the fallback (whether just copy-paste AVX2 code, or something more sophisticated) at the same time

…etch

ldematte

Agree to do AVX512 for D[124]Q4 implementations all at once.
LGTM

…cations * upstream/main: (60 commits) Use batches for other bulk vector benchmarks (elastic#143167) Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:lookup-join.MvJoinKeyOnTheLookupIndexAfterStats} elastic#143388 Mute org.elasticsearch.snapshots.ConcurrentSnapshotsIT testBackToBackQueuedDeletes elastic#143387 [Inference API] Parse endpoint metadata from persisted endpoints (elastic#143081) Add cluster formation doc to DistributedArchitectureGuide (elastic#143318) Fix flattened root block loader null expectation (elastic#143238) Unmute ValueSourceReaderTypeConversionTests testLoadAll (elastic#143189) ESQL: Add split coalescing for many small files (elastic#143335) Unmute mixed-cluster spatial parse warning test (elastic#143186) Fix zero-size estimate in BytesRefBlock null test (elastic#143258) Make DataType and DataFormat top-level enums (elastic#143312) Add support for steps to change the target index name for later steps (elastic#142955) Set mayContainDuplicates flag to test deduplication (elastic#143375) ESQL: Fix Driver search load millis as nanos bug (elastic#143267) Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:lookup-join.LookupJoinWithMixPushableAndUnpushableFilters} elastic#143378 ESQL: Forbid MV_EXPAND before full text functions (elastic#143249) ESQL: Fix unresolved name pattern (elastic#143210) Implement boxplot queryDSL aggregation for exponential_histograms (elastic#143026) Add prefetching to x64 bulk vector implementations (elastic#142387) Make large segment vector tests resilient to memory constraints (elastic#143366) ...

Remove AVX512 BBQ implementations pending further work

thecoop added 2 commits February 11, 2026 16:57

Fill in various bulk operations to use prefetching

96eb6fb

Reorder aarch ops to the right order

b9d6171

thecoop requested a review from ldematte February 12, 2026 13:40

thecoop requested a review from a team as a code owner February 12, 2026 13:41

thecoop added WIP >refactoring :Search Relevance/Vectors Vector search labels Feb 12, 2026

elasticsearchmachine added the v9.4.0 label Feb 12, 2026

sqr7u avx512 bulk implementation

4a1e050

thecoop commented Feb 12, 2026

View reviewed changes

thecoop added 2 commits February 12, 2026 16:29

BBQ implementations

25b3325

Merge remote-tracking branch 'upstream/main' into vector-bulk-op-pref…

9255527

…etch

benwtrent reviewed Feb 12, 2026

View reviewed changes

Merge branch 'main' into vector-bulk-op-prefetch

b6574c4

thecoop mentioned this pull request Feb 20, 2026

Create native byte vector scorers #142015

Merged

ldematte reviewed Feb 23, 2026

View reviewed changes

libs/simdvec/native/src/vec/c/amd64/vec_2.cpp Outdated Show resolved Hide resolved

thecoop added 3 commits February 24, 2026 13:22

Tweak BBW prefetching

878869d

Tweak prefetching

92eef10

Just remove AVX512 BBQ implementations

6b14e3a

Merge branch 'main' into vector-bulk-op-prefetch

aeb7117

thecoop removed the WIP label Feb 25, 2026

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Feb 25, 2026

ldematte reviewed Feb 26, 2026

View reviewed changes

Add a bit more prefetching to d2q4

36fde1b

ldematte added 2 commits March 2, 2026 09:17

Merge remote-tracking branch 'upstream/main' into vector-bulk-op-pref…

d91c27b

…etch

Publish vec binaries + update version

8c483ac

ldematte approved these changes Mar 2, 2026

View reviewed changes

ldematte enabled auto-merge (squash) March 2, 2026 08:30

thecoop disabled auto-merge March 2, 2026 09:39

thecoop merged commit d913ab0 into elastic:main Mar 2, 2026
35 checks passed

thecoop deleted the vector-bulk-op-prefetch branch March 2, 2026 10:41

tballison pushed a commit to tballison/elasticsearch that referenced this pull request Mar 3, 2026

Add prefetching to x64 bulk vector implementations (elastic#142387)

c6d833f

Remove AVX512 BBQ implementations pending further work

Conversation

thecoop commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thecoop Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thecoop Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

ldematte Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Feb 13, 2026

Uh oh!

benwtrent commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ldematte commented Feb 23, 2026

Uh oh!

thecoop commented Feb 24, 2026

Uh oh!

elasticsearchmachine commented Feb 25, 2026

Uh oh!

ldematte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ldematte Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

ldematte Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

thecoop commented Feb 26, 2026

Uh oh!

ldematte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

thecoop commented Feb 12, 2026 •

edited

Loading

thecoop Feb 12, 2026 •

edited

Loading

benwtrent commented Feb 13, 2026 •

edited

Loading