Skip to content

Add prefetching to x64 bulk vector implementations#142387

Merged
thecoop merged 13 commits intoelastic:mainfrom
thecoop:vector-bulk-op-prefetch
Mar 2, 2026
Merged

Add prefetching to x64 bulk vector implementations#142387
thecoop merged 13 commits intoelastic:mainfrom
thecoop:vector-bulk-op-prefetch

Conversation

@thecoop
Copy link
Copy Markdown
Member

@thecoop thecoop commented Feb 12, 2026

Prefetch the next vectors whilst processing one set of vectors for sqri7u, bulki8, and 2, 4-bit BBQ.
Remove AVX512 BBQ implementations, only using AVX2, as AVX512 is often slower than just the AVX2 impls

This causes a small performance drop for small number of vectors (-5%), but substantially faster for large numbers (+70%).

eg,

VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk              256    1024   EUCLIDEAN            NATIVE           128  thrpt    5  335315.141 ±  4260.120  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk              256    1024   EUCLIDEAN            NATIVE          1500  thrpt    5   27202.387 ±   305.012  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk              256    1024   EUCLIDEAN            NATIVE        130000  thrpt    5     535.276 ±    57.630  ops/s

becomes

VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk              256    1024   EUCLIDEAN            NATIVE           128  thrpt    5  328152.116 ±  2151.307  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk              256    1024   EUCLIDEAN            NATIVE          1500  thrpt    5   26192.751 ±   128.196  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk              256    1024   EUCLIDEAN            NATIVE        130000  thrpt    5     767.641 ±    25.418  ops/s

KnnIndexTester shows a 20% speed improvement for sqri7 searches, and 10-20% faster for BBQ 2/4-bit

f32_t* results
) {
for (int c = 0; c < count; c++) {
const int blk = dims & ~(STRIDE_BYTES_LEN - 1);
Copy link
Copy Markdown
Member Author

@thecoop thecoop Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bulk pattern seems to be template-able, but it needs:

  • the 'inner' method to be a template param (here sqri7u_inner)
  • the vectors tail method can be a template param (vec_sqri7u)
  • the dimension tail. This is a lot less obvious, as it's in-line to the method, and there are several method variables updated by that bit of code. How could that be templated out sensibly?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALternatively, we could use classes, but then I'm not sure what that does to inline-ability and (potentially virtual) method calls

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried to stay avoid from classes, and especially virtual methods, up to now, and focus on what can be done with templating instead.
A template would probably be not easy on the eyes; as you point out, it would require 3 template functions.

And it would be good to look at the cost of virtual calls for bulk; the overhead for single distance/score made it not feasible, but maybe with bulk operations we are OK?

Let's split and try both; I can give a template-based version a go, you could try a class-based one.

Copy link
Copy Markdown
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how to interpret the benchmarks.

How native scoring is actually used, we will score only bulks of N size at a time, and drop to native only for those bulk sizes. Then return to "java land", rinse and repeat.

This is true for diskbbq, hnsw, and flat (eventually).

Do the benchmarks indicate that there is only a performance gain if we do VERY large bulk sizes? Or do the benchmarks also take into account the switching between Java and Native every 16/32/64 vectors at a time?

@benwtrent
Copy link
Copy Markdown
Member

Do the benchmarks indicate that there is only a performance gain if we do VERY large bulk sizes? Or do the benchmarks also take into account the switching between Java and Native every 16/32/64 vectors at a time?

Reading the benchmarks, they don't it just assumes we are bulk scoring every vector in the set, this is not how things actually work. I will see if I can iterate on adjusting this one benchmark, but the others do seem complicated to add this logic to.

@benwtrent
Copy link
Copy Markdown
Member

benwtrent commented Feb 13, 2026

So, I have a local branch that adds bulkScore parameter, I ran it against main (so, not this branch right now)

PR: #142480

Could we see how the prefetching works with this type of thing?

Benchmark                                                    (bulkSize)  (dims)   (function)  (implementation)  (numVectors)   Mode  Cnt      Score      Error  Units
VectorScorerInt7uBulkBenchmark.scoreMultipleRandom                   32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  15568.442 ±  840.138  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandom                 1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  16323.343 ±  796.789  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk               32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  25851.679 ± 2296.438  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk             1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  25972.461 ± 1751.552  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleSequential               32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  17561.196 ±  957.424  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleSequential             1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  17510.794 ± 1528.157  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleSequentialBulk           32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  26485.337 ±  380.044  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleSequentialBulk         1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  26095.281 ±  701.095  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandom              32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  15366.295 ± 1316.324  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandom            1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  15804.847 ±  149.808  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  27118.706 ±  892.149  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  27325.290 ±  226.967  ops/s

@ldematte
Copy link
Copy Markdown
Contributor

Base on what we found, it makes sense to concentrate on smaller bulk sizes.
We should keep in mind that prefetching has very little benefit when accessing data sequentially (bbq/diskbbq); Also, I'm not sure how we should reconcile it with the striped nature of 2 and 4 bit vectors -- I think we are missing something here, we need to prefetch different areas. Ssee my comment.

Prefetching should matter more for HNSW; the "MultipleRandomBulk" thing, where we use bulk but we have offsets (int[] nodes)

@thecoop
Copy link
Copy Markdown
Member Author

thecoop commented Feb 24, 2026

The AVX512 BBQ implementations are slower than the AVX2 implementations, so I've just removed them so we always use AVX2 implementations.

@thecoop thecoop removed the WIP label Feb 25, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Feb 25, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Copy Markdown
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some questions about possible alternative for prefetching and experimenting, if they make sense/if you haven't already tried them.
Also one consideration about the bbq stuff for AVX-512 -- looking at the code, I'm not surprised this is slower for dims < 1024.
However, I'm happy to re-introduce them with the proposed fixes in another PR.


static inline __m512i dot_bit_512(const __m512i a, const int8_t* b) {
const __m512i q0 = _mm512_loadu_si512((const __m512i *)b);
return _mm512_popcnt_epi64(_mm512_and_si512(q0, a));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really surprised this is slower, as it should be the optimal instruction to use. However, of course this path is taken only if we can fill the whole STRIDE_BYTES_LEN (512 bits), which is not a given! Dimensions < 512 will not benefit from this at all, and probably this makes sense only for 1024 or more.

int64_t subRet2 = _mm512_reduce_add_epi64(acc2);
int64_t subRet3 = _mm512_reduce_add_epi64(acc3);

for (; r < length; r++) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A way of making this better/on par with AVX2 without removing it (as I think this should still be faster for higher dimensions), is to use fall back to the AVX2 code path here -- or to _mm256_popcnt_epi64, _mm_popcnt_epi64, __builtin_popcountll progressivelty,

@thecoop
Copy link
Copy Markdown
Member Author

thecoop commented Feb 26, 2026

I think it'll be best if we do AVX512 for D[124]Q4 implementations all at once, then we can ensure it's all done together. We also need to work out how to do the fallback (whether just copy-paste AVX2 code, or something more sophisticated) at the same time

Copy link
Copy Markdown
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to do AVX512 for D[124]Q4 implementations all at once.
LGTM

@ldematte ldematte enabled auto-merge (squash) March 2, 2026 08:30
@thecoop thecoop disabled auto-merge March 2, 2026 09:39
@thecoop thecoop merged commit d913ab0 into elastic:main Mar 2, 2026
35 checks passed
@thecoop thecoop deleted the vector-bulk-op-prefetch branch March 2, 2026 10:41
szybia added a commit to szybia/elasticsearch that referenced this pull request Mar 2, 2026
…cations

* upstream/main: (60 commits)
  Use batches for other bulk vector benchmarks (elastic#143167)
  Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:lookup-join.MvJoinKeyOnTheLookupIndexAfterStats} elastic#143388
  Mute org.elasticsearch.snapshots.ConcurrentSnapshotsIT testBackToBackQueuedDeletes elastic#143387
  [Inference API] Parse endpoint metadata from persisted endpoints (elastic#143081)
  Add cluster formation doc to DistributedArchitectureGuide (elastic#143318)
  Fix flattened root block loader null expectation (elastic#143238)
  Unmute ValueSourceReaderTypeConversionTests testLoadAll (elastic#143189)
  ESQL: Add split coalescing for many small files (elastic#143335)
  Unmute mixed-cluster spatial parse warning test (elastic#143186)
  Fix zero-size estimate in BytesRefBlock null test (elastic#143258)
  Make DataType and DataFormat top-level enums (elastic#143312)
  Add support for steps to change the target index name for later steps (elastic#142955)
  Set mayContainDuplicates flag to test deduplication (elastic#143375)
  ESQL: Fix Driver search load millis as nanos bug (elastic#143267)
  Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:lookup-join.LookupJoinWithMixPushableAndUnpushableFilters} elastic#143378
  ESQL: Forbid MV_EXPAND before full text functions (elastic#143249)
  ESQL: Fix unresolved name pattern (elastic#143210)
  Implement boxplot queryDSL aggregation for exponential_histograms (elastic#143026)
  Add prefetching to x64 bulk vector implementations (elastic#142387)
  Make large segment vector tests resilient to memory constraints (elastic#143366)
  ...
tballison pushed a commit to tballison/elasticsearch that referenced this pull request Mar 3, 2026
Remove AVX512 BBQ implementations pending further work
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>refactoring :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants