Skip to content

Adding bulkSize for benchmarking to better reflect realworld usage#142480

Merged
benwtrent merged 5 commits intoelastic:mainfrom
benwtrent:add-bulksize-to-int7-benchy
Feb 25, 2026
Merged

Adding bulkSize for benchmarking to better reflect realworld usage#142480
benwtrent merged 5 commits intoelastic:mainfrom
benwtrent:add-bulksize-to-int7-benchy

Conversation

@benwtrent
Copy link
Copy Markdown
Member

I am not 100% sure if this is needed or not? But this is how things are actually used. I wonder if we are over indexing here and thinking we will just scan 1000+ vectors in a row off-heap when actually we will jump between Java & native land in standard chunks?

Here is a local run

Benchmark                                                    (bulkSize)  (dims)   (function)  (implementation)  (numVectors)   Mode  Cnt      Score      Error  Units
VectorScorerInt7uBulkBenchmark.scoreMultipleRandom                   32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  15568.442 ±  840.138  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandom                 1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  16323.343 ±  796.789  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk               32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  25851.679 ± 2296.438  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleRandomBulk             1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  25972.461 ± 1751.552  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleSequential               32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  17561.196 ±  957.424  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleSequential             1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  17510.794 ± 1528.157  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleSequentialBulk           32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  26485.337 ±  380.044  ops/s
VectorScorerInt7uBulkBenchmark.scoreMultipleSequentialBulk         1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  26095.281 ±  701.095  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandom              32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  15366.295 ± 1316.324  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandom            1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  15804.847 ±  149.808  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk          32    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  27118.706 ±  892.149  ops/s
VectorScorerInt7uBulkBenchmark.scoreQueryMultipleRandomBulk        1500    1024  DOT_PRODUCT            NATIVE          1500  thrpt    5  27325.290 ±  226.967  ops/s

The difference seems very minimal, and might just be due to the new overhead of the array copy. But I don't know how to avoid that without significant refactors or a dramatic increase in heap utilization (maybe thats ok? I can adjust to where all slices are created up front....).

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Feb 13, 2026
@thecoop
Copy link
Copy Markdown
Member

thecoop commented Feb 20, 2026

Looks sensible, but can we link the default size 32 to a specific codepath in ES? Can we apply this for other BulkBenchmark classes too?

@benwtrent
Copy link
Copy Markdown
Member Author

can we link the default size 32 to a specific codepath in ES?

For sure

Can we apply this for other BulkBenchmark classes too?

I suppose? My main concern is that this helps the current work with native pre-fetching.

@thecoop
Copy link
Copy Markdown
Member

thecoop commented Feb 23, 2026

We can reference ESNextOSQVectorsScorer.BULK_SIZE in a comment here.

I've also been looking at HNSW; that uses maxConn * 2 for its batch sizes - so 16*2 with the default maxConn. The maximum maxConn is 512, so an absolute maximum is 1024. Exhaustive searches use a batch size of 64.

I suggest we use 16, 64, 256, and 1024 for batch sizes here.

@benwtrent benwtrent requested a review from thecoop February 23, 2026 14:05
@benwtrent
Copy link
Copy Markdown
Member Author

@thecoop if this is good, I will merge and then add similar logic to our other benchmarks.

Copy link
Copy Markdown
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it makes sense to separate the size of the dataset to search (so to test the behaviour in case of cache misses) and the bulk batch size.

I was wondering if we need both numVectorsToScore and bulkSize; we should see how this behave just by scoring a single bulk, but having numVectorsToScore too is more realistic (it helps in having more realistic patterns to access the cache).

for (int v = 0; v < numVectorsToScore; v++) {
scores[v] = scorer.score(v);
int v = 0;
while (v < numVectorsToScore) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as before, but I'm OK with the change as it highlights that we always have smaller batches.

for (int i = 0; i < numVectorsToScore; i += bulkSize) {
int toScoreInThisBatch = Math.min(bulkSize, numVectorsToScore - i);
// Copy the slice of sequential IDs to the scratch array
System.arraycopy(ids, i, toScore, 0, toScoreInThisBatch);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not negligible in terms of impact on the benchmark, or is it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has an impact. I don't know of a better way other than pre allocating ALL the batches ahead of time. If we are ok with the heap usage, we can do taht instead

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the real code do this? Does it creates a new array? Use a scratch like the benchmark? I think we should do the same.
Also, probably, there is room for improvement here; we can avoid copies if we change the API to

void bulkScore(int[] nodes, float[] scores, int offset, int bulkSize)

(or add it, with the existing implementation calling bulkScore(nodes, scores, 0, int bulkSize) or something like that)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's a problem for another day :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the real code do this? Does it creates a new array? Use a scratch like the benchmark? I think we should do the same.

For a query, it creates a new score & batch array and those single arrays are used during the entire time of the query, which means over many score runs.

However, that also means that the IDs are that are USED for batch are indeed copied in (prod is actually much slower, popping from a queue individually for HNSW)

bench.dims = dims;
bench.numVectors = 1000;
bench.numVectorsToScore = 200;
bench.bulkSize = 200;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use a number < numVectorsToScore to exercise better the 2 nested loops

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do this because testing makes so many different assumptions on the return value. it will be a significant rewrite

@Param({ "16", "32", "64", "256", "1024" })
public int bulkSize;

@Param({ "SCALAR", "LUCENE", "NATIVE" })
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unless we want to explicilty exclude an entry, just @param will do (this way we don't need to worry about keeping this list updated, in case we add a new implementation)

// HNSW params will have the distributed ordinal bulk sizes depending on the number of connections in the graph
// The default is 16, maximum is 512, and the bottom layer is 2x that the configured setting, so 1024 is a maximum
// the MOST common case here is 32
@Param({ "16", "32", "64", "256", "1024" })
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisHegarty FYI, I think Lucene benchmarks should be updated in the same way?

@thecoop
Copy link
Copy Markdown
Member

thecoop commented Feb 23, 2026

I've been seeing some memory problems with the new benchmarks, where there weren't any before. I'll check if there's anything obvious going on

@thecoop
Copy link
Copy Markdown
Member

thecoop commented Feb 24, 2026

Well, I've been unable to replicate the hangs I've seen - I suggest merging this, then we can explore further if it happens again (given this is test code)

@benwtrent benwtrent merged commit e7f09ce into elastic:main Feb 25, 2026
35 checks passed
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this pull request Feb 25, 2026
…lastic#142480)

* Adding bulkSize for benchmarking to better reflect realworld usage

* adding bulk size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants