Adding bulkSize for benchmarking to better reflect realworld usage#142480
Adding bulkSize for benchmarking to better reflect realworld usage#142480benwtrent merged 5 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
|
Looks sensible, but can we link the default size |
For sure
I suppose? My main concern is that this helps the current work with native pre-fetching. |
|
We can reference I've also been looking at HNSW; that uses I suggest we use 16, 64, 256, and 1024 for batch sizes here. |
.../src/main/java/org/elasticsearch/benchmark/vector/scorer/VectorScorerInt7uBulkBenchmark.java
Show resolved
Hide resolved
|
@thecoop if this is good, I will merge and then add similar logic to our other benchmarks. |
ldematte
left a comment
There was a problem hiding this comment.
LGTM, it makes sense to separate the size of the dataset to search (so to test the behaviour in case of cache misses) and the bulk batch size.
I was wondering if we need both numVectorsToScore and bulkSize; we should see how this behave just by scoring a single bulk, but having numVectorsToScore too is more realistic (it helps in having more realistic patterns to access the cache).
| for (int v = 0; v < numVectorsToScore; v++) { | ||
| scores[v] = scorer.score(v); | ||
| int v = 0; | ||
| while (v < numVectorsToScore) { |
There was a problem hiding this comment.
This is the same as before, but I'm OK with the change as it highlights that we always have smaller batches.
| for (int i = 0; i < numVectorsToScore; i += bulkSize) { | ||
| int toScoreInThisBatch = Math.min(bulkSize, numVectorsToScore - i); | ||
| // Copy the slice of sequential IDs to the scratch array | ||
| System.arraycopy(ids, i, toScore, 0, toScoreInThisBatch); |
There was a problem hiding this comment.
This is probably not negligible in terms of impact on the benchmark, or is it?
There was a problem hiding this comment.
it has an impact. I don't know of a better way other than pre allocating ALL the batches ahead of time. If we are ok with the heap usage, we can do taht instead
There was a problem hiding this comment.
How does the real code do this? Does it creates a new array? Use a scratch like the benchmark? I think we should do the same.
Also, probably, there is room for improvement here; we can avoid copies if we change the API to
void bulkScore(int[] nodes, float[] scores, int offset, int bulkSize)
(or add it, with the existing implementation calling bulkScore(nodes, scores, 0, int bulkSize) or something like that)
There was a problem hiding this comment.
But that's a problem for another day :)
There was a problem hiding this comment.
How does the real code do this? Does it creates a new array? Use a scratch like the benchmark? I think we should do the same.
For a query, it creates a new score & batch array and those single arrays are used during the entire time of the query, which means over many score runs.
However, that also means that the IDs are that are USED for batch are indeed copied in (prod is actually much slower, popping from a queue individually for HNSW)
| bench.dims = dims; | ||
| bench.numVectors = 1000; | ||
| bench.numVectorsToScore = 200; | ||
| bench.bulkSize = 200; |
There was a problem hiding this comment.
I would use a number < numVectorsToScore to exercise better the 2 nested loops
There was a problem hiding this comment.
I do this because testing makes so many different assumptions on the return value. it will be a significant rewrite
| @Param({ "16", "32", "64", "256", "1024" }) | ||
| public int bulkSize; | ||
|
|
||
| @Param({ "SCALAR", "LUCENE", "NATIVE" }) |
There was a problem hiding this comment.
Nit: unless we want to explicilty exclude an entry, just @param will do (this way we don't need to worry about keeping this list updated, in case we add a new implementation)
| // HNSW params will have the distributed ordinal bulk sizes depending on the number of connections in the graph | ||
| // The default is 16, maximum is 512, and the bottom layer is 2x that the configured setting, so 1024 is a maximum | ||
| // the MOST common case here is 32 | ||
| @Param({ "16", "32", "64", "256", "1024" }) |
There was a problem hiding this comment.
@ChrisHegarty FYI, I think Lucene benchmarks should be updated in the same way?
|
I've been seeing some memory problems with the new benchmarks, where there weren't any before. I'll check if there's anything obvious going on |
|
Well, I've been unable to replicate the hangs I've seen - I suggest merging this, then we can explore further if it happens again (given this is test code) |
…lastic#142480) * Adding bulkSize for benchmarking to better reflect realworld usage * adding bulk size
I am not 100% sure if this is needed or not? But this is how things are actually used. I wonder if we are over indexing here and thinking we will just scan 1000+ vectors in a row off-heap when actually we will jump between Java & native land in standard chunks?
Here is a local run
The difference seems very minimal, and might just be due to the new overhead of the array copy. But I don't know how to avoid that without significant refactors or a dramatic increase in heap utilization (maybe thats ok? I can adjust to where all slices are created up front....).