Add prefetching to x64 bulk vector implementations#142387
Add prefetching to x64 bulk vector implementations#142387thecoop merged 13 commits intoelastic:mainfrom
Conversation
| f32_t* results | ||
| ) { | ||
| for (int c = 0; c < count; c++) { | ||
| const int blk = dims & ~(STRIDE_BYTES_LEN - 1); |
There was a problem hiding this comment.
This bulk pattern seems to be template-able, but it needs:
- the 'inner' method to be a template param (here
sqri7u_inner) - the vectors tail method can be a template param (
vec_sqri7u) - the dimension tail. This is a lot less obvious, as it's in-line to the method, and there are several method variables updated by that bit of code. How could that be templated out sensibly?
There was a problem hiding this comment.
ALternatively, we could use classes, but then I'm not sure what that does to inline-ability and (potentially virtual) method calls
There was a problem hiding this comment.
We tried to stay avoid from classes, and especially virtual methods, up to now, and focus on what can be done with templating instead.
A template would probably be not easy on the eyes; as you point out, it would require 3 template functions.
And it would be good to look at the cost of virtual calls for bulk; the overhead for single distance/score made it not feasible, but maybe with bulk operations we are OK?
Let's split and try both; I can give a template-based version a go, you could try a class-based one.
benwtrent
left a comment
There was a problem hiding this comment.
I am not sure how to interpret the benchmarks.
How native scoring is actually used, we will score only bulks of N size at a time, and drop to native only for those bulk sizes. Then return to "java land", rinse and repeat.
This is true for diskbbq, hnsw, and flat (eventually).
Do the benchmarks indicate that there is only a performance gain if we do VERY large bulk sizes? Or do the benchmarks also take into account the switching between Java and Native every 16/32/64 vectors at a time?
Reading the benchmarks, they don't it just assumes we are bulk scoring every vector in the set, this is not how things actually work. I will see if I can iterate on adjusting this one benchmark, but the others do seem complicated to add this logic to. |
|
So, I have a local branch that adds PR: #142480 Could we see how the prefetching works with this type of thing? |
|
Base on what we found, it makes sense to concentrate on smaller bulk sizes. Prefetching should matter more for HNSW; the "MultipleRandomBulk" thing, where we use bulk but we have offsets ( |
|
The AVX512 BBQ implementations are slower than the AVX2 implementations, so I've just removed them so we always use AVX2 implementations. |
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
ldematte
left a comment
There was a problem hiding this comment.
Looks good, some questions about possible alternative for prefetching and experimenting, if they make sense/if you haven't already tried them.
Also one consideration about the bbq stuff for AVX-512 -- looking at the code, I'm not surprised this is slower for dims < 1024.
However, I'm happy to re-introduce them with the proposed fixes in another PR.
|
|
||
| static inline __m512i dot_bit_512(const __m512i a, const int8_t* b) { | ||
| const __m512i q0 = _mm512_loadu_si512((const __m512i *)b); | ||
| return _mm512_popcnt_epi64(_mm512_and_si512(q0, a)); |
There was a problem hiding this comment.
I'm really surprised this is slower, as it should be the optimal instruction to use. However, of course this path is taken only if we can fill the whole STRIDE_BYTES_LEN (512 bits), which is not a given! Dimensions < 512 will not benefit from this at all, and probably this makes sense only for 1024 or more.
| int64_t subRet2 = _mm512_reduce_add_epi64(acc2); | ||
| int64_t subRet3 = _mm512_reduce_add_epi64(acc3); | ||
|
|
||
| for (; r < length; r++) { |
There was a problem hiding this comment.
A way of making this better/on par with AVX2 without removing it (as I think this should still be faster for higher dimensions), is to use fall back to the AVX2 code path here -- or to _mm256_popcnt_epi64, _mm_popcnt_epi64, __builtin_popcountll progressivelty,
|
I think it'll be best if we do AVX512 for D[124]Q4 implementations all at once, then we can ensure it's all done together. We also need to work out how to do the fallback (whether just copy-paste AVX2 code, or something more sophisticated) at the same time |
ldematte
left a comment
There was a problem hiding this comment.
Agree to do AVX512 for D[124]Q4 implementations all at once.
LGTM
…cations * upstream/main: (60 commits) Use batches for other bulk vector benchmarks (elastic#143167) Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:lookup-join.MvJoinKeyOnTheLookupIndexAfterStats} elastic#143388 Mute org.elasticsearch.snapshots.ConcurrentSnapshotsIT testBackToBackQueuedDeletes elastic#143387 [Inference API] Parse endpoint metadata from persisted endpoints (elastic#143081) Add cluster formation doc to DistributedArchitectureGuide (elastic#143318) Fix flattened root block loader null expectation (elastic#143238) Unmute ValueSourceReaderTypeConversionTests testLoadAll (elastic#143189) ESQL: Add split coalescing for many small files (elastic#143335) Unmute mixed-cluster spatial parse warning test (elastic#143186) Fix zero-size estimate in BytesRefBlock null test (elastic#143258) Make DataType and DataFormat top-level enums (elastic#143312) Add support for steps to change the target index name for later steps (elastic#142955) Set mayContainDuplicates flag to test deduplication (elastic#143375) ESQL: Fix Driver search load millis as nanos bug (elastic#143267) Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:lookup-join.LookupJoinWithMixPushableAndUnpushableFilters} elastic#143378 ESQL: Forbid MV_EXPAND before full text functions (elastic#143249) ESQL: Fix unresolved name pattern (elastic#143210) Implement boxplot queryDSL aggregation for exponential_histograms (elastic#143026) Add prefetching to x64 bulk vector implementations (elastic#142387) Make large segment vector tests resilient to memory constraints (elastic#143366) ...
Remove AVX512 BBQ implementations pending further work
Prefetch the next vectors whilst processing one set of vectors for sqri7u, bulki8, and 2, 4-bit BBQ.
Remove AVX512 BBQ implementations, only using AVX2, as AVX512 is often slower than just the AVX2 impls
This causes a small performance drop for small number of vectors (-5%), but substantially faster for large numbers (+70%).
eg,
becomes
KnnIndexTester shows a 20% speed improvement for sqri7 searches, and 10-20% faster for BBQ 2/4-bit