Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It is really strange that there is no instruction to horizontally sum the elements of a SIMD vector in
AVX/AVX2/AVX512
as this is needed all the time. InAVX512
there is_mm512_reduce_add_ps(x)
, but this expands to multiple instructions. E.g., from GCC-12immintrin.h
:On
AVX2
I have been usingi.e., 8 instructions to sum 8 float elements. I have been wondering to what extend this affects the performance of matrix-matrix and matrix-vector multiplications as it needs to get done for every element of the resulting matrix/vector.
In
iqk_mul_mat
most matrix-matrix multiplications are done by simultaneously processing 8 columns of the right matrix, so we end up with 8 SIMD vectors containing the dot products of a row from the left matrix with the 8 columns. In this case it is possible to have a more efficient implementation where we end up with a single SIMD vector containing the horizontal sums of the 8 SIMD vectors like thisI count 29 instructions, so less than 4 instructions per horizontal sum.
Plugging this into
iqk_mul_mat
results in 1-2% performance improvements for basically all quantized matrix-matrix multiplications (float multiplications are done with 5x5 tiles on Zen4, so this idea is not directly or easily transferable). Strangely enough, on a pureAVX2
system (Ryzen-5975WX), I observe 1-2% reduced performance, hence in this PR the 8x8 sum is only used on Zen4.One can also apply the idea to matrix-vector multiplications by simply gathering 8 dot products and then using the 8x8 horizontal sum. This is relevant for TG. TG is severly memory-bound on
x86_64
systems, so there is no benefit when using the number of threads that results in peak performance. But with just 1 thread I observe up to 10% speedup on Zen4.