Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The CUDA code absolutely does not like MLA. On the main branch MLA attention is in the range of 15-20% slower than the standard attention implementation. The issue is with the$(N_h \times N_t \times K)$ and $(N_h \times 1 \times K)$ , where $N_h$ is the head size, $N_t$ is the number of tokens in the KV cache, and $K$ is the number of heads. These get computed as $K$ consecutive $(N_h \times N_t) \times (N_h \times 1)$ matrix-vector multiplications. To add insult to injury, for $K$ copies (one for each $3 K$ CUDA kernel launches. The associated overhead by far exceeds the time needed for the actual matrix multiplications, so the computation becomes extremely slow compared to what it could be.
wk_b x q_nopeandwv_b x qkv_compressedoperations. For TG they require two tensor multiplications of shapeswk_b x q_nopewhereq_nopeis not contiguous, we getq_noperow) to contiguous memory, followed by quantization for a single row (whenwk_bis quantized), followed by the actual GEMV, i.e.,This PR fixes the inefficiency by adding a special purpose kernel that performs the$K$ GEMV in one go. It is a bit of a hack and I should try to consolidate with the regular
ggml_cuda_op_mul_mat_vec_qimplementation, but it should do for now. In addition, the PR adds a newquantize_tensor_q8_1_cudamethod that operates on non-contiguous tensors that have a single row. This allows theq_nopequantization for theqk_b x q_nopemultiplication to be done with a single call.These two changes result in a significant speedup of the MLA attention computation on CUDA. For
IQ4_NLquantized DeepSeek-Lite with all layers processed on the GPU we get a TG-128 increase of 31%. For the hybrid calculations where the experts are computed on the CPU we get a 15% speedup. MLA is now (nearly) on par with standard attention for short contexts and outperforms it with increasing context length. Here is a table comparing standard to MLA attention in this PR for hybrid CPU/GPU inference as a function of context length. The CPU is Ryzen-7950X, and the GPU is RTX-4080