Faster MLA on CUDA #234

ikawrakow · 2025-02-26T17:27:04Z

The CUDA code absolutely does not like MLA. On the main branch MLA attention is in the range of 15-20% slower than the standard attention implementation. The issue is with the wk_b x q_nope and wv_b x qkv_compressed operations. For TG they require two tensor multiplications of shapes $(N_h \times N_t \times K)$ and $(N_h \times 1 \times K)$, where $N_h$ is the head size, $N_t$ is the number of tokens in the KV cache, and $K$ is the number of heads. These get computed as $K$ consecutive $(N_h \times N_t) \times (N_h \times 1)$ matrix-vector multiplications. To add insult to injury, for wk_b x q_nope where q_nope is not contiguous, we get $K$ copies (one for each q_nope row) to contiguous memory, followed by quantization for a single row (when wk_b is quantized), followed by the actual GEMV, i.e., $3 K$ CUDA kernel launches. The associated overhead by far exceeds the time needed for the actual matrix multiplications, so the computation becomes extremely slow compared to what it could be.

This PR fixes the inefficiency by adding a special purpose kernel that performs the $K$ GEMV in one go. It is a bit of a hack and I should try to consolidate with the regular ggml_cuda_op_mul_mat_vec_q implementation, but it should do for now. In addition, the PR adds a new quantize_tensor_q8_1_cuda method that operates on non-contiguous tensors that have a single row. This allows the q_nope quantization for the qk_b x q_nope multiplication to be done with a single call.

These two changes result in a significant speedup of the MLA attention computation on CUDA. For IQ4_NL quantized DeepSeek-Lite with all layers processed on the GPU we get a TG-128 increase of 31%. For the hybrid calculations where the experts are computed on the CPU we get a 15% speedup. MLA is now (nearly) on par with standard attention for short contexts and outperforms it with increasing context length. Here is a table comparing standard to MLA attention in this PR for hybrid CPU/GPU inference as a function of context length. The CPU is Ryzen-7950X, and the GPU is RTX-4080

model	test	t/s (std)	t/s (MLA, this PR)	Speedup
deepseek2 16B IQ4_NL	tg64@pp128	52.99 ± 0.03	52.43 ± 0.04	0.989
deepseek2 16B IQ4_NL	tg64@pp256	52.77 ± 0.09	52.26 ± 0.07	0.990
deepseek2 16B IQ4_NL	tg64@pp512	51.58 ± 1.19	51.93 ± 0.10	1.007
deepseek2 16B IQ4_NL	tg64@pp1024	50.75 ± 0.56	51.73 ± 0.07	1.019
deepseek2 16B IQ4_NL	tg64@pp2048	49.96 ± 0.28	51.29 ± 0.05	1.027
deepseek2 16B IQ4_NL	tg64@pp4096	47.94 ± 0.58	50.23 ± 0.05	1.048
deepseek2 16B IQ4_NL	tg64@pp8192	43.77 ± 0.34	48.04 ± 0.04	1.098
deepseek2 16B IQ4_NL	tg64@pp16384	37.76 ± 0.15	44.62 ± 0.17	1.182

The low MLA performance on CUDA is dues to the wk_b * q_nope operation. It turns into n_head matrix multiplications with n_head separate quantization and GEMV steps. The associated overhead is just too much for TG where each GEMV is very fast (512 x 128 = 131 KFLOP for DeepSeek-Lite, 4X that for DeepSeekV3/R1). The way it was done there was also a copy of each q_nope row before quantization, which I have now eliminated. This results in a ~2.5% speedup. What needs to happen instead is to launch a single computation that quantizes all heads, and then have a kernel that does the GEMV for all heads instead of n_head sequential GEMVs.

It is a total hack, but it works.

Remove duplicated gemv's.

davidsyoung · 2025-02-27T16:17:26Z

@ikawrakow Seeing a significant speed increase from this, with also transposed KV cache. From 12t/s to 17.25t/s, and seeing less of a drop off on speed as well at longer PP tokens. Full CUDA 15x3090 Q2_K MLA.

Really nice!

Iwan Kawrakow added 4 commits February 26, 2025 08:56

Slightly better

78b4071

CUDA: Quantize non-contiguous tensors

3468438

Much better MLA

762e5f9

It is a total hack, but it works.

ikawrakow mentioned this pull request Feb 26, 2025

Slightly faster CUDA MLA #233

Closed

Cleanup

a107d96

Remove duplicated gemv's.

ikawrakow merged commit 51029ed into main Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Faster MLA on CUDA #234

Faster MLA on CUDA #234

Uh oh!

ikawrakow commented Feb 26, 2025

Uh oh!

davidsyoung commented Feb 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Faster MLA on CUDA #234

Faster MLA on CUDA #234

Uh oh!

Conversation

ikawrakow commented Feb 26, 2025

Uh oh!

davidsyoung commented Feb 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants