Minor CUDA PP speed improvement #567

ikawrakow · 2025-07-01T14:45:37Z

I was actually trying to improve MMQ performance for quants with a block-size of 16, but ended up with a small improvement of the MMQ kernel for blocks of 32. Just 1-2% kind of improvement, so nothing earth shattering.

Here a sweep-bench graph for LlaMA-3.1-8B on RTX-4080 for Q4_0 and IQ4_KS. The IQ4_KS improvement is slightly larger because I added a tweak to the tile loading kernel in addition of taking advantage of the slightly faster tile multiplication kernel.

Nexesenex · 2025-07-02T03:05:58Z

No problem on my side on Miqu Q5_K_M (full offload w/MMQ on 3 GPUs) and Wizard 8x22b IQ3_S mix (same test) after adapting this PR to Croco.cpp (mainline's fork).
Perfs are similar, with maybe a 0.5-1% bonus (still in the margin of variation of my bench results, but not downward, upward).

Can the iq4_ks versant of that PR be valid on the other quants' MMQ kernels using currently
const int k0 = 8 * (threadIdx.x / 4) + threadIdx.x % 4;
such as iq4_xs and iq4_nl?

ikawrakow · 2025-07-02T07:11:23Z

Can the iq4_ks versant of that PR be valid on the other quants' MMQ kernels

Not sure, one needs to try.

Larger gains would come from rewriting the MMQ implementation to have the x-tiles be reused more times. Currently Q4_0 MMQ is almost 10% faster than IQ4_KS. This does not make any sense. Yes, unpacking IQ4_KS is more expensive than unpacking Q4_0, but one should be able to fully amortize the unpacking cost in large matrix multiplications. This is what happens on the CPU, where all quants using the same unpacked GEMM kernel have the same performance (to within 1-2%). I think the reason we see this on CUDA is that there all optimizations are made with Q4_0 as the main optimization target. As Q4_0 is very simple, and it costs next to nothing to unpack, the remaining MMQ logic is tailored for very cheap unpacking, to the detriment of all other quantization types.

Iwan Kawrakow added 2 commits June 27, 2025 18:45

Slightly better q8_0_q8_1 kerneel and iqk_ks tile loading

1e6e483

Minor

036b5bb

ikawrakow merged commit 46f2e5d into main Jul 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Minor CUDA PP speed improvement #567

Minor CUDA PP speed improvement #567

Uh oh!

ikawrakow commented Jul 1, 2025

Uh oh!

Nexesenex commented Jul 2, 2025

Uh oh!

ikawrakow commented Jul 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Minor CUDA PP speed improvement #567

Minor CUDA PP speed improvement #567

Uh oh!

Conversation

ikawrakow commented Jul 1, 2025

Uh oh!

Nexesenex commented Jul 2, 2025

Uh oh!

ikawrakow commented Jul 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants