Skip to content

Conversation

@ikawrakow
Copy link
Owner

I was actually trying to improve MMQ performance for quants with a block-size of 16, but ended up with a small improvement of the MMQ kernel for blocks of 32. Just 1-2% kind of improvement, so nothing earth shattering.

Here a sweep-bench graph for LlaMA-3.1-8B on RTX-4080 for Q4_0 and IQ4_KS. The IQ4_KS improvement is slightly larger because I added a tweak to the tile loading kernel in addition of taking advantage of the slightly faster tile multiplication kernel.

u4

@Nexesenex
Copy link
Contributor

No problem on my side on Miqu Q5_K_M (full offload w/MMQ on 3 GPUs) and Wizard 8x22b IQ3_S mix (same test) after adapting this PR to Croco.cpp (mainline's fork).
Perfs are similar, with maybe a 0.5-1% bonus (still in the margin of variation of my bench results, but not downward, upward).

Can the iq4_ks versant of that PR be valid on the other quants' MMQ kernels using currently
const int k0 = 8 * (threadIdx.x / 4) + threadIdx.x % 4;
such as iq4_xs and iq4_nl?

@ikawrakow
Copy link
Owner Author

Can the iq4_ks versant of that PR be valid on the other quants' MMQ kernels

Not sure, one needs to try.

Larger gains would come from rewriting the MMQ implementation to have the x-tiles be reused more times. Currently Q4_0 MMQ is almost 10% faster than IQ4_KS. This does not make any sense. Yes, unpacking IQ4_KS is more expensive than unpacking Q4_0, but one should be able to fully amortize the unpacking cost in large matrix multiplications. This is what happens on the CPU, where all quants using the same unpacked GEMM kernel have the same performance (to within 1-2%). I think the reason we see this on CUDA is that there all optimizations are made with Q4_0 as the main optimization target. As Q4_0 is very simple, and it costs next to nothing to unpack, the remaining MMQ logic is tailored for very cheap unpacking, to the detriment of all other quantization types.

@ikawrakow ikawrakow merged commit 46f2e5d into main Jul 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants