Skip to content

Conversation

@ikawrakow
Copy link
Owner

This PR is a follow up of #417 and (almost) completes the quantized matrix multiplication (a.k.a. MMQ) implementation for IQX_K quants. The only one missing is IQ4_KSS, but I don't think I'll do that one as the packing is much too complicated.

There are larger performance gains for IQ2_KS (~35%) than for IQ2_K and IQ3_K (~10%). This is due to IQ2_KS having blocks of 32 and thus being able to use the more efficient GEMM kernel (see discussion in #417).

The graph illustrates the performance improvements for the same setup as in #417.

z17

Looking at this graph and in the graph in #417, I almost feel like adding IQ3_KS and IQ5_KS as 3- and 5-bit quants with blocks of 32.

@ubergarm
Copy link
Contributor

Wow the IQ2_KS improved around 35%!? The 32 block _KS variants have a nice speedup.

I'd probably try out the larger IQ3_KS and especially IQ5_KS for some mixes in the future if you decide to add them.

@ikawrakow ikawrakow merged commit 14ed9fb into main May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants