Conversation
|
(funny sidenote before any conspiracy theories abound: this quant was originally labeled as IQ3_KL back when the idea was a bit different, but Some Nice People pointed out to me that Somehere Out There there's already a quantization scheme with that name, so I renamed it to avoid any misconceptions) |
|
How are you determining the tables per tensor? Is it a greedy algorithm? Also did you do any tests against pre-existing data types to determine whether per-tensor tables are actually better than the static ones we are currently using? |
I'd lie if I said I understand the algorithm perfectly, since I was basically doing a prototyping workshop session with various assistants and my math background for it is, well, suboptimal :) The basis for it is a weighted Lloyd-Max (minimized mean-square error). We normalize values within subblocks, throw them into 8192 bins (gives virtually perfect quality while still being a noticeable speedup on the biggest tensors, probably could decrease to like 2048 but it's still very fast so didn't bother) and we do 300 iterations, stopping early if we have convergence. I tried to do a comparison but because the block shape is different than with existing quantization schemes you can't really do a direct one. Note that I'm by no means close to an expert on this, so just throwing this out to potentially start a discussion on new optimized quant types. |
|
However, if you're interested, I could try doing specifically one of the existing quants with a per-tensor scale construction to see what the relative KLD would be. |
Yes, I would like to know whether that would be beneficial. |
| { "Q2_K", LLAMA_FTYPE_MOSTLY_Q2_K, " 2.96G, +3.5199 ppl @ Llama-3-8B", }, | ||
| { "Q2_K_S", LLAMA_FTYPE_MOSTLY_Q2_K_S, " 2.96G, +3.1836 ppl @ Llama-3-8B", }, | ||
| { "IQ3_XXS", LLAMA_FTYPE_MOSTLY_IQ3_XXS, " 3.06 bpw quantization", }, | ||
| { "Q3_PT", LLAMA_FTYPE_MOSTLY_Q3_PT, " 3.25 bpw quantization", }, |
There was a problem hiding this comment.
How can it be 3.25 bpw if the type is 3.875 bpw?
There was a problem hiding this comment.
There are a ton of artifacts here because the first experiment was a completely different quant that turned out to not work :)
| } else if (strcmp(argv[arg_idx], "--keep-split") == 0) { | ||
| params.keep_split = true; | ||
| } else if (strcmp(argv[arg_idx], "--keep-split") == 0) { | ||
| params.keep_split = true; |
There was a problem hiding this comment.
This duplicates an existing argument. Likely a merge artifact? Or was this meant to handle --threads?
| printf(" --threads n\n"); | ||
| printf(" number of threads to use for cross-tensor parallelization (default: 0, use same as within-tensor)\n"); | ||
| printf(" when n > 0, enables parallel quantization of multiple tensors simultaneously\n"); |
There was a problem hiding this comment.
This argument is not handled. I assume this isn't intended?
There was a problem hiding this comment.
Yeah, I abandoned this, it's also out-of-scope for this PR.
| // Determine whether this tensor will be Q3_PT (mirror the pass-2 logic) | ||
| bool quantize = tname.rfind("weight") == tname.size() - 6; | ||
| quantize &= (ggml_n_dims(tensor) >= 2); | ||
| quantize &= tname.find("_norm.weight") == std::string::npos; | ||
| quantize &= tname.find("ffn_gate_inp.weight") == std::string::npos; | ||
| if (!quantize) { continue; } |
There was a problem hiding this comment.
This doesn't fully mirror the logic for excluding things to quantize, meaning it will break for recurrent models (like Mamba, RWKV, and others) where some tensors are 2D+, but shouldn't be quantized (so this will produce extra metadata and unnecessary calculations for tensors which aren't quantized to Q3_PT).
Ideally that logic should be in a single place to make it easier to modify, but I think that will be handled by #19770 with its tensor_allows_quantization function.
| case LLAMA_FTYPE_MOSTLY_IQ2_S: return "IQ2_S - 2.5 bpw"; | ||
| case LLAMA_FTYPE_MOSTLY_IQ2_M: return "IQ2_M - 2.7 bpw"; | ||
| case LLAMA_FTYPE_MOSTLY_IQ3_XS: return "IQ3_XS - 3.3 bpw"; | ||
| case LLAMA_FTYPE_MOSTLY_Q3_PT: return "Q3_PT - 3.25 bpw"; |
There was a problem hiding this comment.
The type is 3.875 bpw, no?
I think a less confusing explanation would be that the mappings of representable values of current quants are static, while you're proposing a non-linear quant with dynamic levels/steps depending on the tensor (at least from what I understand). |
Yeah, the original version used the name "codebooks" which was maybe better (but also not fully correct). |
|
@JohannesGaessler aight, here you go :) pure 3.74 bpw quant comparison:
I've committed the reference code for Q3_KPT as well. |
| GGML_TABLE_BEGIN(uint8_t, kmask_iq2xs, 8) | ||
| 1, 2, 4, 8, 16, 32, 64, 128 | ||
| GGML_TABLE_END() | ||
|
|
||
| GGML_TABLE_BEGIN(uint8_t, ksigns_iq2xs, 128) | ||
| 0, 129, 130, 3, 132, 5, 6, 135, 136, 9, 10, 139, 12, 141, 142, 15, | ||
| 144, 17, 18, 147, 20, 149, 150, 23, 24, 153, 154, 27, 156, 29, 30, 159, | ||
| 160, 33, 34, 163, 36, 165, 166, 39, 40, 169, 170, 43, 172, 45, 46, 175, | ||
| 48, 177, 178, 51, 180, 53, 54, 183, 184, 57, 58, 187, 60, 189, 190, 63, | ||
| 192, 65, 66, 195, 68, 197, 198, 71, 72, 201, 202, 75, 204, 77, 78, 207, | ||
| 80, 209, 210, 83, 212, 85, 86, 215, 216, 89, 90, 219, 92, 221, 222, 95, | ||
| 96, 225, 226, 99, 228, 101, 102, 231, 232, 105, 106, 235, 108, 237, 238, 111, | ||
| 240, 113, 114, 243, 116, 245, 246, 119, 120, 249, 250, 123, 252, 125, 126, 255, | ||
| GGML_TABLE_END() |
There was a problem hiding this comment.
I think we misunderstood each other. I understood your idea was to replace these tables with per-tensor tables.
There was a problem hiding this comment.
Ah, now I get it 😀 aight, will try that too.
There was a problem hiding this comment.
(regarding this particular table)
The ksigns don't actually require a table; the most significant bit is derived from how many other bits are set; there can only be an even number of negative numbers per block in those types. That particular table is likely for speed optimization, but isn't technically necessary.
llama.cpp/ggml/src/ggml-quants.c
Lines 3077 to 3096 in ecbcb7e
From what I understand, the per-tensor tables would replace tables like kvalues_iq4nl.
There was a problem hiding this comment.
Ah, now I get it 😀 aight, will try that too.
Well the more pressing matter is that I don't understand what it is your "per-tensor" format is actually doing from your textual description.
There was a problem hiding this comment.
All right, I'll try to explain based on the difference between Q3_K and Q3_KPT.
In Q3_K, you have (in dequantize_row_q3_K):
*y++ = dl * ((int8_t)((q[l+ 0] >> shift) & 3) - ((hm[l+ 0] & m) ? 0 : 4));Here, q is the low bits part of the weight (2 bits) and hm is the high bits (sign) part (1 bit), this gets multiplied by the scale (which is the superblock scale times the respective small block scale).
All of this uses values from a static distribution of [-4, -3, -2, -1, 0, 1, 2, 3]. Those are the only values that can be the result of this internal computation of ((int8_t)((q[l+ 0] >> shift) & 3) - ((hm[l+ 0] & m) ? 0 : 4)). Thus, the distribution of values inside the tensor is constrained by the interaction of the superblock scale and subblock scales. Once those are fixed, the specific quants can only pick an integer multiplier from -4 to 3.
Q3_KPT uses this:
int k_idx = ((q[l + 0] >> shift) & 3) + ((hm[l + 0] & m) ? 4 : 0);
y[l + 0] = dl1 * (levels[k_idx] * 7.0f - 4.0f);where levels is a float array. So, the distribution of values before scaling is also in the range of [-3, 3] - however, the exact distribution is calculated per tensor to account for that tensor's specific distribution of values. Therefore, instead of having [-4, -3, -2, -1, 0, 1, 2, 3], you might have for example [-3, -1, -0.6, 0, 0.3, 0.6, 1, 3] to account for a symmetric distribution that's more likely to have extreme values and values closer to zero - or even [0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5] for a tensor that has only zero-or-positive values since it's a waste of precision to then use negative values at all.
Does this make any sense? :) (note: this is also a self-study exercise to me, I strongly believe in learning-by-doing, so please excuse if I'm saying something wrong)
EDIT: @compilade rightly mentioned that the values are from -4 to 3, I got confused by some "super-symetric" quant schemes :)
There was a problem hiding this comment.
Of course as @compilade suggested using float arrays is an efficiency problem, but you can also get a bigger resolution by using int_8 values with a divisor instead, since picking 2^bpw values from 256 is still better resolution than using fixed a fixed preselected set of 2^bpw values.
There was a problem hiding this comment.
To be clear I suggested (elsewhere) to use int8 values for the q3pt_values, to allow using int8 dot products within the sub-blocks, like the other non-linear quants already do.
There was a problem hiding this comment.
Shouldn't Q3_K and Q3_KPT then have the same size though? Or did you use a different format as the base?
There was a problem hiding this comment.
Q3_K and Q3_KPT do have exactly the same size. The original quant in this thread (Q3_PT) was a result of my attempt to make a completely new quant around 3.25bpw which went haywire and produced a completely new quant as a result, but then since you asked how this would work with existing quant I made a "PT" version of Q3_K (same block structure even, the only difference is in how the dequant algorithm uses the per-tensor scales).
There was a problem hiding this comment.
As in:
-a---- 27.02.2026 19:16 1887009952 Qwen_Qwen3-4B-Instruct-2507-Q3_KPT_pure.gguf
-a---- 27.02.2026 18:03 1886997184 Qwen_Qwen3-4B-Instruct-2507-Q3_K_pure.gguf
The difference in size is exactly the per-tensor distributions which are stored in an array in the GGUF header.
|
Just trying to run this against an existing Q8_0 quant (which I understand would yield particularly lossy results) and seeing a segmentation fault. Any idea why that's happening? command line tail of output |
|
@narinishi will check (but keep in mind this is a very draft solution) |
|
@narinishi sorry, couldn't reproduce - you should've gotten a warning that you need an imatrix for this quntization, after making an imatrix the quants worked fine for me. |
|
@pwilkin since you‘re storing additional per-tensor table in metadata, I wonder if you could make it a bit more flexible by allowing multiple tables per tensor. It could be simply a list of tables along with a stride. That way separate tables for merged qkv or experts would be possible. Or even say a table per N rows… |
Yes, that's the next step - but before that, I want to take a step back and see if actually porting reasonably fast kernels for those tensors is possible. No use having a quantization scheme that takes KLD/PPL off if it's super slow. |
I tried using Bartowski's Anything else I might be missing? @pwilkin |
|
@narinishi I'll try using that exact quant and see if I can reproduce. |
|
Now in the quant laboratory: Q4_DPT. Which is just a fancy name for IQ4_NL with learned kvalues. I hacked an extremely ugly CUDA kernel hack to compare the performance effect: (venv) ilintar@LinuksowaJaskinia:/media/ilintar/D_SSD/models$ llama-bench -m Qwen_Qwen3-4B-Instruct-2507-IQ4_NL-pure.gguf
load_backend: loaded BLAS backend from /devel/tools/llama.cpp/build/bin/libggml-blas.so
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from /devel/tools/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /devel/tools/llama.cpp/build/bin/libggml-cpu-haswell.so
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 4B IQ4_NL - 4.5 bpw | 2.20 GiB | 4.02 B | BLAS,CUDA | 8 | pp512 | 5394.70 ± 232.97 |
| qwen3 4B IQ4_NL - 4.5 bpw | 2.20 GiB | 4.02 B | BLAS,CUDA | 8 | tg128 | 121.56 ± 0.42 |
build: e395d080c (8173)
(venv) ilintar@LinuksowaJaskinia:/media/ilintar/D_SSD/models$ llama-bench -m Qwen_Qwen3-4B-Instruct-2507-Q4_DPT-pure.gguf
load_backend: loaded BLAS backend from /devel/tools/llama.cpp/build/bin/libggml-blas.so
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from /devel/tools/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /devel/tools/llama.cpp/build/bin/libggml-cpu-haswell.so
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 4B Q4_DPT - IQ4_NL with learned levels | 2.20 GiB | 4.02 B | BLAS,CUDA | 8 | pp512 | 5325.80 ± 271.60 |
| qwen3 4B Q4_DPT - IQ4_NL with learned levels | 2.20 GiB | 4.02 B | BLAS,CUDA | 8 | tg128 | 108.28 ± 0.21 |
build: e395d080c (8173)So, not great not terrible I guess. The KLD data for the quants: IQ4_NL:
====== Perplexity statistics ======
Mean PPL(Q) : 8.123314 ± 0.155079
Mean PPL(base) : 7.910885 ± 0.149572
Cor(ln(PPL(Q)), ln(PPL(base))): 99.22%
Mean ln(PPL(Q)/PPL(base)) : 0.026499 ± 0.002381
Mean PPL(Q)/PPL(base) : 1.026853 ± 0.002445
Mean PPL(Q)-PPL(base) : 0.212429 ± 0.019811
====== KL divergence statistics ======
Mean KLD: 0.033115 ± 0.000551
Maximum KLD: 4.302760
99.9% KLD: 1.215412
99.0% KLD: 0.304676
95.0% KLD: 0.121186
90.0% KLD: 0.077289
Median KLD: 0.011778
10.0% KLD: 0.000026
5.0% KLD: 0.000003
1.0% KLD: -0.000000
0.1% KLD: -0.000004
Minimum KLD: -0.000004
====== Token probability statistics ======
Mean Δp: -0.522 ± 0.036 %
Maximum Δp: 86.560%
99.9% Δp: 36.079%
99.0% Δp: 15.386%
95.0% Δp: 6.054%
90.0% Δp: 2.912%
75.0% Δp: 0.234%
Median Δp: -0.001%
25.0% Δp: -0.724%
10.0% Δp: -4.718%
5.0% Δp: -9.085%
1.0% Δp: -20.438%
0.1% Δp: -47.195%
Minimum Δp: -86.191%
RMS Δp : 5.767 ± 0.103 %
Same top p: 91.992 ± 0.170 %
Q4_DPT:
====== Perplexity statistics ======
Mean PPL(Q) : 8.169248 ± 0.156889
Mean PPL(base) : 7.910885 ± 0.149572
Cor(ln(PPL(Q)), ln(PPL(base))): 99.28%
Mean ln(PPL(Q)/PPL(base)) : 0.032137 ± 0.002301
Mean PPL(Q)/PPL(base) : 1.032659 ± 0.002376
Mean PPL(Q)-PPL(base) : 0.258363 ± 0.019745
====== KL divergence statistics ======
Mean KLD: 0.030403 ± 0.000630
Maximum KLD: 8.150362
99.9% KLD: 1.044685
99.0% KLD: 0.282885
95.0% KLD: 0.108604
90.0% KLD: 0.068070
Median KLD: 0.010477
10.0% KLD: 0.000022
5.0% KLD: 0.000003
1.0% KLD: -0.000000
0.1% KLD: -0.000003
Minimum KLD: -0.000004
====== Token probability statistics ======
Mean Δp: -0.354 ± 0.035 %
Maximum Δp: 96.635%
99.9% Δp: 33.628%
99.0% Δp: 14.918%
95.0% Δp: 6.130%
90.0% Δp: 3.110%
75.0% Δp: 0.296%
Median Δp: -0.000%
25.0% Δp: -0.546%
10.0% Δp: -4.105%
5.0% Δp: -7.891%
1.0% Δp: -20.029%
0.1% Δp: -48.167%
Minimum Δp: -91.041%
RMS Δp : 5.597 ± 0.110 %
Same top p: 92.459 ± 0.165 % |
|
@JohannesGaessler If you have some free time I'd appreciate you taking a look if there's a civilized way to pass the kvalues over to the CUDA kernels (this is the only reason I've included them for this quant specifically). |
|
My opinion was and still is that just like with FP8 the per-tensor auxiliary data should live in an optional dedicated |
|
@JohannesGaessler okay, so my idea for doing that based on your input right now looks like this: -> make a new llm_graph_input_i subclass for the per-quantization-type levels Does this make sense? |
|
If you want to do prototyping that sounds like it could work but that approach is fundamentally incompatible with how I'm implementing tensor parallelism. That implementation assumes that there are homogeneous axes along which tensor data can be split and distributed across GPUs. If you partition a tensor into several parts with a different memory layout that will not work. |
|
The levels are just a tiny piece of data needed for computation though (a [2^bpw] 1D vector) and it's marked as input, so it'll be copied to all the backends anyways, right? |
|
The pointers would not be calculated correctly if you do that. The way the meta backend is supposed to work is that the underlying "simple" backends can treat their tensor slices in the exact same way they would normally. |
|
So how would one do it properly with the meta backend? |
|
One tensor to hold the regular data, one tensor to hold auxiliary data. Set the tensor for the regular data to be split by e.g. rows, set the tensor for the auxiliary data to be mirrored (or split if you do end up implementing scales per X rows). The point is that the assignment of data to GPUs is on a per-tensor basis so the data layout within a tensor needs to be homogeneous. |
|
Unrelated to this PR but generally to quantization, I've been looking at https://github.com/z-lab/paroquant which claims quite nice results for 4-bit quantization |
…ree padding at the end :)
|
With vision capabilities becoming more common (e.g. Qwen 3.5 supporting image and video input), it would be worthwhile to evaluate quantization methods against visual weights. |
Inspired but also a bit worried by the discussions over the recent quant proposals, I've decided to try my luck and spent the last few days chugging my AI workhorses to prototype a new quant for llama.cpp. The idea behind the quant was pretty simple: all of the quants in llama.cpp use static, predefined "magic scales" that were determined to be somewhat optimal for quantizing tensors. So I thought - how about we instead create per-tensor scales pre-quantization and then use them for quantizing and dequantizing? That's how the Q3_PT (or "Per Tensor") quantization scheme was born.
I've tested this quant on Qwen3-4B-Instruct-2507. The imatrix dataset is @bartowski1182 's https://gist.github.com/bartowski1182/82ae9b520227f57d79ba04add13d0d0d, while the KLD and perplexity testing dataset used is @EAddario 's https://huggingface.co/datasets/eaddario/imatrix-calibration
combined_all_smallwith 100 chunks.This PR contains, per the recommendations, just the pure CPU reference code and quantization/dequantization code, so obviously any inference using this quant will be painfully slow. However, I've made the KLD and perplexity tests and they show this quant very nicely fills a gap between the Q3_K and IQ4_XS quant types. The "pure" quants mean that they were quantized with
--pure, the only exceptions being the embedding and output tensors quantized inQ6_K.