-
Notifications
You must be signed in to change notification settings - Fork 156
WIP Compute per layer LIM Scores during imatrix #326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
*WARNING*: This is mostly vibe code. Hope I'm not wasting y'alls time. Compute Layer Importance Modification (LIM) Scores The goal of this PR is to rank layers of a given tensor in order of sensitivity to quantization error. Given that it is now possible to use `llama-quantize --custom-q ...` regex, it may be possible to use these LIM Scores to decide which layers of a given tensor to quantize more or less in an attempt to preserve generation quality (e.g. low perplexity) while reducing memory footprint as compared to using same quant size across all layers of a given tensor. This experimental PR was motivated by this comment and PR: ggml-org/llama.cpp#12718 I may force-push this after more testing and experimenting to see if it is actually doing the right thing and if the output is actually useful to improve quantization quality e.g. PPL per GiB... This may just be a big mistake, lol. This is built on existing imatrix computation and assumes that values of `x[j]` are the "activations" coming right in/out of the given tensor layer. I don't know GGML and generally work in python or vanilla c not so much c++. So a lot of this was vibe coded running [ubergarm/DeepSeek-V3-0324-GGUF IQ4_K_R4 quant](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF/tree/main/DeepSeek-V3-0324-IQ4_K_R4). So this is partially an experiment actually trying to use an LLM instead of just enjoying the meta of manual quantization min-maxing. ``` @misc{dumitru2024layerwisequantizationpragmaticeffective, title={Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels}, author={Razvan-Gabriel Dumitru and Vikas Yadav and Rishabh Maheshwary and Paul-Ioan Clotan and Sathwik Tejaswi Madhusudhan and Mihai Surdeanu}, year={2024}, eprint={2406.17415}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2406.17415}, code={https://github.com/RazvanDu/LayerwiseQuant/}, } ```
|
Do I understand the results in the quoted PR correctly? The I didn't go to read the blog post, but why would cosine similarity between the inputs of two subsequent layers measure layer importance? |
| for (int row = 0; row < (int)(src1->ne[1]*src1->ne[2]); ++row) { | ||
| const float * x = data + row * src1->ne[0]; | ||
| for (int j = 0; j < (int)src1->ne[0]; ++j) { | ||
| e.activations[j] = x[j]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, activations gets overwritten each time we get called with a new set of activations. It also gets overwritten as we go over the rows of the activation matrix. At the end of the run, the compute_lim() function gets called. Which means that we get the LIM computed with just the very last token processed in the imatrix run, not an actual statistical evaluation of cosine similarities between inputs to tensors of the same type in subsequent layers.
Correct, the summary of the rest of that PR thread including the specific comment by @compilade point out issues with that initial experiment and suggest it may be possible to implement the cosine similarity estimate of relative layer importance in
The paper that suggests using cosine similarity says:
I'll hack around some more to see if I can fix the implementation to possibly do a "running cosine similarity" given the naive first attempt is not properly doing a statistical evaluation across all the input tokens. The paper suggests another possible method of measuring relative layer sensitivity that I didn't try. Maybe one could calculate the "condition numbers" or "max stretch" for each layer's tensor and rank them, just wildly spit-balling beyond my pay grade xD... Really appreciate your time, thanks! |
Sure. But the activations did not change due to that tensor only, they changed due to all tensors in the preceding layer. Or more precisely, activations changed due to the tensor we are considering, plus all tensors with their linear and non-linear operations that followed, before arriving at the same tensor type in the next layer. If the changes in the activations were trivially predictable, people wouldn't be doing complicated networks, and wouldn't be experimenting around with GELU's, RELU's, SILU's, variations of RoPE, different combinations of activation normalizations, and all that jazz. I can see looking at the activation change between whole layers to derive an estimate of how important the entire layer was, but claiming that the difference in activation input to a specific tensor type between two consecutive layers is a measure of how important this specific tensor type is? That's pushing it. |
|
I agree with @ikawrakow, comparing across layers for a particular tensor seems like it would have non-intuitive results which might not necessarily be linked to relative importance of the tensors. I think what is calculated here is the cosine similarity across the inputs of between consecutive layers of each linear operations in the model(s). It's not particularly clear how this information can be used.
@ubergarm What I meant by this was to calculate LIM scores with the input and output within each linear operations (i.e. what |
|
Can you be more specific how you want to calculate the impact of a linear operation from the input activations and the result of the linear operation? I have used this to derive corrections for a quantized model (have not published, it is in a private repository where I experiment with stuff). But I don't really see how one can derive tensor importance scores from that. |
@ikawrakow I might not have thought this through properly. I was thinking of directly calculating a dot product between the input and output of each matmul (and normalizing) to get LIM scores by negating that, but this would only work for square matrices (where the input and output have the same shape). |
|
Closing this in favor of implementation in PR#328. ExperimentStill more experimentation to do, and sorry no visual graphs as I'm away from my desk, but did a quick A/B test comparing two Finally, I provide the tl;dr;Using PR#328
While it is within the noise, there may be room for further improvement applying the scores to attention tensors quantization as well which I didn't do for this experiment. In retrospect, I probably should have used the layer importance scores from ProcedureCompute imatrix and layer similarity scores using `V3-0324` `q8_0`$ numactl -N 1 -m 1 \
./build/bin/llama-imatrix \
--verbosity 1 \
--layer-similarity \
-m /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0.gguf \
-f calibration_data_v5_rc.txt \
-o /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-$(git rev-parse --short HEAD).dat \
--ctx-size 512 \
--numa numactl \
--threads 128
llama_model_loader: loaded meta data with 46 key-value pairs and 1147 tensors from /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
llama_model_loader: - kv 3: general.version str = V3-0324
llama_model_loader: - kv 4: general.basename str = DeepSeek
llama_model_loader: - kv 5: general.size_label str = 256x21B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 16: general.file_type u32 = 7
llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["
llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3
llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["
llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 45: general.quantization_version u32 = 2
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q8_0: 786 tensors
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = deepseek2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 129280
llm_load_print_meta: n_merges = 127741
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 163840
llm_load_print_meta: n_embd = 7168
llm_load_print_meta: n_layer = 61
llm_load_print_meta: n_head = 128
llm_load_print_meta: n_head_kv = 128
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 192
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 24576
llm_load_print_meta: n_embd_v_gqa = 16384
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18432
llm_load_print_meta: n_expert = 256
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = yarn
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 671B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 672.050 B
llm_load_print_meta: model size = 665.308 GiB (8.504 BPW)
llm_load_print_meta: repeating layers = 663.474 GiB (8.504 BPW, 670.196 B parameters)
llm_load_print_meta: general.name = DeepSeek V3 0324
llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token = 131 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead = 3
llm_load_print_meta: n_lora_q = 1536
llm_load_print_meta: n_lora_kv = 512
llm_load_print_meta: n_ff_exp = 2048
llm_load_print_meta: n_expert_shared = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm = 1
llm_load_print_meta: expert_gating_func = sigmoid
llm_load_print_meta: rope_yarn_log_mul = 0.1000
llm_load_tensors: ggml ctx size = 0.47 MiB
llm_load_tensors: CPU buffer size = 681274.97 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CPU KV buffer size = 2440.00 MiB
llama_new_context_with_model: KV self size = 2440.00 MiB, K (f16): 1464.00 MiB, V (f16): 976.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 283.01 MiB
llama_new_context_with_model: graph nodes = 3724
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 128 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 309.837 ms
compute_imatrix: computing over 213 chunks with batch_size 512
compute_imatrix: 37.90 seconds per pass - ETA 2 hours 14.55 minutes
[1]60.9619,[2]10.7701,[3]5.8724,[4]3.7883,[5]2.9691,[6]2.5089,[7]2.2199,[8]2.0199,[9]1.9095,
save_imatrix: entry ' blk.60.ffn_down_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.60.ffn_gate_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.60.ffn_up_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.25.ffn_down_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.26.ffn_down_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.25.ffn_up_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.25.ffn_gate_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.26.ffn_gate_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry ' blk.26.ffn_up_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: stored collected data after 10 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[10]1.8219,[11]2.0296,[12]2.0839,[13]2.0978,[14]2.1403,[15]2.0365,[16]1.9492,[17]1.8786,[18]1.8160,[19]1.7743,
save_imatrix: stored collected data after 20 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[20]1.7315,[21]1.6986,[22]1.6609,[23]1.6319,[24]1.6201,[25]1.6080,[26]1.5822,[27]1.6812,[28]1.7547,[29]1.8204,
save_imatrix: stored collected data after 30 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[30]1.8188,[31]1.8323,[32]1.8317,[33]1.8091,[34]1.8457,[35]1.8217,[36]1.8215,[37]1.8106,[38]1.8208,[39]1.8070,
save_imatrix: stored collected data after 40 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[40]1.7838,[41]1.7606,[42]1.7410,[43]1.7291,[44]1.7157,[45]1.7023,[46]1.6981,[47]1.6919,[48]1.6811,[49]1.6707,
save_imatrix: stored collected data after 50 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[50]1.6650,[51]1.6623,[52]1.6625,[53]1.6672,[54]1.6812,[55]1.6781,[56]1.6683,[57]1.6764,[58]1.6796,[59]1.6906,
save_imatrix: stored collected data after 60 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[60]1.6855,[61]1.7243,[62]1.7565,[63]1.7884,[64]1.8197,[65]1.8677,[66]1.8802,[67]1.9148,[68]1.9442,[69]1.9996,
save_imatrix: stored collected data after 70 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[70]2.0525,[71]2.0832,[72]2.1136,[73]2.1258,[74]2.1407,[75]2.1702,[76]2.2011,[77]2.2185,[78]2.2164,[79]2.2313,
save_imatrix: stored collected data after 80 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[80]2.2543,[81]2.2904,[82]2.3238,[83]2.3342,[84]2.3650,[85]2.3733,[86]2.3730,[87]2.4024,[88]2.4344,[89]2.4899,
save_imatrix: stored collected data after 90 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[90]2.5102,[91]2.5125,[92]2.5192,[93]2.5349,[94]2.5452,[95]2.5779,[96]2.5670,[97]2.6058,[98]2.6319,[99]2.6214,
save_imatrix: stored collected data after 100 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[100]2.6537,[101]2.7008,[102]2.7326,[103]2.7740,[104]2.8020,[105]2.8310,[106]2.8682,[107]2.8605,[108]2.8789,[109]2.8849,
save_imatrix: stored collected data after 110 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[110]2.8910,[111]2.8878,[112]2.9177,[113]2.9435,[114]2.9520,[115]2.9363,[116]2.9104,[117]2.9044,[118]2.9147,[119]2.9003,
save_imatrix: stored collected data after 120 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[120]2.8773,[121]2.8737,[122]2.8738,[123]2.8819,[124]2.8872,[125]2.8942,[126]2.9018,[127]2.9043,[128]2.9343,[129]2.9484,
save_imatrix: stored collected data after 130 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[130]2.9241,[131]2.9003,[132]2.8771,[133]2.8544,[134]2.8563,[135]2.8567,[136]2.8828,[137]2.9150,[138]2.9340,[139]2.9389,
save_imatrix: stored collected data after 140 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[140]2.9637,[141]2.9866,[142]3.0151,[143]3.0354,[144]3.0569,[145]3.0766,[146]3.0972,[147]3.1154,[148]3.1266,[149]3.1351,
save_imatrix: stored collected data after 150 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[150]3.1395,[151]3.1572,[152]3.1761,[153]3.1759,[154]3.1834,[155]3.1945,[156]3.2035,[157]3.2148,[158]3.2209,[159]3.2300,
save_imatrix: stored collected data after 160 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[160]3.2442,[161]3.2498,[162]3.2525,[163]3.2595,[164]3.2704,[165]3.2724,[166]3.2737,[167]3.2912,[168]3.3010,[169]3.3082,
save_imatrix: stored collected data after 170 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[170]3.3258,[171]3.3403,[172]3.3354,[173]3.3417,[174]3.3424,[175]3.3575,[176]3.3691,[177]3.3818,[178]3.3768,[179]3.3734,
save_imatrix: stored collected data after 180 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[180]3.3682,[181]3.3635,[182]3.3578,[183]3.3531,[184]3.3472,[185]3.3600,[186]3.3887,[187]3.4121,[188]3.4336,[189]3.4550,
save_imatrix: stored collected data after 190 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[190]3.4850,[191]3.4990,[192]3.5134,[193]3.5036,[194]3.5210,[195]3.5145,[196]3.4953,[197]3.4747,[198]3.4946,[199]3.5110,
save_imatrix: stored collected data after 200 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[200]3.5207,[201]3.5290,[202]3.5447,[203]3.5621,[204]3.5748,[205]3.5874,[206]3.6021,[207]3.5989,[208]3.5771,[209]3.5556,
save_imatrix: stored collected data after 210 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
[210]3.5342,[211]3.5134,[212]3.4930,[213]3.4727,
save_imatrix: stored collected data after 213 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-f7c5a94e.dat
Final estimate: PPL = 3.4727 +/- 0.03300
llama_print_timings: load time = 38826.79 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 7699212.14 ms / 109056 tokens ( 70.60 ms per token, 14.16 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 7777812.63 ms / 109057 tokens
======================== sorted layer importances
0: Layer 0, <cos_sim> = 0.517453
1: Layer 60, <cos_sim> = 0.59436
2: Layer 8, <cos_sim> = 0.857555
3: Layer 3, <cos_sim> = 0.858137
4: Layer 1, <cos_sim> = 0.869657
5: Layer 59, <cos_sim> = 0.875667
6: Layer 57, <cos_sim> = 0.888417
7: Layer 5, <cos_sim> = 0.906457
8: Layer 58, <cos_sim> = 0.911674
9: Layer 7, <cos_sim> = 0.921961
10: Layer 53, <cos_sim> = 0.926514
11: Layer 22, <cos_sim> = 0.932632
12: Layer 17, <cos_sim> = 0.936935
13: Layer 24, <cos_sim> = 0.93742
14: Layer 23, <cos_sim> = 0.939419
15: Layer 4, <cos_sim> = 0.941044
16: Layer 15, <cos_sim> = 0.945621
17: Layer 25, <cos_sim> = 0.94563
18: Layer 6, <cos_sim> = 0.946055
# NOTE: i prioritized the above 17 routed expert layers [3-60] for more bpw quantization (first 0-2 layers are dense)
19: Layer 21, <cos_sim> = 0.946446
20: Layer 16, <cos_sim> = 0.947423
21: Layer 27, <cos_sim> = 0.947699
22: Layer 18, <cos_sim> = 0.948201
23: Layer 10, <cos_sim> = 0.949096
24: Layer 54, <cos_sim> = 0.949141
25: Layer 2, <cos_sim> = 0.949452
26: Layer 20, <cos_sim> = 0.949668
27: Layer 30, <cos_sim> = 0.949811
28: Layer 26, <cos_sim> = 0.951796
29: Layer 13, <cos_sim> = 0.951903
30: Layer 14, <cos_sim> = 0.952166
31: Layer 9, <cos_sim> = 0.952194
32: Layer 44, <cos_sim> = 0.952973
33: Layer 35, <cos_sim> = 0.953037
34: Layer 45, <cos_sim> = 0.953128
35: Layer 29, <cos_sim> = 0.954667
36: Layer 28, <cos_sim> = 0.954742
37: Layer 31, <cos_sim> = 0.954809
38: Layer 56, <cos_sim> = 0.955925
39: Layer 43, <cos_sim> = 0.956722
40: Layer 50, <cos_sim> = 0.958269
41: Layer 19, <cos_sim> = 0.959386
42: Layer 33, <cos_sim> = 0.95975
43: Layer 32, <cos_sim> = 0.960649
44: Layer 55, <cos_sim> = 0.960837
45: Layer 11, <cos_sim> = 0.961299
46: Layer 34, <cos_sim> = 0.961852
47: Layer 12, <cos_sim> = 0.962011
48: Layer 46, <cos_sim> = 0.962943
49: Layer 49, <cos_sim> = 0.965045
50: Layer 39, <cos_sim> = 0.96526
51: Layer 40, <cos_sim> = 0.96575
52: Layer 37, <cos_sim> = 0.967049
53: Layer 36, <cos_sim> = 0.96716
54: Layer 52, <cos_sim> = 0.967574
55: Layer 38, <cos_sim> = 0.968262
56: Layer 41, <cos_sim> = 0.968457
57: Layer 48, <cos_sim> = 0.968755
58: Layer 51, <cos_sim> = 0.968768
59: Layer 47, <cos_sim> = 0.968788
60: Layer 42, <cos_sim> = 0.971662
======================== sorted attention importances
0: Layer 0, <cos_sim> = 0.13174
1: Layer 8, <cos_sim> = 0.516951
2: Layer 11, <cos_sim> = 0.61188
3: Layer 10, <cos_sim> = 0.612091
4: Layer 12, <cos_sim> = 0.612348
5: Layer 18, <cos_sim> = 0.616718
6: Layer 16, <cos_sim> = 0.61912
7: Layer 9, <cos_sim> = 0.655522
8: Layer 13, <cos_sim> = 0.665296
9: Layer 22, <cos_sim> = 0.672061
10: Layer 6, <cos_sim> = 0.699289
11: Layer 19, <cos_sim> = 0.700966
12: Layer 20, <cos_sim> = 0.704575
13: Layer 7, <cos_sim> = 0.71001
14: Layer 14, <cos_sim> = 0.725971
15: Layer 23, <cos_sim> = 0.740926
16: Layer 25, <cos_sim> = 0.747222
17: Layer 17, <cos_sim> = 0.749419
18: Layer 15, <cos_sim> = 0.754558
19: Layer 21, <cos_sim> = 0.761675
20: Layer 24, <cos_sim> = 0.761882
21: Layer 5, <cos_sim> = 0.766086
22: Layer 2, <cos_sim> = 0.767046
23: Layer 30, <cos_sim> = 0.772412
24: Layer 1, <cos_sim> = 0.772533
25: Layer 44, <cos_sim> = 0.777696
26: Layer 29, <cos_sim> = 0.779458
27: Layer 28, <cos_sim> = 0.779721
28: Layer 37, <cos_sim> = 0.780809
29: Layer 26, <cos_sim> = 0.781589
30: Layer 4, <cos_sim> = 0.786884
31: Layer 34, <cos_sim> = 0.787128
32: Layer 36, <cos_sim> = 0.78846
33: Layer 27, <cos_sim> = 0.791454
34: Layer 31, <cos_sim> = 0.805225
35: Layer 33, <cos_sim> = 0.806554
36: Layer 57, <cos_sim> = 0.809911
37: Layer 32, <cos_sim> = 0.811714
38: Layer 38, <cos_sim> = 0.81192
39: Layer 35, <cos_sim> = 0.816966
40: Layer 41, <cos_sim> = 0.820029
41: Layer 40, <cos_sim> = 0.833644
42: Layer 3, <cos_sim> = 0.83367
43: Layer 39, <cos_sim> = 0.835849
44: Layer 42, <cos_sim> = 0.841079
45: Layer 60, <cos_sim> = 0.853526
46: Layer 45, <cos_sim> = 0.857364
47: Layer 56, <cos_sim> = 0.859897
48: Layer 59, <cos_sim> = 0.861441
49: Layer 53, <cos_sim> = 0.864087
50: Layer 46, <cos_sim> = 0.864727
51: Layer 43, <cos_sim> = 0.864848
52: Layer 51, <cos_sim> = 0.872346
53: Layer 48, <cos_sim> = 0.87434
54: Layer 52, <cos_sim> = 0.874649
55: Layer 47, <cos_sim> = 0.878183
56: Layer 58, <cos_sim> = 0.879985
57: Layer 49, <cos_sim> = 0.880846
58: Layer 55, <cos_sim> = 0.885206
59: Layer 50, <cos_sim> = 0.897436
60: Layer 54, <cos_sim> = 0.921917
======================== sorted ffn importances
0: Layer 7, <cos_sim> = 0.571293
1: Layer 10, <cos_sim> = 0.590428
2: Layer 11, <cos_sim> = 0.591834
3: Layer 17, <cos_sim> = 0.608386
4: Layer 15, <cos_sim> = 0.620593
5: Layer 0, <cos_sim> = 0.632572
6: Layer 9, <cos_sim> = 0.643826
7: Layer 12, <cos_sim> = 0.64739
8: Layer 8, <cos_sim> = 0.649753
9: Layer 21, <cos_sim> = 0.67168
10: Layer 18, <cos_sim> = 0.679443
11: Layer 19, <cos_sim> = 0.701283
12: Layer 60, <cos_sim> = 0.701407
13: Layer 13, <cos_sim> = 0.712941
14: Layer 16, <cos_sim> = 0.722858
15: Layer 24, <cos_sim> = 0.725591
16: Layer 14, <cos_sim> = 0.727539
17: Layer 22, <cos_sim> = 0.728219
18: Layer 20, <cos_sim> = 0.736531
19: Layer 6, <cos_sim> = 0.744335
20: Layer 23, <cos_sim> = 0.749712
21: Layer 29, <cos_sim> = 0.757133
22: Layer 25, <cos_sim> = 0.758496
23: Layer 5, <cos_sim> = 0.759015
24: Layer 27, <cos_sim> = 0.759242
25: Layer 28, <cos_sim> = 0.76237
26: Layer 43, <cos_sim> = 0.764705
27: Layer 36, <cos_sim> = 0.766839
28: Layer 35, <cos_sim> = 0.773264
29: Layer 26, <cos_sim> = 0.775702
30: Layer 33, <cos_sim> = 0.778872
31: Layer 32, <cos_sim> = 0.790364
32: Layer 3, <cos_sim> = 0.790503
33: Layer 30, <cos_sim> = 0.792984
34: Layer 31, <cos_sim> = 0.79496
35: Layer 37, <cos_sim> = 0.795521
36: Layer 34, <cos_sim> = 0.796573
37: Layer 56, <cos_sim> = 0.804781
38: Layer 40, <cos_sim> = 0.806738
39: Layer 59, <cos_sim> = 0.808235
40: Layer 4, <cos_sim> = 0.809825
41: Layer 1, <cos_sim> = 0.819665
42: Layer 38, <cos_sim> = 0.820409
43: Layer 39, <cos_sim> = 0.820894
44: Layer 41, <cos_sim> = 0.824874
45: Layer 44, <cos_sim> = 0.846473
46: Layer 52, <cos_sim> = 0.849335
47: Layer 42, <cos_sim> = 0.850524
48: Layer 45, <cos_sim> = 0.851349
49: Layer 55, <cos_sim> = 0.852943
50: Layer 47, <cos_sim> = 0.85862
51: Layer 50, <cos_sim> = 0.858953
52: Layer 51, <cos_sim> = 0.861418
53: Layer 58, <cos_sim> = 0.861473
54: Layer 2, <cos_sim> = 0.862156
55: Layer 57, <cos_sim> = 0.86361
56: Layer 46, <cos_sim> = 0.864787
57: Layer 48, <cos_sim> = 0.867249
58: Layer 54, <cos_sim> = 0.876651
59: Layer 49, <cos_sim> = 0.883354
60: Layer 53, <cos_sim> = 0.90793
|
WARNING: This is mostly vibe code. Hope I'm not wasting y'alls time.
Compute Layer Importance Modification (LIM) Scores
The goal of this PR is to rank layers of a given tensor in order of sensitivity to quantization error. Given that it is now possible to use
llama-quantize --custom-q ...regex, it may be possible to use these LIM Scores to decide which layers of a given tensor to quantize more or less in an attempt to preserve generation quality (e.g. low perplexity) while reducing memory footprint as compared to using same quant size across all layers of a given tensor.This experimental PR was motivated by this comment and PR: ggml-org/llama.cpp#12718 (comment) (EDIT fixed link directly to comment)
I may force-push this after more testing and experimenting to see if it is actually doing the right thing and if the output is actually useful to improve quantization quality e.g. PPL per GiB... This may just be a big mistake, lol.
This is built on existing imatrix computation and assumes that values of
x[j]are the "activations" coming right in/out of the given tensor layer. I don't know GGML and generally work in python or vanilla c not so much c++. So a lot of this was vibe coded running ubergarm/DeepSeek-V3-0324-GGUF IQ4_K_R4 quant. So this is partially an experiment actually trying to use an LLM instead of just enjoying the meta of manual quantization min-maxing.TODO
Qwen/CodeQwen1.5-7B-Chat-GGUFq8_0ubergarm/DeepSeek-V3-0324-GGUFq8_0--custom-qregex and compare PPL per GiBReference
Logs
llama-imatrix run printing out what hopefully are actually LIM scores
numactl -N 1 -m 1 \ ./build/bin/llama-imatrix \ --verbosity 1 \ -m /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0.gguf \ -f calibration_data_v5_rc.txt \ -o imatrix.dat \ --ctx-size 512 \ --numa numactl \ --threads 128 llama_model_loader: loaded meta data with 46 key-value pairs and 1147 tensors from /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324 llama_model_loader: - kv 3: general.version str = V3-0324 llama_model_loader: - kv 4: general.basename str = DeepSeek llama_model_loader: - kv 5: general.size_label str = 256x21B llama_model_loader: - kv 6: general.license str = mit llama_model_loader: - kv 7: deepseek2.block_count u32 = 61 llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168 llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432 llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128 llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128 llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8 llama_model_loader: - kv 16: general.file_type u32 = 7 llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3 llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280 llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536 llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048 llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256 llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1 llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000 llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2 llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3 llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = [" llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3 llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = [" llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1 llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 45: general.quantization_version u32 = 2 llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q8_0: 786 tensors llm_load_vocab: special tokens cache size = 818 llm_load_vocab: token to piece cache size = 0.8223 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = deepseek2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 129280 llm_load_print_meta: n_merges = 127741 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 163840 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_layer = 61 llm_load_print_meta: n_head = 128 llm_load_print_meta: n_head_kv = 128 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 192 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 24576 llm_load_print_meta: n_embd_v_gqa = 16384 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 18432 llm_load_print_meta: n_expert = 256 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = yarn llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 0.025 llm_load_print_meta: n_ctx_orig_yarn = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 671B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 672.050 B llm_load_print_meta: model size = 665.308 GiB (8.504 BPW) llm_load_print_meta: repeating layers = 663.474 GiB (8.504 BPW, 670.196 B parameters) llm_load_print_meta: general.name = DeepSeek V3 0324 llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: LF token = 131 'Ä' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_layer_dense_lead = 3 llm_load_print_meta: n_lora_q = 1536 llm_load_print_meta: n_lora_kv = 512 llm_load_print_meta: n_ff_exp = 2048 llm_load_print_meta: n_expert_shared = 1 llm_load_print_meta: expert_weights_scale = 2.5 llm_load_print_meta: expert_weights_norm = 1 llm_load_print_meta: expert_gating_func = sigmoid llm_load_print_meta: rope_yarn_log_mul = 0.1000 llm_load_tensors: ggml ctx size = 0.47 MiB llm_load_tensors: CPU buffer size = 681274.97 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 0.025 llama_kv_cache_init: CPU KV buffer size = 2440.00 MiB llama_new_context_with_model: KV self size = 2440.00 MiB, K (f16): 1464.00 MiB, V (f16): 976.00 MiB llama_new_context_with_model: CPU output buffer size = 0.49 MiB llama_new_context_with_model: CPU compute buffer size = 283.01 MiB llama_new_context_with_model: graph nodes = 3724 llama_new_context_with_model: graph splits = 1 system_info: n_threads = 128 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | compute_imatrix: tokenizing the input .. compute_imatrix: tokenization took 312.531 ms compute_imatrix: computing over 213 chunks with batch_size 512 compute_imatrix: 53.45 seconds per pass - ETA 3 hours 9.73 minutes [1]60.9619,[2]10.7701,[3]5.8724,[4]3.7883,[5]2.9691,[6]2.5089,[7]2.2199,[8]2.0199,[9]1.9095, save_imatrix: entry ' blk.60.ffn_down_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware** save_imatrix: entry ' blk.60.ffn_gate_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware** save_imatrix: entry ' blk.60.ffn_up_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware** save_imatrix: entry ' blk.25.ffn_down_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware** save_imatrix: entry ' blk.26.ffn_down_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware** save_imatrix: entry ' blk.25.ffn_up_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware** save_imatrix: entry ' blk.25.ffn_gate_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware** save_imatrix: entry ' blk.26.ffn_gate_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware** save_imatrix: entry ' blk.26.ffn_up_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware** save_imatrix: stored collected data after 10 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [10]1.8219,[11]2.0296,[12]2.0839,[13]2.0978,[14]2.1403,[15]2.0365,[16]1.9492,[17]1.8786,[18]1.8160,[19]1.7743, save_imatrix: stored collected data after 20 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [20]1.7315,[21]1.6986,[22]1.6609,[23]1.6319,[24]1.6201,[25]1.6080,[26]1.5822,[27]1.6812,[28]1.7547,[29]1.8204, save_imatrix: stored collected data after 30 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [30]1.8188,[31]1.8323,[32]1.8317,[33]1.8091,[34]1.8457,[35]1.8217,[36]1.8215,[37]1.8106,[38]1.8208,[39]1.8070, save_imatrix: stored collected data after 40 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [40]1.7838,[41]1.7606,[42]1.7410,[43]1.7291,[44]1.7157,[45]1.7023,[46]1.6981,[47]1.6919,[48]1.6811,[49]1.6707, save_imatrix: stored collected data after 50 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [50]1.6650,[51]1.6623,[52]1.6625,[53]1.6672,[54]1.6812,[55]1.6781,[56]1.6683,[57]1.6764,[58]1.6796,[59]1.6906, save_imatrix: stored collected data after 60 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [60]1.6855,[61]1.7243,[62]1.7565,[63]1.7884,[64]1.8197,[65]1.8677,[66]1.8802,[67]1.9148,[68]1.9442,[69]1.9996, save_imatrix: stored collected data after 70 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [70]2.0525,[71]2.0832,[72]2.1136,[73]2.1258,[74]2.1407,[75]2.1702,[76]2.2011,[77]2.2185,[78]2.2164,[79]2.2313, save_imatrix: stored collected data after 80 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [80]2.2543,[81]2.2904,[82]2.3238,[83]2.3342,[84]2.3650,[85]2.3733,[86]2.3730,[87]2.4024,[88]2.4344,[89]2.4899, save_imatrix: stored collected data after 90 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [90]2.5102,[91]2.5125,[92]2.5192,[93]2.5349,[94]2.5452,[95]2.5779,[96]2.5670,[97]2.6058,[98]2.6319,[99]2.6214, save_imatrix: stored collected data after 100 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [100]2.6537,[101]2.7008,[102]2.7326,[103]2.7740,[104]2.8020,[105]2.8310,[106]2.8682,[107]2.8605,[108]2.8789,[109]2.8849, save_imatrix: stored collected data after 110 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [110]2.8910,[111]2.8878,[112]2.9177,[113]2.9435,[114]2.9520,[115]2.9363,[116]2.9104,[117]2.9044,[118]2.9147,[119]2.9003, save_imatrix: stored collected data after 120 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [120]2.8773,[121]2.8737,[122]2.8738,[123]2.8819,[124]2.8872,[125]2.8942,[126]2.9018,[127]2.9043,[128]2.9343,[129]2.9484, save_imatrix: stored collected data after 130 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [130]2.9241,[131]2.9003,[132]2.8771,[133]2.8544,[134]2.8563,[135]2.8567,[136]2.8828,[137]2.9150,[138]2.9340,[139]2.9389, save_imatrix: stored collected data after 140 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [140]2.9637,[141]2.9866,[142]3.0151,[143]3.0354,[144]3.0569,[145]3.0766,[146]3.0972,[147]3.1154,[148]3.1266,[149]3.1351, save_imatrix: stored collected data after 150 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [150]3.1395,[151]3.1572,[152]3.1761,[153]3.1759,[154]3.1834,[155]3.1945,[156]3.2035,[157]3.2148,[158]3.2209,[159]3.2300, save_imatrix: stored collected data after 160 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [160]3.2442,[161]3.2498,[162]3.2525,[163]3.2595,[164]3.2704,[165]3.2724,[166]3.2737,[167]3.2912,[168]3.3010,[169]3.3082, save_imatrix: stored collected data after 170 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [170]3.3258,[171]3.3403,[172]3.3354,[173]3.3417,[174]3.3424,[175]3.3575,[176]3.3691,[177]3.3818,[178]3.3768,[179]3.3734, save_imatrix: stored collected data after 180 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [180]3.3682,[181]3.3635,[182]3.3578,[183]3.3531,[184]3.3472,[185]3.3600,[186]3.3887,[187]3.4121,[188]3.4336,[189]3.4550, save_imatrix: stored collected data after 190 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [190]3.4850,[191]3.4990,[192]3.5134,[193]3.5036,[194]3.5210,[195]3.5145,[196]3.4953,[197]3.4747,[198]3.4946,[199]3.5110, save_imatrix: stored collected data after 200 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [200]3.5207,[201]3.5290,[202]3.5447,[203]3.5621,[204]3.5748,[205]3.5874,[206]3.6021,[207]3.5989,[208]3.5771,[209]3.5556, save_imatrix: stored collected data after 210 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat [210]3.5342,[211]3.5134,[212]3.4930,[213]3.4727, save_imatrix: stored collected data after 213 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-0e808309.dat llama_print_timings: load time = 54390.61 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 10568880.33 ms / 109056 tokens ( 96.91 ms per token, 10.32 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 10644363.84 ms / 109057 tokens Final estimate: PPL = 3.4727 +/- 0.03300 === Computing Layer Importance Modification (LIM) Scores... Tensor: ffn_down Layer LIM Score ----- --------- 0 -0.0005 1 0.0003 Tensor: ffn_gate Layer LIM Score ----- --------- 0 -0.9435 1 -0.9339 Tensor: attn_kv_b Layer LIM Score ----- --------- 0 0.0158 1 -0.0101 2 0.1035 3 0.0725 4 0.0570 5 -0.1063 6 -0.0104 7 -0.0682 8 0.0010 9 -0.0483 10 0.0071 11 -0.0183 12 0.0444 13 -0.0155 14 -0.0235 15 -0.0039 16 -0.0144 17 0.0431 18 0.1076 19 0.0789 20 -0.0668 21 -0.0136 22 -0.0317 23 0.0152 24 0.0210 25 -0.0111 26 0.0289 27 0.0192 28 -0.0513 29 0.0366 30 0.0046 31 -0.0151 32 -0.0159 33 0.0894 34 0.0484 35 0.0126 36 0.0168 37 -0.0292 38 0.0405 39 -0.0329 40 0.0770 41 0.0044 42 0.0064 43 0.0106 44 0.0041 45 0.0120 46 -0.0012 47 -0.0506 48 -0.0222 49 0.0434 50 0.0409 51 0.0133 52 0.0315 53 0.0141 54 0.0002 55 -0.0269 56 -0.0391 57 0.0213 58 0.0365 59 -0.0249 Tensor: attn_q_a Layer LIM Score ----- --------- 0 -0.4179 1 -0.8773 2 -0.9436 3 -0.9022 4 -0.9166 5 -0.9418 6 -0.9812 7 -0.9599 8 -0.9085 9 -0.9724 10 -0.9882 11 -0.9868 12 -0.9906 13 -0.9816 14 -0.9827 15 -0.9766 16 -0.9590 17 -0.9474 18 -0.9573 19 -0.9601 20 -0.9553 21 -0.9345 22 -0.9042 23 -0.9299 24 -0.9555 25 -0.9554 26 -0.9598 27 -0.9575 28 -0.9610 29 -0.9634 30 -0.9601 31 -0.9572 32 -0.9674 33 -0.9619 34 -0.9707 35 -0.9493 36 -0.9801 37 -0.9702 38 -0.9737 39 -0.9567 40 -0.9366 41 -0.9667 42 -0.9751 43 -0.9566 44 -0.9488 45 -0.9364 46 -0.9516 47 -0.9355 48 -0.9723 49 -0.9630 50 -0.9702 51 -0.9591 52 -0.9670 53 -0.8937 54 -0.9420 55 -0.9566 56 -0.9543 57 -0.8239 58 -0.8915 59 -0.9073 Tensor: ffn_up Layer LIM Score ----- --------- 0 -0.9435 1 -0.9339 Tensor: ffn_gate_shexp Layer LIM Score ----- --------- 3 -0.9355 4 -0.9365 5 -0.9068 6 -0.9485 7 -0.9117 8 -0.8524 9 -0.9458 10 -0.9404 11 -0.9593 12 -0.9458 13 -0.9364 14 -0.9494 15 -0.8997 16 -0.9017 17 -0.8748 18 -0.8369 19 -0.9108 20 -0.8583 21 -0.8067 22 -0.8093 23 -0.8568 24 -0.8719 25 -0.8983 26 -0.9103 27 -0.8789 28 -0.9135 29 -0.9107 30 -0.8975 31 -0.9346 32 -0.9335 33 -0.9334 34 -0.9343 35 -0.9524 36 -0.9404 37 -0.9573 38 -0.9487 39 -0.8949 40 -0.9070 41 -0.9669 42 -0.9815 43 -0.9481 44 -0.9233 45 -0.9606 46 -0.9472 47 -0.9145 48 -0.9580 49 -0.9672 50 -0.9689 51 -0.9570 52 -0.9670 53 -0.9735 54 -0.9553 55 -0.9542 56 -0.9671 57 -0.9526 58 -0.9285 59 -0.9185 Tensor: attn_output Layer LIM Score ----- --------- 0 -0.0085 1 -0.0031 2 -0.0161 3 0.0021 4 -0.0048 5 -0.0054 6 -0.0048 7 0.0039 8 0.0093 9 0.0012 10 0.0088 11 0.0053 12 -0.0081 13 -0.0059 14 -0.0070 15 0.0006 16 -0.0065 17 -0.0013 18 -0.0146 19 0.0130 20 0.0002 21 0.0036 22 0.0010 23 -0.0060 24 -0.0079 25 0.0084 26 0.0084 27 0.0064 28 0.0000 29 0.0105 30 -0.0013 31 -0.0003 32 -0.0054 33 0.0022 34 -0.0029 35 -0.0028 36 0.0048 37 0.0044 38 -0.0011 39 -0.0155 40 0.0008 41 -0.0222 42 0.0034 43 0.0029 44 0.0060 45 -0.0064 46 0.0054 47 -0.0042 48 0.0226 49 -0.0025 50 -0.0013 51 -0.0026 52 -0.0077 53 -0.0047 54 0.0012 55 -0.0097 56 -0.0060 57 -0.0017 58 -0.0126 59 -0.0006 Tensor: attn_q_b Layer LIM Score ----- --------- 0 -0.0019 1 0.0326 2 -0.0428 3 0.0138 4 -0.0080 5 0.0039 6 -0.0023 7 0.0048 8 -0.0020 9 -0.0183 10 -0.0130 11 0.0098 12 -0.0203 13 0.0459 14 -0.0151 15 0.0240 16 -0.0004 17 0.0102 18 0.0228 19 -0.0027 20 0.0248 21 -0.0085 22 -0.0558 23 0.0006 24 0.0064 25 0.0101 26 0.0460 27 -0.0457 28 0.0438 29 0.0190 30 0.0018 31 -0.0275 32 0.0409 33 -0.0184 34 0.0215 35 -0.0329 36 0.0059 37 -0.0366 38 -0.0044 39 0.0191 40 -0.0017 41 -0.0191 42 -0.0314 43 -0.0303 44 0.0249 45 0.0063 46 0.0204 47 -0.0585 48 -0.0175 49 0.0103 50 -0.0059 51 -0.0109 52 -0.0188 53 -0.0267 54 -0.0126 55 0.0192 56 -0.0573 57 -0.0073 58 0.0007 59 0.0150 Tensor: ffn_up_exps Layer LIM Score ----- --------- 3 -0.5456 4 -0.4082 5 -0.2537 6 -0.1726 7 -0.1470 8 -0.1202 9 -0.1336 10 -0.1300 11 -0.1028 12 -0.0907 13 -0.0846 14 -0.1017 15 -0.1079 16 -0.1087 17 -0.1140 18 -0.1238 19 -0.1185 20 -0.1048 21 -0.1017 22 -0.1183 23 -0.1191 24 -0.1308 25 -0.1321 26 -0.1296 27 -0.1313 28 -0.1243 29 -0.1219 30 -0.1115 31 -0.1232 32 -0.1394 33 -0.1531 34 -0.1637 35 -0.1862 36 -0.1986 37 -0.1989 38 -0.1842 39 -0.1887 40 -0.1801 41 -0.1856 42 -0.1775 43 -0.1715 44 -0.1735 45 -0.1763 46 -0.1583 47 -0.1574 48 -0.1662 49 -0.1617 50 -0.1480 51 -0.1449 52 -0.1454 53 -0.1490 54 -0.1414 55 -0.1439 56 -0.1482 57 -0.1503 58 -0.1510 59 -0.1676 Tensor: ffn_down_shexp Layer LIM Score ----- --------- 3 -0.0069 4 -0.0084 5 -0.0035 6 0.0161 7 -0.0323 8 0.0076 9 -0.0282 10 0.0427 11 0.0319 12 -0.0441 13 -0.0088 14 0.0075 15 0.0354 16 0.0322 17 0.0148 18 0.0170 19 0.0018 20 0.0105 21 -0.0051 22 0.0146 23 0.0331 24 -0.0011 25 0.0010 26 0.0267 27 -0.0100 28 0.0151 29 0.0055 30 -0.0155 31 -0.0191 32 -0.0075 33 -0.0136 34 -0.0237 35 -0.0251 36 -0.0276 37 0.0159 38 -0.0328 39 -0.0050 40 0.0141 41 -0.0140 42 -0.0111 43 0.0180 44 -0.0102 45 -0.0356 46 0.0016 47 0.0206 48 -0.0075 49 -0.0405 50 0.0422 51 -0.0146 52 -0.0320 53 0.0046 54 0.0311 55 0.0032 56 -0.0039 57 -0.0203 58 -0.0136 59 -0.0119 Tensor: ffn_up_shexp Layer LIM Score ----- --------- 3 -0.9355 4 -0.9365 5 -0.9068 6 -0.9485 7 -0.9117 8 -0.8524 9 -0.9458 10 -0.9404 11 -0.9593 12 -0.9458 13 -0.9364 14 -0.9494 15 -0.8997 16 -0.9017 17 -0.8748 18 -0.8369 19 -0.9108 20 -0.8583 21 -0.8067 22 -0.8093 23 -0.8568 24 -0.8719 25 -0.8983 26 -0.9103 27 -0.8789 28 -0.9135 29 -0.9107 30 -0.8975 31 -0.9346 32 -0.9335 33 -0.9334 34 -0.9343 35 -0.9524 36 -0.9404 37 -0.9573 38 -0.9487 39 -0.8949 40 -0.9070 41 -0.9669 42 -0.9815 43 -0.9481 44 -0.9233 45 -0.9606 46 -0.9472 47 -0.9145 48 -0.9580 49 -0.9672 50 -0.9689 51 -0.9570 52 -0.9670 53 -0.9735 54 -0.9553 55 -0.9542 56 -0.9671 57 -0.9526 58 -0.9285 59 -0.9185 Tensor: attn_kv_a_mqa Layer LIM Score ----- --------- 0 -0.4179 1 -0.8773 2 -0.9436 3 -0.9022 4 -0.9166 5 -0.9418 6 -0.9812 7 -0.9599 8 -0.9085 9 -0.9724 10 -0.9882 11 -0.9868 12 -0.9906 13 -0.9816 14 -0.9827 15 -0.9766 16 -0.9590 17 -0.9474 18 -0.9573 19 -0.9601 20 -0.9553 21 -0.9345 22 -0.9042 23 -0.9299 24 -0.9555 25 -0.9554 26 -0.9598 27 -0.9575 28 -0.9610 29 -0.9634 30 -0.9601 31 -0.9572 32 -0.9674 33 -0.9619 34 -0.9707 35 -0.9493 36 -0.9801 37 -0.9702 38 -0.9737 39 -0.9567 40 -0.9366 41 -0.9667 42 -0.9751 43 -0.9566 44 -0.9488 45 -0.9364 46 -0.9516 47 -0.9355 48 -0.9723 49 -0.9630 50 -0.9702 51 -0.9591 52 -0.9670 53 -0.8937 54 -0.9420 55 -0.9566 56 -0.9543 57 -0.8239 58 -0.8915 59 -0.9073 Tensor: ffn_gate_inp Layer LIM Score ----- --------- 3 -0.9355 4 -0.9365 5 -0.9068 6 -0.9485 7 -0.9117 8 -0.8524 9 -0.9458 10 -0.9404 11 -0.9593 12 -0.9458 13 -0.9364 14 -0.9494 15 -0.8997 16 -0.9017 17 -0.8748 18 -0.8369 19 -0.9108 20 -0.8583 21 -0.8067 22 -0.8093 23 -0.8568 24 -0.8719 25 -0.8983 26 -0.9103 27 -0.8789 28 -0.9135 29 -0.9107 30 -0.8975 31 -0.9346 32 -0.9335 33 -0.9334 34 -0.9343 35 -0.9524 36 -0.9404 37 -0.9573 38 -0.9487 39 -0.8949 40 -0.9070 41 -0.9669 42 -0.9815 43 -0.9481 44 -0.9233 45 -0.9606 46 -0.9472 47 -0.9145 48 -0.9580 49 -0.9672 50 -0.9689 51 -0.9570 52 -0.9670 53 -0.9735 54 -0.9553 55 -0.9542 56 -0.9671 57 -0.9526 58 -0.9285 59 -0.9185 Tensor: ffn_gate_exps Layer LIM Score ----- --------- 3 -0.5456 4 -0.4082 5 -0.2537 6 -0.1726 7 -0.1470 8 -0.1202 9 -0.1336 10 -0.1300 11 -0.1028 12 -0.0907 13 -0.0846 14 -0.1017 15 -0.1079 16 -0.1087 17 -0.1140 18 -0.1238 19 -0.1185 20 -0.1048 21 -0.1017 22 -0.1183 23 -0.1191 24 -0.1308 25 -0.1321 26 -0.1296 27 -0.1313 28 -0.1243 29 -0.1219 30 -0.1115 31 -0.1232 32 -0.1394 33 -0.1531 34 -0.1637 35 -0.1862 36 -0.1986 37 -0.1989 38 -0.1842 39 -0.1887 40 -0.1801 41 -0.1856 42 -0.1775 43 -0.1715 44 -0.1735 45 -0.1763 46 -0.1583 47 -0.1574 48 -0.1662 49 -0.1617 50 -0.1480 51 -0.1449 52 -0.1454 53 -0.1490 54 -0.1414 55 -0.1439 56 -0.1482 57 -0.1503 58 -0.1510 59 -0.1676 Tensor: ffn_down_exps Layer LIM Score ----- --------- 3 -0.0001 4 0.0004 5 -0.0014 6 0.0006 7 -0.0001 8 -0.0015 9 0.0008 10 0.0013 11 0.0021 12 -0.0015 13 0.0004 14 0.0010 15 0.0022 16 -0.0002 17 -0.0001 18 -0.0021 19 0.0021 20 -0.0013 21 0.0003 22 0.0013 23 -0.0014 24 0.0006 25 0.0001 26 -0.0002 27 -0.0016 28 0.0003 29 0.0004 30 -0.0011 31 -0.0014 32 0.0021 33 -0.0017 34 -0.0005 35 -0.0011 36 -0.0006 37 -0.0007 38 0.0010 39 -0.0037 40 0.0004 41 0.0012 42 -0.0012 43 0.0018 44 -0.0005 45 0.0028 46 0.0009 47 -0.0015 48 0.0000 49 0.0013 50 -0.0012 51 0.0011 52 0.0016 53 0.0005 54 0.0007 55 -0.0021 56 0.0001 57 0.0021 58 -0.0003 59 0.0001Raw LIM Scores for all tensors and layers of `DeepSeek-V3-0324` `q8_0` GGUF
Normalized LIM Scores for all tensors and layers of `DeepSeek-V3-0324` `q8_0` GGUF