LlaMA-4 support (text only) #321

ikawrakow · 2025-04-09T09:36:37Z

It seems the initial reactions to LlaMA-4 are mostly negative. Nevertheless, quantized LlaMA-Scout is something I can run on one of my systems, so here it is.

Derived from PR 12791 in mainline. But the code bases have diverged so much by now that it did take some effort to port the PR.

As with Gemma-3, I did not add the necessary modifications to convert_hf_to_gguf.py, so mainline is required to generate the model GGUF.

Did a quick test with a Q6_K model (no imatrix yet, so wanted to use more bits to not worry about quantization effects). Ryzen-5975WX CPU, RTX-4080 GPU, using

-ot exps=CPU -rtr -fmoe -t 32 -ngl 100

I got 221 t/s in the perplexity run, and 10.5 t/s for 128 tokens asking the standard question about the meaning of life. This is not bad at all.

As mentioned in PR 12791, the model fails the ultimate AGI test:

> How many r's are there in strawberry?
There are 2 R's in the word "strawberry".

Closes #314

ikawrakow · 2025-04-09T15:02:02Z

So, using a single active expert as prescribed by the model parameters, I get

PPL(Q8_0, n_ctx = 512) = 9.0644

Activating 2 experts using --override-kv "llama4.expert_used_count=int:2" I get

PPL(Q8_0, n_ctx = 512) = 8.7030

It is of course slower (133 t/s vs 211 t/s with the setup described above), but it is kind of strange that 2 experts produce a lower PPL. This wasn't the case for Mixtral8x7B where 3 experts were worse than 2 (unless one was using a very low bpw quantization).

saood06 · 2025-04-10T03:37:51Z

So, using a single active expert as prescribed by the model parameters, I get
PPL(Q8_0, n_ctx = 512) = 9.0644
Activating 2 experts using --override-kv "llama4.expert_used_count=int:2" I get
PPL(Q8_0, n_ctx = 512) = 8.7030
It is of course slower (133 t/s vs 211 t/s with the setup described above), but it is kind of strange that 2 experts produce a lower PPL. This wasn't the case for Mixtral8x7B where 3 experts were worse than 2 (unless one was using a very low bpw quantization).

Have you tried even higher numbers? Does it peak at 2 experts?

ikawrakow · 2025-04-10T05:59:25Z

Here some quantization experiments with LlaMA-4-Scout

UD-Q2_K_XL.gguf - downloaded from Huggingface: PPL(n_ctx = 512) = 9.6535
Same quantization mix as UD-Q2_K_XL.gguf, but quantized with ik_llama.cpp¹: PPL(n_ctx = 512) = 9.5668
Replace q4_K with iq4_K for ffn_down_exps tensors: PPL(n_ctx = 512) = 9.4895
Strangely enough, replacing q4_K with iq4_K in the attention tensors leads to higher PPL

¹ Unsloth's Q2_K_XL mix is obtained without any code changes using

./bin/llama-quantize --imatrix $imatrix --custom-q "ffn_gate_shexp=q4_K,ffn_up_shexp=q4_K,ffn_down_shexp=q6_K,attn=q4_K,token_embd.weight=q4_K,output.weight=q6_K,blk\.[0-5]\.ffn_down_exps=q4_K,ffn_down_exps=q3_K,ffn_up_exps=q2_K,ffn_gate_exps=q2_K" $model $output_file q2_K

ikawrakow · 2025-04-10T06:05:45Z

Have you tried even higher numbers? Does it peak at 2 experts?

Not yet. I'm doing some quantization experiments, and things take some time on the hardware I have available. For 3 experts with Q8_0 the PPL calculation will be more than an hour.

saood06 · 2025-04-10T06:13:30Z

Strangely enough, replacing q4_K with iq4_K in the attention tensors leads to higher PPL

Do you think this could affect other architectures?

ikawrakow · 2025-04-10T06:18:31Z

Do you think this could affect other architectures?

I have noticed in the past that iq4_k/iq5_k/iq6_k for the attention tensors does not have a clear advantage compared to q4_K/q5_K/q6_K. They are much better for the FFN portion and that's where the quality gains come from. But this is the first time when it became worse. So, in your case, if you are looking to optimize performance (and have time/energy to experiment), you can try replacing iq4_k with q4_K in the attention tensors as this will improve inference speed.

ikawrakow · 2025-04-10T06:20:02Z

Oh, for token embeddings I had a few cases where it was better to use the corresponding k-quant instead of the iqk quant.

saood06 · 2025-04-10T06:46:32Z

I have noticed in the past that iq4_k/iq5_k/iq6_k for the attention tensors does not have a clear advantage compared to q4_K/q5_K/q6_K. They are much better for the FFN portion and that's where the quality gains come from. But this is the first time when it became worse. So, in your case, if you are looking to optimize performance (and have time/energy to experiment), you can try replacing iq4_k with q4_K in the attention tensors as this will improve inference speed.

Oh, for token embeddings I had a few cases where it was better to use the corresponding k-quant instead of the iqk quant.

Interesting to hear. I will take all this into account next time I make quants.

ikawrakow · 2025-04-10T06:57:24Z

Have you tried even higher numbers? Does it peak at 2 experts?

Just tried. Did not run Wikitext2 to completion, but after 172 chunks PPL with 3 experts is 0.1 higher than 2 experts, so it is very unlikely it will be better at the end. Still better than a single expert, but 2 experts seems to be the sweet spot (at the expense of a hit in performance).

ikawrakow · 2025-04-10T07:05:15Z

This seems solid enough, merging it.

saood06 · 2025-04-10T08:20:34Z

Just tried. Did not run Wikitext2 to completion, but after 172 chunks PPL with 3 experts is 0.1 higher than 2 experts, so it is very unlikely it will be better at the end. Still better than a single expert, but 2 experts seems to be the sweet spot (at the expense of a hit in performance).

If I ever try Maverick will see if it is replicable there.

ikawrakow · 2025-04-10T15:11:51Z

So, L4-Scout seems to quantize pretty well.

4-bit (IQ4_KS)

PPL = 9.0554 (better than Q8_0, so no need to go beyond that)
Quantized model size: 54.003 GiB
Recipe

./bin/llama-quantize --imatrix l4_scout_imat_512.out --custom-q "ffn_gate_shexp=iq4_ks,ffn_up_shexp=iq4_ks,ffn_down_shexp=iq5_k,attn=iq4_ks,token_embd.weight=q4_K,output.weight=q6_K,ffn_.*_exps=iq4_ks" ../../iquants/models/l4_109B/Llama4-Scout-16x17B-BF16.gguf junk1.bin iq4_ks

(so basically everything with IQ4_KS, except for ffn_down_shexp (IQ5_K), token_embd (Q4_K) and output.weight (Q6_K)) gives a Wikitext2 PPL of 9.0554 (better than Q8_0).

Beating Unsloth's UD-Q2_K_XL

PPL = 9.4736 vs theirs PPL = 9.6535
Model size: 39.090 GiB vs Unsloth's 39.654 GiB
Recipe

./bin/llama-quantize --imatrix l4_scout_imat_512.out --custom-q "ffn_gate_shexp=iq4_ks,ffn_up_shexp=iq4_ks,ffn_down_shexp=iq5_k,attn=iq4_ks,token_embd.weight=q4_K,output.weight=q6_K,blk\.[0-5]\.ffn_down_exps=iq4_ks,ffn_down_exps=q3_K,ffn_up_exps=q2_K,ffn_gate_exps=q2_K" ../../iquants/models/l4_109B/Llama4-Scout-16x17B-BF16.gguf junk1.bin q2_K

Beating Unsloth's UD-IQ2_XXS

PPL = 10.1506 vs theirs PPL = 10.3454
Model size: 34.871 GiB vs theirs 35.904 GiB
Recipe:

./bin/llama-quantize --imatrix l4_scout_imat_512.out --custom-q "ffn_gate_shexp=iq4_ks,ffn_up_shexp=iq4_ks,ffn_down_shexp=iq5_k,attn=iq4_ks,token_embd.weight=q4_K,output.weight=q6_K,blk\.[0-5]\.ffn_down_exps=iq4_ks,ffn_down_exps=q3_K,ffn_up_exps=iq2_xxs,ffn_gate_exps=iq2_xxs" ../../iquants/models/l4_109B/Llama4-Scout-16x17B-BF16.gguf junk1.bin iq2_xxs

Beating Unsloth's UD-IQ1_S

PPL = 10.9640 vs theirs PPL = 11.0173
Model size: 31.121 GiB vs theirs 31.510 GiB
Recipe:

./bin/llama-quantize --imatrix l4_scout_imat_512.out --custom-q "ffn_gate_shexp=iq4_ks,ffn_up_shexp=iq4_ks,ffn_down_shexp=iq5_k,attn=iq4_ks,token_embd.weight=q4_K,output.weight=q6_K,blk\.[0-5]\.ffn_down_exps=iq4_ks,ffn_down_exps=iq3_k,ffn_up_exps=iq1_s,ffn_gate_exps=iq1_s" ../../iquants/models/l4_109B/Llama4-Scout-16x17B-BF16.gguf junk1.bin iq1_s

ikawrakow · 2025-04-11T16:01:10Z

Here another recipe for iq3_xxs:

./bin/llama-quantize --imatrix l4_scout_imat_512.out --custom-q "ffn_gate_shexp=iq4_ks,ffn_up_shexp=iq4_ks,ffn_down_shexp=iq5_k,attn=iq4_ks,token_embd.weight=q4_K,output.weight=q6_K,ffn_down_exps=iq4_ks,ffn_.*_exps=iq3_xxs" ../../iquants/models/l4_109B/Llama4-Scout-16x17B-BF16.gguf junk1.bin iq3_xxs

The model ends up being 45.05 GiB (48.38 GB), so qualifies for this "under 50 GB" shoot-out. Final Wiki2 PPL is 9.2462 (so just 2% higher than Q8_0). PPL after 300 chunks (as used in the shoot-out) is 8.8937. If I then go through the trouble of running llama-perplexity with the --kl-divergence option, I get this

====== Perplexity statistics ======
Mean PPL(Q)                   :   8.894160 ±   0.099641
Cor(ln(PPL(Q)), ln(PPL(base))):  97.61%
Mean ln(PPL(Q)/PPL(base))     :   0.030502 ±   0.002438

====== KL divergence statistics ======
Mean    KLD:   0.106186 ±   0.001075
99.0%   KLD:   1.098310
Median  KLD:   0.033228

====== Token probability statistics ======
Mean    Δp: -0.695 ± 0.033 %
90.0%   Δp:  5.221%
Median  Δp: -0.002%

RMS Δp    :  9.177 ± 0.076 %
Same top p: 87.280 ± 0.120 %

So, a different league than the shoot-out models.

Kawrakow added 2 commits April 9, 2025 09:56

llama4: WIP

7a1fc34

llama4: this seems to be working

b51661b

ikawrakow changed the title ~~LlaMA-4 support~~ LlaMA-4 support (text only) Apr 9, 2025

ikawrakow merged commit 474435f into main Apr 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlaMA-4 support (text only) #321

LlaMA-4 support (text only) #321

ikawrakow commented Apr 9, 2025 •

edited

Loading

ikawrakow commented Apr 9, 2025

saood06 commented Apr 10, 2025

ikawrakow commented Apr 10, 2025

ikawrakow commented Apr 10, 2025

saood06 commented Apr 10, 2025

ikawrakow commented Apr 10, 2025

ikawrakow commented Apr 10, 2025

saood06 commented Apr 10, 2025

ikawrakow commented Apr 10, 2025 •

edited

Loading

ikawrakow commented Apr 10, 2025

saood06 commented Apr 10, 2025

ikawrakow commented Apr 10, 2025 •

edited

Loading

ikawrakow commented Apr 11, 2025

LlaMA-4 support (text only) #321

LlaMA-4 support (text only) #321

Conversation

ikawrakow commented Apr 9, 2025 • edited Loading

ikawrakow commented Apr 9, 2025

saood06 commented Apr 10, 2025

ikawrakow commented Apr 10, 2025

ikawrakow commented Apr 10, 2025

saood06 commented Apr 10, 2025

ikawrakow commented Apr 10, 2025

ikawrakow commented Apr 10, 2025

saood06 commented Apr 10, 2025

ikawrakow commented Apr 10, 2025 • edited Loading

ikawrakow commented Apr 10, 2025

saood06 commented Apr 10, 2025

ikawrakow commented Apr 10, 2025 • edited Loading

4-bit (IQ4_KS)

Beating Unsloth's UD-Q2_K_XL

Beating Unsloth's UD-IQ2_XXS

Beating Unsloth's UD-IQ1_S

ikawrakow commented Apr 11, 2025

ikawrakow commented Apr 9, 2025 •

edited

Loading

ikawrakow commented Apr 10, 2025 •

edited

Loading

ikawrakow commented Apr 10, 2025 •

edited

Loading