-
Notifications
You must be signed in to change notification settings - Fork 27
LlaMA-4 support (text only) #321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
So, using a single active expert as prescribed by the model parameters, I get
Activating 2 experts using
It is of course slower (133 t/s vs 211 t/s with the setup described above), but it is kind of strange that 2 experts produce a lower PPL. This wasn't the case for Mixtral8x7B where 3 experts were worse than 2 (unless one was using a very low bpw quantization). |
Have you tried even higher numbers? Does it peak at 2 experts? |
Here some quantization experiments with LlaMA-4-Scout
1 Unsloth's
|
Not yet. I'm doing some quantization experiments, and things take some time on the hardware I have available. For 3 experts with |
Do you think this could affect other architectures? |
I have noticed in the past that |
Oh, for token embeddings I had a few cases where it was better to use the corresponding k-quant instead of the |
Interesting to hear. I will take all this into account next time I make quants. |
Just tried. Did not run |
This seems solid enough, merging it. |
If I ever try Maverick will see if it is replicable there. |
So, L4-Scout seems to quantize pretty well. 4-bit (IQ4_KS)
(so basically everything with Beating Unsloth's UD-Q2_K_XL
Beating Unsloth's UD-IQ2_XXS
Beating Unsloth's UD-IQ1_S
|
Here another recipe for
The model ends up being 45.05 GiB (48.38 GB), so qualifies for this "under 50 GB" shoot-out. Final Wiki2 PPL is
So, a different league than the shoot-out models. |
It seems the initial reactions to LlaMA-4 are mostly negative. Nevertheless, quantized LlaMA-Scout is something I can run on one of my systems, so here it is.
Derived from PR 12791 in mainline. But the code bases have diverged so much by now that it did take some effort to port the PR.
As with Gemma-3, I did not add the necessary modifications to
convert_hf_to_gguf.py
, so mainline is required to generate the model GGUF.Did a quick test with a
Q6_K
model (no imatrix yet, so wanted to use more bits to not worry about quantization effects). Ryzen-5975WX CPU, RTX-4080 GPU, usingI got 221 t/s in the perplexity run, and 10.5 t/s for 128 tokens asking the standard question about the meaning of life. This is not bad at all.
As mentioned in PR 12791, the model fails the ultimate AGI test:
Closes #314