Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add importance matrix support for legacy quants? #4932

Closed
ikawrakow opened this issue Jan 14, 2024 · 7 comments
Closed

Add importance matrix support for legacy quants? #4932

ikawrakow opened this issue Jan 14, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@ikawrakow
Copy link
Contributor

I have the implementation ready, but I'm not sure if this is what we want. Use of an importance matrix does improve perplexity for all models I have tried. But on the other hand the "legacy" ggml quants Q4_0 and Q5_0 are never very good, but they are also never really bad (Q4_1 and Q5_1 have more erratic behavior, for some models being better than Q4_0/Q5_0 and for other models being worse). Hence, one may want to preserve them the way they are as a kind of reference.

Opinions?

@ikawrakow ikawrakow added the enhancement New feature or request label Jan 14, 2024
@abc-nix
Copy link

abc-nix commented Jan 14, 2024

Hi, ikawrakow.
First, thank you very much for the new improved quants. They have yet to be widely tested but the initial PPL results you have shared are very good, and I really want to test them with q2-q3 ks quants of very dense models like goliath-120B and the newer community MoEs when I have time.

For your question, my opinion (as only a simple user) is that the "legacy" quants should remain the same, at least for now. The new imatrix technique you have invented seem to initially work very well with English (when using an English text as you have showcased), but I believe that a longer period of testing by the community will let us know how well they work in different uses and languages. Until then, I believe having the "legacy" quants exist as a fallback is the most desirable option.

A lot of people don't use the quantization tool themselves and rely on the quants released by users like TheBloke. If there is an issue that has been overlooked and the new quants (created using an English text for the imatrix generation) perform worse in some areas than the previous versions, having the legacy quants as an alternative would be the best option in my opinion.

@ikawrakow
Copy link
Contributor Author

@abc-nix Does your opinion remain the same after learning that one will be still able to quantize both, k-quants and legacy quants, without using an importance matrix? So that, in case there are issues, one can always fall back to the existing quantization?

@abc-nix
Copy link

abc-nix commented Jan 14, 2024

@ikawrakow My opinion amounts to less than a grain of sand, so don't consider it a general opinion but my own.

I did take into account that the current k-quants can optionally use the new and improved imatrix method (forced only for q2-k quants I think), which is a great benefit. With this new method we will find in the wild as many quants as there are datasets used to compute the imatrix, but this will also bring many quants that may compete to being the best of its size. It may also improve desired performance in certain areas depending on the dataset used, making the new k-quants much better compared to generalized quants. But a normal user, from the outside, will not be able to distinguish one from another only looking at the final file.

I think there should still be a reproducible format, the legacy format, that should be predictable to perform the same no matter if I create it, or I download it from a huggingface repo. Keeping the "legacy" quants as they are (even if it is optional to use the imatrix method, and this method could improve user experience) should also make it easier for people to help resolve issues some users may experience (like people complaining something is wrong with llama.cpp but after testing the legacy quant they realize the issue is with the specific dataset used on their k-quant or an issue with imatrix instead of the general program). Sometimes more options can also lead to more chaos. Having a reference quant that should be the same (without the risk of "mistakenly" using a bad dataset) for all users would make it easier to troubleshot.

As I said, this is only my opinion. Discard it as you would a grain of sand.

@JohannesGaessler
Copy link
Collaborator

I only really care about the legacy quants for development; it is much easier to prototype features for q4_0 or q8_0 than any of the k-quants due to the much simpler data structure. I don't particularly care whether the legacy quants have slightly better/worse perplexity because I usually only need to check whether it changes.

@sorasoras
Copy link

@abc-nix Does your opinion remain the same after learning that one will be still able to quantize both, k-quants and legacy quants, without using an importance matrix? So that, in case there are issues, one can always fall back to the existing quantization?

I think supporting legacy quants is needed.
Not All tensor in Qwen14B support K quants.
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q5_0: 20 tensors
llama_model_loader: - type q8_0: 20 tensors
llama_model_loader: - type q4_K: 121 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 1 tensors
Some of Tensor have to fallback to Q5_0,Q8_0.
importance matrix support for legacy quants would indeed improve overall perplexity.

@ggerganov
Copy link
Owner

Optional importance matrix support for legacy quants similar to the one in #4930 would be useful.

@ikawrakow
Copy link
Contributor Author

Closed via #4969

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants