Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Req: Add Importance Matrix / RAM avail calculations to ISQ #377

Open
psyv282j9d opened this issue Jun 4, 2024 · 3 comments
Open
Labels
models Additions to model or architectures new feature New feature or request

Comments

@psyv282j9d
Copy link

Looking over ISQ (based on your previous ask), I found a few things missing that I've learned are helpful via trial and error.

imatrix, If you look at the discussion here, you can see that calculating the Importance Matrix prior to quantization, can be used to offset some of the negative effects of quantization. In particular, this comment gives a great walkthrough of which tools to use to calculate the imatrix, then how to use it when quantizing.

Also, one of the key benefits of Ollama (Go wrapper around llama.cpp) is in llm/memory.go. In the function EstimateGPULayers, it calculates, based on available VRAM (or system RAM for metal) how many layers can be offloaded to the GPU. This number is then passed to the --n_gpu_layers option of llama.cpp.

What are the chances of incorporating these ideas into ISQ? It would be great to go from safetensors / bf16 on disk to automagically optimal memory loading for inference. :-)

@psyv282j9d
Copy link
Author

psyv282j9d commented Jun 4, 2024

Oops. I should add the imatrix calculation requires a data file. The outcome of the previously referenced discussion seems to have settled on the file linked in this comment.

Direct link to groups_merged-enhancedV3.txt

@EricLBuehler EricLBuehler added new feature New feature or request models Additions to model or architectures labels Jun 5, 2024
@EricLBuehler
Copy link
Owner

Hello @psyv282j9d!

This sounds like a great feature which I would love to add. I have begun work on tracking memory usage in #392. I will look into applying the imatrix quants, from what I understand it is a different quantization standard?

@psyv282j9d
Copy link
Author

My understanding (IANAMLE), is that it calculates a matrix of weights, which adjusts the quantized values of "important" tensors. This leads to a quantized model that more closely mimics the original.

From the PR:

We can then use these <a_i^2> as importances (weights) in a weighted RMSE minimization when quantizing the tensor.

So, not a different quantization method, just really helpful tweak on top of the existing gguf quant methods.

(All errors my own :-))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models Additions to model or architectures new feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants