-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantization support #163
Comments
8bit weightonly quantized only support llama now |
Any examples? |
|
As for the model file format, we have not tested GGML/GGUF up to now. What is the motivation to use these formats? |
Will GPTQ be supported? |
@XHPlus There's a lot of open source models on HuggingFace driven by https://huggingface.co/TheBloke. Many people in the open source community use those quantized models on TGI / vLLM. |
Using this option with Llama2-13B gives this error:
I tried both Any suggestions how to fix this? |
@XHPlus Quantization is partially the only way to run bigger models in smaller GPUs, e.g. Mixtral. With vLLM, I can run mixtral quantized with 48 GBs of VRAM. The unquantized model would use up to 100GB VRam i guess. |
How to use 8bit quantized models? Can I run GGML/GGUF models?
The text was updated successfully, but these errors were encountered: