Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization #4111

Closed
kalomaze opened this issue Nov 17, 2023 · 2 comments

Comments

@kalomaze
Copy link
Contributor

kalomaze commented Nov 17, 2023

This pr mentioned a while back that, since Llama 70b used GQA, there is a specific k-quantization trick that allows them to quantize with marginal model size increases:

image

Mistral 7b, a very popular model released after this PR was made, also uses Grouped Query Attention.
Checking for this if the 7b is a Mistral model and applying the same treatment should theoretically provide similar gains unless I am mistaken.

image

I think in general quantization optimization is sorely overlooked, lots of low hanging fruit there for sure....

@kalomaze kalomaze changed the title Llama 2 70b quantizes in way that's optimal for GQA; Mistral 7b is missing that optimization Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization Nov 17, 2023
@ggerganov
Copy link
Owner

The quantum mixtures currently available in llama.cpp have been mostly optimized towards the OG LLaMA models and also for Falcon to some extend. There is no guarantee that these mixtures are optimal for any other model or finetune.

I still think that the correct way to generate per-model quantum mixtures is via #2783, but I haven't came around to implement it yet.

@github-actions github-actions bot added the stale label Mar 19, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants