llama.cpp: add IQ3_XXS quantization models #8

ymcui · 2024-01-31T03:31:32Z

Description

This PR introduces new GGUF quantization type IQ3_XXS, which was recently introduced by llama.cpp (ref. ggml-org/llama.cpp#5196). If you are using 3-bit quantization, you may try IQ3_XXS, as it provides better performance

IQ3_XXS GGUF models have been updated:

Chinese-Mixtral-GGUF: https://huggingface.co/hfl/chinese-mixtral-gguf
Chinese-Mixtral-Instruct-GGUF: https://huggingface.co/hfl/chinese-mixtral-instruct-gguf

Performance

Quant	Q2_K	⭐️IQ3_XXS	Q3_K	Q4_K
Model Size	16.12 GB	17.05 GB	18.96 GB	24.62 GB
BPW	2.96	3.14	3.86	4.87
Speed (PP)	10.27	26.78	12.17	10.02
Speed (TG)	20.29	20.58	21.74	21.67
PPL@Chinese-Mixtral	5.1846 +/- 0.05533	4.5990 +/- 0.04969	4.5545 +/- 0.04893	4.4488 +/- 0.04813
PPL@Chinese-Mixtral-Insttruct	4.5758 +/- 0.03959	4.0389 +/- 0.03489	4.5563 +/- 0.04126	3.9265 +/- 0.03407

Note: Speed (ms/token) is reported under A100-40G. PP: prompt processing; TG: text generation.

Related Issue

None.

ymcui added 2 commits January 31, 2024 11:30

doc: add iq3_xxs perf.

795e74b

doc: add iq3_xxs perf.

7b42536

ymcui requested a review from iMountTai January 31, 2024 03:35

iMountTai approved these changes Jan 31, 2024

View reviewed changes

ymcui merged commit 448665a into main Jan 31, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp: add IQ3_XXS quantization models #8

llama.cpp: add IQ3_XXS quantization models #8

ymcui commented Jan 31, 2024

llama.cpp: add IQ3_XXS quantization models #8

llama.cpp: add IQ3_XXS quantization models #8

Conversation

ymcui commented Jan 31, 2024

Description

Performance

Related Issue