Runtime Overhead on Quantization Group Size #16068
Unanswered
kimjoohyungsd
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm curious about the quantization on llama.cpp. In structure data types for quantization in llama.cpp such as q4_0, this project have limited the scope of quantization block size to at most 256. In additon, I've found out that Int8 SIMD instructions are utilized to accelerate inference.
I'm recently found out your discussion on january 2024 that row wise quantization was not mentioned due to compatability with ggml library. Than, my question is wasn't there any efforts to enlarge quantization block size such as 256,512 elements. If so, was there any decoding througput difference than conventional 32 or 256 contiguous elements? According to GPT, enlarging group size would disrupt cache hit thus resulting violations on cache line size
Beta Was this translation helpful? Give feedback.
All reactions