Runtime Overhead on Quantization Group Size #16068

kimjoohyungsd · 2025-09-18T07:44:16Z

kimjoohyungsd
Sep 18, 2025

Hi, I'm curious about the quantization on llama.cpp. In structure data types for quantization in llama.cpp such as q4_0, this project have limited the scope of quantization block size to at most 256. In additon, I've found out that Int8 SIMD instructions are utilized to accelerate inference.
I'm recently found out your discussion on january 2024 that row wise quantization was not mentioned due to compatability with ggml library. Than, my question is wasn't there any efforts to enlarge quantization block size such as 256,512 elements. If so, was there any decoding througput difference than conventional 32 or 256 contiguous elements? According to GPT, enlarging group size would disrupt cache hit thus resulting violations on cache line size

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Runtime Overhead on Quantization Group Size #16068

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Runtime Overhead on Quantization Group Size #16068

Uh oh!

Uh oh!

kimjoohyungsd Sep 18, 2025

Replies: 0 comments

kimjoohyungsd
Sep 18, 2025