[Question] Static Quantization for Open-Source LLMs #1724

yang-ahuan · 2025-02-18T02:32:20Z

Description

Hi, I am a beginner in quantization and would like to experiment with INT8 dynamic and static quantization on open-source LLMs.

For dynamic quantization, I found that int8_dynamic_activation_int8_weight is available in torchao/quantization/quant_api.py.
For static quantization, I did not find an INT8 version. Instead, I only found float8_static_activation_float8_weight.

Questions

Why is only INT8 dynamic quantization provided? Is there a specific concern that prevents static INT8 quantization?
If I want to implement INT8 static quantization, can I follow tutorials/calibration_flow/static_quant.py as a reference?
For float8_static_activation_float8_weight, it requires a scalar parameter. What would be a recommended way to determine this parameter?

Any insights or guidance would be greatly appreciated. Thanks in advance! 😊

The text was updated successfully, but these errors were encountered:

jcaip · 2025-02-18T17:37:43Z

cc @jerryzh168 probably knows best but I think he might be on PTO currently. @HDCharles do you know if we have a int8 static quantization flow available?

jerryzh168 · 2025-02-18T18:18:23Z

Why is only INT8 dynamic quantization provided? Is there a specific concern that prevents static INT8 quantization?

it's just no use case yet, we can add it similar to float8, we can add it according to the example:

ao/tutorials/calibration_flow/static_quant.py

Line 258 in f2e8f56

def test_static_quant(target_dtype: torch.dtype, mapping_type: MappingType):

although we don't have perf optimizations for these yet, as mentioned below

If I want to implement INT8 static quantization, can I follow tutorials/calibration_flow/static_quant.py as a reference?

yes

For float8_static_activation_float8_weight, it requires a scalar parameter. What would be a recommended way to determine this parameter?

also see

ao/tutorials/calibration_flow/static_quant.py

Line 258 in f2e8f56

def test_static_quant(target_dtype: torch.dtype, mapping_type: MappingType):

for the full flow for float8 static quant as well

HDCharles · 2025-02-18T20:03:20Z

we haven't done much with static quantization on int quantization front because you usually want to use sequential fused quantized options in those situations for better perf i.e.

linear -> relu can be done so you only dequantize after the relu, however currently we don't have good kernels for doing that on gpu. you can do a statically quantized linear op but you get more quantization error and worse performance.

yang-ahuan · 2025-02-19T13:13:43Z

Thank you very much for your response! I really appreciate it.

I quickly ran static_quant.py and saved the model, but it still retains its original data type (for example, in static_quant.py, it remains in bfloat16).

@HDCharles Is there no corresponding kernel available for the CPU as well?

jcaip added question Further information is requested quantize labels Feb 18, 2025

jcaip assigned jerryzh168 and HDCharles Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Static Quantization for Open-Source LLMs #1724

[Question] Static Quantization for Open-Source LLMs #1724

yang-ahuan commented Feb 18, 2025

jcaip commented Feb 18, 2025

jerryzh168 commented Feb 18, 2025 •

edited

Loading

HDCharles commented Feb 18, 2025

yang-ahuan commented Feb 19, 2025

[Question] Static Quantization for Open-Source LLMs #1724

[Question] Static Quantization for Open-Source LLMs #1724

Comments

yang-ahuan commented Feb 18, 2025

Description

Questions

jcaip commented Feb 18, 2025

jerryzh168 commented Feb 18, 2025 • edited Loading

HDCharles commented Feb 18, 2025

yang-ahuan commented Feb 19, 2025

jerryzh168 commented Feb 18, 2025 •

edited

Loading