Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Static Quantization for Open-Source LLMs #1724

Open
yang-ahuan opened this issue Feb 18, 2025 · 4 comments
Open

[Question] Static Quantization for Open-Source LLMs #1724

yang-ahuan opened this issue Feb 18, 2025 · 4 comments
Assignees
Labels
quantize question Further information is requested

Comments

@yang-ahuan
Copy link

Description

Hi, I am a beginner in quantization and would like to experiment with INT8 dynamic and static quantization on open-source LLMs.

  • For dynamic quantization, I found that int8_dynamic_activation_int8_weight is available in torchao/quantization/quant_api.py.
  • For static quantization, I did not find an INT8 version. Instead, I only found float8_static_activation_float8_weight.

Questions

  • Why is only INT8 dynamic quantization provided? Is there a specific concern that prevents static INT8 quantization?
  • If I want to implement INT8 static quantization, can I follow tutorials/calibration_flow/static_quant.py as a reference?
  • For float8_static_activation_float8_weight, it requires a scalar parameter. What would be a recommended way to determine this parameter?

Any insights or guidance would be greatly appreciated. Thanks in advance! 😊

@jcaip jcaip added question Further information is requested quantize labels Feb 18, 2025
@jcaip
Copy link
Contributor

jcaip commented Feb 18, 2025

cc @jerryzh168 probably knows best but I think he might be on PTO currently. @HDCharles do you know if we have a int8 static quantization flow available?

@jerryzh168
Copy link
Contributor

jerryzh168 commented Feb 18, 2025

Why is only INT8 dynamic quantization provided? Is there a specific concern that prevents static INT8 quantization?

it's just no use case yet, we can add it similar to float8, we can add it according to the example:

def test_static_quant(target_dtype: torch.dtype, mapping_type: MappingType):

although we don't have perf optimizations for these yet, as mentioned below

If I want to implement INT8 static quantization, can I follow tutorials/calibration_flow/static_quant.py as a reference?

yes

For float8_static_activation_float8_weight, it requires a scalar parameter. What would be a recommended way to determine this parameter?

also see

def test_static_quant(target_dtype: torch.dtype, mapping_type: MappingType):
for the full flow for float8 static quant as well

@HDCharles
Copy link
Contributor

we haven't done much with static quantization on int quantization front because you usually want to use sequential fused quantized options in those situations for better perf i.e.

linear -> relu can be done so you only dequantize after the relu, however currently we don't have good kernels for doing that on gpu. you can do a statically quantized linear op but you get more quantization error and worse performance.

@yang-ahuan
Copy link
Author

Thank you very much for your response! I really appreciate it.

I quickly ran static_quant.py and saved the model, but it still retains its original data type (for example, in static_quant.py, it remains in bfloat16).

@HDCharles Is there no corresponding kernel available for the CPU as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quantize question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants