-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Static Quantization for Open-Source LLMs #1724
Comments
cc @jerryzh168 probably knows best but I think he might be on PTO currently. @HDCharles do you know if we have a int8 static quantization flow available? |
it's just no use case yet, we can add it similar to float8, we can add it according to the example: ao/tutorials/calibration_flow/static_quant.py Line 258 in f2e8f56
although we don't have perf optimizations for these yet, as mentioned below
yes
also see ao/tutorials/calibration_flow/static_quant.py Line 258 in f2e8f56
|
we haven't done much with static quantization on int quantization front because you usually want to use sequential fused quantized options in those situations for better perf i.e. linear -> relu can be done so you only dequantize after the relu, however currently we don't have good kernels for doing that on gpu. you can do a statically quantized linear op but you get more quantization error and worse performance. |
Thank you very much for your response! I really appreciate it. I quickly ran static_quant.py and saved the model, but it still retains its original data type (for example, in static_quant.py, it remains in bfloat16). @HDCharles Is there no corresponding kernel available for the CPU as well? |
Description
Hi, I am a beginner in quantization and would like to experiment with INT8 dynamic and static quantization on open-source LLMs.
int8_dynamic_activation_int8_weight
is available intorchao/quantization/quant_api.py
.float8_static_activation_float8_weight
.Questions
tutorials/calibration_flow/static_quant.py
as a reference?float8_static_activation_float8_weight
, it requires a scalar parameter. What would be a recommended way to determine this parameter?Any insights or guidance would be greatly appreciated. Thanks in advance! 😊
The text was updated successfully, but these errors were encountered: