You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Smoothquant implements input-weight equalization, currently the implementation in torchao is using module swap, but it can be refactored to use tensor subclass, and also to use AffineQuantizedTensor so that we can consolidate the performance optimizations to one place. We can use static quantization flow as an example: #487.
Step 2. Convert to AffineQuantizedTensor with a new layout
After we collected the stats, we can convert the floating point weight to AffineQuantizedTensor with a new LayoutType and AQTLayout, with an extra equalization_scale Tensor, this can share the same implementation as AWQ I think, although with different dtypes (int8). Example conversion code:
Please create an smoothquant folder under https://github.com/pytorch/ao/tree/main/torchao/prototype
The flow and layout implementation can be in separate files, e.g. flow.py, layout.py (there might be some missing extension points of AffineQuantizedTensor, but we'll work on these at the same time)
Smoothquant implements input-weight equalization, currently the implementation in torchao is using module swap, but it can be refactored to use tensor subclass, and also to use
AffineQuantizedTensor
so that we can consolidate the performance optimizations to one place. We can use static quantization flow as an example: #487.Main benefit of the refactor would be: (1) aligning model level APIs (2) easier deserialization story (https://pytorch.org/ao/stable/serialization.html#what-happens-when-deserializing-an-optimized-model), you can load the quantized state dict to original model directly and get a model ready for inference
Overview
Here is the top level API for smoothquant: https://github.com/pytorch/ao/tree/main/torchao/quantization#to-be-moved-to-prototype-a8w8-dynamic-quantization-with-smoothquant
It follows our calibration flow (static quant flow) pretty closely:
ao/tutorials/calibration_flow/static_quant.py
Lines 121 to 134 in afde175
How to implement it in torchao
Similar to static quantization flow, at the high level, we can have two steps.
Step 1. Inserting Observers
First step is to insert observers that records the running absolute max value:
ao/torchao/quantization/smoothquant.py
Lines 146 to 147 in afde175
we can create a function
insert_smoothquant_observers_
similar toao/tutorials/calibration_flow/static_quant.py
Line 37 in afde175
Step 2. Convert to
AffineQuantizedTensor
with a new layoutAfter we collected the stats, we can convert the floating point weight to
AffineQuantizedTensor
with a newLayoutType
andAQTLayout
, with an extraequalization_scale
Tensor, this can share the same implementation as AWQ I think, although with different dtypes (int8). Example conversion code:ao/tutorials/calibration_flow/static_quant.py
Lines 46 to 63 in afde175
In terms of model level API, we can implement some helper function like
ao/torchao/quantization/quant_api.py
Line 363 in afde175
Logistics (Code Location, Test and Benchmarks)
Please create an
smoothquant
folder underhttps://github.com/pytorch/ao/tree/main/torchao/prototype
The flow and layout implementation can be in separate files, e.g. flow.py, layout.py (there might be some missing extension points of AffineQuantizedTensor, but we'll work on these at the same time)
For Testing, please create a
test_smoothquant.py
in https://github.com/pytorch/ao/tree/main/test/prototype and move the tests fromao/test/integration/test_integration.py
Line 159 in afde175
For e2e flow demo, please add a
smoothquant.py
in https://github.com/pytorch/ao/tree/main/tutorials/calibration_flowfollowing the static quant example, please show the benchmarking result as well (since we are using optimized kernel) following https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization-flow-example
Last step is to test this with llama2/llama3 following instructions in https://github.com/pytorch/ao/tree/main/torchao/_models/llama and measure the metrics in https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks if you have GPU machines. For smoothquant, you can test in CPU machines and add results in the quantization README as well
References
The text was updated successfully, but these errors were encountered: