You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/quantization.md
+45-22
Original file line number
Diff line number
Diff line change
@@ -30,9 +30,32 @@ Support for FP16 and BF16 is limited in many embedded processors. Additional ex
30
30
31
31
Next, we'll show you how to optimize your model for mobile execution (for ET) or get the most from your server or desktop hardware (with AOTI). The basic model build for mobile surfaces two issues: Models quickly run out of memory and execution can be slow. In this section, we show you how to fit your models in the limited memory of a mobile device, and optimize execution speed -- both using quantization. This is the torchchat repo after all!
32
32
For high-performance devices such as GPUs, quantization provides a way to reduce the memory bandwidth required to and take advantage of the massive compute capabilities provided by today's server-based accelerators such as GPUs. In addition to reducing the memory bandwidth required to compute a result faster by avoiding stalls, quantization allows accelerators (which usually have a limited amount of memory) to store and process larger models than they would otherwise be able to.
33
-
We can specify quantization parameters with the --quantize option. The quantize option takes a JSON/dictionary with quantizers and quantization options.
33
+
We can specify quantization parameters with the --quantizeize option. The quantize option takes a JSON/dictionary with quantizers and quantization options.
34
34
generate and export (for both ET and AOTI) can both accept quantization options. We only show a subset of the combinations to avoid combinatorial explosion.
35
35
36
+
## Quantization API
37
+
38
+
Model quantization recipes are specified by a JSON file / dict describing the quantizations to perform. Each quantization step consists of a quantization higher-level operator, and a dict with any parameters:
39
+
40
+
```
41
+
{
42
+
"<quantizer1>: {
43
+
<quantizer1_option1>" : value,
44
+
<quantizer1_option2>" : value,
45
+
...
46
+
},
47
+
"<quantizer2>: {
48
+
<quantizer2_option1>" : value,
49
+
<quantizer2_option2>" : value,
50
+
...
51
+
},
52
+
...
53
+
}
54
+
```
55
+
56
+
The quantization recipe may be specified either on the commandline as a single running string with `--quantize "<json string>"`, or by specifying a filename containing the recipe as a JSON structure with `--quantize filename.json`. It is recommended to store longer recipes as a JSON file, while the CLI variant may be more suitable for quick ad-hoc experiments.
The simplest way to quantize embedding tables is with int8 "channelwise" (symmetric) quantization, where each value is represented by an 8 bit integer, and a floating point scale per embedding (channelwise quantization) or one scale for each group of values in an embedding (groupwise quantization).
38
61
@@ -42,13 +65,13 @@ We can do this in eager mode (optionally with torch.compile), we use the embeddi
42
65
43
66
TODO: Write this so that someone can copy paste
44
67
```
45
-
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
68
+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
@@ -82,12 +105,12 @@ Quantizing embedding tables with int4 provides even higher compression of embedd
82
105
We can do this in eager mode (optionally with torch.compile), we use the embedding quantizer with groupsize set to 0 which uses channelwise quantization:
83
106
84
107
```
85
-
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 0}}' --device cpu
108
+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 4, "groupsize": 0}}' --device cpu
Now you can run your model with the same command as before:
@@ -124,13 +147,13 @@ The simplest way to quantize embedding tables is with int8 groupwise quantizatio
124
147
We can do this in eager mode (optionally with torch.compile), we use the linear:int8 quantizer with groupsize set to 0 which uses channelwise quantization:
125
148
126
149
```
127
-
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
150
+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quantize '{"linear:int8" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
128
151
```
129
152
130
153
Then, export as follows using ExecuTorch for mobile backends:
0 commit comments