| 
 | 1 | +# {BACKEND_NAME} Quantization  | 
 | 2 | + | 
 | 3 | +Document quantization schemes and flows for the backend. This should include a description of each scheme and a code example to perform quantization. Example sections for PT2E and quantize_ are included below, to be replaced with details for the target backend.  | 
 | 4 | + | 
 | 5 | +### Supported Quantization Schemes  | 
 | 6 | +The {BACKEND_NAME} delegate supports the following quantization schemes:  | 
 | 7 | + | 
 | 8 | +- {QUANTIZATION_SCHEME_1}  | 
 | 9 | +- {QUANTIZATION_SCHEME_2}  | 
 | 10 | + | 
 | 11 | +### {QUANTIZATION_METHOD_1} using the PT2E Flow  | 
 | 12 | + | 
 | 13 | +To perform {QUANTIZATION_METHOD_1} with the PT2E flow, perform the following steps prior to exporting the model:  | 
 | 14 | + | 
 | 15 | +1) Create an instance of the `{BackendName}Quantizer` class. Set quantization parameters.  | 
 | 16 | +2) Use `torch.export.export` to prepare for quantization.  | 
 | 17 | +3) Call `prepare_pt2e` to prepare the model for quantization.  | 
 | 18 | +4) For static quantization, run the prepared model with representative samples to calibrate the quantized tensor activation ranges.  | 
 | 19 | +5) Call `convert_pt2e` to quantize the model.  | 
 | 20 | +6) Export and lower the model using the standard flow.  | 
 | 21 | + | 
 | 22 | +The output of `convert_pt2e` is a PyTorch model which can be exported and lowered using the normal flow. As it is a regular PyTorch model, it can also be used to evaluate the accuracy of the quantized model using standard PyTorch techniques.  | 
 | 23 | + | 
 | 24 | +```python  | 
 | 25 | +import torch  | 
 | 26 | +import {MODEL_IMPORT_PATH} as models  | 
 | 27 | +from {MODEL_WEIGHTS_IMPORT}  | 
 | 28 | +from executorch.backends.{backend_name}.quantizer.{backend_name}_quantizer import {BackendName}Quantizer, {get_quantization_config_function}  | 
 | 29 | +from executorch.backends.{backend_name}.partition.{backend_name}_partitioner import {BackendName}Partitioner  | 
 | 30 | +from executorch.exir import to_edge_transform_and_lower  | 
 | 31 | +from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e  | 
 | 32 | + | 
 | 33 | +model = models.{model_name}.{model_function}(weights={ModelWeights}.DEFAULT).eval()  | 
 | 34 | +sample_inputs = ({SAMPLE_INPUT_SHAPE}, )  | 
 | 35 | + | 
 | 36 | +qparams = {get_quantization_config_function}({QUANTIZATION_PARAMS}) # (1)  | 
 | 37 | +quantizer = {BackendName}Quantizer()  | 
 | 38 | +quantizer.set_global(qparams)  | 
 | 39 | + | 
 | 40 | +training_ep = torch.export.export(model, sample_inputs).module() # (2)  | 
 | 41 | +prepared_model = prepare_pt2e(training_ep, quantizer) # (3)  | 
 | 42 | + | 
 | 43 | +for cal_sample in [{CALIBRATION_SAMPLE}]: # Replace with representative model inputs  | 
 | 44 | +	prepared_model(cal_sample) # (4) Calibrate  | 
 | 45 | + | 
 | 46 | +quantized_model = convert_pt2e(prepared_model) # (5)  | 
 | 47 | + | 
 | 48 | +et_program = to_edge_transform_and_lower( # (6)  | 
 | 49 | +    torch.export.export(quantized_model, sample_inputs),  | 
 | 50 | +    partitioner=[{BackendName}Partitioner()],  | 
 | 51 | +).to_executorch()  | 
 | 52 | +```  | 
 | 53 | + | 
 | 54 | +See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information.  | 
 | 55 | + | 
 | 56 | +### LLM Quantization with quantize_  | 
 | 57 | + | 
 | 58 | +The {BACKEND_NAME} backend also supports quantizing models with the [torchao](https://github.com/pytorch/ao) quantize_ API. {ADVANCED_QUANTIZATION_DESCRIPTION}  | 
 | 59 | + | 
 | 60 | +Below is a simple example, but a more detailed tutorial including accuracy evaluation on popular benchmarks can be found in the [torchao documentation]({TORCHAO_DOCS_URL}).  | 
 | 61 | + | 
 | 62 | +```python  | 
 | 63 | +from torchao.quantization.granularity import PerGroup, PerAxis  | 
 | 64 | +from torchao.quantization.quant_api import (  | 
 | 65 | +    IntxWeightOnlyConfig,  | 
 | 66 | +    Int8DynamicActivationIntxWeightConfig,  | 
 | 67 | +    quantize_,  | 
 | 68 | +)  | 
 | 69 | + | 
 | 70 | +# Quantize embeddings with 8-bits, per channel  | 
 | 71 | +embedding_config = IntxWeightOnlyConfig(  | 
 | 72 | +    weight_dtype=torch.int8,  | 
 | 73 | +    granularity=PerAxis(0),  | 
 | 74 | +)  | 
 | 75 | +qunatize_(  | 
 | 76 | +    eager_model,  | 
 | 77 | +    lambda m, fqn: isinstance(m, torch.nn.Embedding),  | 
 | 78 | +)  | 
 | 79 | + | 
 | 80 | + | 
 | 81 | +# Quatize linear layers with 8-bit dynamic activations and 4-bit weights  | 
 | 82 | +linear_config = Int8DynamicActivationIntxWeightConfig(  | 
 | 83 | +    weight_dtype=torch.int4,  | 
 | 84 | +    weight_granularity=PerGroup(32),  | 
 | 85 | +)  | 
 | 86 | +quantize_(eager_model, linear_config)  | 
 | 87 | +```  | 
0 commit comments