02 Sep 17:57

vkuzo

379010f

v0.13.0 Latest

Latest

Highlights

We are excited to announce the 0.13.0 release of torchao! This release adds support for numerous QAT improvements, faster mxfp8 pretraining and more!

Simpler Multi-step QAT API (#2629)

We added a new, simpler, multi-step QAT API that uses only a single config. Now users can specify the target post-training quantization (PTQ) config as the base config and we will automatically infer the correct fake quantize configs to use!

from torchao.quantization import (
    quantize_,
    Int8DynamicActivationInt4WeightConfig
)
from torchao.quantization.qat import QATConfig

# prepare
base_config = Int8DynamicActivationInt4WeightConfig(group_size=32)
qat_config = QATConfig(base_config, step="prepare")
quantize_(m, qat_config)

# train (not shown)

# convert
quantize_(m, QATConfig(base_config, step="convert"))

For more advanced use cases, users can continue to specify specific FakeQuantizeConfigs as before:

# prepare
activation_config = IntxFakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
weight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32)
qat_config = QATConfig(
    activation_config=activation_config,
    weight_config=weight_config,
    step="prepare",
)
quantize_(model, qat_config)

# train and convert (not shown)

(Prototype) NVFP4 and FP8 QAT (#2735, #2666)

We generalized QAT to support FP8 and NVFP4 use cases. You can try them out as follows:

from torchao.quantization import (
    quantize_,
    Float8DynamicActivationInt4WeightConfig,
    Float8DynamicActivationFloat8WeightConfig,
    Float8WeightOnlyConfig,
)
from torchao.prototype.mx_formats import NVFP4InferenceConfig
from torchao.quantization.qat import QATConfig

# Pick a base config
base_config = Float8DynamicActivationInt4WeightConfig()  # or
base_config = Float8DynamicActivationInt8WeightConfig()  # or
base_config = NVFP4InferenceConfig()

# prepare
qat_config = QATConfig(base_config, step="prepare")
quantize_(m, qat_config)

# train (not shown)

# convert
quantize_(m, QATConfig(base_config, step="convert"))

Users can also use the more specific FakeQuantizeConfigs for more advanced use cases, e.g.:

from torchao.quantization import PerRow
from torchao.quantization.qat import Float8FakeQuantizeConfig
from torchao.prototype.qat import NVFP4FakeQuantizeConfig

act_config = Float8FakeQuantizeConfig(torch.float8_e4m3fn, PerRow())
weight_config = NVFP4FakeQuantizeConfig(use_per_tensor_scale=True)

# prepare
qat_config = QATConfig(
    activation_config=activation_config,
    weight_config=weight_config,
    step="prepare",
)
quantize_(model, qat_config)

# train and convert (not shown)

(prototype) 1.2x MXFP8 dense pretraining speedups with torchtitan

We landed performance improvements (such as a faster to_mx dim1 cast) to our prototype MXFP8 training APIs, and we now achieve a 1.2x speedup vs bf16 on pretraining LLaMa 3 8B on NVIDIA B200. Please see our training benchmarks README for more information.

torchao float8 training now integrated into axolotl!

You can now use torchao.float8 directly from axolotl to achieve finetuning QPS e2e speedups of up to 1.1x on 3B parameter models (docs, release notes).

BC Breaking

Float8DynamicActivationFloat8WeightConfig and Float8WeightOnlyConfig version bump to 2 (#2650)

We updated the implementation for float8 Tensor, so bumps the default version from 1 to 2 for these two configs.

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "torchao-testing/opt-125m-Float8DynamicActivationFloat8WeightConfig-v1-0.13.dev"
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="bfloat16",
    device_map="cuda",
)

/data/users/jerryzh/ao/torchao/core/config.py:249: UserWarning: Stored version is not the same as current default version of the config: stored_version=1, current_version=2, please check the deprecation warning
  warnings.warn(
/data/users/jerryzh/ao/torchao/dtypes/floatx/float8_layout.py:113: UserWarning: Models quantized with version 1 of Float8DynamicActivationFloat8WeightConfig is deprecated and will no longer be supported in a future release, please upgrade torchao and quantize again, or download a newer torchao checkpoint, see https://github.com/pytorch/ao/issues/2649 for more details
  warnings.warn(

Suggestion: upgrade torchao to 0.13 and later and generate the checkpoint again:

quantize_(model, Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))

Or download the checkpoint again (please let us know if the checkpoint is not updated)

Please see #2649 for more details around the deprecation.

QAT API Changes (#2628, #2641)

On a high level, the following existing APIs are deprecated and replaced by these new ones. Although this is technically BC-breaking due to typing changes, it will not affect most users as old classes are kept around for now. They are planned to be removed in the next release, however.

IntXQuantizationAwareTrainingConfig -> QATConfig
FromIntXQuantizationAwareTrainingConfig -> QATConfig
FakeQuantizeConfig -> IntxFakeQuantizeConfig
FakeQuantizer -> IntxFakeQuantizer

Please see #2630 and the latest QAT README for more information on how to migrate.

Remove old `change_linear_weights_to_*` APIs (#2721)

The following old quantization APIs no longer work and are removed:

change_linear_weights_to_int8_dqtensors(model)
change_linear_weights_to_int8_woqtensors(model)
change_linear_weights_to_int4_woqtensors(model)

Please use the quantize_ API with the following configs instead:

quantize_(model, Int8WeightOnlyConfig())
quantize_(model, Int4WeightOnlyConfig())

Deprecations

Deprecate old TORCH_VERSION variables (#2719)

The following variables are deprecated and will be removed in the next release:

TORCH_VERSION_AT_LEAST_2_2
TORCH_VERSION_AT_LEAST_2_3
TORCH_VERSION_AT_LEAST_2_4
TORCH_VERSION_AT_LEAST_2_5
TORCH_VERSION_AT_LEAST_2_6
TORCH_VERSION_AT_LEAST_2_7
TORCH_VERSION_AT_LEAST_2_8
TORCH_VERSION_AFTER_2_2
TORCH_VERSION_AFTER_2_3
TORCH_VERSION_AFTER_2_4
TORCH_VERSION_AFTER_2_5

Drop support for PyTorch 2.5 and before (#2720)

torchao only supports the latest 3 versions of PyTorch. Please upgrade to PyTorch 2.6.0+ if you were using an older version of PyTorch.

New Features

New multi-step QAT API (#2629)
Add float8 FakeQuantizeConfig and FakeQuantizer (#2735)
(prototype) Add NVFP4 QAT (#2666)

Improvements

Add StretchedUnifTorchaoQuantizer (#2576)
Allow symmetric_no_clipping_error for KleidiAI kernels, update Readme and validate Kleidi INT4 quantization path (#2570)
Enable powers of 2 cast in float8 rowwise_with_gw_hp recipe (#2677)
Don't call erase if node is already erased in batch norm fusion. (#2716)
Generalize FakeQuantizer beyond intx (#2714)
Allow pattern replacement to ignore literals (#2519)
Replace export_for_training with torch.export.export (#2724)
Allow no quantization during QATConfig convert (#2694)
Int4 sparse marlin tensor (#2771)
Remove group_size arg in Float8DynamicActivationInt4WeightConfig (#2779)
Fix batch norm folding in prepare_pt2e for multiple conv->BN chains sharing the same conv weights (#2795)
Add Float8Tensor ([https://github.com/py...

Contributors

wdvr, abeakkas, and 7 other contributors

Assets 2

17 Jul 17:56

drisspg

v0.12.0

442232f

v0.12.0

Highlights

We are excited to announce the 0.12.0 release of torchao! This release adds support for QAT + Axolotl Integration and prototype MXFP/NVFP support on Blackwell GPUs!

QAT + Axolotl Integration

TorchAO’s QAT support has been integrated into Axolotl’s fine-tuning recipes! Check out the docs here or run it yourself using the following command:

axolotl train examples/llama-3/3b-qat-fsdp2.yaml
axolotl quantize examples/llama-3/3b-qat-fsdp2.yaml

Initial results for Llama3.2-3B by @SalmanMohammadi (axolotl-ai-cloud/axolotl#2590):

Model/Metric	hellaswag acc	hellaswag acc_norm	wikitext bits_per_byte	wikitext byte_perplexity	wikitext word_perplexity
bfloat16	0.5552	0.7315	0.6410	1.5594	10.7591
bfloat16 PTQ	0.5393	0.7157	0.6613	1.5815	11.6033
qat ptq	0.5423	0.7180	0.6567	1.5764	11.4043
Recovered (qat ptq)	18.87%	14.56%	22.66%	23.08%	23.57%

[Prototype | API not finalized] MXFP and NVFP support on Blackwell GPUs

TorchAO now includes prototype support for NVFP4 (NVIDIA's 4-bit floating-point format) and Microscaling (MX) formats on NVIDIA's latest Blackwell GPU architecture. These formats enable efficient inference, achieving up to 61% end-to-end performance improvement in vLLM on Qwen3 models and near 2x speedups for diffusion workloads.

To use:

from torchao.quantization import quantize_ 
from torchao.prototype.mx_formats import (
    MXFPInferenceConfig,
    NVFP4InferenceConfig,
)
# Quantize model with MXFP8 
model = quantize_(model, MXFPInferenceConfig(block_size=32))
# Quantize model to NVFP4 (without double scaling)
model = quantize_(model, NVFP4InferenceConfig())

Note: This is a prototype feature with APIs subject to change. Requires NVIDIA Blackwell GPUs (B200, 5090) with CUDA 12.8+.

BC Breaking

Remove preserve_zero and zero_point_domain from choose_qparams_affine (#2149)
Rename qparams for tinygemm (#2344)
Convert quant_primitives methods private (#2350)
Delete Galore (#2397)
Remove more Galore bits (#2417)
Remove sparsity/prototype/blocksparse (#2205)

Deprecations

Clean up prototype folder (#2232)
Make float8 training's force_recompute_fp8_weight_in_bwd flag do nothing (#2356)

New Features

Enabling MOE Quantization using linear decomposition (#2043)
[PT2E][X86] Migrate fusion passes in Inductor to torchao (#2140)
2:4 activation sparsity packing kernels (#2012)
Add subclass based method for inference w/ MXFP8 (#2132)
Feat: Implementation of the DeepSeek blockwise quantization for fp8 tensors (#1763)
Arm_inductor_quantizer for Pt2e quantization (#2139)
Add mx_fp4 path (#2201)
Add support for KleidiAI int4 kernels on aarch64 Linux (#2169)
Add support for fbgemm int4 mm kernel (#2255)
Enable fp16+int4 mixed precission path for int4 xpu path with int zero point (#2240)
Enable range learning for QAT (#2033)
Patch the _is_conv_node function (#2257)
Add support for fbgemm fp8 kernels (#2276)
Add Float8ActInt4WeightQATQuantizer (#2289)
[float8] add _auto_filter_for_recipe to float8 (#2410)
NVfp4 (#2408)
[float8] Prevent quantize_affine_float8/dequantize_affine_float8 decomposed on inductor (#2379)
[CPU] Enable DA8W4 on CPU (#2128)
Add exportable coreml codebook quantization op (#2443)
Add support for Int4GroupwisePreshuffleTensor for fbgemm (#2421)

Improvement

Add serialization support for AOPerModuleConfig (#2186)
Set eps in end-to-end QAT flow (#2180)
Enable {conv3d, conv_transpose3d} + bn fusion in pt2e (#2212)
Update GemLite to support vLLM V1 (#2199)
[sparse] Add fp8 sparse gemm with rowwise scaling for activation sparsity (#2242)
Patch the _is_conv_node function (#2223)
Relax int4wo device mismatch error (#2254)
Rename AOPerModuleConfig to ModuleFqnToConfig (#2243)
[reland2][ROCm] preshuffled weight mm (#2207)
GPTQ updates (#2235)
Fix QAT range learning, ensure scales get gradients (#2280)
Fix slicing and get_plain() in GemLite (#2288)
Add slicing support for fbgemm fp8 and int4 (#2308)
Add support for bmm and to for fbgemm Tensor (#2337)
Add dynamic quantization support to gemlite layout (#2327)
Test PARQ with torchao activation quantization (#2370)
Update index.rst (#2395)
Add inplace quantizer examples (#2345)
Build mxfp4 kernel for sm120a (#2285)
Enable to_mxfp8 cast for DTensor (#2420)
Enable tensor parallelism for MXLinear (#2434)
Graduate debug handle in torchao (#2452)
Switch alignemtn to 8 for cutlass 4 upgrade (#2491)
Mxfp8 training: add TP sharding strategy for dim1 kernel (#2436)

Bug Fixes

[optim] Fix low-bit optim when used with FSDP2+CPUOffload (#2195)
Fix Per Row scaling for inference (#2253)
Fix benchmark_low_bit_adam.py reference (#2287)
[optim] Fix bug when default dtype is BF16 (#2286)
[sparse] marlin fixes (#2305)
Fix ROCM test failures (#2362)
[float8] Add fnuz fp8 dtypes to Float8Layout ([https://github.com/pyto...

Contributors

malfet, rohansjoshi, and 16 other contributors

Assets 2

09 May 22:08

andrewor14

v0.11.0

f34b473

v0.11.0

Highlights

We are excited to announce the 0.11.0 release of torchao! This release adds support for mixture-of-experts (MoE) quantization, PyTorch 2 Export Quantization (PT2E), and a microbenchmarking framework for inference APIs!

MoE Quantization

We’ve a prototype feature for quantizing MoE modules with a number of TorchAO quantization techniques. This approach leverages the existing TorchAO features for quantizing linear ops and allows them to be used to quantize MoE modules.

from torchao.quantization.prototype.moe_quant.utils import cond_ffn_filter, MoEQuantConfig
from torchao.quantization.quant_api import quantize_, Int8WeightOnlyConfig

quantize_(
    model, 
    MoEQuantConfig(Int8WeightOnlyConfig()),   
    filter_fn=cond_ffn_filter
)
model=torch.compile(
    model, 
    mode="reduce-overhead", 
    fullgraph=is_single_token_inference
)

While the above API is all that is needed to quantize a moe module if your moe module is written to be both quantizable and compilable, in practice its rare for a user model to satisfy these conditions due to the variety of MoE implementations. An initial swap of the normal MoE module with a MoEFeedForwardAOQuantizable module is needed to first prepare the model for quantization. An example of this can be found in llama4_quant.py where this technique is demonstrated for the huggingface llama-4-Scout-17B-16E-Instruct model.

We implemented MoE quantization with 2 methods. The first method (designated `base` in the below benchmarks) simply enhances the existing quantized tensor subclass to quantize the 3D MoE expert tensors and perform the necessary indexing and slicing ops while the second method (`fake`), uses a new tensor subclass to simulate a 3D quantized parameter by storing a sequence of 2D slices of the quantized parameter. The first approach is faster with marginally worse memory characteristics. In both cases doing MoE quantization in this way isn’t expected to be maximally performant compared to implementing fused MoE kernels for each technique, but this approach can yield both moderate speedups and significant memory savings.

The following benchmarks are for mixtral-moe run on a single H100 GPU:

	batchsize 1		batchsize 8
Technique	tok/s	memory (GB)	tok/s	tok/s* batch	memory (GB)
None	78.35	93.76	18.2	145.64	94.12
int8wo-base	98.4	48.87	4.94	39.56	49.2
int4wo-base	79.38	36.15	10.29	82.29	36.12
fp8wo-base	59.41	52.07	2.98	23.81	52.05
fp8dq-base	45.92	53.97	3.78	30.23	53.94
int8wo-fake	6.14	49.13	5.01	40.09	49.23
int4wo-fake	14.25	30.21	11.84	94.75	30.19
fp8wo-fake	3.2	50.31	2.88	23.08	50.29
fp8dq-fake	9.78	50.92	4.08	32.61	50.89

PT2 Export Quantization

We added pytorch 2 export quantization from pytorch to torchao. As part of the planned migration. We’ll follow up with adding deprecation warnings to PyTorch torch.ao.quantization APIs and updating docs in the future. We also simplified the import path for some of the util functions. Here is a non-exhaustive list of APIs you can use:

# top level APIs
from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, prepare_qat_pt2e, convert_pt2e
from torchao.quantization.pt2e.quantizer import X86InductorQuantizer

# export utils
from torchao.quantization.pt2e import (
   move_exported_model_to_eval,
   move_exported_model_to_train,
   allow_exported_model_train_eval
)

# graph utils
from torchao.quantization.pt2e import (
   find_sequential_partitions,
   get_equivalent_types,
   update_equivalent_types_dict,
   bfs_trace_with_node_process,
)

 # pt2e numeric debugger
from torchao.quantization.pt2e import (
   generate_numeric_debug_handle,
   CUSTOM_KEY,
   NUMERIC_DEBUG_HANDLE_KEY,
   prepare_for_propagation_comparison,
   extract_results_from_loggers,
   compare_results,
)

Microbenchmarking Framework for Inference APIs

We’ve introduced a streamlined microbenchmark framework, to help developers track and evaluate the performance of their post-training quantization and sparsity APIs for different matrix sizes and model types. The framework also includes support for advanced GPU and memory profiling techniques, providing deeper insights into performance characteristics.

To run the benchmarks, use the following command:

python -m benchmarks.microbenchmarks.benchmark_runner --config benchmarks/microbenchmarks/test/benchmark_config.yml

Sample Benchmark Results (on 1xH100):

Name	Quantization	Shape	Baseline Inference Time (ms)	Inference Time (ms)	Speedup
small_bf16_linear	float8dq-tensor	16384, 16384, 16384	13.34	7.72	1.73x
small_bf16_linear	float8dq-tensor	16384, 16384, 32768	26.04	14.62	1.78x
small_bf16_linear	float8dq-tensor	16384, 16384, 65536	53.59	29.05	1.84x
small_bf16_linear	float8dq-tensor	16384, 32768, 32768	68.94	28.07	2.46x
small_bf16_linear	float8dq-tensor	16384, 32768, 65536	108.63	58.7	1.85x
small_bf16_linear	float8dq-tensor	16384, 65536, 65536	215.66	118.42	1.82x
small_bf16_linear	float8dq-tensor	32768, 32768, 32768	108.16	57.09	1.89x
small_bf16_linear	float8dq-tensor	32768, 32768, 65536	214.74	110.08	1.95x
small_bf16_linear	float8dq-tensor	32768, 65536, 65536	432.44	223.46	1.94x
small_bf16_linear	float8dq-tensor	65536, 65536, 65536	870.37	447.97	1.94x

BC Breaking

Remove prototype low bit optim code completely (#2159)

New Features

Add quantized attn_scores @ v test for intented used in quantized attention (#2008)
Add fallback kernel and interface (#2010)
Add fallback kernel and interface for rhs only quantized matmul (#2011)
Add KleidiAI gemm kernels (#2000)
Use quantized gemm only on aarch64 (#2023)
Adds utility to replace Q/DQ ops with torchao quantized linear ops (#1967)
Adds Q/DQ layout support for embedding quantization with IntxWeightOnlyConfig (#1972)
Move Int8DynamicActivationIntxWeightConfig out of experimental (#1968)
Initial ParetoQ commit (#1876)
INT4 XPU enabling (#1577)
Vectorized row sum (#2034)
Add gemm for fp32_a_int8_b matmul kernel (#2039)
Add gemm kernel to interface (#2040)
Add tests for attention matmul for gemm kernels (#2041)
Gemm int8 a int8 b kernels (#2049)
Add tests cases for q @ k attention variant (#2051)
Add gemm int8 a x int8 b to interface (#2055)
[Quant][PT2E][X86] Enable annotation of aten.mul.tensor with X86InductorQuantizer (#2075)
Add AOPerModuleConfig to torchao.quantization (#2134)
Enabling MoE Quantization using linear decomposition (#2043)

Improvement

Match QAT prepare and convert numerics exactly (#1964)
[Prototype] Update torchao.prototype.parq and add 4-bit Llama 3.2 1B benchmark ([https://gith...

Contributors

syed-ahmed, SalmanMohammadi, and 3 other contributors

Assets 2

07 Apr 19:57

jerryzh168

v0.10.0

8b264ce

v0.10.0

Highlights

We are excited to announce the 0.10.0 release of torchao! This release adds support for end to end training for mxfp8 on Nvidia B200, PARQ (for quantization aware training), module swap quantization API to for research, and some updates for low bit kernels!

Low Bit Optimizers moved to Official Support (#1864)

Low bit optimizers (added in 0.4) is moved out of prototype and now have official support in torchao.

[Prototype] End to End Training Support for mxfp8 on NVIDIA B200 (#1786, #1841, #1951, #1932, #1980)

We have an early version of the end to end training workflow for the mxfp8 dtypes with torch.compile on NVIDIA B200, with the cuBLAS mxfp8 gemm seeing an observed speedup of over 2x over bfloat16 gemm, and casts from bfloat16 to mxfp8 achieving up to 5.5 TB/s. Please see our README.md for MX for more information. We plan to improve performance further in future releases.

[Prototype] Piecewise-Affine Regularized Quantization (#1738)

PARQ is a new theoretical framework for inducing quantization through regularization. It supports standard QAT, as well as new gradual quantization methods, in an easy to use optimizer-only interface. No modifications to a model’s forward or backward pass are needed for quantization.

from torchao.prototype.parq.optim import QuantOptimizer, ProxHardQuant
from torchao.prototype.parq.quant import UnifQuantizer

# Separate quantizable from non-quantizable parameter groups
param_groups = [
    {"params": weights, "quant_bits": 2},  # add extra quant_bits key for QAT
    {"params": others},
]

# Initialize any torch.optim.Optimizer
base_optimizer = torch.optim.SGD(param_groups, lr=0.1, momentum=0.9, weight_decay=1e-4)

# Apply a simple wrapper to quantize in optimizer.step()
optimizer = QuantOptimizer(
    base_optimizer, quantizer=UnifQuantizer(), prox_map=ProxHardQuant()
)

[Prototype] Module Swap Quantization API (#1886)

We added a prototype API for post-training quantization. Users can swap their linear or embedding layers into their QuantizedLinear and QuantizedEmbedding counterparts, and set the quantizers that specify how they want the input activations or weights to be quantized:

quantized_linear = QuantizedLinear(...)
quantized_linear.weight_quantization = IntQuantizer(
    num_bits=4,
    group_size=32,
    dynamic=True,
    quantization_mode="symmetric",
)
quantized_linear.input_quantization = CodeBookQuantizer(
    num_bits=8,
    features=10,
)

Note: The API is highly subject to change and will be integrated with quantize_ in the future. For more detail, please see the README.

[Prototype] Low Bit Kernels Updates (#1826, #1935, #1998, #1652)

Low-bit CPU and MPS kernels are now pip installable from source. To install torchao with low-bit CPU kernels, you can use the following command on an Arm-based Mac:

USE_CPP=1 pip install git+https://github.com/pytorch/ao.git

You can then quantize your model to run on Arm-based Macs with high-performance CPU kernels in torchao. SharedEmbeddingQuantizer,EmbeddingQuantizer, and Int8DynamicActivationIntxWeightConfig all support 1-8 bit quantization.

from torchao.experimental.quant_api import Int8DynamicActivationIntxWeightConfig, SharedEmbeddingQuantizer, EmbeddingQuantizer
from torchao.quantization.granularity import PerGroup, PerRow
from torchao.quantization.quant_api import quantize_
# Quantize embedding/unembedding to 8-bits with SharedEmbeddingQuantizer 
# SharedEmbeddingQuantizer is for quantizing models like Llama1B/3B
# where the embedding/unembedding layers share weights 
# If the embedding/unembedding layers do not share weights, use 
# EmbeddingQuantizer instead        
SharedEmbeddingQuantizer(
	weight_dtype=torch.int8,
	granularity=PerRow(),
	has_weight_zeros=True
).quantize(model)  # Quantize linear layers to 4-bits 
quantize_(
	model,
    Int8DynamicActivationIntxWeightConfig(
        weight_dtype=torch.int4,
	    granularity=PerGroup(128),
	    has_weight_zeros=False,
    )
)

BC Breaking

Delete delayed scaling from torchao.float8 (#1753)

The following usage of `Float8Config` is deprecated in torchao v0.10.0:

config = Float8LinearConfig(
    cast_config_input=CastConfig(scaling_type=ScalingType.DELAYED),
    cast_config_weight=CastConfig(scaling_type=ScalingType.DELAYED),
    cast_config_grad_output=CastConfig(scaling_type=ScalingType.DELAYED),
)

If you would like to use float8 training with delayed scaling, please use an earlier release of torchao. Please see #1680 for more context about this deprecation.

Enforce AOBaseConfig type in `quantize_`'s `config` argument (#1861)

This was done following a deprecation window to simplify the arguments of quantize_, please see #1690 for more context.

# torchao v.0.9.0
def quantize_( 
    model: torch.nn.Module,
    **config: Union[AOBaseConfig, Callable[[torch.nn.Module], torch.nn.Module]],**  
    filter_fn: Optional[Callable[[torch.nn.Module, str], bool]] = None,
    set_inductor_config: Optional[bool] = None,
    device: Optional[torch.types.Device] = None,
):

# torchao v.0.10.0
def quantize_(
    model: torch.nn.Module,
    config: AOBaseConfig,
    filter_fn: Optional[Callable[[torch.nn.Module, str], bool]] = None,
    set_inductor_config: Optional[bool] = None,  
    device: Optional[torch.types.Device] = None,
):

Remove the `set_inductor_config` argument of `quantize_`. (#1865)

This was done following a deprecation window to decouple quantize_ from torchinductor, please see #1715 for more context.

# torchao v.0.9.0
def quantize_(
    ...,
    set_inductor_config: Optional[bool] = None,
    ...,
):  
    # if set_inductor_config != None, throw a deprecation warning
    # if set_inductor_config == None, set it to True to stay consistent with old behavior

# torchao v0.10.0
def quantize_(
    ...,
):
    # set_inductor_config is removed from quantize_ and moved to relevant individual workflows

Deprecations

We removed some of our prototype features that are not used, including DORA (#1815), split_k kernel (#1816), profiler (#1862) and bitnet (#1866).

New Features

QAT

Added PARQ (#1738)

Low Bit Optimizers

Promote Low Bit Optim out of prototype (#1864)

Module swap quantization API

Add module swap quantization API from Quanty (#1886)

Benchmarking

Micro-benchmark inference (#1759)
Add sparsity to benchmarking (#1917)
Add float8 training benchmarking scripts (#1802)

Improvement

Kernels

1-8 bit CPU and MPS kernels are now pip installable from source (#1826)
Added 1-8 bit shared embedding ops to further compress models like Llama1B/3B where the embedding/unembedding weights are shared (#1935)
CPU kernels added runtime microkernel selection based on CPU features and matrix size (#1998)
KleidiAI microkernel library was integrated with CPU kernels to improve GEMM performance on Arm CPUs (#1652)
Add build flag to set parallel_backend (#1870)
Add quant api + python test for shared embedding (#1937)
Add dynamic shape support for lowbit kernels (#1942)
Add LUT-based bitpacking for 1-4 bits (#1987)
Add lut support to linear kernel ([https://github.com/pytorc...

Contributors

facebook-github-bot, lisjin, and 6 other contributors

Assets 2

28 Feb 14:23

HDCharles

v0.9.0

14cfbc7

v0.9.0

Highlights

We are excited to announce the 0.9.0 release of torchao! This release moves a number of sparsity techniques out of prototype, a significant overhaul of the quantize_ api, a new cutlass kernel for 4 bit dynamic quantization and more!

Block Sparsity promoted out of prototype

We’ve promoted block sparsity out of torchao.prototype and made several performance improvements.
You can accelerate your models with block sparsity as follows:

from torchao.sparsity import sparsify, block_sparse_weight
sparsify_(model, block_sparse_weight(blocksize=64))

Blocksparse Benchmarks

Technique	Decode (tok/s)	Model Size (GB)
baseline	134.40	15.01
2:4 sparse	163.13	10.08
bsr-0.8-32	210.91	6.01
bsr-0.8-64	222.43	6.00
bsr-0.9-32	255.19	4.88
bsr-0.9-64	262.94	4.88
2:4 sparse + int4wo (marlin)	255.21	3.89

Block Sparsity technique names (bsr) indicate sparsity fraction and blocksize.

These numbers were generated on H100 using torchao/_models/llama/generate.py on the Meta-Llama-3.1-8B model. You can reproduce these numbers using this script

BC Breaking

TorchAO M1 Binaries currently not working

W've identified that the binaries are broken on M1 and have been since v0.8.0 though they were working in v0.7.0. We're working on a fix for this, details and discussion can be found here.

quantize_ configuration callables -> configs (#1595, #1694, #1696, #1697)

We are migrating the way quantize_ workflows are configured from callables (tensor subclass inserters) to direct configuration (config objects). Motivation: align with the rest of the ecosystem, enable inspection of configs after instantiation, remove a common source of confusion.

What is changing:

Specifically, here is how the signature of quantize_'s second argument will change:

#
# torchao v0.8.0 and before
#
def quantize(
    model: torch.nn.Module,
    apply_tensor_subclass: Callable[[torch.nn.Module], torch.nn.Module],
    ...,
): ...

#
# torchao v0.9.0
#
def quantize(
    model: torch.nn.Module,
    config: Union[AOBaseConfig, Callable[[torch.nn.Module], torch.nn.Module]],
    ...,
): ...

#
# torchao v0.10.0 or later (exact version TBD)
#
def quantize(
    model: torch.nn.Module,
    config: AOBaseConfig,
    ...,
): ...

the name of the second argument to quantize_ changed from apply_tensor_subclass to config. Since the vast majority of callsites today are passing in configuration with a positional argument, this change should not affect most people.
the type of the second argument to quantize_ will change from Callable[[torch.nn.Module], torch.nn.Module] to config: AOBaseConfig, following a deprecation process detailed below.
for individual workflows, the user facing API name changed from snake case (int8_weight_only) to camel case (Int8WeightOnlyConfig). All argument names for each config are kept as-is. We will keep the old snake case names (int8_weight_only) around and alias them to the new names (int8_weight_only = Int8WeightOnlyConfig), to avoid breaking callsites. We plan to keep the old names forever. Here are all the workflow config name changes:

old name (will keep working)	new name (recommended)
`int4_weight_only`	`Int4WeightOnlyConfig`
`float8_dynamic_activation_float8_weight`	`Float8DynamicActivationFloat8WeightConfig`
`float8_static_activation_float8_weight`	`Float8StaticActivationFloat8WeightConfig`
`float8_weight_only`	`Float8WeightOnlyConfig`
`fpx_weight_only`	`FPXWeightOnlyConfig`
`gemlite_uintx_weight_only`	`GemliteUIntXWeightOnlyConfig`
`int4_dynamic_activation_int4_weight`	`Int4DynamicActivationInt4WeightConfig`
`int8_dynamic_activation_int4_weight`	`Int8DynamicActivationInt4WeightConfig`
`int8_dynamic_activation_int8_semi_sparse_weight`	n/a (deprecated)
`int8_dynamic_activation_int8_weight`	`Int8DynamicActivationInt8WeightConfig`
`int8_weight_only`	`Int8WeightOnlyConfig`
`uintx_weight_only`	`UIntXWeightOnlyConfig`

Configuration for prototype workflows using quantize_ will be migrated at a later time.

How these changes can affect you:

If you are a user of existing quantize_ API workflows and are passing in config by a positional argument (quantize_(model, int8_weight_only(group_size=128))), you are not affected. This positional syntax will keep working going forward. You are encouraged to migrate your callsite to the new config name (quantize_(model, Int8WeightOnlyConfig(group_size=128)) though the old names will continue to work indefinitely.
If you are a user of existing quantize_ API workflows and are passing in config by a keyword argument (quantize_(model, tensor_subclass_inserter=int8_weight_only(group_size=128))), your callsite will break. You will need to change your callsite to quantize_(model, config=int8_weight_only(group_size=128)). We don't expect many people to be in this bucket.
If you are a developer writing new workflows for the quantize_ API, you will need to use the new configuration system. Please see #1690 for details.
If you are a user of sparsify_, you are not affected for now and a similar change will happen in a future version of torchao.

This migration will be a two step process:

in torchao v0.9.0, we will enable the new syntax while starting the deprecation process for the old syntax.
in torchao v.0.10.0 or later, we will remove the old syntax

Please see #1690 for more details.

Block Sparsity imports after moved out of prototype (#1734)

Before:

from torchao.prototype.sparsity.superblock.blocksparse import block_sparse_weight

After:

from torchao.sparsity import block_sparse_weight

Deprecations

deprecation of the `set_inductor_config` argument of `quantize_` (#1716)

We are migrating the set_inductor_config argument of quantize_ to individual workflows. Motivation:

this functionality was intended for inference, and we don't want to expose it to future training workflows that we plan to add to quantize_.
higher level, this flag couples torchao workflows with torch.compile, which is not ideal. We would rather keep these systems decoupled at the quantize_ API level, with individual workflows opting in as needed.

Impact on users

for torchao v0.9.0:: if you are passing in set_inductor_config to quantize_, your callsite will keep working with a deprecation warning. We recommend that you migrate this option to your individual workflow.
for a future version of torchao: the set_inductor_config argument will be removed from quantize_.

API changes

# torchao v0.8.x
def quantize_(
    ...,
    set_inductor_config: bool = True,
    ...,
): ...

# torchao v.0.9.0
def quantize_(
    ...,
    set_inductor_config: Optional[bool] = None,
    ...,
):
    # if set_inductor_config != None, throw a deprecation warning
    # if set_inductor_config == None, set it to True to stay consistent with old behavior

# torchao v TBD (a future release)
def quantize_(
    ...,
):
    # set_inductor_config is removed from quantize_ and moved to relevant individual workflows

Please see #1715 for more details.

Deprecation warning for float8 training delayed and static scaling (#1681, #1680)

We plan to deprecate delayed and static scaling from torchao.float8 training codebase due to lack of real world use cases for delayed/static scaling (dynamic scaling is required for higher accuracy) and
complexity tax for supporting these features.

for torchao v0.9.0: add deprecation warning for delayed and static scaling
for torchao v0.10.0: deprecate delayed and static scaling

New Features

Supermask for improving accuracy for sparse models (#1729)

Supermask (https://pytorch.org/blog/speeding-up-vits/) is a technique for improving the accuracy of block sparsified models by learning a block-sparse mask during a training phase.

from torchao.sparsity import SupermaskLinear, block_sparse_weight
sparsify_(model, lambda x: SupermaskLinear.from_linear(x, block_size=64, sparsity_level=0.9)
# training here

# collapse supermask into a normal linear layer (with many weights set to 0) and then convert to block sparse format for inference speedup
sparsify_(model, lambda x: SupermaskLinear.to_linear(x, sparsity_level=0.9)
sparsify_(model, block_sparse_weight(blocksize=64))

Dynamic quantization W4A4 CUTLASS-based kernel (#1515)

This kernel which adds support for 4 bit dynamic activation + 4 bit weight quantization can be used as follows:

from torchao.quantization import int4_dynamic_activation_int4_weight
quantize_(model, int4_dynamic_activation_int4_weight)

Improvements

Earl...

Contributors

balancap, jaewoosong, and 3 other contributors

Assets 2

15 Jan 18:25

jainapurva

v0.8.0

192eed5

v0.8.0

Highlights

We are excited to announce the 0.8.0 release of torchao! In this release we’ve shipped the first CUTLASS kernel in torchAO which adds support for W4A8 linear operator. In addition to this, we’ve also added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding.

W4A8 based on CUTLASS

A new W4A8 linear operator is implemented, that corresponds to int8_dynamic_activation_int4_weight quantization where two 4-bit weights get packed into a single 8-bit integer value; also, CUTLASS is made a sub-module of torchao repo, in order to be able to utilize more of its functionality to implement new kernels.

Benchmarks on A100

`-q parameter`	Average tokens/sec	Average Bandwidth in GB/s	Peak Memory Usage in GB	Model Size in GB
	95.24	258.55	13.90	13.21
`-q int8wo`	155.31	1028.37	8.97	6.62
`-q int4wo-32`	186.70	774.98	5.31	4.15
`-q int4wo-hqq`	186.47	774.01	5.04	4.15
`-q int8dq`	49.64	328.72	9.44	6.62
`-q w4a8-cutlass` (tuned)	119.31	394.86	4.52	3.31

Prefill performance benchmarks

We’ve added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding. During prefill, we are compute bound and find that dynamic quantization offers greater speedups over weight-only quantization, which is faster for prefill. We’ve also added an option for int8 dynamic quantization that will selectively use prefill during LLM decoding.

BC Breaking

Delete the float8-all-gather-only functionality from float8 training (#1451)

The use_fp8_all_gather_only was an experimental flag, off by default, which was not marketed and not used by anyone as far as we know. We are removing it to simplify the code.

Before

config = Float8LinearConfig(
...,
# the option below is being removed
use_fp8_all_gather_only = True,  
)  
convert_to_float8_training(model, config=config, ...)

After

The use_fp8_all_gather_only option is no longer supported.

New Features

Add TTFT benchmarks + update sparsity benchmarks (#1140)
Gemlite integration in torchao (#1034)
W4A8 based on CUTLASS (#880)

Improvement

quantize_

Expose zero_point_domain as arguments (#1401)
Add convert path for quantize_ QAT API (#1540)
Int8 dynamic prefill weight only decode (#1436)

autoquant

Make int8 dynamic quant in autoquant serializable (#1484)
Additional fixes for autoquant serialization (#1486)
Add exhaustive config option to intmm kernel (#1392)

float8 training

[float8] Allow specifying arbitrary dtype for each tensor, enabling recipes with e4m3 in both the forward and the backward (#1378)

experimental

Remove temp build files from torchao (#1551)

other

Torchao setup.py with cmake (#1490)

Bug Fixes

Fix bfloat16/float16/float32 options (#1369)
Fix a bug in LinearActivationQuantizedTensor (#1400)
Fix error message in float8 FSDP utils (#1423)
Fixes observer attachment to model based on config for wanda sparsifier (#1265)
[resubmit] Gemlite fix (#1435)
🐛 Fix: Memory leak in image processing endpoint (#1513)

Performance

[float8] Re-enable slow-accum in the bwd of axis-wise scaling schemes (#1377)

Documentation

Update api_ref_quantization.rst (#1408)
Update index.rst (#1409)
Update QAT READMEs using new APIs (#1541)

Developers

Pytorch/ao/torchao/experimental/ops/mps/test (#1442)
Verify that submodules are checked out (#1536)

New Contributors

@sanchitintel made their first contribution in #1375
@philipbutler made their first contribution in #1337
@airMeng made their first contribution in #1401
@DerekLiu35 made their first contribution in #1299
@agrawal-aka made their first contribution in #1265
@gmagogsfm made their first contribution in #1443
@dongxiaolong made their first contribution in #1513

Full Changelog: v0.7.0...v0.8.0-rc2

Contributors

gmagogsfm, philipbutler, and 5 other contributors

Assets 2

06 Dec 22:13

vkuzo

v0.7.0-rc3

e39126a

v0.7.0

Highlights

We are excited to announce the 0.7.0 release of torchao! This release moves QAT out of prototype with improved LoRA support and more flexible APIs, and adds support for new experimental kernels such as Marlin QQQ (for CUDA), int8_dynamic_activation_intx_weight (for ARM CPU), and more!

QAT moved out of prototype, LoRA integration, new flexible APIs (#1020, #1085, #1152, #1037, #1152)

QAT has been moved out of prototype to torchao/quantization/qat to provide better API stability guarantees moving forward. In addition to the existing *QATQuantizer classes, we now also support the more flexible FakeQuantizedLinear and FakeQuantizedEmbedding modules for users to configure the exact quantization settings they wish to use during QAT.

from torchao.quantization.qat.api import FakeQuantizeConfig
from torchao.quantization.qat.embedding import FakeQuantizedEmbedding
from torchao.quantization.qat.linear import FakeQuantizedLinear

# Specify quantization schemes to use during QAT
activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
weight_config = FakeQuantizeConfig(torch.int4, group_size=8)

# Replace nn.Linear and nn.Embedding with these in your model
fq_linear = FakeQuantizedLinear(16, 32, False, activation_config, weight_config)
fq_embedding = FakeQuantizedEmbedding(16, 32, weight_config=weight_config)

We also leveraged the new flexible APIs to build a new QAT + LoRA fine-tuning flow in torchtune. Try it out today!

tune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config llama3/8B_qat_lora

Marlin QQQ for CUDA (#1113)

Marlin QQQ is an optimized GPU kernel that supports W4A8 mixed precision GEMM. For more details about Marlin QQQ, please refer to paper.

from torchao.dtypes import MarlinQQQLayout
quantize_(
    model,
    int8_dynamic_activation_int4_weight(
        group_size=128,
        mapping_type=MappingType.SYMMETRIC,
        act_mapping_type=MappingType.SYMMETRIC,
        layout=MarlinQQQLayout(),
    ),
)

Benchmarking results can be found in https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#marlin-qqq.

This is a prototype feature - feel free to try out!

int8_dynamic_activation_intx_weight Quantization for ARM CPU (#995, #1027, #1254, #1353)

We have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon).

from torchao.experimental.quant_api import int8_dynamic_activation_intx_weight
assert precision == torch.float32, "int8_dynamic_activation_intx_weight requires fp32 precision"

# Build kernels in temp location, and load them in torch
# This requires an ARM CPU
from torchao.experimental.temp_build import temp_build_and_load_torchao_ops
temp_build_and_load_torchao_ops(cmake_lists_path=os.path.dirname(os.path.realpath(__file__)) + "/../../experimental")
# Quantize model
nbit = 4
assert nbit >= 1 and nbit <= 8, "nbits must be 1 to 8"
group_size = 128
has_weight_zeros = False
quantize_(
    model,
    int8_dynamic_activation_intx_weight(
        group_size=group_size,
        nbit=nbit,
        has_weight_zeros=has_weight_zeros,
    ),
)

Benchmarking results can be found in https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#int8_dynamic_activation_intx_weight-quantization

We are still trying to figure out how to ship the ARM CPU kernels, so the exact API is subject to change.

BC Breaking

Rename AQT#2 LayoutType -> Layout (#1049)

Before:

from torchao.dtypes import (
    BlockSparseLayoutType,
    Int4CPULayoutType,
    MarlinQQQLayoutType,
    MarlinSparseLayoutType,
    SemiSparseLayoutType,
    TensorCoreTiledLayoutType,
    UintxLayoutType,
    Float8LayoutType,
    LayoutType,
    PlainLayoutType,
)

After:

from torchao.dtypes import (
    BlockSparseLayout,
    Int4CPULayout,
    MarlinQQQLayout,
    MarlinSparseLayout,
    SemiSparseLayout,
    TensorCoreTiledLayout,
    UintxLayout,
    Float8Layout,
    Layout,
    PlainLayout,
)

QAT imports after move out of prototype (#1091)

Before:

from torchao.quantization.prototype.qat import (
    disable_4w_fake_quant,
    disable_8da4w_fake_quant,
    enable_4w_fake_quant,
    enable_8da4w_fake_quant,
    ComposableQATQuantizer,
    Int4WeightOnlyQATQuantizer,
    Int4WeightOnlyEmbeddingQATQuantizer
    Int8DynActInt4WeightQATQuantizer,
    Int8DynActInt4WeightQATLinear,
)
from torchao.quantization.prototype.qat.api import (
    FakeQuantizeConfig,
)
from torchao.quantization.prototype.qat.fake_quantizer import (
    FakeQuantizer,
)

After:

from torchao.quantization.qat import (
    ComposableQATQuantizer,
    Int4WeightOnlyQATQuantizer,
    Int4WeightOnlyEmbeddingQATQuantizer
    Int8DynActInt4WeightQATQuantizer,
)
from torchao.quantization.qat.linear import (
    disable_4w_fake_quant,
    disable_8da4w_fake_quant,
    enable_4w_fake_quant,
    enable_8da4w_fake_quant,
    Int8DynActInt4WeightQATLinear,
)
from torchao.quantization.qat.api import (
    FakeQuantizeConfig,
)
from torchao.quantization.qat.fake_quantizer import (
    FakeQuantizer,
)

New Features

Add BF16 stochastic rounding option for optimizers (#1124)
Add quantize_() API support for NF4 (#1216)
Support W4A8 Marlin kernel (#1113)

Improvements

quantize_

Add default filtering to remove mis-alinged weights (#1194)
Add tensor parallelism support for int4_weight_only quantization (#1120)
Add support for asymmetric act quant for int8 dynamic quant (#1131)
Add support for groupwise quantization for int8 weight only quantization (#1121)
Add AQT tensor parallel for float8_dynamic_quant (#1078)
Int8wo Embedding Quant (#1167)
Making sure int4 weight only supports cpu as well (#1203)
BF16 support for Quant-LLM kernel (#1147)
Add hardware check to fp8 quant (#1314)
Add support for quantize_() with Float8Linear module (#1344)

autoquant

Added support for Per Tensor Scaling for Float8 Dynamic Autoquant (#1175)
Add floating point options for autoquant and add accuracy measurement (#1355)

benchmarks

Adding batchsize support for torchao llama benchmarks (#1182)
Add capability of benchmarking arbitrary binary (#1107)

experimental

Add embedding ops aten (#1129)
Add embedding ops executorch (#1137)
Add quantized embedding kernels to torchao (#1018)
Allow deprecated declarations what using Parallel ExecuTorch (#1031)
Introduce lowbit quantized linear MPS kernels (#954)
Enable 6-bit kernel (#1027)
Kleidi 4b blockwise gemv prototype (#997)
Experimental 6-bit quantization for Llama in torchchat (#1094)
Introduce 7-bit quantization for Llama in torchchat. (#1139)
Executorch Subclass API (#966) (#995)
8-bit packing support (#1248)
Experimental Enable 8-bit (#1254)
Experimental Benchmarking (#1353)

optimizer

[low-bit optim] Upcast everything to FP32 for internal calculations (#1068)
[Low-bit optim] Support for dcp.save() and dcp.load() (#1217)
Enable CPU Offload for Intel GPU (#1324)

SAM2

SAM2.1 copy (#1172)
SAM2 AMG server side request batching (#1197)
More SAM2-fast server improvements (#1285)
SAM2 Fast AMG: memory profiling and more compile (#1296)
SAM2 AMG cli and other QoL improvements (#1336)
SAM2 AMG cli.py on modal (#1349)
Reduce SAM2 AMG cli startup by using deploy (#1350)
Reduce startup time for SAM2 AMG by using torch.export (#1358)
More batching and improved furious accuracy/performance (#1253)
SAM2.1 and example README (#1048)
SAM2 AMG example mIoU, perf numbers and more SAM2 model annotations (#1196)

other

Add SpinQuant to generate.py (#1069)
SpinQuant (#983)
SmoothQuant using tensor subclassing (#1030)
Expose FakeQuantizeConfigs in QAT quantizers (#1214)
Add module-swap UX for INT8 mixed-precision training (https://github.com/pytorch/...

Contributors

digantdesai, tibidoh, and 20 other contributors

Assets 2

21 Oct 21:45

drisspg

v0.6.1

99c8d52

v0.6.1

Highlights

We are excited to announce the 0.6.1 release of torchao! This release adds support for Auto-Round support, Float8 Axiswise scaled training, a BitNet training recipe, an implementation of AWQ and much more!

Auto-Round Support (#581)

Auto-Round is a new weight-only quantization algorithm, it has as achieved superior accuracy compared to GPTQ, AWQ, and OmniQuant across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our paper, GitHub repository, and Hugging Face low-bit quantization leaderboard.

from torchao.prototype.autoround.core import prepare_model_for_applying_auto_round_
from torchao.prototype.autoround.core import apply_auto_round

prepare_model_for_applying_auto_round_(
    model,
    is_target_module=is_target_module,
    bits=4,
    group_size=128,
    iters=200,
    device=device,
)

input_ids_lst = []
for data in dataloader:
    input_ids_lst.append(data["input_ids"].to(model_device))

multi_t_input_ids = MultiTensor(input_ids_lst)
out = model(multi_t_input_ids)

quantize_(model, apply_auto_round(), is_target_module)

Added float8 training axiswise scaling support with per-gemm-argument configuration (#940)

We added experimental support for rowwise scaled float8 gemm to torchao.float8, with per-gemm-input configurability to enable exploration of various recipes. Here is how a user can configure all-axiswise scaling

# all-axiswise scaling
config = torchao.float8.config.recipe_name_to_linear_config(Float8LinearRecipeName.ALL_AXISWISE)
m = torchao.float8.convert_to_float8_training(config)

# or, a custom recipe by @lw where grad_weight is left in bfloat16
config = torchao.float8.config.recipe_name_to_linear_config(Float8LinearRecipeName.LW_AXISWISE_WITH_GW_HP)
m = torchao.float8.convert_to_float8_training(config)

Early performance benchmarks show all-axiswise scaling achieve a 1.13x speedup vs bf16 on torchtitan / LLaMa 3 8B / 8 H100 GPUs (compared to 1.17x from all-tensorwise scaling in the same setup), and loss curves which match to bf16 and all-tensorwise scaling. Further performance and accuracy benchmarks will follow in future releases.

Introduced BitNet b1.58 training recipe (#930)

Adds recipe for doing BitNet b1.58](https://arxiv.org/abs/2402.17764) ternary weights clamping.

from torchao.prototype.quantized_training import bitnet_training
from torchao import quantize_

model = ...
quantize_(model, bitnet_training())

Notably: Our implementation utilizes INT8 Tensor Cores to make up for this loss in speed. In fact, our implementation is faster than BF16 training in most cases.

[Prototype] Implemented Activation Aware Weight Quantization AWQ (#743)

Perplexity and performance measured on A100 GPU:

Model	Quantization	Tokens/sec	Throughput (GB/sec)	Peak Mem (GB)	Model Size (GB)
Llama-2-7b-chat-hf	bfloat16	107.38	1418.93	13.88	13.21
	awq-hqq-int4	196.6	761.2	5.05	3.87
	awq-uint4	43.59	194.93	7.31	4.47
	int4wo-hqq	209.19	804.32	4.89	3.84
	int4wo-64	201.14	751.42	4.87	3.74

Usage:

from torchao.prototype.awq import insert_awq_observer_, awq_uintx, AWQObservedLinear
quant_dtype = torch.uint4
group_size = 64
calibration_limit = 10
calibration_seq_length = 1024
model=model.to(device)
insert_awq_observer_(model,calibration_limit, calibration_seq_length, quant_dtype=quant_dtype, group_size=group_size)
with torch.no_grad():
    for batch in calibration_data:
        model(batch.to(device))
is_observed_linear = lambda m, fqn: isinstance(m, AWQObservedLinear)
quantize_(model, awq_uintx(quant_dtype=quant_dtype, group_size = group_size), is_observed_linear)

New Features

[Prototype] Added Float8 support for AQT tensor parallel (#1003)
Added composable QAT quantizer (#938)
Introduced torchchat quantizer (#897)
Added INT8 mixed-precision training (#748)
Implemented sparse marlin AQT layout (#621)
Added a PerTensor static quant api (#787)
Introduced uintx quant to generate and eval (#811)
Added Float8 Weight Only and FP8 weight + dynamic activation (#740)
Implemented Auto-Round support (#581)
Added 2, 3, 4, 5 bit custom ops (#828)
Introduced symmetric quantization with no clipping error in the tensor subclass based API (#845)
Added int4 weight-only embedding QAT (#947)
Added support for 1-bit and 6-bit quantization for Llama in torchchat (#910, #1007)
Added a linear_observer class for doing static activation calibration (#807)
Exposed hqq through uintx_weight_only API (#786)
Added RowWise scaling option for Float8 dynamic activation quantization (#819)
Added Float8 weight only to autoquant api (#866)

Improvements

Enhanced Auto-Round functionality (#870)
Improved FSDP support for low-bit optimizers (#538)
Added support for using AffineQuantizedTensor with weights_only=True for torch.load (#630)
Optimized 3-bit packing (#1029)
Added more evaluation metrics to llama/eval.sh (#934)
Improved eager numerics for dynamic scales in float8 (#904)

Bug fixes

Fixed inference_mode issues (#885)
Fixed failing FP6 benchmark (#931)
Resolved various issues with float8 support (#918, #923)
Fixed load state dict when device is different for low-bit optim (#1021)

Performance

Added SM75 (Turing) support for FP6 kernel (#942)
Implemented int8 dynamic quant + bsr support (#821)
Added workaround to recover the perf for quantized vit in torch.compile (#926)

INT8 Mixed-Precision Training

On NVIDIA GPUs, INT8 Tensor Cores is approximately 2x faster than their BF16/FP16 counterparts. In mixed-precision training, we can down-cast activations and weights dynamically to INT8 to leverage faster matmuls. However, since INT8 has very limited range [-128,127], we perform row-wise quantization, similar to how INT8 post-training quantization (PTQ) is done. Weight is still in original precision.

from torchao.prototype.quantized_training import int8_mixed_precision_training, Int8MixedPrecisionTrainingConfig
from torchao.quantization import quantize_

model = ...

# apply INT8 matmul to all 3 matmuls
quantize_(model, int8_mixed_precision_training())

# customize which matmul is left in original precision.
config = Int8MixedPrecisionTrainingConfig(
    output=True,
    grad_input=True,
    grad_weight=False,
)
quantize_(model, int8_mixed_precision_training(config))

End2end speed benchmark using benchmarks/quantized_training/pretrain_llama2.py

Model & GPU	bs x seq_len	Config	Tok/s	Peak mem (GB)
Llama2-7B, A100	8 x 2048	BF16 (baseline)	~4400	59.69
Llama2-7B, A100	8 x 2048	INT8 mixed-precision	~6100 (+39%)	58.28
Llama2-1B, 4090	16 x 2048	BF16 (baseline)	~17,900	18.23
Llama2-1B, 4090	16 x 2048	INT8 mixed-precision	~30,700 (+72%)	18.34

Docs

Updated README with more current float8 speedup information (#816)
Added tutorial for trainable tensor subclass (#908)
Improved documentation for float8 unification and inference (#895, #896)

Devs

Added compile tests to test suite (#906)
Improved CI setup and build processes (#887)
Added M1 wheel support (#822)
Added more benchmarking and profiling tools (#1017)
Renamed fpx to floatx (#877)
Removed torchao_nightly package (#661)
Added more lint fixes (#827)
Added better subclass testing support (#839)
Added CI to catch syntax errors (#861)
Added tutorial on composing quantized subclass w/ Dtensor based TP (#785)

Security

No significant security updates in this release.

Untopiced

Added basic SAM2 AutomaticMaskGeneration example server (#1039)

New Contributors

@iseeyuan made their first contribution in #805
@YihengBrianWu made their first contribution in #860
@kshitij12345 made their first contribution in #863
@ZainRizvi made their first contribution in #887
@alexsamardzic made their first contribution in #899
@vaishnavi17 made their first contribution in #911
@tobiasvanderwerff made their first contribution in #931
@kwen2501 made their first contribution in #937
@y-sq made their first contribution in #912
@jimexist made their first contribution in #969
@danielpatrickhug made their first contribution in #914
@ramreddymounica made their first contribution in #1007
@yushangdi made their first contribution in h...

Contributors

jimexist, ZainRizvi, and 12 other contributors

Assets 2

08 Sep 17:18

andrewor14

v0.5.0

ae8384b

v0.5.0

Highlights

We are excited to announce the 0.5 release of torchao! This release adds support for memory efficient inference, float8 training and inference, int8 quantized training, HQQ, automatic mixed-precision quantization through bayesian optimization, sparse marlin, and integrations with HuggingFace, SGLang, and diffusers.

Memory Efficient Inference Support #738

We've added support for Llama 3.1 to the llama benchmarks in TorchAO and added new features and improvements as a proof of concept for memory efficient inference. These additions allow us to to do 130k context length inference with Llama 3.1-8B with only 18.91 GB memory if we combine with kv cache quantization, int4 weight only quantization and linear causal mask.

General savings depend on technique and context length as can be seen in the following graph:

Float8 Training #551

torchao.float8 implements training recipes with the scaled float8 dtypes, as laid out in https://arxiv.org/abs/2209.05433.

With torch.compile on, current results show throughput speedups of up to 1.5x on 128 H100 GPU LLaMa 3 70B pretraining jobs (details)

from torchao.float8 import convert_to_float8_training
convert_to_float8_training(m, module_filter_fn=...)

And for an end-to-minimal training recipe of pretraining with float8, you can check out torchtitan.

Float8 Inference #740 #819

We have introduced two new quantization APIs for Float8 inference:

Float8 Weight-Only Quantization: A new quant_api float8_weight_only() has been added to apply float8 weight-only symmetric per-channel quantization to linear layers.
Float8 Dynamic Activation and Weight Quantization: A new quant_api float8_dynamic_activation_float8_weight() has been introduced to apply float8 dynamic symmetric quantization to both activations and weights of linear layers. By default PerTensor scaling. We have also added an option to do PerRow scaling of both activations and weights. By computing scales at a finer granularity, it can potentially reduce the overall quantization error and increase performance by reducing dynamic quantization overhead.

Example usage:

import torch
from torchao.quantization import quantize_, float8_weight_only, float8_dynamic_activation_float8_weight, PerRow

# Create a model
model = YourModel()

# Apply float8 weight-only quantization
quantize_(model, float8_weight_only())

# Apply float8 dynamic activation and weight quantization
quantize_(model, float8_dynamic_activation_float8_weight())

# Apply PerRow scaling to weight and activations
quantize_(linear_module, float8_dynamic_activation_float8_weight(granularity=PerRow()))

Notes:

These new APIs are designed to work with PyTorch 2.5 and later versions.
float8_dynamic_activation_float8_weight requires CUDA devices with compute capability 8.9 or higher for hardware acceleration.

Int8 quantized training #644 #748

@gau-nernst introduced 2 experimental works on training using INT8.

INT8 quantized training (#644): weight is quantized to INT8 during the whole duration of training to save memory. Compute remains in high precision. To train the model effectively with only quantized weights, we use stochastic rounding for weight update. Right now, memory saving is not too competitive compared to compiled BF16 baseline.
INT8 mixed-precision training (#748): weight is kept in the original high precision, but weight and activation are dynamically quantized to INT8 during training to utilize INT8 tensor cores. We observe up to 70% speedup for Llama2 pre-training on 4090, and 20% speedup for Llama3 pre-training on 8x A100 with FSDP2.

from torchao.quantization import quantize_
from torchao.prototype.quantized_training import int8_weight_only_quantized_training, int8_mixed_precision_training

model = YourModel()

# apply INT8 quantized training
quantize_(model, int8_weight_only_quantized_training())

# apply INT8 mixed-precision training
quantize_(model, int8_mixed_precision_training())

For more information and benchmark results, see README and the respective PR (#644 and #748)

HQQ Integration in torchao #605 #786

hqq is added to existing torchao APIs, it gives improvements on model accuracy and leverages the existing efficient kernels in torchao. We enabled hqq for int4_weight_only API:

quantize_(model, int4_weight_only(group_size, use_hqq=True)

We also added this to the uintx api for accuracy experiments (current uintx kernels are slow):

quantize_(model, uintx_weight_only(torch.uint2, group_size, use_hqq=True)

Automatic Mixed-Precision Quantization through Bayesian Optimization #592, #694

We provided a Bayesian Optimization (BO) tool leveraging Ax to auto search mixed-precision weight-only quantization configuration, i.e., bit width and group size of intN_weight_only(bit_width, group_size) for each layer. It also includes a sensitivity analysis tool to calculate layer-wise average Hessian trace and average fisher information matrix trace, which is an optional step to customize and improve BO search.

To optimize for model accuracy under a model size constraint (GB):

python --BO_acc_modelsize.py --checkpoint=/tmp/Meta-Llama-3-8B --model_size_constraint=6.0

To optimize for inference throughput under a model perplexity constraint:

python --BO_acc_throughput.py --checkpoint=/tmp/Meta-Llama-3-8B --ppl_constraint=7.5

For more detailed usage, please refer to this README. The mixed-precision quantization searched by this tool reduces 20.1% model size with 2.8% perplexity reduction, and improves 15.1% inference throughput with 3.2% perplexity reduction on the Llama3-8B model compared to int8 uniform quantization.

Sparse Marlin #621, #733

@Diogo-V added sparse-marlin, a W4AFP16 2:4 sparse kernel, support to TorchAO.
On Meta LLama3, we observe a 25% tok/s increase (180 -> 226) compared to our existing int4-wo implementation.

from torchao.quantization.quant_api import quantize_, int4_weight_only
from torchao.dtypes import MarlinSparseLayoutType
quantize_(model, int4_weight_only(layout_type=MarlinSparseLayoutType()))

Model	Technique	Tokens/Second	Memory Bandwidth (GB/s)	Peak Memory (GB)	Model Size (GB)
Llama-3-8B	Base (bfloat16)	95.64	1435.54	16.43	15.01
	int8dq	8.61	64.75	9.24	7.52
	int8wo	153.03	1150.80	10.42	7.52
	int4wo-64	180.80	763.33	6.88	4.22
	int4wo-64-sparse-marlin	226.02	689.20	5.32	3.05

HuggingFace Integration

torchao is integrated into huggingface: https://huggingface.co/docs/transformers/main/en/quantization/torchao now you can use int4_weight_only, int8_weight_only and int8_dynamic_activation_int8_weight through TorchAoConfig in huggingface. Currently available in huggingface main branch only.

SGLang Integration

torchao is also integrated into sglang (sgl-project/sglang#1341) for llama3 model, you can try out with:

python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --torchao-config int4wo-128

Supported configurations are ["int4wo-<group_size>", "int8wo", "int8dq", "fp8wo" (only available in torchao 0.5+)]

diffusers Integration

diffusers-torchao provides end-to-end inference and experimental training recipes to use torchao with diffusers in this repo. We demonstrate 53.88% speedup on Flux.1-Dev* and 27.33% speedup on CogVideoX-5b when comparing compiled quantized models against their standard bf16 counterparts.

BC Breaking

Add layout option to woq int4 api #670

# torchao 0.4.0
from torchao.quantization import quantize_, int4_weight_only
quantize_(my_model, int4_weight_only(inner_k_tiles=8))

# torchao 0.5.0
from torchao.quantization import quantize, int4_weight_only
quant...

Contributors

raziel, crcrpar, and 9 other contributors

Assets 2

07 Aug 16:48

jcaip

v0.4.0

245ab4e

v0.4.0

Highlights

We are excited to announce the 0.4 release of torchao! This release adds support for KV cache quantization, quantization aware training (QAT), low bit optimizer support, composing quantization and sparsity, and more!

KV cache quantization (#532)

We've added support for KV cache quantization, showing a peak memory reduction from 19.7 -> 19.2 GB on Llama3-8B at an 8192 context length. We plan to investigate Llama3.1 next.

Quantization-Aware Training (QAT) (#383, #555)

We now support two QAT schemes for linear layers: Int8 per token dynamic activations + int4 per group weights, and int4 per group weights (using the efficient tinygemm int4 kernel after training). Users can access this feature by transforming their models before and after training using the appropriate quantizer, for example:

from torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer

# Quantizer for int8 dynamic per token activations +
# int4 grouped per channel weights, only for linear layers
qat_quantizer = Int8DynActInt4WeightQATQuantizer()

# Insert "fake quantize" operations into linear layers.
# These operations simulate quantization numerics during
# training without performing any dtype casting
model = qat_quantizer.prepare(model)

# Convert fake quantize to actual quantize operations
model = qat_quantizer.convert(model)

Initial evaluation results indicate that QAT in torchao can recover up to 96% of quantized accuracy degradation on hellaswag and up to 68% of quantized perplexity degradation on wikitext for Llama3 compared to post-training quantization (PTQ). For more details, please refer to the README and this blog post.

Composing quantization and sparsity (#457, #473)

We've added support for composing int8 dynamic quantization with 2:4 sparsity, using the quantize_ API. We also added SAM benchmarks that show a 7% speedup over standalone sparsity / int8 dynamic quantization here.

from torchao.quantization import quantize_, int8_dynamic_activation_int8_semi_sparse_weight
quantize_(model, int8_dynamic_activation_int8_semi_sparse_weight())

Community Contributions

low-bit optimizer support (#478, #463, #482, #484, #538)

@gau-nernst added implementations for 4-bit, 8-bit, and FP8 Adam with FSDP2/FSDP support. Our API is a drop-in replacement for torch.optim.Adam and can be used as follows:

from torchao.prototype.low_bit_optim import Adam8bit, Adam4bit, AdamFp8
from torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit, AdamWFp8


model = ...
optim = Adam8bit(model.parameters()) # replace with Adam4bit and AdamFp8 for the 4 / fp8 versions

For more information about low bit optimizer support please refer to our README.

Improvements to 4-bit quantization (#517, #552, #544, #479 )

@bdhirsh @jeromeku @yanbing-j @manuelcandales @larryliu0820 added torch.compile support for NF4 Tensor, custom CUDA int4 tinygemm unpacking ops, and several bugfixes to torchao

BC breaking

quantize has been renamed to quantize_ #467

# for torchao 0.4
from torchao.quantization import quantize_, int8_weight_only
quantize_(model, int8_weight_only())

# for torchao 0.3
from torchao.quantization import quantize, int8_weight_only
quantize(model, int8_weight_only())

apply_sparse_semi_structured has been deprecated in favor of sparsify_ which matches the quantize_ API #473

# for torchao 0.4
from torchao.sparsity import _sparsify, semi_sparse_weight
sparsify_(model, semi_sparse_weight())

# for torchao 0.3
from torchao.sparsity import apply_sparse_semi_structured
apply_sparse_semi_structured(model)

Deprecations

New Features

Added kv_cache quantization #532
Migrated float8_experimental to torchao.float8, enabling float8 training support #551 #529
Added FP5 E2M2 #399
Added 4-bit, 8-bit, and FP8 ADAM support #478 #463 #482
Added FSDP2 support for low-bit optimizers #484
[prototype] mixed-precision quantization and eval framework #531
Added int4 weight-only QAT support #555, #383
Added custom CUDA tinygemm unpacking ops #415

Improvements

Composing quantization and sparsity now uses the unified AQT Layout #498
Added default inductor config settings #423
Better dtype and device handling for Int8DynActInt4WeightQuantizer and Int4WeightOnlyQuantizer #475 #479
Enable model.to for int4/int8 weight only quantized models #486 #522
Added more logging to TensorCoreTiledAQTLayout #520
Added general fake_quantize_affine op with mask support #492 #500
QAT now uses the shared fake_quantize_affine primitive #527
Improve FSDP support for low-bit optimizers #538
Custom op and inductor decomp registration now uses a decorator #434
Updated torch version to no longer require unwrap_tensor_subclass #595

Bug fixes

Fixed import for TORCH_VERSION_AFTER_* #433
Fixed crash when PYTORCH_VERSION is not defined #455
Added torch.compile support for NF4Tensor #544
Added fbcode check to fix torchtune in Genie #480
Fixed int4pack_mm error #517
Fixed cuda device check #536
Weight shuffling now runs on CPU for int4 quantization due to a MPS memory issue #552
Scale and input now are the same dtype for int8 weight only quantization #534
Fixed FP6-LLM API #595

Performance

Added segment-anything-fast benchmarks for composed quantization + sparsity #457
Updated low-bit Adam benchmark #481

Docs

Updated README.md #583 #438 #445 #460
Updated installation instructions #447 #459
Added more docs for int4_weight_only API #469
Added developer guide notebook #588
Added optimized model serialization/deserialization doc #524 #525
Added new float8 feature tracker #557
Added static quantization tutorial for calibration-based techniques #487

Devs

Fix numpy version in CI #537
trymerge now uploads merge records to s3 #448
Updated python version to 3.9 #488
torchao no long depends on torch #449
benchmark_model now accepts args and kwargs and supports cpu and mps backends #586 #406
Add git version suffix to package name #547
Added validations to torchao #453 #454
Parallel test support with pytest-xdist #518
Quantizer now uses logging instead of print #472

Not user facing

Refactored _replace_linear_8da4w #451
Remove unused code from AQT implementation #476 #440 #441 #471
Improved error message for lm_eval script #444
Updated HF_TOKEN env variable #427
Fixed typo in Quant-LLM in #450
Add a test for map_location="cpu" in #497
Removed sparse test collection warning #489
Refactored layout imple...

Contributors

jeromeku, larryliu0820, and 9 other contributors

Assets 2

Releases: pytorch/ao

v0.13.0

Highlights

Simpler Multi-step QAT API (#2629)

(Prototype) NVFP4 and FP8 QAT (#2735, #2666)

(prototype) 1.2x MXFP8 dense pretraining speedups with torchtitan

torchao float8 training now integrated into axolotl!

BC Breaking

QAT API Changes (#2628, #2641)

Remove old change_linear_weights_to_* APIs (#2721)

Deprecations

Deprecate old TORCH_VERSION variables (#2719)

Drop support for PyTorch 2.5 and before (#2720)

New Features

Improvements

Contributors

Uh oh!

v0.12.0

Highlights

QAT + Axolotl Integration

[Prototype | API not finalized] MXFP and NVFP support on Blackwell GPUs

BC Breaking

Deprecations

New Features

Improvement

Bug Fixes

Contributors

Uh oh!

v0.11.0

Highlights

MoE Quantization

PT2 Export Quantization

Microbenchmarking Framework for Inference APIs

BC Breaking

New Features

Improvement

Contributors

Uh oh!

v0.10.0

Highlights

Low Bit Optimizers moved to Official Support (#1864)

[Prototype] End to End Training Support for mxfp8 on NVIDIA B200 (#1786, #1841, #1951, #1932, #1980)

[Prototype] Piecewise-Affine Regularized Quantization (#1738)

[Prototype] Module Swap Quantization API (#1886)

[Prototype] Low Bit Kernels Updates (#1826, #1935, #1998, #1652)

BC Breaking

Delete delayed scaling from torchao.float8 (#1753)

Enforce AOBaseConfig type in quantize_'s config argument (#1861)

Remove the set_inductor_config argument of quantize_. (#1865)

Deprecations

New Features

Improvement

Contributors

Uh oh!

v0.9.0

Highlights

Block Sparsity promoted out of prototype

Blocksparse Benchmarks

BC Breaking

TorchAO M1 Binaries currently not working

quantize_ configuration callables -> configs (#1595, #1694, #1696, #1697)

Block Sparsity imports after moved out of prototype (#1734)

Deprecations

deprecation of the set_inductor_config argument of quantize_ (#1716)

Impact on users

API changes

Deprecation warning for float8 training delayed and static scaling (#1681, #1680)

New Features

Supermask for improving accuracy for sparse models (#1729)

Dynamic quantization W4A4 CUTLASS-based kernel (#1515)

Improvements

Earl...

Contributors

Uh oh!

v0.8.0

Highlights

W4A8 based on CUTLASS

Benchmarks on A100

Prefill performance benchmarks

BC Breaking

Remove old `change_linear_weights_to_*` APIs (#2721)

Enforce AOBaseConfig type in `quantize_`'s `config` argument (#1861)

Remove the `set_inductor_config` argument of `quantize_`. (#1865)

deprecation of the `set_inductor_config` argument of `quantize_` (#1716)