Skip to content

Commit

Permalink
Merge branch 'main' into 085_no_input_autoquant
Browse files Browse the repository at this point in the history
  • Loading branch information
cpuhrsch authored Apr 29, 2024
2 parents 144b03d + e3ed90f commit e261405
Show file tree
Hide file tree
Showing 25 changed files with 1,155 additions and 190 deletions.
11 changes: 2 additions & 9 deletions .github/workflows/doc_build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,16 +43,9 @@ jobs:
python -m pip install -e .
cd docs
python -m pip install -r requirements.txt
- name: Get the torchtune version
run: |
# Get the github.ref_name and save into the
# REF_NAME variable. This will be passed in
# conf.py to display the version in the
# site dropdown
GITHUB_REF=${{ github.ref }}
TORCHAO_VERSION_DOCS="${GITHUB_REF}"
echo "$TORCHAO_VERSION_DOCS"
- name: Build docs
env:
TORCHAO_VERSION_DOCS: ${{ github.ref }}
run: |
cd docs
make html
Expand Down
40 changes: 40 additions & 0 deletions .github/workflows/nightly_smoke_test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
name: PyTorch CUDA Nightly Smoke Test

on:
schedule:
# 6 am PST every day
- cron: "0 14 * * *"
workflow_dispatch:

concurrency:
group: regression_test-${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
cancel-in-progress: true

env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

jobs:
test:
strategy:
fail-fast: false
matrix:
include:
- name: CUDA Nightly
runs-on: linux.g5.12xlarge.nvidia.gpu
torch-spec: '--pre torch --index-url https://download.pytorch.org/whl/nightly/cu121'
gpu-arch-type: "cuda"
gpu-arch-version: "12.1"


uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
runner: ${{ matrix.runs-on }}
gpu-arch-type: ${{ matrix.gpu-arch-type }}
gpu-arch-version: ${{ matrix.gpu-arch-version }}
script: |
python -m pip install --upgrade pip
pip install ${{ matrix.torch-spec }}
pip install -r requirements.txt
pip install -r dev-requirements.txt
python setup.py install
pytest test --verbose -s
61 changes: 31 additions & 30 deletions .github/workflows/regression_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,44 +22,45 @@ jobs:
matrix:
include:
- name: CUDA 2.2.2
runs-on: 4-core-ubuntu-gpu-t4
runs-on: linux.g5.12xlarge.nvidia.gpu
torch-spec: 'torch==2.2.2'
- name: CUDA 2.3 RC
runs-on: 4-core-ubuntu-gpu-t4
torch-spec: 'torch==2.3.0 --index-url https://download.pytorch.org/whl/test/cu121'
- name: CUDA Nightly
runs-on: 4-core-ubuntu-gpu-t4
torch-spec: '--pre torch --index-url https://download.pytorch.org/whl/nightly/cu121'
gpu-arch-type: "cuda"
gpu-arch-version: "12.1"
- name: CUDA 2.3
runs-on: linux.g5.12xlarge.nvidia.gpu
torch-spec: 'torch==2.3.0'
gpu-arch-type: "cuda"
gpu-arch-version: "12.1"
- name: CUDA 2.4.0.dev20240421
runs-on: linux.g5.12xlarge.nvidia.gpu
torch-spec: '--pre torch==2.4.0.dev20240421+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121'
gpu-arch-type: "cuda"
gpu-arch-version: "12.1"
- name: CPU 2.2.2
runs-on: 32-core-ubuntu
runs-on: linux.4xlarge
torch-spec: 'torch==2.2.2 --index-url https://download.pytorch.org/whl/cpu'
- name: CPU 2.3 RC
runs-on: 32-core-ubuntu
torch-spec: 'torch==2.3.0 --index-url https://download.pytorch.org/whl/test/cpu'
gpu-arch-type: "cpu"
gpu-arch-version: ""
- name: CPU 2.3
runs-on: linux.4xlarge
torch-spec: 'torch==2.3.0 --index-url https://download.pytorch.org/whl/cpu'
gpu-arch-type: "cpu"
gpu-arch-version: ""
- name: Nightly CPU
runs-on: 32-core-ubuntu
runs-on: linux.4xlarge
torch-spec: '--pre torch --index-url https://download.pytorch.org/whl/nightly/cpu'

runs-on: ${{ matrix.runs-on }}
steps:
- uses: actions/checkout@v2
gpu-arch-type: "cpu"
gpu-arch-version: ""

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.9'

- name: Install dependencies
run: |
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
runner: ${{ matrix.runs-on }}
gpu-arch-type: ${{ matrix.gpu-arch-type }}
gpu-arch-version: ${{ matrix.gpu-arch-version }}
script: |
python -m pip install --upgrade pip
pip install ${{ matrix.torch-spec }}
pip install -r requirements.txt
pip install -r dev-requirements.txt
- name: Install package
run: |
pip install .
- name: Run tests
run: |
python setup.py install
pytest test --verbose -s
134 changes: 100 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,65 +1,131 @@
# torchao: PyTorch Architecture Optimization

**Note: This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an github issue**
[![](https://dcbadge.vercel.app/api/server/cudamode?style=flat)](discord.gg/cudamode)

This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues)

## Introduction
torchao is a PyTorch native library for optimizing your models using lower precision dtypes, techniques like quantization and sparsity and performant kernels.
`torchao` is a PyTorch library for quantization and sparsity.

## Get Started
To try out our APIs, you can check out API examples in [quantization](./torchao/quantization) (including `autoquant`), [sparsity](./torchao/sparsity), [dtypes](./torchao/dtypes).

## Installation
**Note: this library makes liberal use of several new features in pytorch, its recommended to use it with the current nightly or latest stable version of PyTorch.**
### Installation
`torchao` makes liberal use of several new features in pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.

1. From PyPI:
Stable Release
```Shell
pip install torchao
```

2. From Source:
Nightly Release
```Shell
pip install torchao-nightly
```

From source

```Shell
git clone https://github.com/pytorch-labs/ao
git clone https://github.com/pytorch/ao
cd ao
pip install -e .
pip install .
```

### Quantization

```python
import torch
import torchao

# inductor settings which improve torch.compile performance for quantized modules
torch._inductor.config.force_fuse_int_mm_with_mul = True
torch._inductor.config.use_mixed_mm = True

# Plug in your model and example input
model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16)
input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda')

# perform autoquantization
torchao.autoquant(model, (input))

# compile the model to recover performance
model = torch.compile(model, mode='max-autotune')
model(input)
```

## Key Features
The library provides
1. Support for lower precision [dtypes](./torchao/dtypes) such as nf4, uint4 that are torch.compile friendly
2. [Quantization algorithms](./torchao/quantization) such as dynamic quant, smoothquant, GPTQ that run on CPU/GPU and Mobile.
* Int8 dynamic activation quantization
* Int8 and int4 weight-only quantization
* Int8 dynamic activation quantization with int4 weight quantization
* [GPTQ](https://arxiv.org/abs/2210.17323) and [Smoothquant](https://arxiv.org/abs/2211.10438)
* High level `autoquant` API and kernel auto tuner targeting SOTA performance across varying model shapes on consumer/enterprise GPUs.
3. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks
4. Integration with other PyTorch native libraries like [torchtune](https://github.com/pytorch/torchtune) and [ExecuTorch](https://github.com/pytorch/executorch)
### Sparsity

```python
import torch
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
from torch.ao.pruning import WeightNormSparsifier

# bfloat16 CUDA model
model = torch.nn.Sequential(torch.nn.Linear(64, 64)).cuda().to(torch.bfloat16)

# Accuracy: Finding a sparse subnetwork
sparse_config = []
for name, mod in model.named_modules():
if isinstance(mod, torch.nn.Linear):
sparse_config.append({"tensor_fqn": f"{name}.weight"})

sparsifier = WeightNormSparsifier(sparsity_level=1.0,
sparse_block_shape=(1,4),
zeros_per_block=2)

# attach FakeSparsity
sparsifier.prepare(model, sparse_config)
sparsifier.step()
sparsifier.squash_mask()
# now we have dense model with sparse weights

# Performance: Accelerated sparse inference
for name, mod in model.named_modules():
if isinstance(mod, torch.nn.Linear):
mod.weight = torch.nn.Parameter(to_sparse_semi_structured(mod.weight))
```

To learn more try out our APIs, you can check out API examples in
* [quantization](./torchao/quantization)
* [sparsity](./torchao/sparsity)
* [dtypes](./torchao/dtypes)


## Supported Features
1. [Quantization algorithms](./torchao/quantization)
- [Int8 weight-only](https://github.com/pytorch/ao/blob/main/torchao/quantization/weight_only.py) quantization
- [Int4 weight-only](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/int4mm.cu) quantization
- [GPTQ](https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py) and [Smoothquant](https://github.com/pytorch/ao/blob/main/torchao/quantization/smoothquant.py) for low latency inference
- High level [torchao.autoquant API](https://github.com/pytorch/ao/blob/main/torchao/quantization/autoquant.py) and [kernel autotuner](https://github.com/pytorch/ao/blob/main/torchao/kernel/autotuner.py) targeting SOTA performance across varying model shapes on consumer and enterprise GPUs
2. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks
3. Support for lower precision [dtypes](./torchao/dtypes) such as
- [nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) without writing custom Triton or CUDA code
- [uint4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/uint4.py)
4. [Bleeding Edge Kernels](./torchao/prototype/) for experimental kernels without backwards compatibility guarantees
- [GaLore](https://github.com/pytorch/ao/tree/main/torchao/prototype/galore) for memory efficient finetuning
- [fused HQQ Gemm Kernel](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) for compute bound workloads

## Our Goals
torchao embodies PyTorch’s design philosophy [details](https://pytorch.org/docs/stable/community/design.html), especially "usability over everything else". Our vision for this repository is the following:

* Composability: Native solutions for optimization techniques that compose with both `torch.compile` and `FSDP`
* For example, for QLoRA for new dtypes support
* Interoperability: Work with the rest of the PyTorch ecosystem such as torchtune, gpt-fast and ExecuTorch
* Transparent Benchmarks: Regularly run performance benchmarking of our APIs across a suite of Torchbench models and across hardware backends
* Composability with `torch.compile`: We rely heavily on `torch.compile` to write pure PyTorch code and codegen efficient kernels. There are however limits to what a compiler can do so we don't shy away from writing our custom CUDA/Triton kernels
* Composability with `FSDP`: The new support for FSDP per parameter sharding means engineers and researchers alike can experiment with different quantization and distributed strategies concurrently.
* Performance: We measure our performance on every commit using an A10G. We also regularly run performance benchmarks on the [torchbench](https://github.com/pytorch/benchmark) suite
* Heterogeneous Hardware: Efficient kernels that can run on CPU/GPU based server (w/ torch.compile) and mobile backends (w/ ExecuTorch).
* Infrastructure Support: Release packaging solution for kernels and a CI/CD setup that runs these kernels on different backends.
* Packaging kernels should be easy: We support custom [CUDA and Triton extensions](./torchao/csrc/) so you can focus on writing your kernels and we'll ensure that they work on most operating systems and devices

## Interoperability with PyTorch Libraries
## Integrations

torchao has been integrated with other repositories to ease usage
torchao has been integrated with other libraries including

* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) is integrated with 8 and 4 bit weight-only quantization techniques with and without GPTQ.
* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) is integrated with GPTQ for both 8da4w (int8 dynamic activation, with int4 weight) and int4 weight only quantization.
* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) leverages our 8 and 4 bit weight-only quantization techniques with optional support for GPTQ
* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) leverages our GPTQ implementation for both 8da4w (int8 dynamic activation with int4 weight) and int4 weight-only quantization.
* [HQQ](https://github.com/mobiusml/hqq/blob/master/hqq/backends/torchao.py) leverages our int4mm kernel for low latency inference

## Success stories
Our kernels have has been used to achieve SOTA inference performance on
Our kernels have been used to achieve SOTA inference performance on

1. Image segmentation models with [sam-fast](pytorch.org/blog/accelerating-generative-ai)
2. Language models with [gpt-fast](pytorch.org/blog/accelerating-generative-ai-2)
3. Diffusion models with [sd-fast](pytorch.org/blog/accelerating-generative-ai-3)
* Image segmentation models with [sam-fast](pytorch.org/blog/accelerating-generative-ai)
* Language models with [gpt-fast](pytorch.org/blog/accelerating-generative-ai-2)
* Diffusion models with [sd-fast](pytorch.org/blog/accelerating-generative-ai-3)

## License

Expand Down
Loading

0 comments on commit e261405

Please sign in to comment.