Skip to content

Commit

Permalink
Benchamarking
Browse files Browse the repository at this point in the history
Summary: Add benchmarks for experimental torchao kernels.

Differential Revision: D66512859
  • Loading branch information
metascroy authored and facebook-github-bot committed Nov 26, 2024
1 parent 5eb6339 commit de00766
Show file tree
Hide file tree
Showing 2 changed files with 64 additions and 1 deletion.
56 changes: 55 additions & 1 deletion torchao/_models/llama/generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ def main(
float8_weight_only,
float8_dynamic_activation_float8_weight,
)
from torchao.prototype.quantization.autoquant_v2 import autoquant_v2
# from torchao.prototype.quantization.autoquant_v2 import autoquant_v2
from torchao.utils import unwrap_tensor_subclass

from torchao.quantization.granularity import PerTensor, PerRow
Expand Down Expand Up @@ -297,6 +297,60 @@ def main(
dtype = _NBITS_TO_DTYPE[nbits]
group_size = int(_quant_args[2])
quantize_(model, uintx_weight_only(dtype, group_size, use_hqq=use_hqq))
elif "int8_dynamic_activation_intx_weight" in quantization:
from torchao.experimental.quant_api import int8_dynamic_activation_intx_weight
assert precision == torch.float32, "int8_dynamic_activation_intx_weight requires fp32 precision"

# Build kernels in temp location, and load them in torch
import glob
import os
import subprocess
import tempfile
def cmake_build_torchao_ops(temp_build_dir):
from distutils.sysconfig import get_python_lib
print("Building torchao ops for ATen target")
cmake_prefix_path = get_python_lib()
dir_path = os.path.dirname(os.path.realpath(__file__))
subprocess.run(
[
"cmake",
"-DCMAKE_PREFIX_PATH=" + cmake_prefix_path,
"-DCMAKE_INSTALL_PREFIX=" + temp_build_dir.name,
"-S " + dir_path + "/../../experimental",
"-B " + temp_build_dir.name,
]
)
subprocess.run(
[
"cmake",
"--build",
temp_build_dir.name,
"-j 16",
"--target install",
"--config Release",
]
)
temp_build_dir = tempfile.TemporaryDirectory()
cmake_build_torchao_ops(temp_build_dir)
libs = glob.glob(f"{temp_build_dir.name}/lib/libtorchao_ops_aten.*")
libs = list(filter(lambda l: (l.endswith("so") or l.endswith("dylib")), libs))
assert len(libs) == 1
torch.ops.load_library(libs[0])

# Quantize model
_quant_args = quantization.split("-")
nbit = int(_quant_args[1])
assert nbit >= 1 and nbit <= 8, "nbits must be 1 to 8"
group_size = int(_quant_args[2])
has_weight_zeros = bool(_quant_args[3])
quantize_(
model,
int8_dynamic_activation_intx_weight(
group_size=group_size,
nbit=nbit,
has_weight_zeros=has_weight_zeros,
),
)
elif "float8wo" in quantization:
quantize_(model, float8_weight_only())
elif "float8dq" in quantization:
Expand Down
9 changes: 9 additions & 0 deletions torchao/quantization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,16 @@ We're trying to develop kernels for low bit quantization for intx quantization f

You try can out these apis with the `quantize_` api as above alongside the constructor `uintx_weight_only` an example can be found in in `torchao/_models/llama/generate.py`.

### int8_dynamic_activation_intx_weight Quantization
We have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon). The benchmarks below were run on an M1 Mac Pro, with 8 perf cores, and 2 efficiency cores, and 32GB of RAM. In all cases, torch.compile was used.

| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) |
| ------------- | -------------------------------------------------| --------------| ------------------------| ---------------- | ----------------|
| Llama-3.1-8B | Base (bfloat16) | 1.24 | 18.62 | NA | 15.01 |
| | int8_dynamic_activation_intx_weight-4-256-false | 16.03 | 65.81 | NA | 4.11 |
| | int8_dynamic_activation_intx_weight-3-256-false | 18.94 | 59.97 | NA | 3.17 |

You try can out these apis with the `quantize_` api as above alongside the constructor `int8_dynamic_activation_intx_weight`. An example can be found in `torchao/_models/llama/generate.py`.

### Automatic Inductor Configuration
The `quantize_` and `autoquant` apis now automatically use our recommended inductor configuration setings. You can mimic the same configuration settings for your own experiments by using the `torchao.quantization.utils.recommended_inductor_config_setter` to replicate our recommended configuration settings. Alternatively if you wish to disable these recommended settings, you can use the key word argument `set_inductor_config` and set it to false in the `quantize_` or `autoquant` apis to prevent assignment of those configuration settings. You can also overwrite these configuration settings after they are assigned if you so desire, as long as they are overwritten before passing any inputs to the torch.compiled model. This means that previous flows which referenced a variety of inductor configurations that needed to be set are now outdated, though continuing to manually set those same inductor configurations is unlikely to cause any issues.
Expand Down

0 comments on commit de00766

Please sign in to comment.