Add Auto-Round support #581

yiliu30 · 2024-07-31T03:34:07Z

Resolve #533

Description

Integrated Auto-Round with quantize_ API using hooks + MultiTensor.
Exported the optimized qweight to AffineQuantizedTensor to leverage the tinygemm and Uintx kernels.
Evaluated the accuracy for Llama2/3/3.1 on 5 popular lm-eval tasks (more tests are on the way).
Added Auto-Round to the generation benchmarking for Llama2/3, (Llama 3.1 not yet tested as it was landed a few days ago).
Small fix for the Llama model Fixed the llama model #769

Usage

from torchao.prototype.autoround.core import prepare_model_for_applying_auto_round_
from torchao.prototype.autoround.core import apply_auto_round

prepare_model_for_applying_auto_round_(
    model,
    is_target_module=is_target_module,
    bits=4,
    group_size=128,
    iters=200,
    device=device,
)

input_ids_lst = []
for data in dataloader:
    input_ids_lst.append(data["input_ids"].to(model_device))

multi_t_input_ids = MultiTensor(input_ids_lst)
out = model(multi_t_input_ids)

quantize_(model, apply_auto_round(), is_target_module)

For E2E examples, please refer README.md

cc @thuang6 @ftian1 @wenhuach21

Signed-off-by: yiliu30 <[email protected]>

pytorch-bot · 2024-07-31T03:34:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/581

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 96f745d with merge base 05224a9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

yiliu30 · 2024-07-31T03:41:57Z

Hi @jerryzh168 @msaroufim, I’m reaching out to request a preliminary review for this PR. Although some refactoring is still in progress. I’d like to get your feedback to ensure we’re on the right track before moving forward.

This draft PR includes:

1. A end-to-end example that quantizes the facebook/opt-125m with Auto-Round optimized qweight, scales zeros, and performs inference with torchao's ~~Int4WeightOnlyQuantizedLinearWeight~~ AffineQuantizedTensor.
2. Cleaned up the dependencies of auto-round in the patch-for-ao-2 branch.

Some TODOs:

3. Reduce the GPU memory consumption
4. Support other bits and data types (currently, the weight bits are hardcoded to 4, and activations are not quantized)
5. Further Refactoring auto-round
6. Rearrange the code structure

Regarding 3) GPU memory consumption, in the current flow, I use hooks to capture the inputs and outputs of each block during the calibration stage. This approach differs from the original auto-round's implementation, which captures only the input of the first decoder block and delays block inference to the quantize stage (similar to AutoAWQ's implementation). The implementation in this PR introduces some limitations: a) The GPU memory consumption is quite large when the calibration dataset is large. b) We cannot use the output of a previously quantized block as the input for the following block.

This approach is mainly to align with the static quantization flow and use quantize_ API. I wonder if you are willing to refactor the flow a bit to resolve these limitations, or if you have other suggestions? I think AutoAWQ might also need such adjustments.

torchao/prototype/autoround/core.py

Signed-off-by: yiliu30 <[email protected]>

yiliu30 · 2024-08-01T09:36:48Z

Hi @jerryzh168, for 3), I noticed that GPTQ has a similar complication. #577

Instead, we want to run the model for each input, but ONLY up to the first linear, then pause, do the algorithm to update the weight, get the outputs for the updated weight and then, unpause and continue on until we hit the next linear….etc.

The main difference is that GPTQ handles a single Linear layer, whereas auto-round works on a decoder block (it may also work on a Linear layer when quantizing the lm-head).

Inspired by HDCharles's proposal, I tried to extend it to auto-round. Based on MultiTensor, the remaining issue is enabling the dispatcher to identify the decoder block, such as OPTDecoderLayer.
I resolved this by defining a customized operation called general_decoder and swapping all decoder blocks with it. Then, we perform the inference with the calibration dataset, when the dispatcher encounters general_decoder, it jumps to the auto-round's optimization process with all accumulated inputs and returns the outputs of the optimized model or original model.

I have prepared a full demo at here. Could you please take a look, thanks a lot!

jerryzh168 · 2024-08-01T22:12:34Z

@yiliu30 sorry for the late reply, I think using MultiInput from @HDCharles's GPTQ issue makes sense for your use case, since Auto-Round flow is similar to GPTQ flow but does not fit into the static quant flow (with observers) very well

jerryzh168 · 2024-08-01T22:14:55Z

one small nit for the "general_decoder", we can use

if func is torch.ops.transformers_ops.general_decoder:
   outputs = optimize_decoder(func, grouped_args, spec)

instead of looking at func.__name__

also after this is done, I think we can improve our current utils for operator implementation:

ao/torchao/dtypes/utils.py

Lines 11 to 70 in db345bd

    
           def _implements(cls, aten_ops_or_torch_fns): 
        
               """Use this decorator to implement a function for an aten ops in __torch_dispatch__ 
        
               (if user passed in a list of ops) 
        
               or torch function in __torch_function__ (if user passed in a single object) 
        
               class MyTensor(torch.Tensor): 
        
                   ... 
        
                   implements = classmethod(_implements) 
        
               implements = MyTensor.implements 
        
               @implements(torch.nn.functional.linear): 
        
               def _(func, types, args, kwargs): 
        
                   ... 
        
               """ 
        
               if not hasattr(cls, "_ATEN_OP_OR_TORCH_FN_TABLE"): 
        
                   cls._ATEN_OP_OR_TORCH_FN_TABLE = {} 
        
               if not isinstance(aten_ops_or_torch_fns, (list, tuple)): 
        
                   aten_ops_or_torch_fns = [aten_ops_or_torch_fns] 
        
               def decorator(func): 
        
                   for op in aten_ops_or_torch_fns: 
        
                       @functools.wraps(op) 
        
                       def wrapper(*args, **kwargs): 
        
                           return func(*args, **kwargs) 
        
                       cls._ATEN_OP_OR_TORCH_FN_TABLE[op] = wrapper 
        
                   return func 
        
               return decorator 
        
           def _dispatch__torch_function__(cls, func, types, args=(), kwargs=None): 
        
               """Use this util function for a common `__torch_function__` implementation 
        
               that dispatches to ops/functions registered with `_implements` 
        
               class MyTensor(torch.Tensor): 
        
                   ... 
        
                   __torch_function__ = classmethod(_dispatch__torch_function__) 
        
               """ 
        
               kwargs = {} if kwargs is None else kwargs 
        
               if hasattr(cls, "_ATEN_OP_OR_TORCH_FN_TABLE") and \ 
        
                  func in cls._ATEN_OP_OR_TORCH_FN_TABLE: 
        
                   return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, *args, **kwargs) 
        
               with torch._C.DisableTorchFunctionSubclass(): 
        
                   return func(*args, **kwargs) 
        
           def _dispatch__torch_dispatch__(cls, func, types, args, kwargs): 
        
               """Use this util function for a common `__torch_dispatch__` implementation 
        
               that dispatches to ops/functions registered with `_implements` 
        
               class MyTensor(torch.Tensor): 
        
                   ... 
        
                   __torch_dispatch__ = classmethod(_dispatch__torch_dispatch__) 
        
               """ 
        
               if hasattr(cls, "_ATEN_OP_OR_TORCH_FN_TABLE") and \ 
        
                  func in cls._ATEN_OP_OR_TORCH_FN_TABLE: 
        
                   return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, *args, **kwargs) 
        
               raise NotImplementedError(f"{cls.__name__} dispatch: attempting to run unimplemented operator/function: {func}")

and incorporate this use case so you can reduce boilerplate code

Signed-off-by: yiliu30 <[email protected]>

torchao/prototype/autoround/core.py

jerryzh168

requested some changes

Signed-off-by: yiliu30 <[email protected]>

--------- Signed-off-by: yiliu30 <[email protected]>

torchao/prototype/autoround/multi_tensor.py

torchao/prototype/autoround/README.md

wenhuach21 · 2024-08-26T04:32:39Z

I was curious about the compute dtype supported by the AO kernel. If it only supports FP16, I recommend forcing the dtype to FP16 before passing it to AutoRound. However, if BF16 is also supported, it would be preferable to set the scale_type in AutoRound to align with the original model.

Additionally, the accuracy data slightly differs from the results of our recipe, which may not be solely due to changes in hyperparameters. We should investigate this further.

Signed-off-by: yiliu30 <[email protected]>

jerryzh168 · 2024-08-27T20:13:19Z

I was curious about the compute dtype supported by the AO kernel. If it only supports FP16, I recommend forcing the dtype to FP16 before passing it to AutoRound. However, if BF16 is also supported, it would be preferable to set the scale_type in AutoRound to align with the original model.

Additionally, the accuracy data slightly differs from the results of our recipe, which may not be solely due to changes in hyperparameters. We should investigate this further.

it depends on the kernel, int4 weight only that uses tinygemm kernel only supports bfloat16 I think

torchao/_models/llama/model.py

jerryzh168 · 2024-08-27T20:15:27Z

torchao/prototype/autoround/README.md

+quantize_(model, apply_auto_round(), is_target_module)
+```
+
+## End-to-End Results


so what about performance results?

jerryzh168

code changes looks good to me, one comment is just to include performance data (token/s, memory etc.) in README as well, similar to https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks

Signed-off-by: yiliu30 <[email protected]>

yiliu30 · 2024-08-28T05:25:37Z

The benchmark depends on #769

Signed-off-by: yiliu30 <[email protected]>

…#16) * wrap model's buffers and params to `MultiTensor` and update the results Signed-off-by: yiliu30 <[email protected]>

jerryzh168 · 2024-09-04T00:27:35Z

torchao/_models/llama/generate.py

+                )
+            else:
+                is_target_module = lambda mod, fqn: isinstance(mod, TransformerBlock)
+            quantize_model_with_autoround_(


nit: should we just use the same flow everywhere to reduce confusions, the flow in https://github.com/pytorch/ao/pull/581/files#diff-af129d63635a3b5b0a95f1a3831f852fbd7bedfd66b38d41bf4975fb49aad246 would be the recommended one I think

jerryzh168 · 2024-09-04T00:58:38Z

Thanks @yiliu30 for addressing all the comments!

yiliu30 · 2024-09-04T01:25:36Z

@jerryzh168 Thanks for your patient guidance and detailed examples. This joint effort will allow more users to benefit from AO and auto-round!

* initial flow for autoround Signed-off-by: yiliu30 <[email protected]> * update flow Signed-off-by: yiliu30 <[email protected]> * use int4 kernel Signed-off-by: yiliu30 <[email protected]> * remove debug code Signed-off-by: yiliu30 <[email protected]> * update the forward Signed-off-by: yiliu30 <[email protected]> * clean code Signed-off-by: yiliu30 <[email protected]> * e2e example Signed-off-by: yiliu30 <[email protected]> * refine code Signed-off-by: yiliu30 <[email protected]> * add requirements for test Signed-off-by: yiliu30 <[email protected]> * update test Signed-off-by: yiliu30 <[email protected]> * update the readme Signed-off-by: yiliu30 <[email protected]> * add readme Signed-off-by: yiliu30 <[email protected]> * update the filenames Signed-off-by: yiliu30 <[email protected]> * update the np version Signed-off-by: yiliu30 <[email protected]> * add demo Signed-off-by: yiliu30 <[email protected]> * format Signed-off-by: yiliu30 <[email protected]> * add more docs Signed-off-by: yiliu30 <[email protected]> * format Signed-off-by: yiliu30 <[email protected]> * add doc Signed-off-by: yiliu30 <[email protected]> * use `AffineQuantizedTensor` Signed-off-by: yiliu30 <[email protected]> * impl ar using multensors Signed-off-by: yiliu30 <[email protected]> * clean code Signed-off-by: yiliu30 <[email protected]> * use hook + multensors Signed-off-by: yiliu30 <[email protected]> * separate mul_tensors into a new file Signed-off-by: yiliu30 <[email protected]> * fix typos Signed-off-by: yiliu30 <[email protected]> * rename mul_tensor to multi_tensor Signed-off-by: yiliu30 <[email protected]> * enable amp Signed-off-by: yiliu30 <[email protected]> * eval model Signed-off-by: yiliu30 <[email protected]> * add gen examples Signed-off-by: yiliu30 <[email protected]> * add warmup to benchmark Signed-off-by: yiliu30 <[email protected]> * add benchmark Signed-off-by: yiliu30 <[email protected]> * clean code Signed-off-by: yiliu30 <[email protected]> * format code Signed-off-by: yiliu30 <[email protected]> * use tiny kernel Signed-off-by: yiliu30 <[email protected]> * add more note Signed-off-by: yiliu30 <[email protected]> * format Signed-off-by: yiliu30 <[email protected]> * correct typos Signed-off-by: yiliu30 <[email protected]> * remove hard code Signed-off-by: yiliu30 <[email protected]> * use intx Signed-off-by: yiliu30 <[email protected]> * enable offload for multitensor Signed-off-by: yiliu30 <[email protected]> * update the default config Signed-off-by: yiliu30 <[email protected]> * refine note Signed-off-by: yiliu30 <[email protected]> * update the version check Signed-off-by: yiliu30 <[email protected]> * format Signed-off-by: yiliu30 <[email protected]> * update Signed-off-by: yiliu30 <[email protected]> * add ut Signed-off-by: yiliu30 <[email protected]> * format Signed-off-by: yiliu30 <[email protected]> * add scripts Signed-off-by: yiliu30 <[email protected]> * format code Signed-off-by: yiliu30 <[email protected]> * format Signed-off-by: yiliu30 <[email protected]> * update Signed-off-by: yiliu30 <[email protected]> * fix typo Signed-off-by: yiliu30 <[email protected]> * refine bench code Signed-off-by: yiliu30 <[email protected]> * Enable `use_optimized_layer_output` and AO' llama (pytorch#12) Signed-off-by: yiliu30 <[email protected]> * Refine the Doc (pytorch#14) --------- Signed-off-by: yiliu30 <[email protected]> * add more docstring Signed-off-by: yiliu30 <[email protected]> * add paper link Signed-off-by: yiliu30 <[email protected]> * correct some note Signed-off-by: yiliu30 <[email protected]> * add cmd Signed-off-by: yiliu30 <[email protected]> * udpdate the scripts Signed-off-by: yiliu30 <[email protected]> * revert some change Signed-off-by: yiliu30 <[email protected]> * Add a lightweight configuration for quick benchmarking (pytorch#15) Signed-off-by: yiliu30 <[email protected]> * update quant method name Signed-off-by: yiliu30 <[email protected]> * Wrap model's buffers and params to `MultiTensor` & update the results (pytorch#16) * wrap model's buffers and params to `MultiTensor` and update the results Signed-off-by: yiliu30 <[email protected]> --------- Signed-off-by: yiliu30 <[email protected]>

yiliu30 added 19 commits July 24, 2024 02:56

initial flow for autoround

be78a08

Signed-off-by: yiliu30 <[email protected]>

update flow

49f8075

Signed-off-by: yiliu30 <[email protected]>

use int4 kernel

62834a2

Signed-off-by: yiliu30 <[email protected]>

remove debug code

6433e75

Signed-off-by: yiliu30 <[email protected]>

update the forward

65f46e5

Signed-off-by: yiliu30 <[email protected]>

clean code

1e22c11

Signed-off-by: yiliu30 <[email protected]>

e2e example

b8d37b9

Signed-off-by: yiliu30 <[email protected]>

refine code

8d388fb

Signed-off-by: yiliu30 <[email protected]>

add requirements for test

07a95a0

Signed-off-by: yiliu30 <[email protected]>

update test

6baa62f

Signed-off-by: yiliu30 <[email protected]>

update the readme

78a5067

Signed-off-by: yiliu30 <[email protected]>

add readme

37e9f5f

Signed-off-by: yiliu30 <[email protected]>

update the filenames

8bfe76a

Signed-off-by: yiliu30 <[email protected]>

update the np version

e25d6eb

Signed-off-by: yiliu30 <[email protected]>

add demo

16a901d

Signed-off-by: yiliu30 <[email protected]>

format

5f16e8d

Signed-off-by: yiliu30 <[email protected]>

add more docs

f3442c5

Signed-off-by: yiliu30 <[email protected]>

format

432da79

Signed-off-by: yiliu30 <[email protected]>

add doc

7ee9f9b

Signed-off-by: yiliu30 <[email protected]>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 31, 2024

jerryzh168 reviewed Jul 31, 2024

View reviewed changes

torchao/prototype/autoround/core.py Outdated Show resolved Hide resolved

use AffineQuantizedTensor

e5ffcca

Signed-off-by: yiliu30 <[email protected]>

yiliu30 added 2 commits August 8, 2024 04:44

impl ar using multensors

cec375b

Signed-off-by: yiliu30 <[email protected]>

clean code

a8f5681

Signed-off-by: yiliu30 <[email protected]>

yiliu30 commented Aug 8, 2024

View reviewed changes

torchao/prototype/autoround/core.py Outdated Show resolved Hide resolved

jerryzh168 requested changes Aug 23, 2024

View reviewed changes

yiliu30 added 3 commits August 24, 2024 23:06

Merge branch 'main' into auto_round_support-3

fabe8d2

Enable use_optimized_layer_output and AO' llama (#12)

9ae5392

Signed-off-by: yiliu30 <[email protected]>

Refine the Doc (#14)

157c189

--------- Signed-off-by: yiliu30 <[email protected]>

yiliu30 commented Aug 26, 2024

View reviewed changes

torchao/prototype/autoround/multi_tensor.py Show resolved Hide resolved

yiliu30 commented Aug 26, 2024

View reviewed changes

torchao/prototype/autoround/README.md Show resolved Hide resolved

torchao/prototype/autoround/README.md Show resolved Hide resolved

yiliu30 requested a review from jerryzh168 August 26, 2024 03:52

yiliu30 added 4 commits August 26, 2024 00:47

add more docstring

2df3f5f

Signed-off-by: yiliu30 <[email protected]>

add paper link

d719460

Signed-off-by: yiliu30 <[email protected]>

correct some note

d7ba39e

Signed-off-by: yiliu30 <[email protected]>

add cmd

a2c6b28

Signed-off-by: yiliu30 <[email protected]>

jerryzh168 reviewed Aug 27, 2024

View reviewed changes

torchao/_models/llama/model.py Show resolved Hide resolved

jerryzh168 reviewed Aug 27, 2024

View reviewed changes

jerryzh168 approved these changes Aug 27, 2024

View reviewed changes

resolve conflicts

896d87f

Signed-off-by: yiliu30 <[email protected]>

yiliu30 added 5 commits August 28, 2024 06:18

udpdate the scripts

6a8e073

Signed-off-by: yiliu30 <[email protected]>

revert some change

9e48d1a

Signed-off-by: yiliu30 <[email protected]>

Add a lightweight configuration for quick benchmarking (#15)

5ca125e

Signed-off-by: yiliu30 <[email protected]>

merge with main

b6d95ce

Signed-off-by: yiliu30 <[email protected]>

update quant method name

21686f1

Signed-off-by: yiliu30 <[email protected]>

yiliu30 mentioned this pull request Sep 3, 2024

Fixed the llama model #769

Merged

Wrap model's buffers and params to MultiTensor & update the results (…

96f745d

…#16) * wrap model's buffers and params to `MultiTensor` and update the results Signed-off-by: yiliu30 <[email protected]>

jerryzh168 reviewed Sep 4, 2024

View reviewed changes

jerryzh168 merged commit f5703b0 into pytorch:main Sep 4, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Auto-Round support #581

Add Auto-Round support #581

yiliu30 commented Jul 31, 2024 •

edited

Loading

pytorch-bot bot commented Jul 31, 2024 •

edited

Loading

yiliu30 commented Jul 31, 2024 •

edited

Loading

yiliu30 commented Aug 1, 2024 •

edited

Loading

jerryzh168 commented Aug 1, 2024

jerryzh168 commented Aug 1, 2024

jerryzh168 left a comment

wenhuach21 commented Aug 26, 2024

jerryzh168 commented Aug 27, 2024

jerryzh168 Aug 27, 2024

jerryzh168 left a comment •

edited

Loading

yiliu30 commented Aug 28, 2024 •

edited

Loading

jerryzh168 Sep 4, 2024

jerryzh168 commented Sep 4, 2024

yiliu30 commented Sep 4, 2024

Add Auto-Round support #581

Add Auto-Round support #581

Conversation

yiliu30 commented Jul 31, 2024 • edited Loading

Description

Usage

pytorch-bot bot commented Jul 31, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/581

✅ No Failures

yiliu30 commented Jul 31, 2024 • edited Loading

yiliu30 commented Aug 1, 2024 • edited Loading

jerryzh168 commented Aug 1, 2024

jerryzh168 commented Aug 1, 2024

jerryzh168 left a comment

Choose a reason for hiding this comment

wenhuach21 commented Aug 26, 2024

jerryzh168 commented Aug 27, 2024

jerryzh168 Aug 27, 2024

Choose a reason for hiding this comment

jerryzh168 left a comment • edited Loading

Choose a reason for hiding this comment

yiliu30 commented Aug 28, 2024 • edited Loading

jerryzh168 Sep 4, 2024

Choose a reason for hiding this comment

jerryzh168 commented Sep 4, 2024

yiliu30 commented Sep 4, 2024

yiliu30 commented Jul 31, 2024 •

edited

Loading

pytorch-bot bot commented Jul 31, 2024 •

edited

Loading

yiliu30 commented Jul 31, 2024 •

edited

Loading

yiliu30 commented Aug 1, 2024 •

edited

Loading

jerryzh168 left a comment •

edited

Loading

yiliu30 commented Aug 28, 2024 •

edited

Loading