Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Auto-Round support #581

Merged
merged 77 commits into from
Sep 4, 2024
Merged

Conversation

yiliu30
Copy link
Contributor

@yiliu30 yiliu30 commented Jul 31, 2024

Resolve #533

Description

  • Integrated Auto-Round with quantize_ API using hooks + MultiTensor.
  • Exported the optimized qweight to AffineQuantizedTensor to leverage the tinygemm and Uintx kernels.
  • Evaluated the accuracy for Llama2/3/3.1 on 5 popular lm-eval tasks (more tests are on the way).
  • Added Auto-Round to the generation benchmarking for Llama2/3, (Llama 3.1 not yet tested as it was landed a few days ago).
  • Small fix for the Llama model Fixed the llama model #769

Usage

from torchao.prototype.autoround.core import prepare_model_for_applying_auto_round_
from torchao.prototype.autoround.core import apply_auto_round

prepare_model_for_applying_auto_round_(
    model,
    is_target_module=is_target_module,
    bits=4,
    group_size=128,
    iters=200,
    device=device,
)

input_ids_lst = []
for data in dataloader:
    input_ids_lst.append(data["input_ids"].to(model_device))

multi_t_input_ids = MultiTensor(input_ids_lst)
out = model(multi_t_input_ids)

quantize_(model, apply_auto_round(), is_target_module)

For E2E examples, please refer README.md

cc @thuang6 @ftian1 @wenhuach21

Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Copy link

pytorch-bot bot commented Jul 31, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/581

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 96f745d with merge base 05224a9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 31, 2024
@yiliu30
Copy link
Contributor Author

yiliu30 commented Jul 31, 2024

Hi @jerryzh168 @msaroufim, I’m reaching out to request a preliminary review for this PR. Although some refactoring is still in progress. I’d like to get your feedback to ensure we’re on the right track before moving forward.

This draft PR includes:

  • 1. A end-to-end example that quantizes the facebook/opt-125m with Auto-Round optimized qweight, scales zeros, and performs inference with torchao's Int4WeightOnlyQuantizedLinearWeight AffineQuantizedTensor.
  • 2. Cleaned up the dependencies of auto-round in the patch-for-ao-2 branch.

Some TODOs:

  • 3. Reduce the GPU memory consumption
  • 4. Support other bits and data types (currently, the weight bits are hardcoded to 4, and activations are not quantized)
  • 5. Further Refactoring auto-round
  • 6. Rearrange the code structure

Regarding 3) GPU memory consumption, in the current flow, I use hooks to capture the inputs and outputs of each block during the calibration stage. This approach differs from the original auto-round's implementation, which captures only the input of the first decoder block and delays block inference to the quantize stage (similar to AutoAWQ's implementation). The implementation in this PR introduces some limitations: a) The GPU memory consumption is quite large when the calibration dataset is large. b) We cannot use the output of a previously quantized block as the input for the following block.

This approach is mainly to align with the static quantization flow and use quantize_ API. I wonder if you are willing to refactor the flow a bit to resolve these limitations, or if you have other suggestions? I think AutoAWQ might also need such adjustments.

@yiliu30
Copy link
Contributor Author

yiliu30 commented Aug 1, 2024

Hi @jerryzh168, for 3), I noticed that GPTQ has a similar complication. #577

Instead, we want to run the model for each input, but ONLY up to the first linear, then pause, do the algorithm to update the weight, get the outputs for the updated weight and then, unpause and continue on until we hit the next linear….etc.

The main difference is that GPTQ handles a single Linear layer, whereas auto-round works on a decoder block (it may also work on a Linear layer when quantizing the lm-head).

Inspired by HDCharles's proposal, I tried to extend it to auto-round. Based on MultiTensor, the remaining issue is enabling the dispatcher to identify the decoder block, such as OPTDecoderLayer.
I resolved this by defining a customized operation called general_decoder and swapping all decoder blocks with it. Then, we perform the inference with the calibration dataset, when the dispatcher encounters general_decoder, it jumps to the auto-round's optimization process with all accumulated inputs and returns the outputs of the optimized model or original model.

I have prepared a full demo at here. Could you please take a look, thanks a lot!

@jerryzh168
Copy link
Contributor

@yiliu30 sorry for the late reply, I think using MultiInput from @HDCharles's GPTQ issue makes sense for your use case, since Auto-Round flow is similar to GPTQ flow but does not fit into the static quant flow (with observers) very well

@jerryzh168
Copy link
Contributor

one small nit for the "general_decoder", we can use

if func is torch.ops.transformers_ops.general_decoder:
   outputs = optimize_decoder(func, grouped_args, spec)

instead of looking at func.__name__

also after this is done, I think we can improve our current utils for operator implementation:

def _implements(cls, aten_ops_or_torch_fns):
"""Use this decorator to implement a function for an aten ops in __torch_dispatch__
(if user passed in a list of ops)
or torch function in __torch_function__ (if user passed in a single object)
class MyTensor(torch.Tensor):
...
implements = classmethod(_implements)
implements = MyTensor.implements
@implements(torch.nn.functional.linear):
def _(func, types, args, kwargs):
...
"""
if not hasattr(cls, "_ATEN_OP_OR_TORCH_FN_TABLE"):
cls._ATEN_OP_OR_TORCH_FN_TABLE = {}
if not isinstance(aten_ops_or_torch_fns, (list, tuple)):
aten_ops_or_torch_fns = [aten_ops_or_torch_fns]
def decorator(func):
for op in aten_ops_or_torch_fns:
@functools.wraps(op)
def wrapper(*args, **kwargs):
return func(*args, **kwargs)
cls._ATEN_OP_OR_TORCH_FN_TABLE[op] = wrapper
return func
return decorator
def _dispatch__torch_function__(cls, func, types, args=(), kwargs=None):
"""Use this util function for a common `__torch_function__` implementation
that dispatches to ops/functions registered with `_implements`
class MyTensor(torch.Tensor):
...
__torch_function__ = classmethod(_dispatch__torch_function__)
"""
kwargs = {} if kwargs is None else kwargs
if hasattr(cls, "_ATEN_OP_OR_TORCH_FN_TABLE") and \
func in cls._ATEN_OP_OR_TORCH_FN_TABLE:
return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, *args, **kwargs)
with torch._C.DisableTorchFunctionSubclass():
return func(*args, **kwargs)
def _dispatch__torch_dispatch__(cls, func, types, args, kwargs):
"""Use this util function for a common `__torch_dispatch__` implementation
that dispatches to ops/functions registered with `_implements`
class MyTensor(torch.Tensor):
...
__torch_dispatch__ = classmethod(_dispatch__torch_dispatch__)
"""
if hasattr(cls, "_ATEN_OP_OR_TORCH_FN_TABLE") and \
func in cls._ATEN_OP_OR_TORCH_FN_TABLE:
return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, *args, **kwargs)
raise NotImplementedError(f"{cls.__name__} dispatch: attempting to run unimplemented operator/function: {func}")
and incorporate this use case so you can reduce boilerplate code

Signed-off-by: yiliu30 <[email protected]>
Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requested some changes

@wenhuach21
Copy link

I was curious about the compute dtype supported by the AO kernel. If it only supports FP16, I recommend forcing the dtype to FP16 before passing it to AutoRound. However, if BF16 is also supported, it would be preferable to set the scale_type in AutoRound to align with the original model.

Additionally, the accuracy data slightly differs from the results of our recipe, which may not be solely due to changes in hyperparameters. We should investigate this further.

Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
@jerryzh168
Copy link
Contributor

I was curious about the compute dtype supported by the AO kernel. If it only supports FP16, I recommend forcing the dtype to FP16 before passing it to AutoRound. However, if BF16 is also supported, it would be preferable to set the scale_type in AutoRound to align with the original model.

Additionally, the accuracy data slightly differs from the results of our recipe, which may not be solely due to changes in hyperparameters. We should investigate this further.

it depends on the kernel, int4 weight only that uses tinygemm kernel only supports bfloat16 I think

quantize_(model, apply_auto_round(), is_target_module)
```

## End-to-End Results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what about performance results?

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code changes looks good to me, one comment is just to include performance data (token/s, memory etc.) in README as well, similar to https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks

Signed-off-by: yiliu30 <[email protected]>
@yiliu30
Copy link
Contributor Author

yiliu30 commented Aug 28, 2024

The benchmark depends on #769

@yiliu30 yiliu30 mentioned this pull request Sep 3, 2024
…#16)

* wrap model's buffers and params to `MultiTensor` and update the results

Signed-off-by: yiliu30 <[email protected]>
)
else:
is_target_module = lambda mod, fqn: isinstance(mod, TransformerBlock)
quantize_model_with_autoround_(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we just use the same flow everywhere to reduce confusions, the flow in https://github.com/pytorch/ao/pull/581/files#diff-af129d63635a3b5b0a95f1a3831f852fbd7bedfd66b38d41bf4975fb49aad246 would be the recommended one I think

@jerryzh168
Copy link
Contributor

Thanks @yiliu30 for addressing all the comments!

@jerryzh168 jerryzh168 merged commit f5703b0 into pytorch:main Sep 4, 2024
17 checks passed
@yiliu30
Copy link
Contributor Author

yiliu30 commented Sep 4, 2024

@jerryzh168 Thanks for your patient guidance and detailed examples. This joint effort will allow more users to benefit from AO and auto-round!

jerryzh168 pushed a commit to jerryzh168/ao that referenced this pull request Sep 4, 2024
* initial flow for autoround

Signed-off-by: yiliu30 <[email protected]>

* update flow

Signed-off-by: yiliu30 <[email protected]>

* use int4 kernel

Signed-off-by: yiliu30 <[email protected]>

* remove debug code

Signed-off-by: yiliu30 <[email protected]>

* update the forward

Signed-off-by: yiliu30 <[email protected]>

* clean code

Signed-off-by: yiliu30 <[email protected]>

* e2e example

Signed-off-by: yiliu30 <[email protected]>

* refine code

Signed-off-by: yiliu30 <[email protected]>

* add requirements for test

Signed-off-by: yiliu30 <[email protected]>

* update test

Signed-off-by: yiliu30 <[email protected]>

* update the readme

Signed-off-by: yiliu30 <[email protected]>

* add readme

Signed-off-by: yiliu30 <[email protected]>

* update the filenames

Signed-off-by: yiliu30 <[email protected]>

* update the np version

Signed-off-by: yiliu30 <[email protected]>

* add demo

Signed-off-by: yiliu30 <[email protected]>

* format

Signed-off-by: yiliu30 <[email protected]>

* add more docs

Signed-off-by: yiliu30 <[email protected]>

* format

Signed-off-by: yiliu30 <[email protected]>

* add doc

Signed-off-by: yiliu30 <[email protected]>

* use `AffineQuantizedTensor`

Signed-off-by: yiliu30 <[email protected]>

* impl ar using multensors

Signed-off-by: yiliu30 <[email protected]>

* clean code

Signed-off-by: yiliu30 <[email protected]>

* use hook + multensors

Signed-off-by: yiliu30 <[email protected]>

* separate mul_tensors into a new file

Signed-off-by: yiliu30 <[email protected]>

* fix typos

Signed-off-by: yiliu30 <[email protected]>

* rename mul_tensor to multi_tensor

Signed-off-by: yiliu30 <[email protected]>

* enable amp

Signed-off-by: yiliu30 <[email protected]>

* eval model

Signed-off-by: yiliu30 <[email protected]>

* add gen examples

Signed-off-by: yiliu30 <[email protected]>

* add warmup to benchmark

Signed-off-by: yiliu30 <[email protected]>

* add benchmark

Signed-off-by: yiliu30 <[email protected]>

* clean code

Signed-off-by: yiliu30 <[email protected]>

* format code

Signed-off-by: yiliu30 <[email protected]>

* use tiny kernel

Signed-off-by: yiliu30 <[email protected]>

* add more note

Signed-off-by: yiliu30 <[email protected]>

* format

Signed-off-by: yiliu30 <[email protected]>

* correct typos

Signed-off-by: yiliu30 <[email protected]>

* remove hard code

Signed-off-by: yiliu30 <[email protected]>

* use intx

Signed-off-by: yiliu30 <[email protected]>

* enable offload for multitensor

Signed-off-by: yiliu30 <[email protected]>

* update the default config

Signed-off-by: yiliu30 <[email protected]>

* refine note

Signed-off-by: yiliu30 <[email protected]>

* update the version check

Signed-off-by: yiliu30 <[email protected]>

* format

Signed-off-by: yiliu30 <[email protected]>

* update

Signed-off-by: yiliu30 <[email protected]>

* add ut

Signed-off-by: yiliu30 <[email protected]>

* format

Signed-off-by: yiliu30 <[email protected]>

* add scripts

Signed-off-by: yiliu30 <[email protected]>

* format code

Signed-off-by: yiliu30 <[email protected]>

* format

Signed-off-by: yiliu30 <[email protected]>

* update

Signed-off-by: yiliu30 <[email protected]>

* fix typo

Signed-off-by: yiliu30 <[email protected]>

* refine bench code

Signed-off-by: yiliu30 <[email protected]>

* Enable `use_optimized_layer_output` and AO' llama (pytorch#12)

Signed-off-by: yiliu30 <[email protected]>

* Refine the Doc (pytorch#14)

---------

Signed-off-by: yiliu30 <[email protected]>

* add more docstring

Signed-off-by: yiliu30 <[email protected]>

* add paper link

Signed-off-by: yiliu30 <[email protected]>

* correct some note

Signed-off-by: yiliu30 <[email protected]>

* add cmd

Signed-off-by: yiliu30 <[email protected]>

* udpdate the scripts

Signed-off-by: yiliu30 <[email protected]>

* revert some change

Signed-off-by: yiliu30 <[email protected]>

* Add a lightweight configuration for quick benchmarking (pytorch#15)

Signed-off-by: yiliu30 <[email protected]>

* update quant method name

Signed-off-by: yiliu30 <[email protected]>

* Wrap model's buffers and params to `MultiTensor` & update the results (pytorch#16)

* wrap model's buffers and params to `MultiTensor` and update the results

Signed-off-by: yiliu30 <[email protected]>

---------

Signed-off-by: yiliu30 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Add Auto-Round support
4 participants