Add HQQ support #605

mobicham · 2024-08-06T11:35:49Z

Adds support for HQQ quantization without using the hqq lib.

Note:

The dequantized output produced by AffineQuantizedTensor is a bit worse than that produced by the hqq-lib. You can check that by setting raw_output=True. The problem has to do with the midpoint used in the dequantization logic by AffineQuantizedTensor which produces zero-point values with very low magnitude.

pytorch-bot · 2024-08-06T11:35:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/605

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0b511f5 with merge base ffa88a4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-08-06T11:35:56Z

Hi @mobicham!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot · 2024-08-06T12:06:00Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

…to quantization api

torchao/dtypes/affine_quantized_tensor.py

torchao/prototype/hqq/example.py

torchao/quantization/hqq.py

torchao/prototype/hqq/core.py

mobicham · 2024-08-08T07:30:21Z

Llama3.1 8B Instruct lm-eval, the individual scores for each benchmark are a bit different (probably because of the mid-point thing), but the average score is the same

HDCharles

Can you modify https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_api.py#L384

to have a new argument use_hqq default to False that applies the hqq quantization
so that it aligns with the existing API?

jerryzh168 · 2024-08-08T22:11:35Z

The problem has to do with the midpoint used in the dequantization logic by AffineQuantizedTensor which produces zero-point values with very low magnitude.

this is talking about accuracy? I can see you'd need this conversion for int4 tinygemm, but it's not required for other cases, if this is significant, maybe you can expose raw_output as a flag to the top level quant_api for hqq

mobicham · 2024-08-09T08:35:46Z

@jerryzh168 we can keep it this way for the moment. The other solution would be to modify the alternate minimization and update the quant parameters based ont he mid-point format instead of (Wq - zero)*scale. For the moment, the average lm-eval is about the same as hqq-lib, so I guess it's mostly fine. If there are some performance issues I can get back to it and try to change the steps to improve performance with the mid-point approach which is needed for using the tinygemm kernel.

torchao/prototype/hqq/example.py

torchao/quantization/quant_api.py

torchao/quantization/quant_primitives.py

mobicham · 2024-08-14T08:25:38Z

@mobicham thanks, for tests right now we support both unittest.TestCase and pytest (e.g.

ao/test/dtypes/test_uintx.py

Line 39 in 174e630

@pytest.mark.parametrize("bit_width", bit_widths)

), what you have in the file makes sense I think, since you have hardcoded errors for each dtype I feel it's fine to just separate separate them into different tests

I separated the tests in separate classes and added a check regarding the device, otherwise the tests would fail if no gpu is available.

mobicham · 2024-08-15T11:32:57Z

@jerryzh168 anything missing for the merge?

msaroufim · 2024-08-15T14:33:18Z

IIRC there was some test failure, just retriggered CI if all g ill merge

mobicham · 2024-08-15T14:50:09Z

@msaroufim the tests fail because of triton. I was about to delete torchao/prototype/hqq/core.py since we moved the hqq core stuff to quantization, but just saw this pr uses it : #679

  test/hqq/test_hqq_affine.py:3: in <module>
      from torchao.prototype.hqq.core import HQQQuantizer
  /opt/conda/envs/venv/lib/python3.9/site-packages/torchao/prototype/hqq/__init__.py:1: in <module>
      from .mixed_mm import triton_mixed_mm, pack_2xint4
  /opt/conda/envs/venv/lib/python3.9/site-packages/torchao/prototype/hqq/mixed_mm.py:2: in <module>
      import triton.language as tl
  E   ModuleNotFoundError: No module named 'triton'

msaroufim · 2024-08-15T14:57:37Z

I do see some legit looking errors for GPU https://github.com/pytorch/ao/actions/runs/10403091060/job/28816346879?pr=605#step:12:2570

And for CPU not sure I follow the connection to the version guard fix PR, basically we need to ensure HQQ Quantizer doesnt get imported by accident on CPU machines that way we dont crash on the triton import. And that HQQ quantizer tests are skipped on CPU instnaces

Also I gave you access to trigger CI yourself that way you'll get signal per commit (its just an annoying thing for first time contributors)

Some of the changes you might see on local vs CI runs are due to us running multiple pytorch versions in CI so feel free to add skips for older pytorch versions in case your code isn't working for that

EDIT: Discussed offline to skip def test_dynamic_quant_per_channel_numerics_cuda(self):

mobicham · 2024-08-15T15:02:19Z

It looks like it's the tensorcore test that is failing not the rest:

  test/hqq/test_hqq_affine.py::TestHQQ4bit::test_hqq_tensorcore_4bit FAILED

It was working fine just 2 days ago I think, let me clone and recheck on an instance

mobicham · 2024-08-15T15:11:46Z

And for CPU not sure I follow the connection to the version guard fix PR, basically we need to ensure HQQ Quantizer doesnt get imported by accident on CPU machines that way we dont crash on the triton import

It's not supposed to be imported, I wanted to delete torchao/prototype/hqq/core.py because we moved all the stuff there to torchao/quantization/quant_primitives .py. Will take a look at it

jerryzh168 · 2024-08-30T00:09:47Z

torchao/quantization/quant_api.py

@@ -389,7 +389,7 @@ def int4_weight_only(group_size=128, layout_type=TensorCoreTiledLayoutType(inner
         size is more fine grained, choices are [256, 128, 64, 32]
        `layout_type`: layout type for quantized tensor, default is `TensorCoreTiledLayoutType(inner_k_tiles=8)`
    """
-    def apply_int4_weight_only_quant(weight):
+    def apply_int4_weight_only_quant(weight, use_hqq=False):


I just found that this flag is not used, so we don't really expose hqq to users right now, are you planning to create a new function for hqq? cc @mobicham

My understand was that @HDCharles suggested putting it there and later turning it on by default.
It is exposed though via to_affine_quantized https://github.com/pytorch/ao/pull/605/files#diff-a9708dc28f15bb9cf665417e6c66601f9e8e2f1f672d1858603b74fa879a3357R62
Let me know if there's another way of exposing it

commented in #786 (comment)

Summary: att, this is a follow up for pytorch#605 to make hqq available in quantize_ API `quantize_(model, int4_weight_only(group_size, use_hqq=True)` Test Plan: python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16 Average tokens/sec: 195.24 Average Bandwidth: 729.40 GB/s Peak Memory Usage: 5.09 GB Model Size: 3.74 GB python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16 wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} Reviewers: Subscribers: Tasks: Tags:

Expose hqq through `int4_weight_only` API Summary: att, this is a follow up for #605 to make hqq available in quantize_ API `quantize_(model, int4_weight_only(group_size, use_hqq=True)` Test Plan: python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16 Average tokens/sec: 195.24 Average Bandwidth: 729.40 GB/s Peak Memory Usage: 5.09 GB Model Size: 3.74 GB python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16 wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} Reviewers: Subscribers: Tasks: Tags:

…torch nightly (pytorch#605)

Add HQQ support

f99a90c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 6, 2024

use use_hqq flag in AffineQuantizedTensor.from_float + move hqq core …

ab9ea3d

…to quantization api

jerryzh168 reviewed Aug 6, 2024

View reviewed changes

torchao/dtypes/affine_quantized_tensor.py Show resolved Hide resolved

jerryzh168 reviewed Aug 6, 2024

View reviewed changes

torchao/prototype/hqq/example.py Outdated Show resolved Hide resolved

supriyar requested a review from HDCharles August 6, 2024 23:19

jerryzh168 reviewed Aug 7, 2024

View reviewed changes

torchao/quantization/hqq.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Aug 7, 2024

View reviewed changes

torchao/quantization/hqq.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Aug 7, 2024

View reviewed changes

torchao/prototype/hqq/core.py Outdated Show resolved Hide resolved

mobicham and others added 4 commits August 7, 2024 10:20

Merge branch 'pytorch:main' into main

153fed1

move hqq quantization to quant_primitives

e022654

update example with mutliple nbits

f7a9e50

clean-up imports in affine_quantized_tensor

082dc58

HDCharles approved these changes Aug 8, 2024

View reviewed changes

add hqq to quant_api apply_int4_weight_only_quant

77d498a

mobicham marked this pull request as ready for review August 9, 2024 08:25

mobicham and others added 3 commits August 9, 2024 11:12

add random seed

9a83eda

add unittest

c65e796

Merge branch 'pytorch:main' into main

b5abf39

jerryzh168 reviewed Aug 9, 2024

View reviewed changes

torchao/prototype/hqq/example.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Aug 9, 2024

View reviewed changes

torchao/prototype/hqq/example.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Aug 9, 2024

View reviewed changes

torchao/quantization/quant_api.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Aug 11, 2024

View reviewed changes

torchao/quantization/quant_primitives.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Aug 11, 2024

View reviewed changes

torchao/quantization/quant_primitives.py Show resolved Hide resolved

mobicham and others added 2 commits August 14, 2024 09:19

Merge branch 'pytorch:main' into main

d15accf

separate xnbit tests + check device

93ca471

Merge branch 'main' into main

2146a5a

mobicham added 4 commits August 15, 2024 15:36

add torch version for tensorcore dtype

1e5eec8

fix torch 2.4 tensorcore dtype

890a7be

fix core.py import

9700d52

skip assertion error in test_dynamic_quant_per_channel_numerics_cuda

0b511f5

msaroufim approved these changes Aug 15, 2024

View reviewed changes

msaroufim merged commit 18e38f1 into pytorch:main Aug 15, 2024
16 checks passed

msaroufim mentioned this pull request Aug 15, 2024

unskip test_dynamic_quant_per_channel_numerics_cuda(self): #687

Open

jerryzh168 reviewed Aug 30, 2024

View reviewed changes

jerryzh168 mentioned this pull request Aug 31, 2024

Expose hqq through uintx_weight_only API #786

Merged

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

Automatically detect the cuda availability and install corresponding …

f5548ba

…torch nightly (pytorch#605)

Add HQQ support #605

Add HQQ support #605

Uh oh!

Conversation

mobicham commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note:

Uh oh!

pytorch-bot bot commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/605

✅ No Failures

Uh oh!

facebook-github-bot commented Aug 6, 2024

Action Required

Process

Uh oh!

facebook-github-bot commented Aug 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mobicham commented Aug 8, 2024

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

jerryzh168 commented Aug 8, 2024

Uh oh!

mobicham commented Aug 9, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mobicham commented Aug 14, 2024

Uh oh!

mobicham commented Aug 15, 2024

Uh oh!

msaroufim commented Aug 15, 2024

Uh oh!

mobicham commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msaroufim commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mobicham commented Aug 15, 2024

Uh oh!

mobicham commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jerryzh168 Aug 30, 2024

Choose a reason for hiding this comment

Uh oh!

mobicham Aug 30, 2024

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Sep 2, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mobicham commented Aug 6, 2024 •

edited

Loading

pytorch-bot bot commented Aug 6, 2024 •

edited

Loading

mobicham commented Aug 15, 2024 •

edited

Loading

msaroufim commented Aug 15, 2024 •

edited

Loading

mobicham commented Aug 15, 2024 •

edited

Loading