Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Int4WeightOnlyQuantizer to set different dtype for scales_and_zeros #479

Merged
merged 2 commits into from
Jul 5, 2024

Conversation

larryliu0820
Copy link
Contributor

As titled. Currently Int4WeightOnlyQuantizer is hardcoded to return scales_and_zeros with dtype torch.bfloat16. Adding dtype argument into the flow so that it can be different dtype.

scales_and_zeros

As titled. Currently `Int4WeightOnlyQuantizer` is hardcoded to return
`scales_and_zeros` with dtype `torch.bfloat16`. Adding `dtype` argument
into the flow so that it can be different dtype.
Copy link

pytorch-bot bot commented Jul 5, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/479

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f3c320a with merge base a35a1cd (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 5, 2024
@larryliu0820 larryliu0820 requested a review from msaroufim July 5, 2024 18:54
) -> None:
super().__init__()
self.padding = not _check_linear_int4_k(in_features, groupsize, inner_k_tiles)
if self.padding:
from model import find_multiple
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a module called model

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I think this is a relic of when gptq was more deeply coupled with gpt-fast

@msaroufim
Copy link
Member

This seems fine to merge although I do worry that most of our gptq tests are disabled right now in test/quantization/test_quant.api.py

@msaroufim msaroufim self-requested a review July 5, 2024 20:56
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks fine but FYI we don't really have anyone maintaining the gptq example so if there's a use-case for it please let me know

@larryliu0820
Copy link
Contributor Author

Mostly looks fine but FYI we don't really have anyone maintaining the gptq example so if there's a use-case for it please let me know

I'm migrating torchchat to use these APIs, to be prepared for shared kernels across ET and PyTorch eager/compile.

@larryliu0820 larryliu0820 merged commit 9f85488 into main Jul 5, 2024
13 checks passed
@msaroufim msaroufim deleted the quant_dtype branch July 5, 2024 21:55
dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024
…eros (pytorch#479)

* Allow Int4WeightOnlyQuantizer to set different dtype for
scales_and_zeros

As titled. Currently `Int4WeightOnlyQuantizer` is hardcoded to return
`scales_and_zeros` with dtype `torch.bfloat16`. Adding `dtype` argument
into the flow so that it can be different dtype.

* Add comment
yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024
yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024
* Update quantize.py to use torchao Quantizers

Summary:

Remove duplicate code for Int4WeightOnlyQuantizer and
Int8DynActInt4WeightQuantizer and use torchao API.

Test Plan:

```
python torchchat.py generate llama2 --quantize '{"linear:int4": {"groupsize": 256}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"cpu"}}' --prompt "Once upon a time," --max-new-tokens 256
python torchchat.py generate llama2 --quantize '{"linear:a8w4dq": {"groupsize": 256}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"cpu"}}' --prompt "Once upon a time," --max-new-tokens 256
```

Reviewers:

Subscribers:

Tasks:

Tags:

* Fix import

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Install torchao from gh

* Explain import

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Fix dependencies

* Test ao PR pytorch#479

* Update torchao hash

* Update torchao pin

* Fix scheduler bf16/fp16 mix error

* Incorporate torchao changes

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* update hash

* Fix GPU CI job

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* More fix

* Fix executorch CI job

* Use quant api for int4 weight only quantization

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Fix

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Fix again

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Fix 3

* Fix 4

* Try something

* debug

* Only migrate 8a4w

---------

Co-authored-by: Jack Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants