Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce 6-bit quantization for Llama in torchchat #1007

Merged
merged 1 commit into from
Oct 3, 2024

Conversation

ramreddymounica
Copy link
Contributor

Summary:
Introducing the ability to use 6-bit quantization in torchao. Torchchat is PyTorch's solution for local LLM inference (https://github.com/pytorch/torchchat). With torchchat, users can quantize a large language model like Llama and run it locally on their machine. Quantization is a way to convert the model weights from float32 to something that takes less space (e.g., 4-bit integers), but still does not compromise the model quality too much.

We currently offer 2, 3, 4, and 5 bit model quantization in torchchat. This task is about adding a new 6 bit quantization scheme.

Main changes:
Added uint6.h that contains the internal helper functions to pack-unpack 8 bytes, 64 and 128 bytes of uint6s.
Modified bitpack.h to add case statements for 6-bit quantization in the general functions that perform vectorized packing/unpacking on ARM neon vectors. (32, 64, 128 values)
CONTEXT
Refer to previous diffs introducing 2-5 bit quantization. 2-bit: D62133659

Reviewed By: metascroy

Differential Revision: D63792020

Summary:
Introducing the ability to use 6-bit quantization in torchao. Torchchat is PyTorch's solution for local LLM inference (https://github.com/pytorch/torchchat).  With torchchat, users can quantize a large language model like Llama and run it locally on their machine.  Quantization is a way to convert the model weights from float32 to something that takes less space (e.g., 4-bit integers), but still does not compromise the model quality too much.

We currently offer 2, 3, 4, and 5 bit model quantization in torchchat.  This task is about adding a new 6 bit quantization scheme.

Main changes:
Added uint6.h that contains the internal helper functions to pack-unpack 8 bytes, 64 and 128 bytes of uint6s.
Modified bitpack.h to add case statements for 6-bit quantization in the general functions that perform vectorized packing/unpacking on ARM neon vectors. (32, 64, 128 values)
CONTEXT
Refer to previous diffs introducing 2-5 bit quantization. 2-bit: D62133659

Reviewed By: metascroy

Differential Revision: D63792020
Copy link

pytorch-bot bot commented Oct 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1007

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 3, 2024
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D63792020

@facebook-github-bot facebook-github-bot merged commit 9ce7ebb into pytorch:main Oct 3, 2024
3 of 5 checks passed
melvinebenezer pushed a commit to melvinebenezer/ao that referenced this pull request Oct 7, 2024
Differential Revision: D63792020

Pull Request resolved: pytorch#1007
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants