Introduce 6-bit quantization for Llama in torchchat #1007

ramreddymounica · 2024-10-03T21:40:44Z

Summary:
Introducing the ability to use 6-bit quantization in torchao. Torchchat is PyTorch's solution for local LLM inference (https://github.com/pytorch/torchchat). With torchchat, users can quantize a large language model like Llama and run it locally on their machine. Quantization is a way to convert the model weights from float32 to something that takes less space (e.g., 4-bit integers), but still does not compromise the model quality too much.

We currently offer 2, 3, 4, and 5 bit model quantization in torchchat. This task is about adding a new 6 bit quantization scheme.

Main changes:
Added uint6.h that contains the internal helper functions to pack-unpack 8 bytes, 64 and 128 bytes of uint6s.
Modified bitpack.h to add case statements for 6-bit quantization in the general functions that perform vectorized packing/unpacking on ARM neon vectors. (32, 64, 128 values)
CONTEXT
Refer to previous diffs introducing 2-5 bit quantization. 2-bit: D62133659

Reviewed By: metascroy

Differential Revision: D63792020

Summary: Introducing the ability to use 6-bit quantization in torchao. Torchchat is PyTorch's solution for local LLM inference (https://github.com/pytorch/torchchat). With torchchat, users can quantize a large language model like Llama and run it locally on their machine. Quantization is a way to convert the model weights from float32 to something that takes less space (e.g., 4-bit integers), but still does not compromise the model quality too much. We currently offer 2, 3, 4, and 5 bit model quantization in torchchat. This task is about adding a new 6 bit quantization scheme. Main changes: Added uint6.h that contains the internal helper functions to pack-unpack 8 bytes, 64 and 128 bytes of uint6s. Modified bitpack.h to add case statements for 6-bit quantization in the general functions that perform vectorized packing/unpacking on ARM neon vectors. (32, 64, 128 values) CONTEXT Refer to previous diffs introducing 2-5 bit quantization. 2-bit: D62133659 Reviewed By: metascroy Differential Revision: D63792020

pytorch-bot · 2024-10-03T21:40:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1007

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-10-03T21:41:05Z

This pull request was exported from Phabricator. Differential Revision: D63792020

Differential Revision: D63792020 Pull Request resolved: pytorch#1007

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 3, 2024

facebook-github-bot added the fb-exported label Oct 3, 2024

metascroy approved these changes Oct 3, 2024

View reviewed changes

facebook-github-bot merged commit 9ce7ebb into pytorch:main Oct 3, 2024
3 of 5 checks passed

melvinebenezer pushed a commit to melvinebenezer/ao that referenced this pull request Oct 7, 2024

Introduce 6-bit quantization for Llama in torchchat

052415b

Differential Revision: D63792020 Pull Request resolved: pytorch#1007

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce 6-bit quantization for Llama in torchchat #1007

Introduce 6-bit quantization for Llama in torchchat #1007

ramreddymounica commented Oct 3, 2024

pytorch-bot bot commented Oct 3, 2024

facebook-github-bot commented Oct 3, 2024

Introduce 6-bit quantization for Llama in torchchat #1007

Introduce 6-bit quantization for Llama in torchchat #1007

Conversation

ramreddymounica commented Oct 3, 2024

pytorch-bot bot commented Oct 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1007

facebook-github-bot commented Oct 3, 2024