Supporting llama int4 inference using AutoGPTQ in HPU (#166) by HolyFalafel · Pull Request #1125 · huggingface/optimum-habana

HolyFalafel · 2024-07-04T14:29:34Z

Added support for AutoGPTQ when loading quantized model, and running inference in HPU.
This will be available in v1.17

* Supporting llama int4 quantization using AutoGPTQ * cleanups in int4 * Blocking running hqt with int4 * Rename int4 param to gptq * Added call to preprocessing in gptq * Added call to preprocessing in gptq fix * Added call to preprocessing in gptq fix2 * Removed call to preprocessing (found a better solution on AutoGPTQ) * Fixed deprecated message for exllama

Supporting llama int4 inference using AutoGPTQ in HPU (huggingface#166) huggingface#1125

emascarenhas · 2024-09-03T14:54:35Z

Please sync your PR with main/upstream and fix any merge conflicts. Thank you.

yafshar · 2024-09-10T17:58:08Z

@HolyFalafel, please sync this PR with main and ping me to wrap it up. Thanks

yafshar · 2024-09-10T20:29:54Z

+
+
+Llama2-7b in UINT4 is enabled using [AutoGPTQ Fork](https://github.com/HabanaAI/AutoGPTQ), which provides quantization capabilities in PyTorch.
+Currently, the support is for UINT4 inference of pre-quantized models only.


@HolyFalafel please add the AutoGPTQ installation here,

BUILD_CUDA_EXT=0 pip install auto-gptq --no-build-isolation

yafshar · 2024-09-10T20:41:24Z

@HolyFalafel , we have the similar functionality already added. Please check #1165

https://github.com/huggingface/optimum-habana/blob/main/examples/text-generation/README.md#loading-4-bit-checkpoints-from-hugging-face

optimum-habana/optimum/habana/transformers/models/llama/modeling_llama.py

Line 432 in fa1fbc5

def get_k_proj_weight(self):

optimum-habana/optimum/habana/transformers/models/llama/modeling_llama.py

Line 438 in fa1fbc5

def get_k_proj_weight_dtype(self):

…1125" This reverts commit 2000967.

HolyFalafel · 2024-10-06T08:22:22Z

#1364 replaces this PR

HolyFalafel added 3 commits July 4, 2024 17:25

Fixed self.k_proj.weight when using gptq (#269)

f92a879

Added AutoGPTQ UINT4 to README.md

ed9f023

HolyFalafel requested review from libinta and mandy-li as code owners July 4, 2024 14:29

HolyFalafel requested a review from a user July 4, 2024 14:29

HolyFalafel requested a review from regisss as a code owner July 4, 2024 14:29

libinta added the synapse 1.17_dependency PR not backward compatible can be merged only when synapse 1.17 is available. label Jul 9, 2024

This was referenced Jul 29, 2024

Supporting llama int4 quantization using AutoGPTQ HabanaAI/optimum-habana-fork#166

Merged

Fixed self.k_proj.weight when using gptq HabanaAI/optimum-habana-fork#269

Merged

mounikamandava added a commit to emascarenhas/optimum-habana that referenced this pull request Aug 2, 2024

Merge branch 'dev/danny/uint4_readme_us' into syn1.17tr4.43

f3bd9b5

Supporting llama int4 inference using AutoGPTQ in HPU (huggingface#166) huggingface#1125

libinta removed the synapse 1.17_dependency PR not backward compatible can be merged only when synapse 1.17 is available. label Aug 5, 2024

yafshar reviewed Sep 10, 2024

View reviewed changes

libinta added the synapse 1.18 dependency label Sep 18, 2024

hsubramony added a commit that referenced this pull request Oct 1, 2024

Supporting llama int4 inference using AutoGPTQ in HPU (#166)#1125

2000967

hsubramony added a commit that referenced this pull request Oct 1, 2024

Revert "Supporting llama int4 inference using AutoGPTQ in HPU (#166)#…

b87d80e

…1125" This reverts commit 2000967.

libinta removed the synapse 1.18 dependency label Oct 2, 2024

HolyFalafel closed this Oct 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting llama int4 inference using AutoGPTQ in HPU (#166)#1125

Supporting llama int4 inference using AutoGPTQ in HPU (#166)#1125
HolyFalafel wants to merge 3 commits into
huggingface:mainfrom
HabanaAI:dev/danny/uint4_readme_us

HolyFalafel commented Jul 4, 2024 •

edited

Loading

Uh oh!

emascarenhas commented Sep 3, 2024

Uh oh!

yafshar commented Sep 10, 2024

Uh oh!

yafshar Sep 10, 2024

Uh oh!

yafshar commented Sep 10, 2024 •

edited

Loading

Uh oh!

HolyFalafel commented Oct 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		Llama2-7b in UINT4 is enabled using [AutoGPTQ Fork](https://github.com/HabanaAI/AutoGPTQ), which provides quantization capabilities in PyTorch.
		Currently, the support is for UINT4 inference of pre-quantized models only.

Conversation

HolyFalafel commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emascarenhas commented Sep 3, 2024

Uh oh!

yafshar commented Sep 10, 2024

Uh oh!

yafshar Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

yafshar commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HolyFalafel commented Oct 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HolyFalafel commented Jul 4, 2024 •

edited

Loading

yafshar commented Sep 10, 2024 •

edited

Loading