Hqq serialization by mobicham · Pull Request #33141 · huggingface/transformers

mobicham · 2024-08-27T11:15:40Z

Follow-up to #32379
The goal of this PR is to add full support to save/load HQQ-quantized models directly in transformers. So far, serialization was done on the hqq-lib side via the .pt format which is not safe and doesn't work with very large models (>100B params) since the model is not sharded.

What was done during this PR:

Make sure saving/loading HQQ-quantized models works properly.
Make sure multi-gpu support works with the hqq backends (required some updates the hqq lib side)
Make sure adding biases in architectures that do not have a bias by default works (biases are used in some HQQ-calibrated models)
Added update_expected_keys() call in the quantizer. This allows loading quantized models that were initialized with torch.nn.Linear instead

Full gist to try it out: https://gist.github.com/mobicham/701dd564c52590203ee09631425ad797

mobicham · 2024-08-27T11:20:17Z

1/3
5cb7d81
Removed the check_old_param hack.
The problem however is that HQQLinear.state_dict is huge, which makes loading extremely slow. So I added run_expected_keys_check which skips those checks for HQQLinear params. I am not sure if it's a clean way. If you just init a dummy HQQLinear you wouldn't get all the state_dict params anyway 🤔 so if you disable that check it will complain that the parameters is not in the expected keys, let me know if there's a better way of doing this

SunMarc

Nice ! Let's fix the issue regarding the torchao backend and we can merge this. I left a few comments

src/transformers/quantizers/quantizer_hqq.py

src/transformers/modeling_utils.py

mobicham · 2024-08-27T15:58:31Z

2/3: Multi-gpu loading
Loading on multi-gpu looks like it's working fine. There's an issue with the BitBlas backend I just reported here
Forcing the input to use the same device was done on the hqq lib side.

3/3: state_dict on the same safetensor chunk
I run tests with different models and it's working fine ( gist):

model_id  = 'meta-llama/Meta-Llama-3-8B-Instruct' #OK
model_id  = 'meta-llama/Meta-Llama-3-70B' #OK 
model_id = "facebook/opt-125m" #OK
model_id = "meta-llama/Llama-2-13b-chat-hf" #OK
model_id = "microsoft/Phi-3-mini-128k-instruct" #OK
model_id = "google/gemma-2-9b-it" #OK
model_id = "google/gemma-2-2b" #OK

so I think for the moment we can leave it until someone reports some issue, I can't reproduce the problem anyway.

Next steps:

Revisit the comments above (@mobicham )
Change/disable settings for hqqConfig because now saving/loading doesn't support quant scales/zeros as well as meta-data offloading. Need to deprecate it as well on the hqq lib side and a new pip version 2.0.0 (@mobicham )

mobicham · 2024-08-28T10:01:19Z

@SunMarc

Reverted back to if isinstance(module, (torch.nn.Linear, HQQLinear)): but we still need that run_expected_keys_check otherwise it breaks
Updated the default HqqConfig default params since quant_scale, quant_zero, and offload_meta are now deprecated. Also done on the hqq-lib side. I also updated the tests, the doc and made a new hqq lib pip release 0.2.0

SunMarc

Added a couple of comments !

src/transformers/modeling_utils.py

src/transformers/quantizers/quantizer_hqq.py

tests/quantization/hqq/test_hqq.py

src/transformers/utils/quantization_config.py

mobicham · 2024-08-28T10:52:55Z

Regarding this: #33141 (comment)
The issue is that to remove that additional check, we need to have all the HQQLinear dict keys for each layer in the list of expected keys. There are 19 keys per HQQLinear module. For a small model like LLama3-8B, that means 32*7*19=4256 checks per parameter which is extremely slow

SunMarc

Left a suggestion about axis

src/transformers/utils/quantization_config.py

…formers into hqq_serialization

blap · 2024-09-26T00:26:17Z

Just for curiosity, what miss to merge?

SunMarc · 2024-09-26T01:17:40Z

Just for curiosity, what miss to merge?

Waiting for @mobicham to check the latest review and give me to heads-up to merge ! This should be done soon ! Also it looks like that there are some conflits to fix

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

SunMarc · 2024-09-30T12:47:11Z

Thanks for iterating @mobicham! Merging!

* HQQ model serialization attempt * fix hqq dispatch and unexpected keys * style * remove check_old_param * revert to check HQQLinear in quantizer_hqq.py * revert to check HQQLinear in quantizer_hqq.py * update HqqConfig default params * make ci happy * make ci happy * revert to HQQLinear check in quantizer_hqq.py * check hqq_min version 0.2.0 * set axis=1 as default in quantization_config.py * validate_env with hqq>=0.2.0 version message * deprecated hqq kwargs message * make ci happy * remove run_expected_keys_check hack + bump to 0.2.1 min hqq version * fix unexpected_keys hqq update * add pre_quantized check * add update_expected_keys to base quantizerr * ci base.py fix? * ci base.py fix? * fix "quantization typo" src/transformers/utils/quantization_config.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix post merge --------- Co-authored-by: Marc Sun <marc@huggingface.co> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

rohit-gupta · 2024-11-26T23:37:43Z

@mobicham minor documentation issue, but the transformers documentation page for quantization has a giant features matrix which still says serialization of HQQ models is not supported

https://huggingface.co/docs/transformers/main/quantization/overview

SunMarc · 2024-11-27T12:25:01Z

Would you like to open a PR to fix this @rohit-gupta ?

mobicham · 2024-12-02T07:59:09Z

@rohit-gupta thanks for flagging !

blap · 2024-12-02T19:39:25Z

now model.save_pretrained(save_path) give this:


Traceback (most recent call last):
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\hqq1b.py", line 35, in <module>
    model.save_pretrained(save_path)
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\transformers\modeling_utils.py", line 2932, in save_pretrained
    state_dict_split = split_torch_state_dict_into_shards(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_torch.py", line 330, in split_torch_state_dict_into_shards
    return split_state_dict_into_shards_factory(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_base.py", line 108, in split_state_dict_into_shards_factory
    storage_id = get_storage_id(tensor)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_torch.py", line 382, in get_torch_storage_id
    if tensor.device.type == "meta":
       ^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'device'

mobicham · 2024-12-03T08:04:34Z

@blap is this related to the latest transformer changes? Otherwise, which hqq version causes this?

blap · 2024-12-03T11:04:45Z

@blap is this related to the latest transformer changes? Otherwise, which hqq version causes this?

I think so. I didn't had this problem in the release of hqq in transformers.
hqq version: 0.2.3
transformers version: 4.47.0.dev0

mobicham · 2024-12-03T12:25:01Z

@blap is this related to the latest transformer changes? Otherwise, which hqq version causes this?

I think so. I didn't had this problem in the release of hqq in transformers. hqq version: 0.2.3 transformers version: 4.47.0.dev0

@SunMarc do you know what was changed by any chance?

* HQQ model serialization attempt * fix hqq dispatch and unexpected keys * style * remove check_old_param * revert to check HQQLinear in quantizer_hqq.py * revert to check HQQLinear in quantizer_hqq.py * update HqqConfig default params * make ci happy * make ci happy * revert to HQQLinear check in quantizer_hqq.py * check hqq_min version 0.2.0 * set axis=1 as default in quantization_config.py * validate_env with hqq>=0.2.0 version message * deprecated hqq kwargs message * make ci happy * remove run_expected_keys_check hack + bump to 0.2.1 min hqq version * fix unexpected_keys hqq update * add pre_quantized check * add update_expected_keys to base quantizerr * ci base.py fix? * ci base.py fix? * fix "quantization typo" src/transformers/utils/quantization_config.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix post merge --------- Co-authored-by: Marc Sun <marc@huggingface.co> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

blap · 2024-12-09T12:21:00Z

Transformers version 4.48.0.dev0 still has this problem...

mobicham · 2024-12-09T14:41:44Z

Any one from the HF team can track down this problem please? What changed ? Nothing on the hqq lib side changed much.

blap · 2024-12-17T13:56:44Z

@SunMarc ?

SunMarc · 2024-12-24T11:09:29Z

Can you share your script @blap ? I'll have a look asap !

blap · 2024-12-24T14:15:41Z

Can you share your script @blap ? I'll have a look asap !


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig

model_id      = "mllmTeam/PhoneLM-1.5B"
repo          = "PhoneLM-1.5B"
nbits         = 4
group_size    = None
axis          = 0
save_path     = repo+"-nbits"+str(nbits)+"-GS"+str(group_size)+"-Axis"+str(axis)+"-HQQ2"
cache_dir     = repo+"-cache"
device        = "cpu"
compute_dtype = torch.float16

#Quantize
quant_config  = HqqConfig(nbits=nbits, group_size=group_size, axis=axis, quant_scale=False, quant_zero=False)

#Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=compute_dtype, 
    cache_dir=cache_dir,
    device_map=device, 
    quantization_config=quant_config,
    low_cpu_mem_usage=True,
    trust_remote_code=True
)

#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

# Save
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

Error:


Traceback (most recent call last):
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\hqq1b.py", line 32, in <module>
    model.save_pretrained(save_path)
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\transformers\modeling_utils.py", line 2971, in save_pretrained
    state_dict_split = split_torch_state_dict_into_shards(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_torch.py", line 369, in split_torch_state_dict_into_shards
    return split_state_dict_into_shards_factory(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_base.py", line 108, in split_state_dict_into_shards_factory
    storage_id = get_storage_id(tensor)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\huggingface_hub\serialization\_torch.py", line 746, in get_torch_storage_id
    if tensor.device.type == "meta":
       ^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'device'

blap · 2024-12-30T14:46:48Z

So...
Any ideas how to save?

mobicham · 2024-12-30T15:24:02Z

@blap why don't you use the latest release ? It works fine last time I tried (last week)

blap · 2024-12-30T16:33:08Z

@blap why don't you use the latest release ? It works fine last time I tried (last week)

Which version do you use?

Version 4.45.2 give me this:


Traceback (most recent call last):
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\hqq1b.py", line 37, in <module>
    model.save_pretrained(save_path)
  File "C:\Users\Admin\Desktop\Python\0.LLMs\hqq\venv\Lib\site-packages\transformers\modeling_utils.py", line 2565, in save_pretrained
    raise ValueError(
ValueError: The model is quantized with QuantizationMethod.HQQ and is not serializable - check out the warnings from the logger on the traceback to understand the reason why the quantized model is not serializable.

mobicham · 2024-12-30T17:13:04Z

@blap 4.47.0 works for sure

blap · 2024-12-30T18:50:27Z

@blap 4.47.0 works for sure

I just got the same error in this version too.
I tried others models without success.
I only run properly only on hqq, not on transformers.
Can you show some code, please?

mobicham · 2024-12-30T19:13:15Z

@blap

# pip install transformers==4.47.0;
# pip install hqq --upgrade;
##################################################################
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig

model_path = "meta-llama/Meta-Llama-3-8B-Instruct"
quant_model = "quant_model"

quant_config = HqqConfig(nbits=4, group_size=64, axis=1)

model = AutoModelForCausalLM.from_pretrained(model_path,
                                            torch_dtype=torch.float16,
                                            cache_dir='.',
                                            device_map="cuda:0",
                                            quantization_config=quant_config,
                                            low_cpu_mem_usage=True)

tokenizer = AutoTokenizer.from_pretrained(model_path)

model.save_pretrained(quant_model)
tokenizer.save_pretrained(quant_model)

blap · 2024-12-30T19:37:48Z

I found the problem:
If I use group_size=None I got the error.

mobicham · 2024-12-31T09:54:54Z

I found the problem: If I use group_size=None I got the error.

Hmm interesting, thanks for flagging! Fixed here.

Would recommend using 64 or 128 though, some of the fast kernels like Marlin in VLLM and TinyGemm in torchao don't support group_size=None anyway.

mobicham and others added 7 commits July 18, 2024 12:40

HQQ model serialization attempt

ff40f1a

Merge branch 'huggingface:main' into main

fa8a9f5

fix hqq dispatch and unexpected keys

75dfe0a

Merge remote-tracking branch 'upstream/main' into hqq_serialization

f2ea032

Merge remote-tracking branch 'upstream/main' into hqq_serialization

bc9cb55

style

a8704d2

remove check_old_param

5cb7d81

LysandreJik requested a review from SunMarc August 27, 2024 11:28

SunMarc reviewed Aug 27, 2024

View reviewed changes

src/transformers/quantizers/quantizer_hqq.py Outdated Show resolved Hide resolved

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

mobicham mentioned this pull request Aug 28, 2024

Quesiton on the speed for generating the response dropbox/hqq#111

Closed

mobicham added 3 commits August 28, 2024 09:04

revert to check HQQLinear in quantizer_hqq.py

7f1b85d

revert to check HQQLinear in quantizer_hqq.py

71cccd4

update HqqConfig default params

ff982b3

This was referenced Aug 28, 2024

Add way to save quantize config and can be loaded again dropbox/hqq#93

Closed

Weight Sharding dropbox/hqq#100

Closed

Add multi-gpu support for from_quantized call dropbox/hqq#71

Closed

mobicham added 2 commits August 28, 2024 10:14

make ci happy

d35ea7c

make ci happy

cbe219f

SunMarc reviewed Aug 28, 2024

View reviewed changes

revert to HQQLinear check in quantizer_hqq.py

2bb974c

mobicham and others added 2 commits August 28, 2024 12:00

check hqq_min version 0.2.0

9f7c235

Merge branch 'main' into hqq_serialization

7f15b49

SunMarc reviewed Aug 28, 2024

View reviewed changes

src/transformers/utils/quantization_config.py Outdated Show resolved Hide resolved

src/transformers/utils/quantization_config.py Outdated Show resolved Hide resolved

mobicham added 3 commits August 28, 2024 13:14

set axis=1 as default in quantization_config.py

383e028

Merge branch 'hqq_serialization' of https://github.com/mobiusml/trans…

cf5a05c

…formers into hqq_serialization

validate_env with hqq>=0.2.0 version message

4682a72

mobicham and others added 3 commits September 30, 2024 13:13

fix "quantization typo" src/transformers/utils/quantization_config.py

4db1991

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Merge branch 'main' into hqq_serialization

3b56533

fix post merge

a8843cf

SunMarc merged commit f5247ac into huggingface:main Sep 30, 2024

SunMarc mentioned this pull request Oct 25, 2024

Fix regression loading dtype #34409

Merged

Conversation

mobicham commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mobicham commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mobicham commented Aug 27, 2024

Uh oh!

mobicham commented Aug 28, 2024

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mobicham commented Aug 28, 2024

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

blap commented Sep 26, 2024

Uh oh!

SunMarc commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc commented Sep 30, 2024

Uh oh!

rohit-gupta commented Nov 26, 2024

Uh oh!

SunMarc commented Nov 27, 2024

Uh oh!

mobicham commented Dec 2, 2024

Uh oh!

blap commented Dec 2, 2024

Uh oh!

mobicham commented Dec 3, 2024

Uh oh!

blap commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mobicham commented Dec 3, 2024

Uh oh!

blap commented Dec 9, 2024

Uh oh!

mobicham commented Dec 9, 2024

Uh oh!

blap commented Dec 17, 2024

Uh oh!

SunMarc commented Dec 24, 2024

Uh oh!

blap commented Dec 24, 2024

Uh oh!

blap commented Dec 30, 2024

Uh oh!

mobicham commented Dec 30, 2024

Uh oh!

blap commented Dec 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mobicham commented Dec 30, 2024

Uh oh!

blap commented Dec 30, 2024

Uh oh!

mobicham commented Dec 30, 2024

Uh oh!

blap commented Dec 30, 2024

Uh oh!

mobicham commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mobicham commented Aug 27, 2024 •

edited

Loading

mobicham commented Aug 27, 2024 •

edited

Loading

SunMarc commented Sep 26, 2024 •

edited

Loading

blap commented Dec 3, 2024 •

edited

Loading

blap commented Dec 30, 2024 •

edited

Loading

mobicham commented Dec 31, 2024 •

edited

Loading