Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
7f4aa05
Resolve vptq conflict
elvircrn Dec 27, 2024
ac4b142
Rename spqr package to spqr_quant
elvircrn Nov 25, 2024
8c3f5f1
Get rid of aqlm mention
elvircrn Nov 27, 2024
ff61b8e
Start working on tests
elvircrn Nov 27, 2024
23c3a24
Resolve ruff code checks
elvircrn Nov 27, 2024
0cb5ba7
Ruff format
elvircrn Nov 27, 2024
c980e66
Isort
elvircrn Nov 27, 2024
163983b
Test updates
elvircrn Nov 29, 2024
f51d3d1
Add gpu tag
elvircrn Dec 2, 2024
24ca92f
Rename to modules_to_not_convert
elvircrn Dec 2, 2024
913fbcb
Config update
elvircrn Dec 2, 2024
d433165
Docs and config update
elvircrn Dec 2, 2024
5582beb
Docs and config update
elvircrn Dec 2, 2024
3d64f88
Update to update_torch_dtype
elvircrn Dec 2, 2024
81237de
spqr config parameter validation
elvircrn Dec 2, 2024
1dacd50
Ruff update
elvircrn Dec 2, 2024
c1a4304
Apply ruff fixes
elvircrn Dec 2, 2024
53c53c0
Test fixes
elvircrn Dec 2, 2024
c21c412
Ruff update
elvircrn Dec 2, 2024
dc89200
Mark tests as @slow again; Ruff; Docstring update
elvircrn Dec 2, 2024
64929f7
Ruff
elvircrn Dec 2, 2024
fada970
Remove absolute path
elvircrn Dec 9, 2024
4694339
Resolve typo
elvircrn Dec 9, 2024
92ea493
Remove redundandt log
elvircrn Dec 9, 2024
1494453
Check accelerate/spqr availability
elvircrn Dec 9, 2024
9e8f470
Ruff fix
elvircrn Dec 9, 2024
525dcdf
Check if the config contains proper shapes
elvircrn Dec 9, 2024
1a54d86
Ruff test
elvircrn Dec 9, 2024
68afc89
Documentation update
elvircrn Dec 13, 2024
0eff944
overview update
elvircrn Dec 13, 2024
82e7f4e
Ruff checks
elvircrn Dec 27, 2024
274d368
Ruff code quality
elvircrn Dec 27, 2024
a630d6d
Make style
elvircrn Dec 27, 2024
96b2613
Update docs/source/en/quantization/spqr.md
elvircrn Jan 9, 2025
17d1c72
Update spqr.md
elvircrn Jan 9, 2025
55b50c7
Enable gptqmodel (#35012)
jiqing-feng Jan 15, 2025
c4273e2
Fix : Nemotron Processor in GGUF conversion (#35708)
MekkCyber Jan 15, 2025
5aac5e3
Merge branch 'main' into spqr-quantizer
elvircrn Feb 7, 2025
95d2e74
Update docs/source/en/quantization/spqr.md
elvircrn Feb 13, 2025
3fdc0c3
Add missing TOC to doc
elvircrn Feb 13, 2025
14f21c1
Merge branch 'huggingface:main' into spqr-quantizer
elvircrn Feb 13, 2025
cdefeaf
Merge branch 'main' into spqr-quantizer
elvircrn Feb 13, 2025
8da4a66
Merge branch 'main' into spqr-quantizer
MekkCyber Feb 13, 2025
afff70e
Merge branch 'main' into spqr-quantizer
MekkCyber Feb 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docker/transformers-quantization-latest-gpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,9 @@ RUN python3 -m pip install --no-cache-dir aqlm[gpu]==1.0.2
# Add vptq for quantization testing
RUN python3 -m pip install --no-cache-dir vptq

# Add spqr for quantization testing
RUN python3 -m pip install --no-cache-dir spqr_quant[gpu]

# Add hqq for quantization testing
RUN python3 -m pip install --no-cache-dir hqq

Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,8 @@
- local: quantization/aqlm
title: AQLM
- local: quantization/vptq
title: SpQR
- local: quantization/spqr
title: VPTQ
- local: quantization/quanto
title: Quanto
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/main_classes/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,10 @@ Learn how to quantize models in the [Quantization](../quantization) guide.

[[autodoc]] BitNetConfig

## SpQRConfig

[[autodoc]] SpQRConfig

## FineGrainedFP8Config

[[autodoc]] FineGrainedFP8Config
1 change: 1 addition & 0 deletions docs/source/en/quantization/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ Use the table below to help you decide which quantization method to use.
| [FBGEMM_FP8](./fbgemm_fp8.md) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | https://github.com/pytorch/FBGEMM |
| [torchao](./torchao.md) | 🟢 | | 🟢 | 🔴 | 🟡 <sub>5</sub> | 🔴 | | 4/8 | | 🟢🔴 | 🟢 | https://github.com/pytorch/ao |
| [VPTQ](./vptq.md) | 🔴 | 🔴 | 🟢 | 🟡 | 🔴 | 🔴 | 🟢 | 1/8 | 🔴 | 🟢 | 🟢 | https://github.com/microsoft/VPTQ |
| [SpQR](./spqr.md) | 🔴 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 3 | 🔴 | 🟢 | 🟢 | https://github.com/Vahe1994/SpQR/ |
| [FINEGRAINED_FP8](./finegrained_fp8.md) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | |
<Tip>

Expand Down
35 changes: 35 additions & 0 deletions docs/source/en/quantization/spqr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# SpQR

[SpQR](https://github.com/Vahe1994/SpQR) quantization algorithm involves a 16x16 tiled bi-level group 3-bit quantization structure, with sparse outliers as detailed in [SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression](https://arxiv.org/abs/2306.03078).

To SpQR-quantize a model, refer to the [Vahe1994/SpQR](https://github.com/Vahe1994/SpQR) repository.

Load a pre-SpQR-quantized model in [`~PreTrainedModel.from_pretrained`].

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

quantized_model = AutoModelForCausalLM.from_pretrained(
"elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf",
torch_dtype=torch.half,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf")
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't quantize a model on the fly with quantization_config=... using spqr as the quantize type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't, this is not supported as of yet. This PR only adds Inference support.

2 changes: 2 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1029,6 +1029,7 @@
"HiggsConfig",
"HqqConfig",
"QuantoConfig",
"SpQRConfig",
"TorchAoConfig",
"VptqConfig",
],
Expand Down Expand Up @@ -6202,6 +6203,7 @@
HiggsConfig,
HqqConfig,
QuantoConfig,
SpQRConfig,
TorchAoConfig,
VptqConfig,
)
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/integrations/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@
],
"peft": ["PeftAdapterMixin"],
"quanto": ["replace_with_quanto_layers"],
"spqr": ["replace_with_spqr_linear"],
"vptq": ["replace_with_vptq_linear"],
}

Expand Down Expand Up @@ -210,6 +211,7 @@
)
from .peft import PeftAdapterMixin
from .quanto import replace_with_quanto_layers
from .spqr import replace_with_spqr_linear
from .vptq import replace_with_vptq_linear

try:
Expand Down
122 changes: 122 additions & 0 deletions src/transformers/integrations/spqr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"SpQR (Sparse-Quantized Representation) integration file"

from ..utils import is_accelerate_available, is_spqr_available, is_torch_available


if is_torch_available():
import torch.nn as nn


def replace_with_spqr_linear(
model,
quantization_config=None,
modules_to_not_convert=None,
current_key_name=None,
has_been_replaced=False,
):
"""
Public method that recursively replaces the Linear layers of the given model with SpQR quantized layers.
`accelerate` is needed to use this method. Returns the converted model and a boolean that indicates if the
conversion has been successful or not.

Args:
model (`torch.nn.Module`):
The model to convert, can be any `torch.nn.Module` instance.
quantization_config (`SpQRConfig`):
The quantization config object that contains the quantization parameters.
modules_to_not_convert (`list[str]`, *optional*):
A list of nn.Linear weights to not convert. If a parameter path is in the list (e.g. `lm_head.weight`), the corresponding module will not be
converted.
current_key_name (`list`, *optional*):
A list that contains the current key name. This is used for recursion and should not be passed by the user.
has_been_replaced (`bool`, *optional*):
A boolean that indicates if the conversion has been successful or not. This is used for recursion and
should not be passed by the user.
"""
if modules_to_not_convert is None:
modules_to_not_convert = []

if is_accelerate_available():
from accelerate import init_empty_weights
if is_spqr_available():
from spqr_quant import QuantizedLinear
Comment on lines +52 to +55
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's exactly what I meant, thank you for fixing it !


for name, module in model.named_children():
if current_key_name is None:
current_key_name = []
current_key_name.append(name)

if isinstance(module, nn.Linear):
# Check if the current key is not in the `modules_to_not_convert`
if ".".join(current_key_name) + ".weight" not in modules_to_not_convert:
with init_empty_weights():
tensor_name = ".".join(current_key_name)

shapes = quantization_config.shapes
shapes_keys = shapes.keys()

shapes_valid = (
f"{tensor_name}.dense_weights.shape" in shapes_keys
and f"{tensor_name}.row_offsets.shape" in shapes_keys
and f"{tensor_name}.col_vals.shape" in shapes_keys
and f"{tensor_name}.in_perm.shape" in shapes_keys
)

if not shapes_valid:
raise ValueError(
f"The SpQR quantization config does not contain the shape "
f"configuration for {tensor_name}. This indicates that the "
f"configuration is either invalid or corrupted."
)

dense_weights_shape = shapes[f"{tensor_name}.dense_weights.shape"]
row_offsets_shape = shapes[f"{tensor_name}.row_offsets.shape"]
col_vals_shape = shapes[f"{tensor_name}.col_vals.shape"]
in_perm_shape = shapes[f"{tensor_name}.in_perm.shape"]

in_features = module.in_features
out_features = module.out_features

model._modules[name] = QuantizedLinear.create_placehodler(
rows=out_features,
cols=in_features,
bits=quantization_config.bits,
beta1=quantization_config.beta1,
beta2=quantization_config.beta2,
dense_weights_shape=dense_weights_shape,
row_offsets_shape=row_offsets_shape,
col_vals_shape=col_vals_shape,
in_perm_shape=in_perm_shape,
)
has_been_replaced = True

Comment on lines +93 to +105
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FMI, are the beta and shapes parameters used solely for loading, or do they also play a role in quantization? In other words, if a model is quantized using specific beta or shapes values, can it only be loaded with those same parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpQR quantized weights are a product of tile-based (beta corresponds to the dimension of this tile where beta1, beta2 represent tile width and height respectively) bi-level quantization. You always specify beta before the quantization starts.

Each SpQR weight also comes with a sparse tensor representing the outlier weights. As this tensor is unstructured, one had to keep track of the size of this matrix during loading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following is a visualization of the compression format from the original publication (https://arxiv.org/pdf/2306.03078):

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add it to the doc!

# Store the module class in case we need to transpose the weight later
model._modules[name].source_cls = type(module)
# Force requires grad to False to avoid unexpected errors
model._modules[name].requires_grad_(False)
else:
pass
if len(list(module.children())) > 0:
_, has_been_replaced = replace_with_spqr_linear(
module,
quantization_config=quantization_config,
modules_to_not_convert=modules_to_not_convert,
current_key_name=current_key_name,
has_been_replaced=has_been_replaced,
)
# Remove the last key for recursion
current_key_name.pop(-1)
return model, has_been_replaced
4 changes: 4 additions & 0 deletions src/transformers/quantizers/auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
QuantizationConfigMixin,
QuantizationMethod,
QuantoConfig,
SpQRConfig,
TorchAoConfig,
VptqConfig,
)
Expand All @@ -47,6 +48,7 @@
from .quantizer_higgs import HiggsHfQuantizer
from .quantizer_hqq import HqqHfQuantizer
from .quantizer_quanto import QuantoHfQuantizer
from .quantizer_spqr import SpQRHfQuantizer
from .quantizer_torchao import TorchAoHfQuantizer
from .quantizer_vptq import VptqHfQuantizer

Expand All @@ -66,6 +68,7 @@
"torchao": TorchAoHfQuantizer,
"bitnet": BitNetHfQuantizer,
"vptq": VptqHfQuantizer,
"spqr": SpQRHfQuantizer,
"fp8": FineGrainedFP8HfQuantizer,
}

Expand All @@ -84,6 +87,7 @@
"torchao": TorchAoConfig,
"bitnet": BitNetConfig,
"vptq": VptqConfig,
"spqr": SpQRConfig,
"fp8": FineGrainedFP8Config,
}

Expand Down
83 changes: 83 additions & 0 deletions src/transformers/quantizers/quantizer_spqr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/lic enses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING, Optional

from .base import HfQuantizer


if TYPE_CHECKING:
from ..modeling_utils import PreTrainedModel

from ..integrations import replace_with_spqr_linear
from ..utils import is_accelerate_available, is_spqr_available, is_torch_available, logging
from ..utils.quantization_config import QuantizationConfigMixin


if is_torch_available():
import torch

logger = logging.get_logger(__name__)


class SpQRHfQuantizer(HfQuantizer):
"""
Quantizer of the SpQR method. Enables the loading of prequantized models.
"""

def __init__(self, quantization_config: QuantizationConfigMixin, **kwargs):
super().__init__(quantization_config, **kwargs)
self.quantization_config = quantization_config

def validate_environment(self, *args, **kwargs):
if not torch.cuda.is_available():
raise RuntimeError("GPU is required to run SpQR quantized model.")

if not is_accelerate_available():
raise ImportError("Using `spqr` quantization requires Accelerate: `pip install accelerate`")

if not is_spqr_available():
raise ImportError("Using `spqr` quantization requires SpQR: `pip install spqr_quant[gpu]`")

def update_torch_dtype(self, torch_dtype: "torch.dtype") -> "torch.dtype":
if torch_dtype is None:
torch_dtype = torch.float16
logger.info("Assuming SpQR inference on GPU and loading the model in `torch.float16`.")
elif torch_dtype != torch.float16:
raise ValueError(
"You cannot use any type other than torch.float16 for SpQR. Please either leave it None or set it to"
"torch.float16 explicitly."
)
return torch_dtype

def _process_model_before_weight_loading(
self,
model: "PreTrainedModel",
**kwargs,
):
replace_with_spqr_linear(
model,
quantization_config=self.quantization_config,
modules_to_not_convert=self.quantization_config.modules_to_not_convert,
)
model.config.quantization_config = self.quantization_config

def _process_model_after_weight_loading(self, model: "PreTrainedModel", **kwargs):
return model

@property
def is_trainable(self, model: Optional["PreTrainedModel"] = None):
return False

def is_serializable(self, safe_serialization=None):
return True
8 changes: 8 additions & 0 deletions src/transformers/testing_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@
is_seqio_available,
is_soundfile_available,
is_spacy_available,
is_spqr_available,
is_sudachi_available,
is_sudachi_projection_available,
is_tensorflow_probability_available,
Expand Down Expand Up @@ -1191,6 +1192,13 @@ def require_vptq(test_case):
return unittest.skipUnless(is_vptq_available(), "test requires vptq")(test_case)


def require_spqr(test_case):
"""
Decorator marking a test that requires spqr
"""
return unittest.skipUnless(is_spqr_available(), "test requires spqr")(test_case)


def require_eetq(test_case):
"""
Decorator marking a test that requires eetq
Expand Down
1 change: 1 addition & 0 deletions src/transformers/utils/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,7 @@
is_soundfile_available,
is_spacy_available,
is_speech_available,
is_spqr_available,
is_sudachi_available,
is_sudachi_projection_available,
is_tensorflow_probability_available,
Expand Down
5 changes: 5 additions & 0 deletions src/transformers/utils/import_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,7 @@ def _is_package_available(pkg_name: str, return_version: bool = False) -> Union[
_blobfile_available = _is_package_available("blobfile")
_liger_kernel_available = _is_package_available("liger_kernel")
_triton_available = _is_package_available("triton")
_spqr_available = _is_package_available("spqr_quant")

_torch_version = "N/A"
_torch_available = False
Expand Down Expand Up @@ -1213,6 +1214,10 @@ def is_speech_available():
return _torchaudio_available


def is_spqr_available():
return _spqr_available


def is_phonemizer_available():
return _phonemizer_available

Expand Down
Loading