Skip to content

Fp8 Quantization Support#62

Merged
Satrat merged 26 commits intomainfrom
sa/fp8
Jun 20, 2024
Merged

Fp8 Quantization Support#62
Satrat merged 26 commits intomainfrom
sa/fp8

Conversation

@Satrat
Copy link

@Satrat Satrat commented May 22, 2024

Adds new fp8 quantization format. For now fp8 is assumed to be torch.float8_e4m3fn, in the future we could expand to support torch.float8_e5m2 as well by expanding QuantizationArgs.

The main change here is adding some additional checks to deal with float vs int quantization, as the range and rounding is calculated differently. Since the logic for fp8 compression is the same as int8 aside from a difference in the cast, I merged them into a single compressor. However the int8 compressor can still be referenced by its original "int-quantized" name so this won't break anything on the sparseml or vllm side

Testing

Added additional unit tests to test compression/decompression and scale/zp calculations.

Example/Evaluation

Requires sparseml FP8 PR to run: neuralmagic/sparseml#2306

import torch
from sparseml.transformers import SparseAutoModelForCausalLM, oneshot


# define a sparseml recipe for GPTQ floating pointW8A8 quantization
recipe = """
test_stage:
    quant_modifiers:
        GPTQModifier:
            sequential_update: false
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: "float"
                        symmetric: true
                        strategy: "tensor"
                    input_activations:
                        num_bits: 8
                        type: "float"
                        symmetric: true
                        strategy: "tensor"
                    targets: ["Linear"]
"""

# setting device_map to auto to spread the model evenly across all available GPUs
model_stub = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = SparseAutoModelForCausalLM.from_pretrained(
    model_stub, torch_dtype=torch.float32, device_map="auto"
)

# uses SparseML's built-in preprocessing for ultra chat
dataset = "ultrachat-200k"

# save location of quantized model out
output_dir = "/network/sadkins/llama1.1b_fp8_gptq"

# set dataset config parameters
splits = {"calibration": "train_gen[:5%]"}
max_seq_length = 512
pad_to_max_length = False
num_calibration_samples = 512

# apply recipe to the model and save quantized output compressed to fp8
oneshot(
    model=model,
    dataset=dataset,
    recipe=recipe,
    output_dir=output_dir,
    splits=splits,
    max_seq_length=max_seq_length,
    pad_to_max_length=pad_to_max_length,
    num_calibration_samples=num_calibration_samples,
    save_compressed=True
)

Evaluated with sparseml.evaluate /network/sadkins/llama1.1b_fp8_gptq -d wikitext -i lm-evaluation-harness. Perplexity looks good at 14.53, it was 14.43 for the dense input model

@Satrat Satrat marked this pull request as ready for review May 22, 2024 19:22
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

bfineran
bfineran previously approved these changes May 28, 2024
@Satrat Satrat requested a review from bfineran May 29, 2024 21:51
@Satrat Satrat requested a review from mgoin May 30, 2024 14:28
dbogunowicz
dbogunowicz previously approved these changes Jun 12, 2024
bfineran
bfineran previously approved these changes Jun 17, 2024
@Satrat Satrat requested a review from dbogunowicz June 17, 2024 18:19
dbogunowicz
dbogunowicz previously approved these changes Jun 18, 2024
@Satrat Satrat dismissed stale reviews from dbogunowicz and bfineran via 7101f33 June 19, 2024 17:45
@Satrat Satrat requested a review from dbogunowicz June 20, 2024 13:58
@Satrat Satrat merged commit 75436f6 into main Jun 20, 2024
@Satrat Satrat deleted the sa/fp8 branch June 20, 2024 14:21
Etelis added a commit to Etelis/compressed-tensors that referenced this pull request Sep 11, 2025
* small fixes

* initial commit

* bug fixes

* cleanup

* clarity comments

* clean up compression classes

* fixing zero point issues

* comment for hack

* update quant check

* cleanup fp8 dtypes

* cleanup

* clean up observer

* dtype fix

* docstrings

* fixes after rebase

* test fixes

* style

* get rid of broken segment

* fix broken code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants