Add support for QAT + LoRA by andrewor14 · Pull Request #2976 · unslothai/unsloth

andrewor14 · 2025-07-15T22:27:40Z

Summary: Quantization-aware training (QAT) helps mitigate quantization degradation by simulating quantization numerics in high precision during training (fake quantization). This PR combines QAT with LoRA by applying torchao's QAT support to the peft model.

See the following for more details:

torchao QAT: https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md
torchtune QAT + LoRA: https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700

Current QAT schemes supported are:

fp8-fp8, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel
fp8-int4, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel

Test Plan:

from unsloth import FastLanguageModel

lora_rank = 32

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-4B-Base",
    max_seq_length = 2048,
    load_in_4bit = False,
    fast_inference = False,
    max_lora_rank = lora_rank,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = [ 
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],  
    lora_alpha = lora_rank*2,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    qat_scheme = "fp8-fp8",
)

lora.Linear(
  (base_layer): FakeQuantizedLinear(
    in_features=2560, out_features=4096, bias=False
    (activation_fake_quantizer): FakeQuantizer(Float8FakeQuantizeConfig(dtype=torch.float8_e4m3fn, granularity=PerRow(), hp_value_lb=None, hp_value_ub=None))
    (weight_fake_quantizer): FakeQuantizer(Float8FakeQuantizeConfig(dtype=torch.float8_e4m3fn, granularity=PerRow(), hp_value_lb=None, hp_value_ub=None))
  )
  ... 
)

**Summary:** Quantization-aware training (QAT) helps mitigate quantization degradation by simulating quantization numerics in high precision during training (fake quantization). This PR combines QAT with LoRA by applying torchao's QAT support to the peft model. See the following for more details: - torchao QAT: https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md - torchtune QAT + LoRA: https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700 Current QAT schemes supported are: ``` fp8-fp8, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-int4, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** ``` from unsloth import FastLanguageModel lora_rank = 32 model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Qwen3-4B-Base", max_seq_length = 2048, load_in_4bit = False, fast_inference = False, max_lora_rank = lora_rank, ) model = FastLanguageModel.get_peft_model( model, r = lora_rank, target_modules = [ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], lora_alpha = lora_rank*2, use_gradient_checkpointing = "unsloth", random_state = 3407, qat_scheme = "fp8-fp8", ) lora.Linear( (base_layer): FakeQuantizedLinear( in_features=2560, out_features=4096, bias=False (activation_fake_quantizer): FakeQuantizer(Float8FakeQuantizeConfig(dtype=torch.float8_e4m3fn, granularity=PerRow(), hp_value_lb=None, hp_value_ub=None)) (weight_fake_quantizer): FakeQuantizer(Float8FakeQuantizeConfig(dtype=torch.float8_e4m3fn, granularity=PerRow(), hp_value_lb=None, hp_value_ub=None)) ) ... ) ```

jerryzh168 · 2025-08-13T21:15:38Z

+    if qat_scheme == "fp8-int4":
+        group_size = 128
+        base_config = Float8DynamicActivationInt4WeightConfig(group_size=group_size)
+        filter_fn = lambda m, _: isinstance(m, torch.nn.Linear) and m.in_features >= group_size


we could skip this in the config handler itself

we also plan to remove the group_size arg for now since there is no benefit of making it larger according to Josh

danielhanchen · 2025-08-13T21:20:45Z

+
+    For more details: https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700
+    """
+    try:


No need for try except since pyproject.toml will add it

**Summary:** Following unslothai#2976, which adds support for QAT + LoRA, this PR adds support for QAT during full fine-tuning. See the [torchao QAT README](https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md) for more details. Current QAT schemes supported are: ``` fp8-fp8, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-int4, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** https://gist.github.com/andrewor14/b0364ac3cb8aa114e46b39d848fa5c8b (ongoing)

**Summary:** Following unslothai#2976, which adds support for QAT + LoRA, this PR adds support for QAT during full fine-tuning. See the [torchao QAT README](https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md) for more details. Current QAT schemes supported are: ``` fp8-int4, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-fp8, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** https://gist.github.com/andrewor14/048b5c1bd01b7fa23c53913856a8ef9f Full fine-tuning Llama3.1-8B with and without QAT on `yahma/alpaca-cleaned` for 1 epoch: - Batch size = 16 (no grad accum) - Learning rate = 4e-5 - Quantization scheme = fp8-int4 Wikitext perplexity: - QAT improved perplexity by 19.2% compared to regular fine-tuning - QAT's int4 quantized model even outperformed the bf16 baseline - Regular int4 quantized model (without QAT) was significantly worse than the bf16 baseline ``` ==> unsloth_model_full_baseline_output/eval_float.log <== | | |none | 0|word_perplexity|↓ |9.8446|± | N/A| ==> unsloth_model_full_baseline_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |11.4595|± | N/A| ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |9.2336|± | N/A| ``` Fibonacci test: - Both bf16 baseline and int4 quantized models correctly identified 13 as the next number - QAT quantized model was more succinct in its response - No substantial differences here ``` ### Instruction: Continue the fibonnaci sequence. ### Input: 1, 1, 2, 3, 5, 8 ==> unsloth_model_full_baseline_output/eval_float.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_baseline_output/eval_quantized.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== ### Response: 13<|end_of_text|> ```

**Summary:** Following #2976, which adds support for QAT + LoRA, this PR adds support for QAT during full fine-tuning. See the [torchao QAT README](https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md) for more details. Current QAT schemes supported are: ``` fp8-int4, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-fp8, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** https://gist.github.com/andrewor14/048b5c1bd01b7fa23c53913856a8ef9f Full fine-tuning Llama3.1-8B with and without QAT on `yahma/alpaca-cleaned` for 1 epoch: - Batch size = 16 (no grad accum) - Learning rate = 4e-5 - Quantization scheme = fp8-int4 Wikitext perplexity: - QAT improved perplexity by 19.2% compared to regular fine-tuning - QAT's int4 quantized model even outperformed the bf16 baseline - Regular int4 quantized model (without QAT) was significantly worse than the bf16 baseline ``` ==> unsloth_model_full_baseline_output/eval_float.log <== | | |none | 0|word_perplexity|↓ |9.8446|± | N/A| ==> unsloth_model_full_baseline_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |11.4595|± | N/A| ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |9.2336|± | N/A| ``` Fibonacci test: - Both bf16 baseline and int4 quantized models correctly identified 13 as the next number - QAT quantized model was more succinct in its response - No substantial differences here ``` ### Instruction: Continue the fibonnaci sequence. ### Input: 1, 1, 2, 3, 5, 8 ==> unsloth_model_full_baseline_output/eval_float.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_baseline_output/eval_quantized.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== ### Response: 13<|end_of_text|> ```

**Summary:** Quantization-aware training (QAT) helps mitigate quantization degradation by simulating quantization numerics in high precision during training (fake quantization). This PR combines QAT with LoRA by applying torchao's QAT support to the peft model. See the following for more details: - torchao QAT: https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md - torchtune QAT + LoRA: https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700 Current QAT schemes supported are: ``` fp8-fp8, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-int4, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** ``` from unsloth import FastLanguageModel lora_rank = 32 model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Qwen3-4B-Base", max_seq_length = 2048, load_in_4bit = False, fast_inference = False, max_lora_rank = lora_rank, ) model = FastLanguageModel.get_peft_model( model, r = lora_rank, target_modules = [ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], lora_alpha = lora_rank*2, use_gradient_checkpointing = "unsloth", random_state = 3407, qat_scheme = "fp8-fp8", ) lora.Linear( (base_layer): FakeQuantizedLinear( in_features=2560, out_features=4096, bias=False (activation_fake_quantizer): FakeQuantizer(Float8FakeQuantizeConfig(dtype=torch.float8_e4m3fn, granularity=PerRow(), hp_value_lb=None, hp_value_ub=None)) (weight_fake_quantizer): FakeQuantizer(Float8FakeQuantizeConfig(dtype=torch.float8_e4m3fn, granularity=PerRow(), hp_value_lb=None, hp_value_ub=None)) ) ... ) ```

**Summary:** Following unslothai#2976, which adds support for QAT + LoRA, this PR adds support for QAT during full fine-tuning. See the [torchao QAT README](https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md) for more details. Current QAT schemes supported are: ``` fp8-int4, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-fp8, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** https://gist.github.com/andrewor14/048b5c1bd01b7fa23c53913856a8ef9f Full fine-tuning Llama3.1-8B with and without QAT on `yahma/alpaca-cleaned` for 1 epoch: - Batch size = 16 (no grad accum) - Learning rate = 4e-5 - Quantization scheme = fp8-int4 Wikitext perplexity: - QAT improved perplexity by 19.2% compared to regular fine-tuning - QAT's int4 quantized model even outperformed the bf16 baseline - Regular int4 quantized model (without QAT) was significantly worse than the bf16 baseline ``` ==> unsloth_model_full_baseline_output/eval_float.log <== | | |none | 0|word_perplexity|↓ |9.8446|± | N/A| ==> unsloth_model_full_baseline_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |11.4595|± | N/A| ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |9.2336|± | N/A| ``` Fibonacci test: - Both bf16 baseline and int4 quantized models correctly identified 13 as the next number - QAT quantized model was more succinct in its response - No substantial differences here ``` ### Instruction: Continue the fibonnaci sequence. ### Input: 1, 1, 2, 3, 5, 8 ==> unsloth_model_full_baseline_output/eval_float.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_baseline_output/eval_quantized.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== ### Response: 13<|end_of_text|> ```

andrewor14 force-pushed the qat_lora branch 2 times, most recently from cdff9c6 to 879c1a9 Compare July 15, 2025 22:36

andrewor14 marked this pull request as draft July 16, 2025 15:04

andrewor14 force-pushed the qat_lora branch from 879c1a9 to 1a05768 Compare July 16, 2025 15:12

shimmyshimmer mentioned this pull request Jul 22, 2025

[Feature] QAT support #2993

Open

andrewor14 force-pushed the qat_lora branch from 1a05768 to 4f895ba Compare August 13, 2025 17:51

andrewor14 changed the title ~~DRAFT: Add support for QAT + LoRA~~ Add support for QAT + LoRA Aug 13, 2025

andrewor14 marked this pull request as ready for review August 13, 2025 17:52

andrewor14 force-pushed the qat_lora branch from 4f895ba to ce4eeb7 Compare August 13, 2025 17:56

andrewor14 force-pushed the qat_lora branch from ce4eeb7 to 3295a76 Compare August 13, 2025 20:17

jerryzh168 reviewed Aug 13, 2025

View reviewed changes

danielhanchen reviewed Aug 13, 2025

View reviewed changes

danielhanchen merged commit b23fe78 into unslothai:main Aug 18, 2025

andrewor14 mentioned this pull request Aug 29, 2025

Add support for QAT full fine-tuning #3238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for QAT + LoRA#2976

Add support for QAT + LoRA#2976
danielhanchen merged 1 commit into
unslothai:mainfrom
andrewor14:qat_lora

andrewor14 commented Jul 15, 2025 •

edited

Loading

Uh oh!

jerryzh168 Aug 13, 2025

Uh oh!

jerryzh168 Aug 13, 2025

Uh oh!

danielhanchen Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

andrewor14 commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

danielhanchen Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andrewor14 commented Jul 15, 2025 •

edited

Loading