Add support for QAT + LoRA#2976
Merged
Merged
Conversation
cdff9c6 to
879c1a9
Compare
**Summary:** Quantization-aware training (QAT) helps mitigate quantization degradation by simulating quantization numerics in high precision during training (fake quantization). This PR combines QAT with LoRA by applying torchao's QAT support to the peft model. See the following for more details: - torchao QAT: https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md - torchtune QAT + LoRA: https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700 Current QAT schemes supported are: ``` fp8-fp8, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-int4, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** ``` from unsloth import FastLanguageModel lora_rank = 32 model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Qwen3-4B-Base", max_seq_length = 2048, load_in_4bit = False, fast_inference = False, max_lora_rank = lora_rank, ) model = FastLanguageModel.get_peft_model( model, r = lora_rank, target_modules = [ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], lora_alpha = lora_rank*2, use_gradient_checkpointing = "unsloth", random_state = 3407, qat_scheme = "fp8-fp8", ) lora.Linear( (base_layer): FakeQuantizedLinear( in_features=2560, out_features=4096, bias=False (activation_fake_quantizer): FakeQuantizer(Float8FakeQuantizeConfig(dtype=torch.float8_e4m3fn, granularity=PerRow(), hp_value_lb=None, hp_value_ub=None)) (weight_fake_quantizer): FakeQuantizer(Float8FakeQuantizeConfig(dtype=torch.float8_e4m3fn, granularity=PerRow(), hp_value_lb=None, hp_value_ub=None)) ) ... ) ```
jerryzh168
reviewed
Aug 13, 2025
| if qat_scheme == "fp8-int4": | ||
| group_size = 128 | ||
| base_config = Float8DynamicActivationInt4WeightConfig(group_size=group_size) | ||
| filter_fn = lambda m, _: isinstance(m, torch.nn.Linear) and m.in_features >= group_size |
Contributor
There was a problem hiding this comment.
we could skip this in the config handler itself
Contributor
There was a problem hiding this comment.
we also plan to remove the group_size arg for now since there is no benefit of making it larger according to Josh
|
|
||
| For more details: https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700 | ||
| """ | ||
| try: |
Member
There was a problem hiding this comment.
No need for try except since pyproject.toml will add it
andrewor14
added a commit
to andrewor14/unsloth
that referenced
this pull request
Aug 29, 2025
**Summary:** Following unslothai#2976, which adds support for QAT + LoRA, this PR adds support for QAT during full fine-tuning. See the [torchao QAT README](https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md) for more details. Current QAT schemes supported are: ``` fp8-fp8, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-int4, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** https://gist.github.com/andrewor14/b0364ac3cb8aa114e46b39d848fa5c8b (ongoing)
andrewor14
added a commit
to andrewor14/unsloth
that referenced
this pull request
Sep 5, 2025
**Summary:** Following unslothai#2976, which adds support for QAT + LoRA, this PR adds support for QAT during full fine-tuning. See the [torchao QAT README](https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md) for more details. Current QAT schemes supported are: ``` fp8-int4, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-fp8, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** https://gist.github.com/andrewor14/048b5c1bd01b7fa23c53913856a8ef9f Full fine-tuning Llama3.1-8B with and without QAT on `yahma/alpaca-cleaned` for 1 epoch: - Batch size = 16 (no grad accum) - Learning rate = 4e-5 - Quantization scheme = fp8-int4 Wikitext perplexity: - QAT improved perplexity by 19.2% compared to regular fine-tuning - QAT's int4 quantized model even outperformed the bf16 baseline - Regular int4 quantized model (without QAT) was significantly worse than the bf16 baseline ``` ==> unsloth_model_full_baseline_output/eval_float.log <== | | |none | 0|word_perplexity|↓ |9.8446|± | N/A| ==> unsloth_model_full_baseline_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |11.4595|± | N/A| ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |9.2336|± | N/A| ``` Fibonacci test: - Both bf16 baseline and int4 quantized models correctly identified 13 as the next number - QAT quantized model was more succinct in its response - No substantial differences here ``` ### Instruction: Continue the fibonnaci sequence. ### Input: 1, 1, 2, 3, 5, 8 ==> unsloth_model_full_baseline_output/eval_float.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_baseline_output/eval_quantized.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== ### Response: 13<|end_of_text|> ```
andrewor14
added a commit
to andrewor14/unsloth
that referenced
this pull request
Sep 8, 2025
**Summary:** Following unslothai#2976, which adds support for QAT + LoRA, this PR adds support for QAT during full fine-tuning. See the [torchao QAT README](https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md) for more details. Current QAT schemes supported are: ``` fp8-int4, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-fp8, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** https://gist.github.com/andrewor14/048b5c1bd01b7fa23c53913856a8ef9f Full fine-tuning Llama3.1-8B with and without QAT on `yahma/alpaca-cleaned` for 1 epoch: - Batch size = 16 (no grad accum) - Learning rate = 4e-5 - Quantization scheme = fp8-int4 Wikitext perplexity: - QAT improved perplexity by 19.2% compared to regular fine-tuning - QAT's int4 quantized model even outperformed the bf16 baseline - Regular int4 quantized model (without QAT) was significantly worse than the bf16 baseline ``` ==> unsloth_model_full_baseline_output/eval_float.log <== | | |none | 0|word_perplexity|↓ |9.8446|± | N/A| ==> unsloth_model_full_baseline_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |11.4595|± | N/A| ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |9.2336|± | N/A| ``` Fibonacci test: - Both bf16 baseline and int4 quantized models correctly identified 13 as the next number - QAT quantized model was more succinct in its response - No substantial differences here ``` ### Instruction: Continue the fibonnaci sequence. ### Input: 1, 1, 2, 3, 5, 8 ==> unsloth_model_full_baseline_output/eval_float.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_baseline_output/eval_quantized.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== ### Response: 13<|end_of_text|> ```
andrewor14
added a commit
to andrewor14/unsloth
that referenced
this pull request
Sep 8, 2025
**Summary:** Following unslothai#2976, which adds support for QAT + LoRA, this PR adds support for QAT during full fine-tuning. See the [torchao QAT README](https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md) for more details. Current QAT schemes supported are: ``` fp8-int4, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-fp8, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** https://gist.github.com/andrewor14/048b5c1bd01b7fa23c53913856a8ef9f Full fine-tuning Llama3.1-8B with and without QAT on `yahma/alpaca-cleaned` for 1 epoch: - Batch size = 16 (no grad accum) - Learning rate = 4e-5 - Quantization scheme = fp8-int4 Wikitext perplexity: - QAT improved perplexity by 19.2% compared to regular fine-tuning - QAT's int4 quantized model even outperformed the bf16 baseline - Regular int4 quantized model (without QAT) was significantly worse than the bf16 baseline ``` ==> unsloth_model_full_baseline_output/eval_float.log <== | | |none | 0|word_perplexity|↓ |9.8446|± | N/A| ==> unsloth_model_full_baseline_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |11.4595|± | N/A| ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |9.2336|± | N/A| ``` Fibonacci test: - Both bf16 baseline and int4 quantized models correctly identified 13 as the next number - QAT quantized model was more succinct in its response - No substantial differences here ``` ### Instruction: Continue the fibonnaci sequence. ### Input: 1, 1, 2, 3, 5, 8 ==> unsloth_model_full_baseline_output/eval_float.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_baseline_output/eval_quantized.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== ### Response: 13<|end_of_text|> ```
danielhanchen
pushed a commit
that referenced
this pull request
Sep 8, 2025
**Summary:** Following #2976, which adds support for QAT + LoRA, this PR adds support for QAT during full fine-tuning. See the [torchao QAT README](https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md) for more details. Current QAT schemes supported are: ``` fp8-int4, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-fp8, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** https://gist.github.com/andrewor14/048b5c1bd01b7fa23c53913856a8ef9f Full fine-tuning Llama3.1-8B with and without QAT on `yahma/alpaca-cleaned` for 1 epoch: - Batch size = 16 (no grad accum) - Learning rate = 4e-5 - Quantization scheme = fp8-int4 Wikitext perplexity: - QAT improved perplexity by 19.2% compared to regular fine-tuning - QAT's int4 quantized model even outperformed the bf16 baseline - Regular int4 quantized model (without QAT) was significantly worse than the bf16 baseline ``` ==> unsloth_model_full_baseline_output/eval_float.log <== | | |none | 0|word_perplexity|↓ |9.8446|± | N/A| ==> unsloth_model_full_baseline_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |11.4595|± | N/A| ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |9.2336|± | N/A| ``` Fibonacci test: - Both bf16 baseline and int4 quantized models correctly identified 13 as the next number - QAT quantized model was more succinct in its response - No substantial differences here ``` ### Instruction: Continue the fibonnaci sequence. ### Input: 1, 1, 2, 3, 5, 8 ==> unsloth_model_full_baseline_output/eval_float.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_baseline_output/eval_quantized.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== ### Response: 13<|end_of_text|> ```
abiswas-realadvice
pushed a commit
to abiswas-realadvice/unsloth
that referenced
this pull request
May 14, 2026
**Summary:** Quantization-aware training (QAT) helps mitigate quantization degradation by simulating quantization numerics in high precision during training (fake quantization). This PR combines QAT with LoRA by applying torchao's QAT support to the peft model. See the following for more details: - torchao QAT: https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md - torchtune QAT + LoRA: https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700 Current QAT schemes supported are: ``` fp8-fp8, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-int4, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** ``` from unsloth import FastLanguageModel lora_rank = 32 model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Qwen3-4B-Base", max_seq_length = 2048, load_in_4bit = False, fast_inference = False, max_lora_rank = lora_rank, ) model = FastLanguageModel.get_peft_model( model, r = lora_rank, target_modules = [ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], lora_alpha = lora_rank*2, use_gradient_checkpointing = "unsloth", random_state = 3407, qat_scheme = "fp8-fp8", ) lora.Linear( (base_layer): FakeQuantizedLinear( in_features=2560, out_features=4096, bias=False (activation_fake_quantizer): FakeQuantizer(Float8FakeQuantizeConfig(dtype=torch.float8_e4m3fn, granularity=PerRow(), hp_value_lb=None, hp_value_ub=None)) (weight_fake_quantizer): FakeQuantizer(Float8FakeQuantizeConfig(dtype=torch.float8_e4m3fn, granularity=PerRow(), hp_value_lb=None, hp_value_ub=None)) ) ... ) ```
abiswas-realadvice
pushed a commit
to abiswas-realadvice/unsloth
that referenced
this pull request
May 14, 2026
**Summary:** Following unslothai#2976, which adds support for QAT + LoRA, this PR adds support for QAT during full fine-tuning. See the [torchao QAT README](https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md) for more details. Current QAT schemes supported are: ``` fp8-int4, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel fp8-fp8, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel ``` **Test Plan:** https://gist.github.com/andrewor14/048b5c1bd01b7fa23c53913856a8ef9f Full fine-tuning Llama3.1-8B with and without QAT on `yahma/alpaca-cleaned` for 1 epoch: - Batch size = 16 (no grad accum) - Learning rate = 4e-5 - Quantization scheme = fp8-int4 Wikitext perplexity: - QAT improved perplexity by 19.2% compared to regular fine-tuning - QAT's int4 quantized model even outperformed the bf16 baseline - Regular int4 quantized model (without QAT) was significantly worse than the bf16 baseline ``` ==> unsloth_model_full_baseline_output/eval_float.log <== | | |none | 0|word_perplexity|↓ |9.8446|± | N/A| ==> unsloth_model_full_baseline_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |11.4595|± | N/A| ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== | | |none | 0|word_perplexity|↓ |9.2336|± | N/A| ``` Fibonacci test: - Both bf16 baseline and int4 quantized models correctly identified 13 as the next number - QAT quantized model was more succinct in its response - No substantial differences here ``` ### Instruction: Continue the fibonnaci sequence. ### Input: 1, 1, 2, 3, 5, 8 ==> unsloth_model_full_baseline_output/eval_float.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_baseline_output/eval_quantized.log <== ### Response: The next number in the Fibonacci sequence is 13.<|end_of_text|> ==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <== ### Response: 13<|end_of_text|> ```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary: Quantization-aware training (QAT) helps mitigate quantization degradation by simulating quantization numerics in high precision during training (fake quantization). This PR combines QAT with LoRA by applying torchao's QAT support to the peft model.
See the following for more details:
Current QAT schemes supported are:
Test Plan: