Initial fused `GPTQ` implementation by jeromeku · Pull Request #141 · unslothai/unsloth

jeromeku · 2024-01-29T06:31:50Z

GPTQ Peft Fine-tuning

GPTQ fast_lora

Adds fast_lora implementation for peft fine-tuning of GPTQ quantized models.

Following methodology of existing bitsandbytes fast_lora custom autograd, uses fuses triton quant / dequant matmul kernels from auto_gptq with LoRA adapters into custom torch.autograd.Function (see unsloth/gptq/fast_lora.py).
Default Huggingface GPTQ peft fine-tuning uses the auto_gptq cuda QuantLinear layer, which in turn falls back to a torch-only implementation since the custom cuda kernel employed by auto_gptq does not implement backwards.
Current implementation runs slower than default Huggingface implementation
Additional tuning / optimizations in the works.
See this issue for further profiling details.

Profiling

Also includes a profiling / benchmarking script for comparing unsloth models with huggingface models
See benchmarks/Profiling.MD for documentation.

danielhanchen · 2024-01-29T06:51:55Z

@jeromeku Oh my this is a LARGE PR!!!! I'll take a read through it today :)

danielhanchen · 2024-01-29T14:40:31Z

Ohh know I understand why you add the matmul triton kernels that are merged and not a separate dequantize kernel then a matmul ie:

out = dequantize_and_matmul(X, W)

vs

W = dequantize(W)
out = torch.matmul(X, W)

I took a look through GPTQ's repo, and yes I cannot find any dequantization kernel either written in Triton or not.

To attain maximal performance, technically that means an inclusion of the GPTQ dequantize kernel only, ie without the matrix multiplies inside the Triton kernel, which can screw with the compiler.

I'll see what I can do if I have some more bandwidth - sadly I don't have too much knowledge about GPTQ so I'll have to dive into their papers a bit on how their dequantization even works :)

Great work so far @jeromeku and thanks so much wonderfully for trying to add GPTQ!

jeromeku · 2024-01-29T16:22:16Z

@danielhanchen
I have a pretty good handle on the situation -- will try to strip out the dequant part (in addition to some other optimizations).

danielhanchen · 2024-01-29T16:54:17Z

@jeromeku Ok cool!! :)

jeromeku · 2024-01-30T06:55:28Z

@danielhanchen

Stripped out dequant portion of the fused dequant matmul and did some quick benchmarking of default quant linear forward per huggingface gptq vs. a torch.compiled dequant + torch.mm.

Promising early results (forward only):

    seqlen  reference_gptq_quantlinear  torch.compile(dequant+mm)
0     32.0                    2.581504                   0.406528
1     64.0                    2.563072                   0.407552
2    128.0                    2.591728                   0.430080
3    256.0                    2.689024                   0.502784
4    512.0                    2.971648                   0.780288
5   1024.0                    3.467648                   1.236992
6   2048.0                    4.403200                   2.150400
7   4096.0                    6.563840                   4.184480
8   8192.0                   10.655744                   8.019968
9  16384.0                   19.193855                  15.764481

These are median time (ms) for various sequence lengths.

However, running both forward and backward degrades the performance of the compiled version vs ref, which is confusing since the backwards graph is just a transposed matmul. Needs further investigation.

Interestingly, the triton kernel that gets codegen'ed for the dequant forward part is similar if not more efficient as the hand-written dequant portion of the previous triton kernel.

danielhanchen · 2024-01-30T16:43:04Z

@jeromeku Cool great work again! Ye it definitely looks like torch.compile is destroying the hand written GPTQ kernel inside HF's codebase loll! Ye the backwards is transpose - but I'm assuming it's cause the strides are reversed, causing a performance hit - just my speculation.

jeromeku · 2024-02-03T00:01:43Z

@danielhanchen

Good news -- refactored the fast_lora implementation with a new triton kernel that does dequant separately from matmul (previous impl was an adapted version of the fused dequant matmul kernel from auto_gptq).

Performance now is on par with fast_lora bnb for llama and mistral models.

Will run some additional tests / benchmarks and PR should be ready for review.

Trainer results after 20 steps on guanaco for llama-{gptq,bnb} 4-bit:

hf-gptq

{
  "train_runtime": 113.4277,
  "train_samples_per_second": 1.411,
  "train_steps_per_second": 0.176,
  "train_loss": 1.3709101617336272,
  "epoch": 0.02
}

unsloth-gptq-triton

{
  "train_runtime": 69.5648,
  "train_samples_per_second": 2.3,
  "train_steps_per_second": 0.288,
  "train_loss": 1.3829106092453003,
  "epoch": 0.02
}

unsloth-bnb

{
  "train_runtime": 63.8765,
  "train_samples_per_second": 2.505,
  "train_steps_per_second": 0.313,
  "train_loss": 1.3803951740264893,
  "epoch": 0.02
}

danielhanchen · 2024-02-03T02:09:36Z

@jeromeku Extremely extremely fabulous work!!! Now that is a fantastic performance boost from HF's GPTQ!! It looks like splitting the dequantization step and matmul did the trick!! Again super duper appreciate you adding GPTQ support into Unsloth - highly appreciate it :)

jeromeku · 2024-02-05T23:18:46Z

@danielhanchen

Cleaned up the dequant kernel.

Re-running the above benchmark (20 train steps on TheBloke/Llama-2-7b-GPTQ) gives the following:

{
  "train_runtime": 67.3811,
  "train_samples_per_second": 2.375,
  "train_steps_per_second": 0.297,
  "train_loss": 1.3829236447811126,
  "epoch": 0.02
}

To reproduce, run

python benchmark.py --model_name=llama --model_type=unsloth-gptq-triton --dtype=float16 --dataset_id=guanaco --output_dir=./bench_results

Replace --model_type with hf-gptq-default or unsloth-bnb to benchmark respectively.

See PROFILING.MD for more details on running the benchmark script

danielhanchen · 2024-02-06T03:37:25Z

@jeromeku Super duper great work again! I will take a look later today! Thanks so much for your contribution again!

danielhanchen · 2024-02-07T16:41:16Z

@jeromeku Hey sorry on the delay! Extreme apologies again didn't have time to take a look :( I will do so asap in the next few days! Sorry again, and super great work again! :)

* Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * causal mask dtype * Fix checkpoint and save from local file (unslothai#74) * Enhance gradient checkpointing and add original model ID retrieval in saving utilities * In case adapter_config.json as well * Update patching_utils.py * Update patching_utils.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update loss_utils.py * Update compiler.py * Update vllm_utils.py * Update compiler.py * Update peft_utils.py * Update rl_replacements.py * Update vllm_utils.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update vllm_lora_worker_manager.py * Update utils.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update dataset_utils.py * bidirectional attention * Update vllm_utils.py * Update __init__.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_lora_worker_manager.py * Update vllm_lora_worker_manager.py * Update vllm_lora_worker_manager.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update __init__.py * fix: AsyncLLMEngine bugs (unslothai#82) * fixed a typo in L119, removing unnecessary len() (unslothai#84) Co-authored-by: Xiaochen Zhu <xz479@cl.cam.ac.uk> * Fix gradient checkpointing warning filter implementation * Input grads fix for gemma3 (unslothai#96) * gemma require gradients fix * Update peft_utils.py --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Update vision_utils.py * Vision requires grad * Check SDPA for Mistral / Pixtral * Update compiler.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update __init__.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vllm_utils.py (unslothai#99) Fix bugs in generate_batches.py.Original output = [] will result in duplication of results. * Update vision_utils.py * Fixes to support IterableDataset (unslothai#98) * Support Iterable Datasets * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Preserve batch size from iterable dataset * Preserve batch size from iterable dataset * Support train_on_response_only with IterableDataset * Support train_on_response_only with IterableDataset * Support train_on_response_only with IterableDataset * Support train_on_response_only with IterableDataset * Update vllm_utils.py * Create vllm_rlhf_utils.py * Update vllm_rlhf_utils.py * Update vllm_rlhf_utils.py * Update vllm_rlhf_utils.py * Update vllm_rlhf_utils.py * Update vllm_rlhf_utils.py * Update vllm_rlhf_utils.py * Update vllm_rlhf_utils.py * Update vllm_rlhf_utils.py * Update vllm_rlhf_utils.py * vLLM for Qwen 3 * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update compiler.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Swap space reduce * Update vllm_utils.py * Update vllm_utils.py * Update rl_replacements.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update __init__.py * Update rl_replacements.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update rl_replacements.py * Update vllm_utils.py * Update rl_replacements.py * Revert "Update rl_replacements.py" This reverts commit c0a4022. * Update __init__.py * Update patching_utils.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Fixes * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * revert * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update __init__.py * Update compiler.py * Update temporary_patches.py * Update compiler.py * Update temporary_patches.py --------- Co-authored-by: Edd <68678137+Erland366@users.noreply.github.com> Co-authored-by: Brad Hilton <brad.hilton.nw@gmail.com> Co-authored-by: SpaceHunter <30568250+SpaceHunterInf@users.noreply.github.com> Co-authored-by: Xiaochen Zhu <xz479@cl.cam.ac.uk> Co-authored-by: Roland Tannous <rolandtannous@gonovel.co> Co-authored-by: DoubleMathew <mmathew23@gmail.com> Co-authored-by: Michael Han <107991372+shimmyshimmer@users.noreply.github.com> Co-authored-by: Qian Wu <121997440+5k5000@users.noreply.github.com> Co-authored-by: marcandrelarochelle <marcandrelarochelle1820@gmail.com>

style: improve layout consistency and responsiveness across components

danielhanchen added 30 commits November 30, 2023 03:50

Initial commit

44419b9

First upload of Unsloth code

d45e40a

Update README.md

548c518

requirements

18cc444

Merge branch 'main' of https://github.com/unslothai/unsloth

feb950a

Update pyproject.toml

0b14f48

Update pyproject.toml

47b0964

Update pyproject.toml

1a505d3

Update pyproject.toml

bac8c4f

dependencies

54d6cbd

requirements

94843dd

requirements

fff760c

Update pyproject.toml

67f711d

Update pyproject.toml

dbfb5e7

Update pyproject.toml

8e3f94f

Update pyproject.toml

68af199

requirements

ff84ba2

Torch 2.1

7166dce

Update __init__.py

47569be

Update __init__.py

f9b26ed

Update __init__.py

169608b

Update __init__.py

6552b88

Update __init__.py

7c400ad

Update pyproject.toml

21e8355

Update pyproject.toml

6c075d1

Update pyproject.toml

0349faa

Update pyproject.toml

e1d0265

Update fast_lora.py

581235b

Update __init__.py

03766c2

Update __init__.py

3197ea0

jeromeku added 3 commits January 29, 2024 04:25

minor edit to doc

13c3998

add more documentation

1098520

rename tests to benchmarks

bf32790

jeromeku mentioned this pull request Jan 29, 2024

[Feature request] Support GPTQ quantization #39

Open

jeromeku added 3 commits February 5, 2024 21:39

add new dequant kernel

1ad362a

remove unneeded dequant params

ba941dd

add more documentation

2839d39

jeromeku mentioned this pull request Feb 5, 2024

Tritonv2: faster triton dequant kernel and refactored QuantLinear AutoGPTQ/AutoGPTQ#530

Closed

jeromeku marked this pull request as ready for review March 7, 2024 02:28

achew010 mentioned this pull request Apr 5, 2024

[BUG] GPTQ Kernels dont work with PEFT AutoGPTQ/AutoGPTQ#633

Open

rolandtannous pushed a commit to rolandtannous/unsloth that referenced this pull request Mar 11, 2026

Merge pull request unslothai#141 from unslothai/feature/uxui-heuristics

b4a8e48

style: improve layout consistency and responsiveness across components

danielhanchen closed this Mar 12, 2026

danielhanchen force-pushed the main branch from 997f1a7 to 0099fff Compare March 12, 2026 05:34

rolandtannous pushed a commit that referenced this pull request Mar 12, 2026

Merge pull request #141 from unslothai/feature/uxui-heuristics

492e3c3

style: improve layout consistency and responsiveness across components

Stanley00 pushed a commit to stanley-fork/unsloth that referenced this pull request Mar 12, 2026

Merge pull request unslothai#141 from unslothai/feature/uxui-heuristics

43fac18

style: improve layout consistency and responsiveness across components

danielhanchen mentioned this pull request May 23, 2026

ci: unblock Studio Windows + Linux + Mac smoke (supersedes #5733, #5734, #5738) #5741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial fused `GPTQ` implementation#141

Initial fused `GPTQ` implementation#141
jeromeku wants to merge 144 commits into
unslothai:mainfrom
jeromeku:gptq-draft

jeromeku commented Jan 29, 2024

Uh oh!

danielhanchen commented Jan 29, 2024

Uh oh!

danielhanchen commented Jan 29, 2024

Uh oh!

jeromeku commented Jan 29, 2024

Uh oh!

danielhanchen commented Jan 29, 2024

Uh oh!

jeromeku commented Jan 30, 2024 •

edited

Loading

Uh oh!

danielhanchen commented Jan 30, 2024

Uh oh!

jeromeku commented Feb 3, 2024 •

edited

Loading

Uh oh!

danielhanchen commented Feb 3, 2024

Uh oh!

jeromeku commented Feb 5, 2024

Uh oh!

danielhanchen commented Feb 6, 2024

Uh oh!

danielhanchen commented Feb 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

jeromeku commented Jan 29, 2024

GPTQ Peft Fine-tuning

GPTQ fast_lora

Profiling

Uh oh!

danielhanchen commented Jan 29, 2024

Uh oh!

danielhanchen commented Jan 29, 2024

Uh oh!

jeromeku commented Jan 29, 2024

Uh oh!

danielhanchen commented Jan 29, 2024

Uh oh!

jeromeku commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielhanchen commented Jan 30, 2024

Uh oh!

jeromeku commented Feb 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielhanchen commented Feb 3, 2024

Uh oh!

jeromeku commented Feb 5, 2024

Uh oh!

danielhanchen commented Feb 6, 2024

Uh oh!

danielhanchen commented Feb 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeromeku commented Jan 30, 2024 •

edited

Loading

jeromeku commented Feb 3, 2024 •

edited

Loading