SpinQuant #983

tobiasvanderwerff · 2024-10-01T09:41:56Z

Corresponding issue: #579

This PR adds SpinQuant integration to pytorch/ao. See the paper for details: https://arxiv.org/abs/2405.16406.

Results on LLaMA are shown below, measured by Wikitext word perplexity.

Model	Quantization	Baseline	R4	R2+R4	R1+R2+R4	R1+R2+R4 (pt)	R2+R4 (pt)
Llama-2-7B	None (bfloat16)	12.23	12.24	12.24	12.24
	int8dq	12.35	12.35	12.35
	int4wo-32	12.68	12.58	12.60	13.65	13.49	12.64
	int4wo-64	12.87	12.82	12.80
	int4wo-64-marlin	12.87	12.82		13.65
	uintx-4-32	12.81	12.53		13.63
	uintx-4-64	12.89	12.80
	uintx-2-8		211
Llama-3-8B	None (bfloat16)	7.44	7.44		7.44
	int4wo-32	8.11	8.06		8.54
	uintx-4-64	8.11	8.31		9.00

For R1 and R2, random Hadamard matrices are used, unless (pt) is present, in which case I use the pretrained weights provided by the SpinQuant authors.

TODO

implement R2
implement R4
implement layernorm weight fusion into linear layers (footnote 3 in the paper)
implement R1
~~implement R3~~
Cayley optimization for R1 and R2 (not sure how feasible this is for inference -- it takes them 1hr to run Cayley optimization on 8x A100 GPUs for R1 and R2 using 800 samples of WikiText2 calibration dataset)

pytorch-bot · 2024-10-01T09:42:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/983

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fb3882f with merge base 107e378 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

HDCharles · 2024-10-03T01:09:32Z

Hey this is looking nice so far, long term we probably want to make these tensor subclasses so that we can make serialization easier. that way rather than having to load model -> convert model -> load checkpoint, you can just do load model -> load checkpoint

not absolutely critical but long term it looks like there may be multiple use cases/apis for spin quant, one explicitly for the Cayley QAT and one not, and unifying them based on serialization will make composability much nicer.

tobiasvanderwerff · 2024-10-03T06:06:41Z

Good to know @HDCharles, I'll keep the tensor subclasses in mind. I was wondering, will the choice to integrate this into torchao depend on the performance delta it produces? Currently, there is some Wikitext perf improvement but it's perhaps not that significant.

tobiasvanderwerff · 2024-10-03T13:39:58Z

Update: I'm currently somewhat stuck on this PR. The R2 and R4 matrices are both implemented, and show small perplexity improvements for in4wo-64 quantization (not much though, see table above). I've tried to implement it as much as possible in accordance with the SpinQuant implementation, but these are the best performance results I can achieve thus far (and not quite as good as the results in the paper). What still remains is the R3 rotation and R1 using Cayley optimization.

The R3 rotation is a bit tricky to implement because it requires a modification of Attention.forward() in the middle of the function, after the apply_rotary_emb calls:

ao/torchao/_models/llama/model.py

Lines 290 to 302 in 09b8b3c

    
           def forward(self, x: Tensor, freqs_cis: Tensor, mask: Optional[Tensor], input_pos: Optional[Tensor] = None) -> Tensor: 
        
               bsz, seqlen, _ = x.shape 
        
               kv_size = self.n_local_heads * self.head_dim 
        
               q, k, v = self.wqkv(x).split([self.dim, kv_size, kv_size], dim=-1) 
        
               q = q.view(bsz, seqlen, self.n_head, self.head_dim) 
        
               k = k.view(bsz, seqlen, self.n_local_heads, self.head_dim) 
        
               v = v.view(bsz, seqlen, self.n_local_heads, self.head_dim) 
        
               q = apply_rotary_emb(q, freqs_cis) 
        
               k = apply_rotary_emb(k, freqs_cis)

In the SpinQuant repo they use a monkeypatch solution, but the code becomes a bit ugly in that case. At the same time, they show in the paper that R3 has a minimal effect on performance (table 3), so I'm also not sure how much it's worth to implement.

Lastly, I have not added the R1 matrices, which would require adding a Cayley optimization procedure. Currently, the SpinQuant changes are immediately applicable at inference time, but running Cayley optimization would require some time to complete (they report ~1hr to run Cayley optimization on 8x A100 GPUs for R1 and R2 using 800 samples of WikiText2 calibration dataset). I guess it could also be possible to train these matrices once for a model like Llama-7B and include them as add-on weights.

I would very much appreciate some feedback on how to proceed with this.

tobiasvanderwerff · 2024-10-03T14:16:12Z

I have unblocked myself somewhat regarding the R1 rotation matrices: the authors provide downloads for the optimized R1/R2 weights. I could try these out to see what kind of performance difference can be expected before implementing the Cayley optimization here. My only concern is that their Llama implementation might not be 100% identical as in torchao, which could mean that the R1 weights might not work as well, but it seems worth trying out, anyway.

HDCharles · 2024-10-03T16:08:06Z

i think we can merge it and continue working on it regardless, accuracy improvements are definitely a good metric to see how useful it is though. Even if you look in their paper, for 4-16-16, the improvement of SpinQuant is pretty small even with cayley optimization. Its mostly 4-4-16 where it starts to outperform other methods by a significant margin. We're working on getting some kernels for that in the next 1-2 weeks so it may be more useful to that use case. For now i'd do accuracy benchmarks on groupsize=32 rather than 64/128 since thats the minimum batchsize.

Yeah the monkeypatch is pretty messy, feels like we can do this in a better way with either tensor subclasses or something else.

HDCharles

i think this looks good, seems like a lot of value will be added once activation quantization is used.

would be good to add groupsize 32 numbers and uintx-2bit numbers, and llama3 numbers to the PR description if you have them.

tobiasvanderwerff · 2024-10-05T06:08:21Z

I'll do a final reformat and add some more results in the next few days @HDCharles

andrewor14 · 2024-10-07T17:14:41Z

Hi @tobiasvanderwerff, do you mind reformatting hadamard_utils.py so we don't end up with a 10k line file? I feel you can even separate it into a separate file like _hadamard_matrices.py, so it's easier to review the other parts of hadamard_utils.py

Wrapping the Linear layers might mess with the quantization of the linear layers, so it's probably better to keep the linear layers the same and insert new layers alongside them

This is done for pre-norm LLMs like LLaMa to make them scale-invariant (see footnote 3 in the paper). However, in the current implementation it seems to hurt performance when quantization is used.

…quant`

Random R1 and R2 matrices are showing worse results than just using R4, so the latter seems to be a better default option (at least for now).

wat3rBro · 2024-10-10T21:28:03Z

Hi @tobiasvanderwerff, do you mind reformatting hadamard_utils.py so we don't end up with a 10k line file? I feel you can even separate it into a separate file like _hadamard_matrices.py, so it's easier to review the other parts of hadamard_utils.py

Hi @tobiasvanderwerff @andrewor14 could you use this implementation https://fburl.com/code/d3nuagm4? It's faster and much easier to read.

HDCharles · 2024-10-11T19:00:49Z

its faster? do you have a link to benchmarks?

wat3rBro · 2024-10-11T19:15:08Z

its faster? do you have a link to benchmarks?

The benchmark is in the summary of D61891002.

yiliu30 · 2024-10-11T23:16:01Z

Hi @tobiasvanderwerff, great work! I am wondering if you tested the end-to-end generation performance(tokens/s)?

tobiasvanderwerff · 2024-10-14T05:44:55Z

I have not tested tokens/s generation @yiliu30, but I can test this if you want.

yiliu30 · 2024-10-14T14:49:19Z

I have not tested tokens/s generation @yiliu30, but I can test this if you want.

Thank you, @tobiasvanderwerff ! I'm primarily interested in studying the computational overhead introduced by r4, and I was wondering if the hardmard_transform might break the torch.compile.

tobiasvanderwerff · 2024-10-14T17:06:17Z

Thanks for bringing this up @yiliu30 -- I tested this and it looks like the custom Hadamard transform kernel indeed breaks torch.compile. I'll investigate this and get back to you.

* SpinQuant using R2 matrices * Move Hadamard functions and matrices to separate file * Add R4 rotation * Reformat * Do not wrap Linear layers but use nn.Sequential Wrapping the Linear layers might mess with the quantization of the linear layers, so it's probably better to keep the linear layers the same and insert new layers alongside them * Add test * Fix test and do small reformat of Hadamard code * Fuse Layernorm params into linear layers This is done for pre-norm LLMs like LLaMa to make them scale-invariant (see footnote 3 in the paper). However, in the current implementation it seems to hurt performance when quantization is used. * Add R1 rotation * Add option to load pretrained R1/R2 matrices * Move Spinquant from `torchao/quantization` to `torchao/prototype/spinquant` * Move Hadamard matrices to a separate file * Move test * Minor changes * Reformat * Only enable R4 as default setting Random R1 and R2 matrices are showing worse results than just using R4, so the latter seems to be a better default option (at least for now). * Add __init__.py to spinquant folder * Do not fail if fast_hadamard_transform is not present

tobiasvanderwerff · 2024-10-15T08:51:30Z

@yiliu30 FYI I fixed the issue with torch.compile -- you can see the benchmark results here.

* Move export_aoti into export + minor tidyness * Lint * Remove mismatched arg

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 1, 2024

tobiasvanderwerff marked this pull request as draft October 2, 2024 09:14

tobiasvanderwerff mentioned this pull request Oct 3, 2024

Clarification of RMSNorm layer fusion facebookresearch/SpinQuant#14

Open

tobiasvanderwerff force-pushed the spinquant branch from 6f6445d to 2c8acdd Compare October 4, 2024 13:14

tobiasvanderwerff marked this pull request as ready for review October 4, 2024 18:15

HDCharles self-requested a review October 4, 2024 20:20

HDCharles approved these changes Oct 4, 2024

View reviewed changes

tobiasvanderwerff force-pushed the spinquant branch 2 times, most recently from 0bf6a76 to b6ae688 Compare October 9, 2024 08:24

tobiasvanderwerff changed the title ~~[wip] SpinQuant~~ SpinQuant Oct 9, 2024

tobiasvanderwerff added 12 commits October 10, 2024 14:14

SpinQuant using R2 matrices

5ff4352

Move Hadamard functions and matrices to separate file

c769882

Add R4 rotation

bf3cbf1

Reformat

9be7bf4

Do not wrap Linear layers but use nn.Sequential

e0d6d40

Wrapping the Linear layers might mess with the quantization of the linear layers, so it's probably better to keep the linear layers the same and insert new layers alongside them

Add test

a64aea0

Fix test and do small reformat of Hadamard code

c9cde41

Fuse Layernorm params into linear layers

ac238ec

This is done for pre-norm LLMs like LLaMa to make them scale-invariant (see footnote 3 in the paper). However, in the current implementation it seems to hurt performance when quantization is used.

Add R1 rotation

e6018a0

Add option to load pretrained R1/R2 matrices

3f3888b

Move Spinquant from torchao/quantization to `torchao/prototype/spin…

1217764

…quant`

Move Hadamard matrices to a separate file

58f269f

tobiasvanderwerff added 6 commits October 10, 2024 14:14

Move test

21fd3d8

Minor changes

eeadd6d

Reformat

20c72a3

Only enable R4 as default setting

3dabc19

Random R1 and R2 matrices are showing worse results than just using R4, so the latter seems to be a better default option (at least for now).

Add __init__.py to spinquant folder

49f035d

Do not fail if fast_hadamard_transform is not present

fb3882f

tobiasvanderwerff force-pushed the spinquant branch from a8807f1 to fb3882f Compare October 10, 2024 12:15

HDCharles merged commit 590f8fb into pytorch:main Oct 10, 2024
17 checks passed

tobiasvanderwerff deleted the spinquant branch October 10, 2024 19:14

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

Move export_aoti into export.py (pytorch#983)

35db038

* Move export_aoti into export + minor tidyness * Lint * Remove mismatched arg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpinQuant #983

SpinQuant #983

tobiasvanderwerff commented Oct 1, 2024 •

edited

Loading

pytorch-bot bot commented Oct 1, 2024 •

edited

Loading

HDCharles commented Oct 3, 2024

tobiasvanderwerff commented Oct 3, 2024

tobiasvanderwerff commented Oct 3, 2024 •

edited

Loading

tobiasvanderwerff commented Oct 3, 2024 •

edited

Loading

HDCharles commented Oct 3, 2024 •

edited

Loading

HDCharles left a comment

tobiasvanderwerff commented Oct 5, 2024

andrewor14 commented Oct 7, 2024

wat3rBro commented Oct 10, 2024

HDCharles commented Oct 11, 2024

wat3rBro commented Oct 11, 2024

yiliu30 commented Oct 11, 2024

tobiasvanderwerff commented Oct 14, 2024

yiliu30 commented Oct 14, 2024

tobiasvanderwerff commented Oct 14, 2024

tobiasvanderwerff commented Oct 15, 2024

SpinQuant #983

SpinQuant #983

Conversation

tobiasvanderwerff commented Oct 1, 2024 • edited Loading

TODO

pytorch-bot bot commented Oct 1, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/983

✅ No Failures

HDCharles commented Oct 3, 2024

tobiasvanderwerff commented Oct 3, 2024

tobiasvanderwerff commented Oct 3, 2024 • edited Loading

tobiasvanderwerff commented Oct 3, 2024 • edited Loading

HDCharles commented Oct 3, 2024 • edited Loading

HDCharles left a comment

Choose a reason for hiding this comment

tobiasvanderwerff commented Oct 5, 2024

andrewor14 commented Oct 7, 2024

wat3rBro commented Oct 10, 2024

HDCharles commented Oct 11, 2024

wat3rBro commented Oct 11, 2024

yiliu30 commented Oct 11, 2024

tobiasvanderwerff commented Oct 14, 2024

yiliu30 commented Oct 14, 2024

tobiasvanderwerff commented Oct 14, 2024

tobiasvanderwerff commented Oct 15, 2024

tobiasvanderwerff commented Oct 1, 2024 •

edited

Loading

pytorch-bot bot commented Oct 1, 2024 •

edited

Loading

tobiasvanderwerff commented Oct 3, 2024 •

edited

Loading

tobiasvanderwerff commented Oct 3, 2024 •

edited

Loading

HDCharles commented Oct 3, 2024 •

edited

Loading