Add SpinQuant to generate.py #1069

tobiasvanderwerff · 2024-10-14T17:02:05Z

Add SpinQuant to torchao/_models/llama/generate.py
Only import SpinQuant when necessary in eval.py and generate.py (No need to import the large Hadamard matrices required for SpinQuant otherwise)

No need to import the large Hadamard matrices required for SpinQuant if it isn't necessary

pytorch-bot · 2024-10-14T17:02:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1069

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1543e4f with merge base e7b33bc ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2024-10-14T17:41:06Z

thanks, any results we can show?

tobiasvanderwerff · 2024-10-15T08:10:55Z

@jerryzh168 I'm fixing a torch.compile issue first related to the Hadamard transform using in SpinQuant, after that I'll post some benchmark results here. If you want, we can keep this PR open and I'll push the changes here.

tobiasvanderwerff · 2024-10-15T08:49:04Z

SpinQuant now also works with torch.compile. Benchmark results (llama-2-7b, tested on an A100):

Baseline + torch.compile

Average tokens/sec: 114.08
Average Bandwidth: 1507.58 GB/s
Peak Memory Usage: 13.88 GB
Model Size: 13.21 GB

Spinquant (R4) + torch.compile

Average tokens/sec: 109.59
Average Bandwidth: 1448.61 GB/s
Peak Memory Usage: 13.72 GB
Model Size: 13.22 GB

Spinquant (R1+R2+R4) + torch.compile

NB: R1 and R2 are fused into the linear weights before inference takes place, so it is expected that they do not lead to additional overhead at inference time.

Average tokens/sec: 109.64
Average Bandwidth: 1449.28 GB/s
Peak Memory Usage: 14.90 GB
Model Size: 13.22 GB

tobiasvanderwerff · 2024-10-15T08:57:07Z

Results without torch.compile:

Baseline

Average tokens/sec: 27.33
Average Bandwidth: 361.21 GB/s
Peak Memory Usage: 13.62 GB
Model Size: 13.21 GB

Spinquant (R4)

Average tokens/sec: 23.01
Average Bandwidth: 304.10 GB/s
Peak Memory Usage: 14.24 GB
Model Size: 13.22 GB

yiliu30 · 2024-10-15T09:12:23Z

SpinQuant now also works with torch.compile. Benchmark results (tested on an A100):

Baseline + torch.compile
Average tokens/sec: 114.31
Average Bandwidth: 1510.58 GB/s
Peak Memory Usage: 13.88 GB
Model Size: 13.21 GB
Spinquant (R4) + torch.compile
Average tokens/sec: 109.00
Average Bandwidth: 1440.76 GB/s
Peak Memory Usage: 13.98 GB
Model Size: 13.22 GB

Thanks @tobiasvanderwerff, may I know which model you tested on, llama-2-7b?

tobiasvanderwerff · 2024-10-15T09:14:42Z

Yep, llama-2-7b, I'll add that to the benchmark.

torchao/_models/llama/generate.py

HDCharles · 2024-10-15T15:52:49Z

can you add benchmark numbers for R1+R2 as well? i think R4 is only for activation quantization

HDCharles · 2024-10-17T13:28:52Z

would be good to add this info into a readme file inside the spinquant dir

jerryzh168 · 2024-10-21T23:46:14Z

ready to merge?

tobiasvanderwerff · 2024-10-22T06:20:48Z

Yep, this is ready @jerryzh168

tobiasvanderwerff added 2 commits October 14, 2024 17:59

Only import SpinQuant when necessary

48d6071

No need to import the large Hadamard matrices required for SpinQuant if it isn't necessary

Add SpinQaunt to generate.py

4f405ce

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 14, 2024

Custom op for Hadamard transform for torch.compile compatability

61043c4

tobiasvanderwerff mentioned this pull request Oct 15, 2024

SpinQuant #983

Merged

6 tasks

HDCharles reviewed Oct 15, 2024

View reviewed changes

torchao/_models/llama/generate.py Show resolved Hide resolved

Add spinquant to arg parser info

78bcd31

HDCharles approved these changes Oct 17, 2024

View reviewed changes

tobiasvanderwerff added 3 commits October 17, 2024 17:32

Add Spinquant benchmark results to README

0eb02fd

Add performance testing details

11198e2

Fix broken custom op for PyTorch < 2.4

1543e4f

HDCharles merged commit 3044ee5 into pytorch:main Oct 22, 2024
17 checks passed

tobiasvanderwerff deleted the spinquant-mods branch October 22, 2024 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SpinQuant to generate.py #1069

Add SpinQuant to generate.py #1069

tobiasvanderwerff commented Oct 14, 2024

pytorch-bot bot commented Oct 14, 2024 •

edited

Loading

jerryzh168 commented Oct 14, 2024

tobiasvanderwerff commented Oct 15, 2024

tobiasvanderwerff commented Oct 15, 2024 •

edited

Loading

tobiasvanderwerff commented Oct 15, 2024

yiliu30 commented Oct 15, 2024

Baseline + torch.compile

Spinquant (R4) + torch.compile

tobiasvanderwerff commented Oct 15, 2024

HDCharles commented Oct 15, 2024

HDCharles commented Oct 17, 2024

jerryzh168 commented Oct 21, 2024

tobiasvanderwerff commented Oct 22, 2024

Add SpinQuant to generate.py #1069

Add SpinQuant to generate.py #1069

Conversation

tobiasvanderwerff commented Oct 14, 2024

pytorch-bot bot commented Oct 14, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1069

✅ No Failures

jerryzh168 commented Oct 14, 2024

tobiasvanderwerff commented Oct 15, 2024

tobiasvanderwerff commented Oct 15, 2024 • edited Loading

Baseline + torch.compile

Spinquant (R4) + torch.compile

Spinquant (R1+R2+R4) + torch.compile

tobiasvanderwerff commented Oct 15, 2024

Baseline

Spinquant (R4)

yiliu30 commented Oct 15, 2024

Baseline + torch.compile

Spinquant (R4) + torch.compile

tobiasvanderwerff commented Oct 15, 2024

HDCharles commented Oct 15, 2024

HDCharles commented Oct 17, 2024

jerryzh168 commented Oct 21, 2024

tobiasvanderwerff commented Oct 22, 2024

pytorch-bot bot commented Oct 14, 2024 •

edited

Loading

tobiasvanderwerff commented Oct 15, 2024 •

edited

Loading