Autoquant v2 initial version #1240

jerryzh168 · 2024-11-08T00:58:09Z

Summary:
We refactored the v1 to do benchmark for subgraphs of (prev_op -> linear -> post_op) in order to get more accurate estimation of timing. One issue here is now we need to care about batch size of the subgraph, so we'd need the batch size dimension to use symbolic shape, seems that it does not have good support on torch.compile right now

Current Status:

tested with llama2 and sam
llama2 has the same result as autoquant v1, for both default qtensor subclass list and the one that contains int4
sam get some speedup over v1 because it picked a int8dyn layer, while autoquant v1 picks float for everything

More improvements:

current batch size adjustment code is hardcoded to work for llama model, need to think of a way to generalize it
current we use GraphModule as key and comparing equality of graphs to avoid duplicated benchmarking effort, we have a naive graph equality check function, which should work reasonably well, but we could improve this by using subgraph matcher or canonicalize graph (if the tool exist)
add accuracy sanity checks
apply to more models
fqn from named_modules does not match extracted fqn (in dynamo tracking stack) in some torchbench models, but we can fix this when it appears in the models we care about.
I also heard from Animesh that modules are inlined by default now? and we should rely on node.meta for tracking where the nodes comes from and extract subgraph, we can revisit as well.

Test Plan:
Testing with torchao/_models/llama/generate.py

python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --quantization autoquant_v2-int4

llama2	autoquant_v1	autoquant_v2
default qtensor list	Average tokens/sec: 172.59Average Bandwidth: 1165.98 GB/sPeak Memory Usage: 8.65 GBModel Size: 6.76 GBhttps://www.internalfb.com/phabricator/paste/view/P1680313475	Average tokens/sec: 173.82Average Bandwidth: 1174.28 GB/sPeak Memory Usage: 9.61 GBModel Size: 6.76 GBhttps://www.internalfb.com/phabricator/paste/view/P1680309154
int4 qtensor list	Average tokens/sec: 208.69Average Bandwidth: 807.64 GB/sPeak Memory Usage: 4.53 GBModel Size: 3.87 GBhttps://www.internalfb.com/phabricator/paste/view/P1680316118	Average tokens/sec: 209.08Average Bandwidth: 809.15 GB/sPeak Memory Usage: 5.09 GBModel Size: 3.87 GBhttps://www.internalfb.com/phabricator/paste/view/P1680296091

sam - image_encoder	base	autoquant_v1 (all float)	autoquant_v2 (one layer picked int8dyn)
default qtensor list	23.18455616	23.09307945	24.19601284
	cuda,vit_h,32,13678,16,23.18455616196341,43.132160607870524,0.5811261131824416,max-autotune,torch.bfloat16,None,False,True,True,32,154,4928,None,None	cuda,vit_h,32,13681,16,23.093079445154853,43.30301648920233,0.5811880744669748,max-autotune,torch.bfloat16,autoquant,False,True,True,32,154,4928,None,None	cuda,vit_h,32,56597,69,24.196012837520133,41.32912338554085,0.5827170017253231,max-autotune,torch.bfloat16,autoquant_v2,False,True,True,32,154,4928,None,None

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2024-11-08T00:58:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1240

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ No Failures

As of commit 0019456 with merge base d224653 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2024-11-08T07:11:27Z

torchao/quantization/subgraph_utils/extract_subgraphs.py

+                torch.nn.Linear(*new_shape, dtype=weight_val.dtype),
+            ).cuda()
+
+        else:


this file has some complexity for extracting (prev_op -> linear1 -> maybe_linear_2 -> next_ops) because the models we originally studied had back to back linears. If you only care about transformer models, you can simplify this code quite a bit by removing the special logic for extraction of the second linear. Happy to point to the right places in the code if needed.

let me clean this up a bit later since I'm still not sure if we need to reimplement the functionality with some other approaches yet, will figure out as we expand to test on more models

vkuzo · 2024-11-08T07:12:38Z

torchao/quantization/subgraph_utils/extract_subgraphs.py

+            return True
+    return False
+
+def debug_single_linear(


depending on what you're using this file for, this function also might be deleteable

yeah will refine more when it's closer to land, right now just experimenting to see if this approach helps improve things over the original approach on models we care about

torchao/_models/llama/generate.py

drisspg · 2024-11-20T01:28:56Z

One overall nit is that since this seems like a prototype that will eventually back the main autoquant API, we should probably put this in the prototype folder until we're ready to move.

jerryzh168 · 2024-11-20T01:32:18Z

One overall nit is that since this seems like a prototype that will eventually back the main autoquant API, we should probably put this in the prototype folder until we're ready to move.

oh OK makes sense, I can move it

torchao/prototype/quantization/subgraph_utils/extract_subgraphs.py

torchao/quantization/__init__.py

torchao/quantization/quant_api.py

Summary: We refactored the v1 to do benchmark for subgraphs of (prev_op -> linear -> post_op) in order to get more accurate estimation of timing. One issue here is now we need to care about batch size of the subgraph, so we'd need the batch size dimension to use symbolic shape, seems that it does not have good support on torch.compile right now More improvements: * current batch size adjustment code is hardcoded to work for llama model, need to think of a way to generalize it * using canonicalized subgraph as key for the cache to reduce the number of times we need to do benchmarking * add accuracy sanity checks Test Plan: Testing with torchao/_models/llama/generate.py ``` python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --quantization autoquant_v2-int4 ``` Reviewers: Subscribers: Tasks: Tags:

jerryzh168 · 2024-11-20T21:00:50Z

thanks @drisspg @vkuzo for the review, I have addressed the comments, please take a look again

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 8, 2024

vkuzo reviewed Nov 8, 2024

View reviewed changes

jerryzh168 marked this pull request as draft November 12, 2024 00:35

jerryzh168 force-pushed the autoquant-v2 branch from c54af02 to f18ddbb Compare November 16, 2024 02:50

jerryzh168 marked this pull request as ready for review November 16, 2024 02:59

jerryzh168 added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Nov 16, 2024

jerryzh168 requested review from cpuhrsch and HDCharles November 16, 2024 02:59

jerryzh168 force-pushed the autoquant-v2 branch from e444b5a to cb72ebc Compare November 19, 2024 23:52

drisspg reviewed Nov 20, 2024

View reviewed changes

torchao/_models/llama/generate.py Outdated Show resolved Hide resolved

jerryzh168 force-pushed the autoquant-v2 branch from f4f8ab5 to 831107f Compare November 20, 2024 02:49

vkuzo reviewed Nov 20, 2024

View reviewed changes

torchao/prototype/quantization/subgraph_utils/extract_subgraphs.py Outdated Show resolved Hide resolved

vkuzo reviewed Nov 20, 2024

View reviewed changes

torchao/prototype/quantization/subgraph_utils/extract_subgraphs.py Outdated Show resolved Hide resolved

vkuzo reviewed Nov 20, 2024

View reviewed changes

torchao/quantization/__init__.py Outdated Show resolved Hide resolved

vkuzo reviewed Nov 20, 2024

View reviewed changes

torchao/quantization/quant_api.py Outdated Show resolved Hide resolved

jerryzh168 added 11 commits November 20, 2024 10:35

tested on llama2 and sam

c5dfbf5

ruff

99d6254

ruff

5e99405

import

dda2cae

cleanup

5df9f35

more ruff

2831d0d

ruff

1d7209e

ruff format

a7ebd93

rename autoquant v2

dceba32

cleanup

3300281

jerryzh168 added 3 commits November 20, 2024 10:35

ruff

40b9940

move to prototype folder

ae08ed7

remove prototype import

be822ae

jerryzh168 force-pushed the autoquant-v2 branch from a15efab to be822ae Compare November 20, 2024 18:42

calibration_seq_length

0019456

drisspg approved these changes Nov 21, 2024

View reviewed changes

jerryzh168 merged commit 7446433 into pytorch:main Nov 21, 2024
18 checks passed

jerryzh168 deleted the autoquant-v2 branch November 21, 2024 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoquant v2 initial version #1240

Autoquant v2 initial version #1240

jerryzh168 commented Nov 8, 2024 •

edited

Loading

pytorch-bot bot commented Nov 8, 2024 •

edited

Loading

vkuzo Nov 8, 2024

jerryzh168 Nov 16, 2024

vkuzo Nov 8, 2024

jerryzh168 Nov 12, 2024

drisspg commented Nov 20, 2024

jerryzh168 commented Nov 20, 2024

jerryzh168 commented Nov 20, 2024

Autoquant v2 initial version #1240

Autoquant v2 initial version #1240

Conversation

jerryzh168 commented Nov 8, 2024 • edited Loading

pytorch-bot bot commented Nov 8, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1240

❗ 1 Active SEVs

✅ No Failures

vkuzo Nov 8, 2024

Choose a reason for hiding this comment

jerryzh168 Nov 16, 2024

Choose a reason for hiding this comment

vkuzo Nov 8, 2024

Choose a reason for hiding this comment

jerryzh168 Nov 12, 2024

Choose a reason for hiding this comment

drisspg commented Nov 20, 2024

jerryzh168 commented Nov 20, 2024

jerryzh168 commented Nov 20, 2024

jerryzh168 commented Nov 8, 2024 •

edited

Loading

pytorch-bot bot commented Nov 8, 2024 •

edited

Loading