`quantize_activation_per_token_absmax` use general quant primitives #193

jerryzh168 · 2024-04-30T22:45:05Z

Summary:
att

Test Plan:
OSS CI

Reviewers:

Subscribers:

Tasks:

Tags:

jerryzh168 · 2024-04-30T22:45:27Z

diff in generated quantized code: https://www.internalfb.com/phabricator/paste/view/P1226948181

torchao/quantization/quant_primitives.py

jerryzh168 · 2024-05-01T00:47:39Z

Changes to generated quantized code for vit: https://www.internalfb.com/phabricator/paste/view/P1227030036

cpuhrsch · 2024-05-01T00:51:06Z

Hm those changes seem to indicate that something substantial is still different about this refactor. There must be a slight difference somewhere.

jerryzh168 · 2024-05-01T01:08:31Z

Hm those changes seem to indicate that something substantial is still different about this refactor. There must be a slight difference somewhere.

yeah I don't think the code is exactly the same, it's calling mostly the same ops though, but with different args, for example: https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_primitives.py#L273 has different dim and keepdim args

is there a way to test the speed of generated code?

jerryzh168 · 2024-05-01T01:14:16Z

Hm those changes seem to indicate that something substantial is still different about this refactor. There must be a slight difference somewhere.

yeah I don't think the code is exactly the same, it's calling mostly the same ops though, but with different args, for example: main/torchao/quantization/quant_primitives.py#L273 has different dim and keepdim args

is there a way to test the speed of generated code?

looks like the benchmark code will print the time, let me check

jerryzh168 · 2024-05-01T01:34:09Z

before change: elapsed_time: 1.4610368347167968 milliseconds
after change: elapsed_time: 1.4671536254882813 milliseconds

looks like not much difference

cpuhrsch · 2024-05-01T01:45:42Z

I'm less worried about speed and more about correctness. I don't think a single benchmark datapoint will be conclusive here either. I saw an additional call to mul in your generated code. Can you try changing the generic primitives until it matches for this one example only? At least so we can see the difference in code needed to match.

jerryzh168 · 2024-05-01T01:56:12Z

I'm less worried about speed and more about correctness. I don't think a single benchmark datapoint will be conclusive here either. I saw an additional call to mul in your generated code. Can you try changing the generic primitives until it matches for this one example only? At least so we can see the difference in code needed to match.

sure, for correctness, are you referring to numerics? so we have regression tests here: https://github.com/pytorch/ao/blob/main/test/integration/test_integration.py#L586 and it passes

jerryzh168 · 2024-05-01T02:16:21Z

I just did a quick test with a 10x10 tensor, the quantized integer values match exactly, but there are some differences for scales, I think it's because it's doing clamping before dividing q_max, instead of after: https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_primitives.py#L335

jerryzh168 · 2024-05-01T18:01:12Z

torchao/quantization/quant_primitives.py

-            scales.float()
-        )  # want float scales to avoid overflows for fp16, (bf16 has wide enough range)
-    q_max = 2 ** (n_bits - 1) - 1
-    scales = scales.clamp(min=1e-5).div(q_max)


@vkuzo @HDCharles do you know if this clampping before the scale is calculated is required for quantize_activation_per_token_absmax to work for our use cases?

we need an epsilon to handle the case where max(abs(x)) is zero

I see, so can we do this after we divide by q_max? that's where we are doing the clamp typically

if eps is not None: scale = torch.clamp(scale, min=eps)

you have this logic in choose_qparams_affine, I would imagine it's for the same purpose. I would recommend:

always specify epsilon (not sure why this is optional)

ensure test cases include testing for max(abs(x)) == 0

@vkuzo Is eps best chosen to be (a multiple of / the) machine epsilon of the input tensor's dtype? Or is it a parameter that needs to be searched depending on input data distribution?

OK I'll add a test. will think about if we want to make it required, we didn't do clamping in some of the ops right now I think

will think about if we want to make it required

IMO the case of max(abs(x)) == 0 should be handled for every op

Is eps best chosen to be (a multiple of / the) machine epsilon of the input tensor's dtype?

I think that choosing eps based on the dtype makes sense, it should not be data dependent. My educated guess of how to choose this number would be "smallest number so that the resulting scale is finite", I haven't looked into whether expressing that in terms of machine epsilon would make sense.

after a second thought I guess it makes sense to have an eps, otherwise we may have divide by zero during quantization, I'll change it in a separate PR

for eps, I think we can do torch.finfo(input_dtype).eps if it's not provided from user

…primitives Summary: att Test Plan: OSS CI Reviewers: Subscribers: Tasks: Tags:

HDCharles

can we verify the torch.compiled code is the same between the old/new primitives in a unit test? It seems pretty important to make sure this is doing what we want after torch.compile on an ongoing basis.

e.g. we use https://github.com/pytorch/ao/blob/main/test/integration/test_integration.py#L1110
to extract the triton code in this test, it'd be good if we could compare the generated triton code.

jerryzh168 · 2024-05-03T01:48:29Z

can we verify the torch.compiled code is the same between the old/new primitives in a unit test? It seems pretty important to make sure this is doing what we want after torch.compile on an ongoing basis.

e.g. we use main/test/integration/test_integration.py#L1110 to extract the triton code in this test, it'd be good if we could compare the generated triton code.

There are some differences for this specific op, see: #193 (comment) (for vit model), but we understand that it's from the order of when we do clampping: #193 (comment)

I feel we probably want to establish high level benchmarks like eval accuracy and performance numbers for these in the future.

jerryzh168 · 2024-05-03T02:09:07Z

can we verify the torch.compiled code is the same between the old/new primitives in a unit test? It seems pretty important to make sure this is doing what we want after torch.compile on an ongoing basis.

e.g. we use main/test/integration/test_integration.py#L1110 to extract the triton code in this test, it'd be good if we could compare the generated triton code.

oh do you mean to verify that mix_mm is called in the compiled code? that's a good idea, I can check that

edit:
it seems that this op is not related to mixed_mm though?

HDCharles · 2024-05-09T02:23:02Z

can we verify the torch.compiled code is the same between the old/new primitives in a unit test? It seems pretty important to make sure this is doing what we want after torch.compile on an ongoing basis.
e.g. we use main/test/integration/test_integration.py#L1110 to extract the triton code in this test, it'd be good if we could compare the generated triton code.

There are some differences for this specific op, see: #193 (comment) (for vit model), but we understand that it's from the order of when we do clampping: #193 (comment)

I feel we probably want to establish high level benchmarks like eval accuracy and performance numbers for these in the future.

i don't think that's an issue, just reverse the order in the original op and compare the generated code for that, then.

the triton generation test is mostly about verifying perf between two things written in very different ways, functional changes aren't super important to that concern (though still relevant if they are significantly different, this one shouldn't have a large impact).

jerryzh168 · 2024-05-09T02:50:38Z

i don't think that's an issue, just reverse the order in the original op and compare the generated code for that, then.

the triton generation test is mostly about verifying perf between two things written in very different ways, functional changes aren't super important to that concern (though still relevant if they are significantly different, this one shouldn't have a large impact).

so my point is that the generated code may change and the change should be allowed if the generated code has desired op (like int4mm op), similar performance and accuracy

…ytorch#193) Summary: att Test Plan: OSS CI Reviewers: Subscribers: Tasks: Tags:

Add GUI-based chat to table and mark as not available

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2024

jerryzh168 changed the title ~~Refactor quantize_activation_per_token_absmax to use general quant …~~ quantize_activation_per_token_absmax use general quant primitives Apr 30, 2024

jerryzh168 requested review from HDCharles and cpuhrsch April 30, 2024 22:46

cpuhrsch reviewed Apr 30, 2024

View reviewed changes

torchao/quantization/quant_primitives.py Show resolved Hide resolved

jerryzh168 force-pushed the dedup-2 branch from 5989fa0 to a31cf04 Compare May 1, 2024 00:26

jerryzh168 requested a review from cpuhrsch May 1, 2024 00:47

jerryzh168 force-pushed the dedup-2 branch 2 times, most recently from 4738a2c to 9ede63b Compare May 1, 2024 02:59

jerryzh168 commented May 1, 2024

View reviewed changes

Refactor quantize_activation_per_token_absmax to use general quant …

0f3db8c

…primitives Summary: att Test Plan: OSS CI Reviewers: Subscribers: Tasks: Tags:

jerryzh168 force-pushed the dedup-2 branch from 9ede63b to 0f3db8c Compare May 1, 2024 18:42

cpuhrsch approved these changes May 2, 2024

View reviewed changes

HDCharles reviewed May 3, 2024

View reviewed changes

jerryzh168 added 2 commits May 2, 2024 19:13

Merge branch 'main' into dedup-2

5759dcd

Merge branch 'main' into dedup-2

5686433

msaroufim merged commit 5364de6 into pytorch:main May 3, 2024
15 checks passed

dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024

quantize_activation_per_token_absmax use general quant primitives (p…

3a18566

…ytorch#193) Summary: att Test Plan: OSS CI Reviewers: Subscribers: Tasks: Tags:

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

Update README.md (pytorch#193)

87fb5ba

Add GUI-based chat to table and mark as not available

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`quantize_activation_per_token_absmax` use general quant primitives #193

`quantize_activation_per_token_absmax` use general quant primitives #193

jerryzh168 commented Apr 30, 2024 •

edited

Loading

jerryzh168 commented Apr 30, 2024

jerryzh168 commented May 1, 2024

cpuhrsch commented May 1, 2024

jerryzh168 commented May 1, 2024 •

edited

Loading

jerryzh168 commented May 1, 2024

jerryzh168 commented May 1, 2024

cpuhrsch commented May 1, 2024

jerryzh168 commented May 1, 2024

jerryzh168 commented May 1, 2024 •

edited

Loading

jerryzh168 May 1, 2024 •

edited

Loading

vkuzo May 1, 2024

jerryzh168 May 1, 2024

vkuzo May 1, 2024

cpuhrsch May 1, 2024

jerryzh168 May 1, 2024

vkuzo May 1, 2024

jerryzh168 May 1, 2024 •

edited

Loading

HDCharles left a comment

jerryzh168 commented May 3, 2024

jerryzh168 commented May 3, 2024 •

edited

Loading

HDCharles commented May 9, 2024 •

edited

Loading

jerryzh168 commented May 9, 2024 •

edited

Loading

quantize_activation_per_token_absmax use general quant primitives #193

quantize_activation_per_token_absmax use general quant primitives #193

Conversation

jerryzh168 commented Apr 30, 2024 • edited Loading

jerryzh168 commented Apr 30, 2024

jerryzh168 commented May 1, 2024

cpuhrsch commented May 1, 2024

jerryzh168 commented May 1, 2024 • edited Loading

jerryzh168 commented May 1, 2024

jerryzh168 commented May 1, 2024

cpuhrsch commented May 1, 2024

jerryzh168 commented May 1, 2024

jerryzh168 commented May 1, 2024 • edited Loading

jerryzh168 May 1, 2024 • edited Loading

Choose a reason for hiding this comment

vkuzo May 1, 2024

Choose a reason for hiding this comment

jerryzh168 May 1, 2024

Choose a reason for hiding this comment

vkuzo May 1, 2024

Choose a reason for hiding this comment

cpuhrsch May 1, 2024

Choose a reason for hiding this comment

jerryzh168 May 1, 2024

Choose a reason for hiding this comment

vkuzo May 1, 2024

Choose a reason for hiding this comment

jerryzh168 May 1, 2024 • edited Loading

Choose a reason for hiding this comment

HDCharles left a comment

Choose a reason for hiding this comment

jerryzh168 commented May 3, 2024

jerryzh168 commented May 3, 2024 • edited Loading

HDCharles commented May 9, 2024 • edited Loading

jerryzh168 commented May 9, 2024 • edited Loading

`quantize_activation_per_token_absmax` use general quant primitives #193

`quantize_activation_per_token_absmax` use general quant primitives #193

jerryzh168 commented Apr 30, 2024 •

edited

Loading

jerryzh168 commented May 1, 2024 •

edited

Loading

jerryzh168 commented May 1, 2024 •

edited

Loading

jerryzh168 May 1, 2024 •

edited

Loading

jerryzh168 May 1, 2024 •

edited

Loading

jerryzh168 commented May 3, 2024 •

edited

Loading

HDCharles commented May 9, 2024 •

edited

Loading

jerryzh168 commented May 9, 2024 •

edited

Loading