Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow quantized linear registration in a different file #783

Merged
merged 5 commits into from
Sep 3, 2024

Conversation

jerryzh168
Copy link
Contributor

@jerryzh168 jerryzh168 commented Aug 30, 2024

Summary:

Previously there was some ordering that we need to maintain for quantized linear dispatch table in AffineQuantizedTensor, the reason is there is a fallback entry that dequantizes the input:

(_linear_quantized_act_fallback_check, _linear_quantized_act_fallback_impl),

so the dispatches with two inputs quantized (static or dynamic quantization) must come before this entry and dispatches with weight only quantization, however the fallback is not really used/needed in practice, since people typically just want to call into a very specific kernel.

From offline discussions with @drisspg and @HDCharles, it might be useful to have a "quantized_linear_impl" for LayoutType, this allows people to specify and check which quantized_linear_impl they want to use to make sure they can call into the specific kernel, when this field is set, we'll not run the fallback path for quantized linear either (dequantize all activation and weight tensors and run the floating point linear op)
I think this can be added for a specific layout type if people want to and we don't have to enforce this in the base LayoutType, otherwise we'd have to specify this for all LayoutType which seems a bit cumbersome. The contract here is when people want to skip the fallback path, they can do:

class MyLayoutType:
    quantized_linear_impl = "some_impl"
    ...

and have the dispatch_condition function check for this:

def dispatch_condition(...):
    return weight_tensor.layout_type.quantized_linear_impl == "some_impl" and ...

if there are some issues with the implementation, it will not call the fallback in this case

Test Plan:
python test/dtypes/test_affine_quantized.py -k test_register_new_dispatch

Reviewers:

Subscribers:

Tasks:

Tags:

Copy link

pytorch-bot bot commented Aug 30, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/783

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b7c1512 with merge base e2dad4a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 30, 2024
@jerryzh168 jerryzh168 requested a review from vayuda August 30, 2024 20:24
@jerryzh168 jerryzh168 requested a review from jcaip September 2, 2024 17:24
"""
_AQT_QLINEAR_DISPATCH_TABLE[dispatch_condition] = impl

def _deregister_aqt_quantized_linear_dispatch(dispatch_condition):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When should this be used?

Copy link
Contributor Author

@jerryzh168 jerryzh168 Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used to remove some registration, most of the time probably don't need to use it, I'm using this just to make sure test pass since the registration is global

Summary:

Previously there was some ordering that we need to maintain for quantized linear dispatch table in AffineQuantizedTensor,
the reason is there is a fallback entry that dequantizes the input: https://github.com/pytorch/ao/blob/ba2d3b1333b90ccd0186216649a1c58c6a17ce56/torchao/dtypes/affine_quantized_tensor.py#L1195

so the dispatches with two inputs quantized (static or dynamic quantization) must come before this entry and dispatches with weight only quantization, however the fallback is not
really used/needed in practice, since people typically just want to call into a very specific kernel.

From offline discussions with @drisspg and @HDCharles, it might be useful to have a "quantized_linear_impl" for `LayoutType`, this allows people to specify and check which quantized_linear_impl they want to use to make sure they can call into
the specific kernel, when this field is set, we'll not run the fallback path for quantized linear either (dequantize all activation and weight tensors and run the floating point linear op)
I think this can be added for a specific layout type if people want to and we don't have to enforce this in the base `LayoutType`

Test Plan:
python test/dtypes/test_affine_quantized.py -k test_register_new_dispatch

Reviewers:

Subscribers:

Tasks:

Tags:
@jerryzh168 jerryzh168 force-pushed the improve-quantized-linear branch from 915e8d3 to b7c1512 Compare September 2, 2024 17:55
@jerryzh168 jerryzh168 merged commit e15e509 into pytorch:main Sep 3, 2024
17 checks passed
@jerryzh168 jerryzh168 deleted the improve-quantized-linear branch September 3, 2024 03:06
jerryzh168 added a commit to jerryzh168/ao that referenced this pull request Sep 4, 2024
* Allow quantized linear registration in a different file

Summary:

Previously there was some ordering that we need to maintain for quantized linear dispatch table in AffineQuantizedTensor,
the reason is there is a fallback entry that dequantizes the input: https://github.com/pytorch/ao/blob/ba2d3b1333b90ccd0186216649a1c58c6a17ce56/torchao/dtypes/affine_quantized_tensor.py#L1195

so the dispatches with two inputs quantized (static or dynamic quantization) must come before this entry and dispatches with weight only quantization, however the fallback is not
really used/needed in practice, since people typically just want to call into a very specific kernel.

From offline discussions with @drisspg and @HDCharles, it might be useful to have a "quantized_linear_impl" for `LayoutType`, this allows people to specify and check which quantized_linear_impl they want to use to make sure they can call into
the specific kernel, when this field is set, we'll not run the fallback path for quantized linear either (dequantize all activation and weight tensors and run the floating point linear op)
I think this can be added for a specific layout type if people want to and we don't have to enforce this in the base `LayoutType`

Test Plan:
python test/dtypes/test_affine_quantized.py -k test_register_new_dispatch

Reviewers:

Subscribers:

Tasks:

Tags:

* fix error

* de-register dispatch

* make register/deregister fn public

* rebase and fix error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants