Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subclass API (#966) #995

Merged
merged 1 commit into from
Oct 30, 2024
Merged

Conversation

metascroy
Copy link
Contributor

Summary:

Adds new int8_dynamic_activation_intx_weight quantization with subclass API

Differential Revision: D62464487

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 2, 2024
Copy link

pytorch-bot bot commented Oct 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/995

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ae21905 with merge base 958a197 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D62464487

@@ -300,7 +300,7 @@ def _quantize_affine_no_dtype_cast(
elif zero_point_domain is None:
# This case handles quantization for float8 we expect no zero point and no zero point domain
assert zero_point is None, "zero_point should be None when zero_point_domain is None"
quant = torch.clamp(input * scale.reciprocal(), quant_min, quant_max)
quant = torch.clamp(torch.round(input * (1.0 / scale)), quant_min, quant_max)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryzh168 to confirm if this is OK. It was needed to match behavior of other quantizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, it might be fine as long as all the tests passes I think

Copy link
Contributor Author

@metascroy metascroy Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests do not pass because it is slightly different quantization logic. It looks more sensible to me to round before truncating, but I can also drop this change.

We can do a perplexity study when moving from the other quantizer to this one in torchchat. But I have narrowed down this as being the only numerical difference between the two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we don't want to break tests I think, but if this is better for torchchat we can create a new quant primitive op or add a new option here I feel

if preserve_zero:
zero_point = quant_min - torch.round(min_val_neg / scale)
zero_point = torch.clamp(zero_point, quant_min, quant_max)
if zero_point_domain is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryzh168 confirm if this is OK. It was needed to get scale-only quantization in affine_quantized_tensor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, should zero_point be None here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could make it None, but that changes the return type of this method from Tuple[Tensor, Tensor] to Tuple[Tensor, Optional[Tensor]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think making it None probably makes more sense here

exported = torch.export.export(model, (activations,))

print("Compiling quantized model")
compiled = torch.compile(unwrapped_model)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryzh168 do you see unification for compile and export coming soon? The fact that one requires an unwrapped tensor subclass and the other requires a wrapped one makes using this API inconvenient in torchchat.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's blocked by pytorch/pytorch#129682 and I heard @tugsbayasgalan is working on this

@metascroy
Copy link
Contributor Author

@kimishpatel @jerryzh168 moving review over to GH. I hope I've addressed most of your concerns.

@jerryzh168, the fact that compile and export cannot handle the same model (export requires an unwrapped tensor subclass, compile requires a wrapped one, and eager can handle both) makes using this API inconvenient in torchchat. Do you know if there is planned unification there?

input_tensor = input_tensor.reshape(-1, m, k)

res = [
_impl_2d(input_tensor[i, :, :], weight_tensor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you doing it like this? You can just fuse first N dim. LIke line 379 should be
input_tensor = input_tensor.reshape(-1, k) no?

)

# Quantize activations
activation_scales, activation_zeros = choose_qparams_affine(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dynamic quantization should be reusing affine quantized tensor, example:

def int8_dynamic_activation_int8_weight(layout_type=PlainLayoutType()):

why is this calling these functions here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function doesn't look equivalent? It looks like the quantization is symmetric.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can choose a different mapping type,
I mean we can use

weight = to_linear_activation_quantized(weight, input_quant_func)
to do activation quantization, maybe "Dynamic Activation Quantization + Weight Quantization" in #391 is clearer in explaining this, please let me know if it makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't follow what you're suggesting. I have weights_dequantized on line 260, and I need activations_dequantized so I can call torch.matmul(activations_dequantized, weights_dequantized.transpose(1, 0)) on line 296.

I'm not sure what I'm suppose to replace the code that generates activations_dequantized with.



# This format is intended for use with int8 dynamic quantization
class IntxWeightLayoutType(LayoutType):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry still find this name not descriptive, what are the kernels this layout is targeting? are these executorch native kernels? if so maybe IntxExecutorchLayout or similar might be more helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This layout targets the linear_8bit_act_xbit_weight kernels in torchao/experimental. They can run on all PyTorch platforms (eager, AOTI, compile, and ExecuTorch), so adding ExecuTorch doesn't make sense. Open to naming suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I feel it makes sense to just include the kernel name here then, like Linear8BitActXBitWeightLayoutType (also we are renaming LayoutType to Layout to make things clearer) you'll probably see this after rebase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryzh168 I honestly find layout type description not very useful. There can be N different layouts for N different kernels. Right now entire machinery in subclass API is built around what operator or operator implementation to dispatch to using the layout information. I see two issues here

  1. Each operator maybe backed by different kernel implementation that want different layout. It does not seem feasible to enumerate all possible ways in which weights can be packed, and make such information visible to quant API
  2. Quant API shouldnt really be in the business of understanding packed layout information. This should really be left to subclasses of AQT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimishpatel I don't quite follow why "it's not feasible to enumerate all possible ways in which weights can be packed" but I think we are not asking people to use AffineQuantizedTensor at this stage as I discussed in the post, so feel free to contribute in a way you feel makes more sense, we can always merge/refactor later if needed. although a high level API + brief description of implementation might be helpful for us to understand what you have in mind.

also for "Quant API shouldnt really be in the business of understanding packed layout information. This should really be left to subclasses of AQT" I think it might be better/easier to just copy paste AQT and create a new tensor subclass in that case instead of subclassing AQT, we are not clear whether we should have inheriting AQT as an extension point yet

n, k_ = weight_tensor.shape
assert k_ == k

weights_dequantized = dequantize_per_channel_group(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we tend to use quantize_affine/dequantize_affine I think, also this should probably be:

weights_dequantized = weight_tensor.dequantize()?

assert len(weight_tensor.block_size) == 2
assert weight_tensor.block_size[0] == 1
group_size = weight_tensor.block_size[1]
assert group_size == weight_tensor.layout_tensor.layout_type.group_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can probably be weight_tensor.layout_type.group_size (although we are renaming layout_type to layout now

@kimishpatel
Copy link
Contributor

I plan to review some time tomorrow

metascroy added a commit to metascroy/ao that referenced this pull request Oct 11, 2024
Summary:


Adds new int8_dynamic_activation_intx_weight quantization with subclass API

Differential Revision: D62464487
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D62464487

metascroy added a commit to metascroy/ao that referenced this pull request Oct 11, 2024
Summary:


Adds new int8_dynamic_activation_intx_weight quantization with subclass API

Differential Revision: D62464487
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D62464487


def apply(weight):
assert weight.shape[-1] % group_size == 0
assert weight.device == torch.device("cpu"), "Only CPU is supported"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add CPU device assert here

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some requested changes, please see comments

metascroy added a commit to metascroy/ao that referenced this pull request Oct 18, 2024
Summary:

Adds new int8_dynamic_activation_intx_weight quantization with subclass API

Differential Revision: D62464487
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D62464487

metascroy added a commit to metascroy/ao that referenced this pull request Oct 20, 2024
Summary:

Adds new int8_dynamic_activation_intx_weight quantization with subclass API

Differential Revision: D62464487
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D62464487

metascroy added a commit to metascroy/ao that referenced this pull request Oct 21, 2024
Summary:

Adds new int8_dynamic_activation_intx_weight quantization with subclass API

Differential Revision: D62464487
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D62464487

metascroy added a commit to metascroy/ao that referenced this pull request Oct 21, 2024
Summary:

Adds new int8_dynamic_activation_intx_weight quantization with subclass API

Differential Revision: D62464487
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D62464487

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readme.md --> README.md?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we should remove the quantizer in the name, _linear_8bit_act_xbit_weight_layout.py might be more appropriate

weights_dequantized = weight_tensor.dequantize()

# Quantize activations
activation_scales, activation_zeros = choose_qparams_affine(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably use to_affine_quantized_intx and dequantize_affine here to quantize activation?

Comment on lines 14 to 15
dequantize_per_channel_group,
quantize_per_channel_group,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these ops are a bit deprecated so we tend not to use these if possible

MappingType,
ZeroPointDomain,
)
from torchao.utils import TorchAOBaseTensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like not used

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a few more nit comments

"""
INT = auto()
FLOAT = auto()
ZERO = auto()
Copy link
Contributor

@jerryzh168 jerryzh168 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I feel maybe NONE or NO_ZERO_POINT would be better?

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D62464487

metascroy added a commit to metascroy/ao that referenced this pull request Oct 29, 2024
Summary:
Pull Request resolved: pytorch#995

Adds new int8_dynamic_activation_intx_weight quantization with subclass API

Differential Revision: D62464487
)


def int8_dyn_act_intx_weight(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are spelling out all the words in our API so far, so this should be int8_dynamic_activation_intx_weight I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename the tests as well

Summary:

Adds new int8_dynamic_activation_intx_weight quantization with subclass API

Reviewed By: jerryzh168

Differential Revision: D62464487
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D62464487

@facebook-github-bot facebook-github-bot merged commit 581d8e0 into pytorch:main Oct 30, 2024
18 of 19 checks passed
yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024
* Add warning comments referring to unimplemented functionality

* JSON formatted response using OpenAI API types for server completion requests

* Add models endpoint (pytorch#1000)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants