-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AffineQuantizedObserver #650
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/650
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f3fc52b with merge base 88a263a (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
afe8b62
to
8826bf3
Compare
58b208a
to
3c6c4df
Compare
torchao/quantization/observer.py
Outdated
return tuple(block_size) | ||
raise ValueError(f"Unsupported GranularityType: {granularity_type}") | ||
|
||
class AffineQuantizedObserver(torch.nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the precedent to add observers here or within the quantization file they are used for. I also see that you had to add a new function to quant_primitives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depends on how widely the observer will be used I think, you can start with the quantization file and upstream later when needed I think
yeah this one requires some changes to quant primitives as well, but it's not always needed I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for AWQ workflow
r = _PartialWrapper(partial(cls_or_self, *args, **kwargs)) | ||
return r | ||
|
||
def get_block_size(input_shape: Tuple[int, ...], granularity_type: GranularityType) -> Tuple[int, ...]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we merge this with _get_per_token_block_size
from torchao/quantization/utils.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel in general PerTensor
and PerAxis
will be useful for the dynamic / weight only flows as well. We can do that in a future PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we can do this in a future PR, I feel we can merge everything into this function and move this to torchao/quantization/utils.py
torchao/quantization/observer.py
Outdated
return update_stats_min_max, calculate_qparams_min_max | ||
|
||
_update_stats_min_max, _calculate_qparams_min_max = get_min_max_funcs() | ||
AffineQuantizedMinMaxObserver = AffineQuantizedObserver.with_args(_update_stats_min_max, _calculate_qparams_min_max) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason behind returning these functions as Callables instead of overriding them in the base class? I feel it's cleaner to do the latter (also done in the torch.ao flows) so we don't have to pass functions around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK sounds good, I don't have a very strong preference on this point, the initial motivation is for people to reuse the same AffineQuantizedObserver
everywhere, but sounds fine for them to inherit as well
Out of curiosity (and ignorance), but why do we need to couple "Observer" to a particular dtype? Doesn't it just track statistics of a particular Tensor over time? Or put differently, is there an example of / plans for a Bfloat16Observer or FP8Observer? |
yeah it's not coupled with dtype (bfloat16, fp8, int8), but it is related to the type of quantization I think, in this case "affine quantization with some block_size argument" (quantized_val = fp_val / scale + zero_point), the Tensor stats we are tracking (e.g. min_val/max_val) will be affected by Other types of quantizations could be "non uniform" quantization that maybe people want to figure out the best look up table for a dtype, that can have very different argument list and behavior so we can create new observers for them. For bfloat16 we probably won't need a new observer, for fp8, I think we also don't need that, because it's also doing affine quantization, we also plan to integrate fpx(fp2-fp8) into AffineQuantizedTensor, and it should be easy to reuse this observer. In theory I can make the observer to be unrelated to the type of quantization as well, but I feel it's easier for people to just specify all configurations in one place, so a good heuristic might be: type of quantization + type of observation, e.g. AffineQuantized{MinMax}Observer: tracking min_val/max_val Affine quantization arguments can indeed be decoupled from the observer but I think we can do this later when there are more observers |
Summary: In our static_quant flow tutorial we were still using observers from `torch.ao` which we plan to deprecate, this PR adds a more general observer for `AffineQuantizedTensor`, and has shown that we can replace the old observers (min max observer), there could be futhre work to improve perf, add new types of observation, e.g. tracking stats other than just min/max, moving average observer, histogram observer. Test Plan: python test/quantization/test_observer.py python tutorials/calibration_flow/static_quant.py Reviewers: Subscribers: Tasks: Tags:
3c6c4df
to
f3fc52b
Compare
@cpuhrsch please let me know if you want me to separate the affine quantized tensor arg list and minmax observer now. it will be something like this:
|
will land for now, we can follow up to split the logic when there are more observers I think |
Summary:
In our static_quant flow tutorial we were still using observers from
torch.ao
which we plan to deprecate, this PR adds a more general observer forAffineQuantizedTensor
, and has shown that we can replace the old observers (min max observer), there could be futhre work to improve perf, add new types of observation, e.g. tracking stats other than just min/max, moving average observer, histogram observer.Test Plan:
python test/quantization/test_observer.py
python tutorials/calibration_flow/static_quant.py
Reviewers:
Subscribers:
Tasks:
Tags: