Add AffineQuantizedObserver #650

jerryzh168 · 2024-08-09T23:42:05Z

Summary:
In our static_quant flow tutorial we were still using observers from torch.ao which we plan to deprecate, this PR adds a more general observer for AffineQuantizedTensor, and has shown that we can replace the old observers (min max observer), there could be futhre work to improve perf, add new types of observation, e.g. tracking stats other than just min/max, moving average observer, histogram observer.

Test Plan:
python test/quantization/test_observer.py
python tutorials/calibration_flow/static_quant.py

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2024-08-09T23:42:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/650

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f3fc52b with merge base 88a263a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vayuda · 2024-08-12T17:23:19Z

torchao/quantization/observer.py

+        return tuple(block_size)
+    raise ValueError(f"Unsupported GranularityType: {granularity_type}")
+
+class AffineQuantizedObserver(torch.nn.Module):


Is the precedent to add observers here or within the quantization file they are used for. I also see that you had to add a new function to quant_primitives

depends on how widely the observer will be used I think, you can start with the quantization file and upstream later when needed I think

yeah this one requires some changes to quant primitives as well, but it's not always needed I think

vayuda

Looks good for AWQ workflow

andrewor14 · 2024-08-13T16:58:30Z

torchao/quantization/observer.py

+    r = _PartialWrapper(partial(cls_or_self, *args, **kwargs))
+    return r
+
+def get_block_size(input_shape: Tuple[int, ...], granularity_type: GranularityType) -> Tuple[int, ...]:


should we merge this with _get_per_token_block_size from torchao/quantization/utils.py?

I feel in general PerTensor and PerAxis will be useful for the dynamic / weight only flows as well. We can do that in a future PR

yeah we can do this in a future PR, I feel we can merge everything into this function and move this to torchao/quantization/utils.py

andrewor14 · 2024-08-13T17:02:31Z

torchao/quantization/observer.py

+    return update_stats_min_max, calculate_qparams_min_max
+
+_update_stats_min_max, _calculate_qparams_min_max = get_min_max_funcs()
+AffineQuantizedMinMaxObserver = AffineQuantizedObserver.with_args(_update_stats_min_max, _calculate_qparams_min_max)


What's the reason behind returning these functions as Callables instead of overriding them in the base class? I feel it's cleaner to do the latter (also done in the torch.ao flows) so we don't have to pass functions around

OK sounds good, I don't have a very strong preference on this point, the initial motivation is for people to reuse the same AffineQuantizedObserver everywhere, but sounds fine for them to inherit as well

cpuhrsch · 2024-08-13T19:24:06Z

Out of curiosity (and ignorance), but why do we need to couple "Observer" to a particular dtype? Doesn't it just track statistics of a particular Tensor over time?

Or put differently, is there an example of / plans for a Bfloat16Observer or FP8Observer?

jerryzh168 · 2024-08-13T19:54:34Z

Out of curiosity (and ignorance), but why do we need to couple "Observer" to a particular dtype? Doesn't it just track statistics of a particular Tensor over time?

Or put differently, is there an example of / plans for a Bfloat16Observer or FP8Observer?

yeah it's not coupled with dtype (bfloat16, fp8, int8), but it is related to the type of quantization I think, in this case "affine quantization with some block_size argument" (quantized_val = fp_val / scale + zero_point), the Tensor stats we are tracking (e.g. min_val/max_val) will be affected by block_size. e.g. per tensor quant will track single scalar min_val/max_val, per axis (axis_dim=0) quant will track min_val/max_val tensors with size == weight.size(0) etc.

Other types of quantizations could be "non uniform" quantization that maybe people want to figure out the best look up table for a dtype, that can have very different argument list and behavior so we can create new observers for them. For bfloat16 we probably won't need a new observer, for fp8, I think we also don't need that, because it's also doing affine quantization, we also plan to integrate fpx(fp2-fp8) into AffineQuantizedTensor, and it should be easy to reuse this observer.

In theory I can make the observer to be unrelated to the type of quantization as well, but I feel it's easier for people to just specify all configurations in one place, so a good heuristic might be: type of quantization + type of observation, e.g.

AffineQuantized{MinMax}Observer: tracking min_val/max_val
AffineQuantized{Histogram}Observer: tracking histogram, and use that to calculate min_val/max_val
AffineQuantized{MovingAverage}Observer: tracking moving average of min_val/max_val (can be merged into MinMax observer)

Affine quantization arguments can indeed be decoupled from the observer but I think we can do this later when there are more observers

Summary: In our static_quant flow tutorial we were still using observers from `torch.ao` which we plan to deprecate, this PR adds a more general observer for `AffineQuantizedTensor`, and has shown that we can replace the old observers (min max observer), there could be futhre work to improve perf, add new types of observation, e.g. tracking stats other than just min/max, moving average observer, histogram observer. Test Plan: python test/quantization/test_observer.py python tutorials/calibration_flow/static_quant.py Reviewers: Subscribers: Tasks: Tags:

jerryzh168 · 2024-08-13T23:58:42Z

@cpuhrsch please let me know if you want me to separate the affine quantized tensor arg list and minmax observer now. it will be something like this:

# has the min_val/max_val logic
class MinMaxObserver(...):
   ...
   def forward(...):
       self.min_val = ...
       self.max_val = ...
       ...

# has the list of args for AffineQuantizedTensor
class AffineQuantizedObserver(...):
   def __init__(...):

class AffineQuantizedMinMaxObserver(AffineQuantizedObserver, MinMaxObserver):
   def calculate_qparams(...):
          use self.min_val, self.max_val and args in AffineQuantizedObserver to calculate qparams for AffineQuantizedTensor

jerryzh168 · 2024-08-14T19:48:08Z

will land for now, we can follow up to split the logic when there are more observers I think

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 9, 2024

jerryzh168 force-pushed the add-observer branch from afe8b62 to 8826bf3 Compare August 9, 2024 23:54

jerryzh168 requested review from andrewor14, msaroufim, jcaip, HDCharles and vayuda August 9, 2024 23:54

jerryzh168 force-pushed the add-observer branch 2 times, most recently from 58b208a to 3c6c4df Compare August 10, 2024 00:08

vayuda reviewed Aug 12, 2024

View reviewed changes

vayuda approved these changes Aug 12, 2024

View reviewed changes

andrewor14 approved these changes Aug 13, 2024

View reviewed changes

jerryzh168 force-pushed the add-observer branch from 3c6c4df to f3fc52b Compare August 13, 2024 20:33

This was referenced Aug 13, 2024

Tensor size mismatch when using static quantization #665

Closed

get_block_size function supports different granularities #669

Open

jerryzh168 merged commit 6199f89 into pytorch:main Aug 14, 2024
14 checks passed

jerryzh168 deleted the add-observer branch August 14, 2024 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AffineQuantizedObserver #650

Add AffineQuantizedObserver #650

jerryzh168 commented Aug 9, 2024

pytorch-bot bot commented Aug 9, 2024 •

edited

Loading

vayuda Aug 12, 2024

jerryzh168 Aug 12, 2024

vayuda left a comment

andrewor14 Aug 13, 2024

andrewor14 Aug 13, 2024

jerryzh168 Aug 13, 2024

andrewor14 Aug 13, 2024

jerryzh168 Aug 13, 2024

cpuhrsch commented Aug 13, 2024

jerryzh168 commented Aug 13, 2024 •

edited

Loading

jerryzh168 commented Aug 13, 2024 •

edited

Loading

jerryzh168 commented Aug 14, 2024

Add AffineQuantizedObserver #650

Add AffineQuantizedObserver #650

Conversation

jerryzh168 commented Aug 9, 2024

pytorch-bot bot commented Aug 9, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/650

✅ No Failures

vayuda Aug 12, 2024

Choose a reason for hiding this comment

jerryzh168 Aug 12, 2024

Choose a reason for hiding this comment

vayuda left a comment

Choose a reason for hiding this comment

andrewor14 Aug 13, 2024

Choose a reason for hiding this comment

andrewor14 Aug 13, 2024

Choose a reason for hiding this comment

jerryzh168 Aug 13, 2024

Choose a reason for hiding this comment

andrewor14 Aug 13, 2024

Choose a reason for hiding this comment

jerryzh168 Aug 13, 2024

Choose a reason for hiding this comment

cpuhrsch commented Aug 13, 2024

jerryzh168 commented Aug 13, 2024 • edited Loading

jerryzh168 commented Aug 13, 2024 • edited Loading

jerryzh168 commented Aug 14, 2024

pytorch-bot bot commented Aug 9, 2024 •

edited

Loading

jerryzh168 commented Aug 13, 2024 •

edited

Loading

jerryzh168 commented Aug 13, 2024 •

edited

Loading