Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AffineQuantizedObserver #650

Merged
merged 1 commit into from
Aug 14, 2024
Merged

Conversation

jerryzh168
Copy link
Contributor

Summary:
In our static_quant flow tutorial we were still using observers from torch.ao which we plan to deprecate, this PR adds a more general observer for AffineQuantizedTensor, and has shown that we can replace the old observers (min max observer), there could be futhre work to improve perf, add new types of observation, e.g. tracking stats other than just min/max, moving average observer, histogram observer.

Test Plan:
python test/quantization/test_observer.py
python tutorials/calibration_flow/static_quant.py

Reviewers:

Subscribers:

Tasks:

Tags:

Copy link

pytorch-bot bot commented Aug 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/650

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f3fc52b with merge base 88a263a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 9, 2024
@jerryzh168 jerryzh168 force-pushed the add-observer branch 2 times, most recently from 58b208a to 3c6c4df Compare August 10, 2024 00:08
return tuple(block_size)
raise ValueError(f"Unsupported GranularityType: {granularity_type}")

class AffineQuantizedObserver(torch.nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the precedent to add observers here or within the quantization file they are used for. I also see that you had to add a new function to quant_primitives

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depends on how widely the observer will be used I think, you can start with the quantization file and upstream later when needed I think

yeah this one requires some changes to quant primitives as well, but it's not always needed I think

Copy link
Collaborator

@vayuda vayuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for AWQ workflow

r = _PartialWrapper(partial(cls_or_self, *args, **kwargs))
return r

def get_block_size(input_shape: Tuple[int, ...], granularity_type: GranularityType) -> Tuple[int, ...]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we merge this with _get_per_token_block_size from torchao/quantization/utils.py?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel in general PerTensor and PerAxis will be useful for the dynamic / weight only flows as well. We can do that in a future PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we can do this in a future PR, I feel we can merge everything into this function and move this to torchao/quantization/utils.py

return update_stats_min_max, calculate_qparams_min_max

_update_stats_min_max, _calculate_qparams_min_max = get_min_max_funcs()
AffineQuantizedMinMaxObserver = AffineQuantizedObserver.with_args(_update_stats_min_max, _calculate_qparams_min_max)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason behind returning these functions as Callables instead of overriding them in the base class? I feel it's cleaner to do the latter (also done in the torch.ao flows) so we don't have to pass functions around

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sounds good, I don't have a very strong preference on this point, the initial motivation is for people to reuse the same AffineQuantizedObserver everywhere, but sounds fine for them to inherit as well

@cpuhrsch
Copy link
Contributor

Out of curiosity (and ignorance), but why do we need to couple "Observer" to a particular dtype? Doesn't it just track statistics of a particular Tensor over time?

Or put differently, is there an example of / plans for a Bfloat16Observer or FP8Observer?

@jerryzh168
Copy link
Contributor Author

jerryzh168 commented Aug 13, 2024

Out of curiosity (and ignorance), but why do we need to couple "Observer" to a particular dtype? Doesn't it just track statistics of a particular Tensor over time?

Or put differently, is there an example of / plans for a Bfloat16Observer or FP8Observer?

yeah it's not coupled with dtype (bfloat16, fp8, int8), but it is related to the type of quantization I think, in this case "affine quantization with some block_size argument" (quantized_val = fp_val / scale + zero_point), the Tensor stats we are tracking (e.g. min_val/max_val) will be affected by block_size. e.g. per tensor quant will track single scalar min_val/max_val, per axis (axis_dim=0) quant will track min_val/max_val tensors with size == weight.size(0) etc.

Other types of quantizations could be "non uniform" quantization that maybe people want to figure out the best look up table for a dtype, that can have very different argument list and behavior so we can create new observers for them. For bfloat16 we probably won't need a new observer, for fp8, I think we also don't need that, because it's also doing affine quantization, we also plan to integrate fpx(fp2-fp8) into AffineQuantizedTensor, and it should be easy to reuse this observer.

In theory I can make the observer to be unrelated to the type of quantization as well, but I feel it's easier for people to just specify all configurations in one place, so a good heuristic might be: type of quantization + type of observation, e.g.

AffineQuantized{MinMax}Observer: tracking min_val/max_val
AffineQuantized{Histogram}Observer: tracking histogram, and use that to calculate min_val/max_val
AffineQuantized{MovingAverage}Observer: tracking moving average of min_val/max_val (can be merged into MinMax observer)

Affine quantization arguments can indeed be decoupled from the observer but I think we can do this later when there are more observers

Summary:
In our static_quant flow tutorial we were still using observers from `torch.ao` which we plan to deprecate, this PR adds a more general observer for `AffineQuantizedTensor`, and has shown that we can
replace the old observers (min max observer), there could be futhre work to improve perf, add new types
of observation, e.g. tracking stats other than just min/max, moving average observer, histogram observer.

Test Plan:
python test/quantization/test_observer.py
python tutorials/calibration_flow/static_quant.py

Reviewers:

Subscribers:

Tasks:

Tags:
@jerryzh168
Copy link
Contributor Author

jerryzh168 commented Aug 13, 2024

@cpuhrsch please let me know if you want me to separate the affine quantized tensor arg list and minmax observer now. it will be something like this:

# has the min_val/max_val logic
class MinMaxObserver(...):
   ...
   def forward(...):
       self.min_val = ...
       self.max_val = ...
       ...

# has the list of args for AffineQuantizedTensor
class AffineQuantizedObserver(...):
   def __init__(...):

class AffineQuantizedMinMaxObserver(AffineQuantizedObserver, MinMaxObserver):
   def calculate_qparams(...):
          use self.min_val, self.max_val and args in AffineQuantizedObserver to calculate qparams for AffineQuantizedTensor

@jerryzh168
Copy link
Contributor Author

will land for now, we can follow up to split the logic when there are more observers I think

@jerryzh168 jerryzh168 merged commit 6199f89 into pytorch:main Aug 14, 2024
14 checks passed
@jerryzh168 jerryzh168 deleted the add-observer branch August 14, 2024 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants