Add StretchedUnifTorchaoQuantizer #2576

lisjin · 2025-07-21T16:12:35Z

This PR adds a new stretched uniform quantizer for PARQ, which empirically performs well for 2- and 3-bit QAT. Main differences:

Fractional quant_min=-2**(b - 1) + 0.5 and quant_max=2**(b - 1) - 0.5 values
Per-block min_val, max_val are computed by taking a multiple of the mean over absolute values (instead of absmax)

As in #2091, I also compare the resulting PARQ quantized weights with those quantized with torchao's module swap + quantize_ API. To support this, I created a new tensor subclass StretchedAffineQuantizedTensor and config StretchedIntxWeightOnlyConfig to handle floating point quant_min, quant_max, and zero_point values.

pytorch-bot · 2025-07-21T16:12:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2576

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

linux.aws.h100.8 instance is down, potentially longer queue on linux.aws.h100

✅ No Failures

As of commit 6bdd3f6 with merge base 2eb4f97 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchao/prototype/parq/quant/quant_api.py

metascroy · 2025-07-22T03:37:22Z

torchao/prototype/parq/quant/quant_api.py

+            quant_min=quant_min,
+            quant_max=quant_max,
+        )
+        data, scale, zero_point = _layout.post_process(


IIUC, the zero_point isn't some fixed fp value, but can vary slightly based on the ranges.

So what the CPU kernel can support is a fp32 scale times a int8-LUT value.

So if we want the grid {-3.5, -1.5, 1.5, 3.5}, we instead use the LUT = {-7, -3, 3, 7} with s_table = 0.5.

In addition to having this LUT for the whole tensor, we can have an FP32 scale s at a per-group granularity. Dequantization is then:

w_dequantized = s * s_table * LUT[idx]

It's not clear to be that the affine scheme here is representable in that way. IIUC, you have:

w_dequantized = s * (qval - z)

So qval - z could define the LUT, but it looks like we'd have a different LUT per group_size values because z changes every group_size values? Is that right?

Thanks for the comments @metascroy!

Just to clarify from our chat earlier, zero_point=-0.5 is the same across all groups. (I flipped the sign since it's standard to add zero_point during quantization.)

w_quantized = torch.round(x / s + zero_point) w_dequantized = s * (w_quantized - zero_point)

For the 2-bit case, we set s so that x / s is restricted to range [-1.5, 1.5]. Since zero_point=-0.5, w_quantized lies in the grid {-2, -1, 0, 1}.

It seems like we don't need to use LUT format since this is well-supported by an affine scheme. Maybe it would be worth supporting for latency comparisons though (and to avoid float zero_point).

It is well supported by an affine scheme where zero_point is a float, but we do not have CPU kernel support for this.

But if zero_point is always 0.5, then w_quantized - zero_point is just some value in [1.5, 1.5], and this could define an LUT, so I think we can hook into the kernel in that way.

We just need the LUT to be integer, so we can define the LUT as [-3, -1, 1, 3] and then divide the scales in half.

…ze, dequantize

lisjin · 2025-07-22T20:03:00Z

test/prototype/test_parq.py

+        compare_parq_convert(model, m_ref, optimizer, config)
+
+
+class TestStretchedUnifTorchaoQuantizer(common_utils.TestCase):


New test case that ensures equivalence between PARQ's original UnifQuantizer implementation and the new StretchedUnifTorchaoQuantizer

lisjin · 2025-07-22T20:05:48Z

torchao/prototype/parq/quant/quant_api.py

+    q_abs = input_float.abs()
+    max_val = torch.minimum(
+        b * q_abs.mean(dim=reduction_dims, keepdim=True),
+        torch.amax(q_abs, dim=reduction_dims, keepdim=True),
+    ).clamp_(min=eps)
+
+    scale = max_val / quant_max
+    scale = scale.to(dtype=scale_dtype, device=input_float.device)
+    zero_point = torch.full_like(scale, -0.5, dtype=zero_point_dtype)


Here's the logic for initializing the scale based on multiples of per-group absolute value means. I also manually set the zero point to be the same across groups.

metascroy

Looks good to me. We can translate the affine scheme to LUT when we prepare the data for the kernels.

* Add StretchedUnifTorchaoQuantizer * Fix tinygemm test case * Test equivalence to PARQ UnifQuantizer; custom choose_qparams, quantize, dequantize * Remove dequantize_stretched_affine

Add StretchedUnifTorchaoQuantizer

eea8bc3

lisjin requested review from andrewor14 and metascroy July 21, 2025 16:12

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 21, 2025

lisjin added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Jul 21, 2025

Fix tinygemm test case

64f49cc

metascroy reviewed Jul 22, 2025

View reviewed changes

torchao/prototype/parq/quant/quant_api.py Outdated Show resolved Hide resolved

metascroy reviewed Jul 22, 2025

View reviewed changes

lisjin added 3 commits July 22, 2025 12:29

Test equivalence to PARQ UnifQuantizer; custom choose_qparams, quanti…

bc86a87

…ze, dequantize

Merge branch 'main' into parq

3cbb705

Remove dequantize_stretched_affine

6bdd3f6

lisjin requested a review from metascroy July 22, 2025 19:58

lisjin commented Jul 22, 2025

View reviewed changes

metascroy approved these changes Jul 23, 2025

View reviewed changes

lisjin merged commit 9f4ee3e into main Jul 23, 2025
19 checks passed

lisjin deleted the parq branch July 23, 2025 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add StretchedUnifTorchaoQuantizer #2576

Add StretchedUnifTorchaoQuantizer #2576

Uh oh!

lisjin commented Jul 21, 2025

Uh oh!

pytorch-bot bot commented Jul 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

metascroy Jul 22, 2025

Uh oh!

lisjin Jul 22, 2025

Uh oh!

metascroy Jul 22, 2025

Uh oh!

lisjin Jul 22, 2025

Uh oh!

lisjin Jul 22, 2025

Uh oh!

metascroy left a comment

Uh oh!

Uh oh!

Uh oh!

		compare_parq_convert(model, m_ref, optimizer, config)


		class TestStretchedUnifTorchaoQuantizer(common_utils.TestCase):

Add StretchedUnifTorchaoQuantizer #2576

Add StretchedUnifTorchaoQuantizer #2576

Uh oh!

Conversation

lisjin commented Jul 21, 2025

Uh oh!

pytorch-bot bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2576

❗ 1 Active SEVs

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

metascroy Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

lisjin Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

metascroy Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

lisjin Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

lisjin Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

metascroy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 21, 2025 •

edited

Loading