Added first bits of Uint2Tensor and BitnetTensor #282

andreaskoepf · 2024-05-26T20:49:26Z

Created a UInt2Tensor class (similar to the UInt4Tensor class). Added a BitnetTensor class and a first unit test which quantizes the weights of a nn.Linear() layer and executes the matmul.

Currently generates an error if the commented @torch.compile lines above the unpack_uint2() and pack_uint2() functions are uncommented.

pytorch-bot · 2024-05-26T20:49:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/282

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ae4ead1 with merge base 664f073 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

msaroufim · 2024-05-26T23:38:03Z

test/dtypes/test_uint2.py

+            _apply_weight_only_uint2_quant(m)
+            y_wo = m(x)
+            # sqnr = compute_error(y_ref, y_wo)
+            #opt = torch.compile(m, fullgraph=True, mode="max-autotune")


What's the error you were getting?

AssertionError: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.sym_storage_off.set.default(_to_functional_tensor(FakeTensor(..., size=(16, 4), dtype=torch.uint8)))
It generally failed enabling @torch.compile on pack and unpack functions

@bdhirsh any ideas? I've seen this scary error before and never understood what it meant beyond AOTAutograd is not doing its thing

opened this issue pytorch/pytorch#127374

msaroufim · 2024-05-29T01:57:26Z

torchao/dtypes/uint2.py

+        return output
+
+else:
+    #@torch.compile


did these all fail torch.compile checks?

Btw is the idea here to trigger the else condition on cpu only?

Yes that is right. I'll check torch.compile once again.

jerryzh168 · 2024-05-29T02:02:13Z

torchao/dtypes/uint2.py

+    return (input.view(torch.uint8).to(torch.float32) - zero_point) * scale
+
+
+class UInt2Tensor(torch.Tensor):


maybe we want to generalize UInt4Tensor to work with uint1 to uint7 directly, but for now maybe we can put this in prototype to unblocking kernel development

Good point, will generalize it.

melvinebenezer · 2024-06-01T12:49:00Z

TODO List

Add Triton kernel for unit2
Raise issue on torch.compile related to fake tensors - Torch.compile produces Exception: Please convert all Tensors to FakeTensors first or instantiate pytorch#127374
move uint implementation to prototype
generalise uint4 to handle 1-7
better test cases

Co-authored-by: James Melvin Ebenezer <[email protected]>

CoffeeVampir3 · 2024-06-02T18:42:37Z

I've written a testing network (1 layer MLP) to test full functionality. The primary failure right now is that transpose doesn't work due to padding issues and the matmul operation is not found despite being defined in the bitnet linear fn.
NotImplementedError: aten.amin.default

https://github.com/CoffeeVampir3/ao-bitnet/blob/main/bitnet_staging/bitnet_trained_to_ao_test.py

Added several operations for UInt2Tensor, still needs work.

msaroufim · 2024-06-04T03:13:24Z

test/dtypes/test_uint2.py

+class TestUInt2(QuantizationTestCase):
+    def test_gpu_quant(self):
+        device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        for x_shape in [[2, 4], [5, 5, 5, 4], [1, 4, 4]]:


favor pytest parametrize that way we'll know which failed from CI logs if any

msaroufim · 2024-06-04T03:13:31Z

test/dtypes/test_uint2.py

+        for x_shape in [[2, 4], [5, 5, 5, 4], [1, 4, 4]]:
+            x = torch.randn(*x_shape).to(device)
+            m = nn.Sequential(nn.Linear(4, 16)).to(device)
+            y_ref = m(x)


need some compile test

msaroufim · 2024-06-04T03:15:05Z

torchao/prototype/dtypes/uint2.py

+        triton_pack_uint2[grid](uint8_data, output, n_elements, BLOCK_SIZE=1024)
+        return output
+
+else:


are the kernels compilable on CPU? just wondering if these should ever get called today or they're mostly in so we can eventually revert the triton kernels?

Yes, That's right. Will eventually revert the triton kernels with torch.compile.

msaroufim · 2024-06-04T03:17:09Z

torchao/prototype/dtypes/uint2.py

+        return torch.equal(self.elem, other.elem)
+
+    @classmethod
+    def __torch_dispatch__(cls, func, types, args, kwargs=None):


Not feedback to you guys but I kind of hate how this looks - @cpuhrsch @jerryzh168 is there a more readable design pattern we could use? Something like an abstract class perhaps? Or a Dispatcher class?

Yes, I like the implements pattern from NF4 and AQT.

ao/torchao/dtypes/nf4tensor.py

Lines 46 to 54 in 12f44ab

def implements(aten_ops):

"""Use this decorator to implement a function for an aten op in __torch_dispatch__"""

def decorator(func):

for op in aten_ops:

NF4_OPS_TABLE[op] = func

return func

return decorator

Then you can do stuff like

@implements([torch.ops.aten.to.dtype]) def to_dtype(func, *args, **kwargs): if not args[0][0].is_contiguous(): assert args[0][0].t().is_contiguous() return torch.ops.aten.to.dtype(args[0][0].t(), args[0][1]).t() return args[0][0].get_original_weight().to(args[0][1])

People can then also use this wrapper out of tree like in our tutorial:

ao/tutorials/add_an_op.py

Line 24 in 12f44ab

@torchao.dtypes.nf4tensor.implements([torch.ops.aten.gelu.default])

torch_dispatch then becomes fairly simple

ao/torchao/dtypes/nf4tensor.py

Lines 757 to 782 in 12f44ab

@classmethod

def __torch_dispatch__(cls, func, types, args, kwargs=None):

"""TODO we are not supporting torch dispatch at the moment

instead we have created a Autograd.Function to handle the linear

"""

# All ops in the NF4_OPS_TABLE expect NF4 Tensors as inputs

# And don't support mixed tensor subclasses. This will trigger the handler for

# the next type in the dispatch list

def allowed_subclasses(type):

return (

issubclass(cls, type)

or issubclass(torch._subclasses.fake_tensor.FakeTensor, type)

or issubclass(

torch._subclasses.functional_tensor.FunctionalTensor, type

)

)

if not all(allowed_subclasses(t) for t in types):

return NotImplemented("Up to the next one to handle")

if func in NF4_OPS_TABLE:

return NF4_OPS_TABLE[func](func, args, kwargs)

raise NotImplementedError(

f"NF4Tensor dispatch: attempting to run {func}, this is not supported"

)

Please note that raising an exception at the end is very important. Otherwise you'll just return None, which is valid Python, even though you might have meant to say "This isn't implemented".

Also, it's probably preferred (note: trust but verify this advice!) to use return NotImplemented instead of raising an exception within __torch_dispatch__, because it'll get caught and handled by the PyTorch dispatcher.

@cpuhrsch incorporated your feedback in 1fdeb91. @msaroufim
Have a look if its ok.

msaroufim · 2024-06-04T03:18:01Z

torchao/prototype/dtypes/uintgen.py

+import torch
+
+"""
+Contains generic functions to pack and unpack uint8 tensors into uint2, uint3, uint4, uint5, uint6, and uint7 tensors.


do you mean uintx into uint8?

Yes!, my bad

msaroufim · 2024-06-04T03:18:40Z

torchao/prototype/dtypes/uintgen.py

+    return packed_data
+
+
+def unpack_uint6(packed_data: torch.Tensor) -> torch.Tensor:


btw how does this wcompare to @vayuda's work?

There is a bit of rework. After discussio on cuda-mode between @andreaskoepf and @vayuda. Some of the odd bit functionality will be merged into @vayuda's algorithimic bitpacking.py. However will use this implementation to test correctness and speed with both approaches.

msaroufim · 2024-06-04T03:18:50Z

torchao/prototype/dtypes/uintgen.py

+
+def pack_uint6(uint8_data: torch.Tensor) -> torch.Tensor:
+    """Pack the 6 lowest bits of 4 input bytes into 3 bytes
+


very cool docstrings overall!

msaroufim · 2024-06-04T03:19:14Z

torchao/prototype/dtypes/uintgen.py

+    assert torch.all(k == check)
+
+
+if __name__ == "__main__":


can we move these to test/

melvinebenezer · 2024-06-16T08:15:42Z

1fdeb91

implements pattern for uint2 and bitnet
move uintgen tests to test/

msaroufim · 2024-06-16T14:28:12Z

torchao/prototype/dtypes/bitnet.py

+    # Quantize the input tensor to int2
+    quant = x.sign() + 1
+
+    if target_dtype == torch.uint2:


I don't think you need torch.uint2 it doesn't do anything. You can remove the 0.3 skip test as well in the test file

msaroufim · 2024-06-16T14:30:12Z

torchao/prototype/dtypes/bitnet.py

+        return BitnetTensor(tensor)
+    raise NotImplementedError(f"to {dtype} not supported")
+
+if __name__ == "__main__":


I'd rather have all of this functionality be in the test file

As in make sure you can instantiate a BItNet tensor, copy it it, transpose it, multiply and convert to from the main test

added test cases for BitnetTensor, UIntTensor

bdhirsh · 2024-06-17T14:41:10Z

torchao/prototype/dtypes/uint2.py

+@implements([torch.ops.aten.detach.default])
+def detach(func, args, kwargs):
+    (tensor,) = args
+    return tensor.elem.detach()


sidecar comment:

most of your view op implementations above wrap the output back into a Uint2Tensor ("propagating" the subclass-ness through the model when views are encountered)

detach is just another view op

so you probably want your impl for detach to wrap the output in your subclass?

Yes that is correct. Fixed it

bdhirsh

very cool :)

.

msaroufim · 2024-06-18T02:21:21Z

nice a segfault lol - although this is not from your code

@jerryzh168 have you seen this before? https://github.com/pytorch/ao/actions/runs/9557079384/job/26344629090?pr=282#step:12:2377

* Added first bits of Uint2Tensor and BitnetTensor Co-authored-by: James Melvin Ebenezer <[email protected]> * add conversion to standard signed and unsigned dtypes * added triton kernel for pack and unpack * fix: test cases and device allocation for triton kernels * fix: moved uint2 to prototype folder * Add packing and unpacking functions for uint{2,3,4,5,6,7}. Co-authored-by: James Melvin Ebenezer <[email protected]> * housekeeping: renamed uint_small to uintgen and simple comments * Update uint2.py Added several operations for UInt2Tensor, still needs work. * added pytest ,compile tests and some cleanup * fix: implements pattern for uint2 and BitnetTensor * fix: torch.uint2 available after torch 2.3 * fix: test cases for BitnetTensor, UInt2Tensor and bitpacking gen * fix: removed torch.uint2 * fix: wrap detach in UIntTensor, torch.compile test * fix: CI errors on compile tests * fix: skip tests less than torch 2.4 * Added pytest fixture * remove tensor core flag --------- Co-authored-by: James Melvin Ebenezer <[email protected]> Co-authored-by: Z <[email protected]> Co-authored-by: Pawan Jayakumar <[email protected]> Co-authored-by: Mark Saroufim <[email protected]>

jerryzh168 · 2024-08-10T00:58:45Z

nice a segfault lol - although this is not from your code

@jerryzh168 have you seen this before? pytorch/ao/actions/runs/9557079384/job/26344629090?pr=282#step:12:2377

just saw this comment... I just saw the same error when I'm porting the uintx to pytorch (#635), but Charlie says that autoquant subclass is weird so I disabled it

)

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 26, 2024

msaroufim reviewed May 26, 2024

View reviewed changes

msaroufim requested a review from jerryzh168 May 26, 2024 23:38

msaroufim mentioned this pull request May 26, 2024

Bitnet 1.58 prework, POC, and staging #281

Open

6 tasks

msaroufim reviewed May 29, 2024

View reviewed changes

jerryzh168 reviewed May 29, 2024

View reviewed changes

andreaskoepf force-pushed the uint2_bitnet branch from a4ceeb1 to fcd7c08 Compare June 1, 2024 14:55

andreaskoepf and others added 6 commits June 1, 2024 16:55

Added first bits of Uint2Tensor and BitnetTensor

6927d28

Co-authored-by: James Melvin Ebenezer <[email protected]>

add conversion to standard signed and unsigned dtypes

0d85b06

added triton kernel for pack and unpack

f64e457

fix: test cases and device allocation for triton kernels

8b14c17

fix: moved uint2 to prototype folder

fcd7c08

Add packing and unpacking functions for uint{2,3,4,5,6,7}.

30d95a1

Co-authored-by: James Melvin Ebenezer <[email protected]>

melvinebenezer and others added 2 commits June 3, 2024 19:31

housekeeping: renamed uint_small to uintgen and simple comments

a071487

Update uint2.py

f0d5982

Added several operations for UInt2Tensor, still needs work.

msaroufim previously requested changes Jun 4, 2024

View reviewed changes

melvinebenezer and others added 3 commits June 5, 2024 14:12

added pytest ,compile tests and some cleanup

13fa9d8

Merge branch 'pytorch:main' into uint2_bitnet

326f552

Merge branch 'pytorch:main' into uint2_bitnet

6d6b9dc

msaroufim marked this pull request as ready for review June 15, 2024 20:53

msaroufim marked this pull request as draft June 15, 2024 21:41

fix: implements pattern for uint2 and BitnetTensor

1fdeb91

fix: torch.uint2 available after torch 2.3

4eb5679

msaroufim reviewed Jun 16, 2024

View reviewed changes

Merge branch 'pytorch:main' into uint2_bitnet

4ec77e4

melvinebenezer added 2 commits June 17, 2024 12:50

fix: test cases for BitnetTensor, UInt2Tensor and bitpacking gen

666a724

fix: removed torch.uint2

a2a4359

bdhirsh reviewed Jun 17, 2024

View reviewed changes

fix: wrap detach in UIntTensor, torch.compile test

60970c5

msaroufim marked this pull request as ready for review June 17, 2024 19:14

msaroufim self-requested a review June 17, 2024 19:14

melvinebenezer added 2 commits June 18, 2024 00:28

fix: CI errors on compile tests

c9e9583

fix: skip tests less than torch 2.4

7041216

msaroufim changed the title ~~[WIP] Added first bits of Uint2Tensor and BitnetTensor~~ Added first bits of Uint2Tensor and BitnetTensor Jun 18, 2024

msaroufim and others added 2 commits June 17, 2024 20:17

Added pytest fixture

5ef3f6b

remove tensor core flag

ae4ead1

msaroufim approved these changes Jun 18, 2024

View reviewed changes

msaroufim merged commit cb3bd8c into pytorch:main Jun 18, 2024
13 checks passed

msaroufim mentioned this pull request Jun 18, 2024

tensor cores in8 test failure #394

Closed

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

fix flag for runner aoti. CI test for runner to come later. (pytorch#282

829c81d

)

		return (input.view(torch.uint8).to(torch.float32) - zero_point) * scale


		class UInt2Tensor(torch.Tensor):

	def implements(aten_ops):
	"""Use this decorator to implement a function for an aten op in __torch_dispatch__"""

	def decorator(func):
	for op in aten_ops:
	NF4_OPS_TABLE[op] = func
	return func

	return decorator

	@classmethod
	def __torch_dispatch__(cls, func, types, args, kwargs=None):
	"""TODO we are not supporting torch dispatch at the moment
	instead we have created a Autograd.Function to handle the linear
	"""
	# All ops in the NF4_OPS_TABLE expect NF4 Tensors as inputs
	# And don't support mixed tensor subclasses. This will trigger the handler for
	# the next type in the dispatch list

	def allowed_subclasses(type):
	return (
	issubclass(cls, type)
	or issubclass(torch._subclasses.fake_tensor.FakeTensor, type)
	or issubclass(
	torch._subclasses.functional_tensor.FunctionalTensor, type
	)
	)

	if not all(allowed_subclasses(t) for t in types):
	return NotImplemented("Up to the next one to handle")

	if func in NF4_OPS_TABLE:
	return NF4_OPS_TABLE[func](func, args, kwargs)
	raise NotImplementedError(
	f"NF4Tensor dispatch: attempting to run {func}, this is not supported"
	)

		return packed_data


		def unpack_uint6(packed_data: torch.Tensor) -> torch.Tensor:


		def pack_uint6(uint8_data: torch.Tensor) -> torch.Tensor:
		"""Pack the 6 lowest bits of 4 input bytes into 3 bytes

Added first bits of Uint2Tensor and BitnetTensor #282

Added first bits of Uint2Tensor and BitnetTensor #282

Conversation

andreaskoepf commented May 26, 2024

pytorch-bot bot commented May 26, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/282

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim May 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim May 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryzh168 May 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

melvinebenezer commented Jun 1, 2024 • edited Loading

TODO List

CoffeeVampir3 commented Jun 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuhrsch Jun 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

melvinebenezer commented Jun 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdhirsh left a comment

Choose a reason for hiding this comment

msaroufim commented Jun 18, 2024

jerryzh168 commented Aug 10, 2024

pytorch-bot bot commented May 26, 2024 •

edited

Loading

msaroufim May 27, 2024 •

edited

Loading

msaroufim May 29, 2024 •

edited

Loading

jerryzh168 May 29, 2024 •

edited

Loading

melvinebenezer commented Jun 1, 2024 •

edited

Loading

CoffeeVampir3 commented Jun 2, 2024 •

edited

Loading

cpuhrsch Jun 4, 2024 •

edited

Loading

melvinebenezer commented Jun 16, 2024 •

edited

Loading