Skip to content

Conversation

@masahi
Copy link
Member

@masahi masahi commented Sep 26, 2022

This PR adds TE compute and schedule definitions for int8 conv2d and dense using vrmpy tensorization, and Relay alter layout / legalize to enable using them in e2e settings. Since vrmpy is very similar to x86 VNNI or ARM sdot/udot instructions, lots of code are shared with existing x86 / ARM backend implementations.

This lets us run int8 resnet50 in 146 msec on SD888. All convolutions and the final dense op are tensorized. The current bottleneck is requantize-related operations. The test script and model files to run int8 resnet50 are attached below.

test_qresnet50.zip

@kparzysz-quic @tkonolige @nverke @ibsidorenko

@masahi masahi force-pushed the hex-conv2d-dense-vrmpy branch from 5cc80ab to e972136 Compare September 27, 2022 03:14
@masahi masahi force-pushed the hex-conv2d-dense-vrmpy branch from e972136 to 17bde45 Compare September 27, 2022 03:38
@masahi masahi marked this pull request as ready for review September 27, 2022 05:35
Unlike the nn.dense case (see dense_alter_op.py), we do not convert (uint8, int8) to
(uint8, uint8). That would introduce another convolution by a constant (128 or 1) filter,
to compensate for the dtype legalization. In the nn.dense case, such compensation factor is
just a sum over the K axis.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ibsidorenko @tkonolige @nverke on this. We can convert u8 * s8 convolution to u8 * u8 like below

W'_u8 = W_s8 + 128
X_u8 * W_s8 = X_u8 * (W'_u8 - 128)
                = X'_u8 * W'_u8 - X_u8 * 128

Here, X_u8 * 128 is a convolution of X_u8 by a constant filter. We can factor out 128 to end up with a filter where all elements are 1. So what we need is a windowed sum, or "sum pooling" op - without it, I think we need to do a full blown convolution. This is why I don't use legalization for conv2d. Let me know if you have better idea.

_, inner = s[x].split(fused, factor=128 // np.dtype(x.dtype).itemsize)
outer, inner = s[x].split(fused, factor=128 // np.dtype(x.dtype).itemsize)
s[x].vectorize(inner)
s[x].parallel(outer)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kparzysz-quic @nverke, we are enabling multithreading of elemwise ops here. Multithreading on e2e models have been stable since #12807

@masahi masahi force-pushed the hex-conv2d-dense-vrmpy branch from 14e83d5 to 4ad3e63 Compare September 27, 2022 06:05
@ibsidorenko
Copy link
Contributor

LGTM!

@tmoreau89 tmoreau89 merged commit f3d3ece into apache:main Oct 3, 2022
@tmoreau89
Copy link
Contributor

Thanks @masahi @ibsidorenko @kparzysz-quic the PR has been merged

xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
…che#12911)

* [Hexagon] Support vrmpy tensorization for conv2d and dense schedules

* update

* clean up

* migrate tests to test_launcher.py

* remove vrmpy test files

* use generic int8 conv2d schedule

* clean up

* doc update

* pylint fix

* parametrize dtype in test

* doc update

* add missing paralleization for dense

* more pylint

* fixed for fp32 dense
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants