-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[microTVM] Modernize Arm Cortex-M convolution schedules #13242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
8f0b1a4 to
4fd94e2
Compare
f206531 to
40b5554
Compare
39cb5a4 to
7b465c2
Compare
|
This pull request is ready for review! Would love reviews from @mkatanbaf (who's doing some microTVM + MetaSchedule work), @areusch, and @ekalda. Would also love a look from someone who's more familiar with TVMScript, and can critique my use of it :). That said, there are a few known issues in this PR I still need to fix:
In a following PR, I'll also address:
|
a6dfafc to
febb861
Compare
areusch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a first pass here, thanks @guberti !
| """Addition is commutative, so we could add the bias before, during, or after performing our | ||
| multiply-accumulate operations. It "costs" one cycle either way - if done at the beginning we | ||
| can't use a SMULXY trick to set sum_i to zero for "free", and if done at the end it doesn't | ||
| combine with anything. However, doing it at the beginning frees up a register/prevents needing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of bias addition does not change the overflow behavior. This comment is just stating we could do the additions as:
OR as:
I've changed the wording a bit to make this clearer.
src/relay/qnn/op/requantize.cc
Outdated
| // Check and assign types for scale and zero points. | ||
| AssignType(types[1], DataType::Float(32), axis_shape, reporter); // input_scale | ||
| AssignType(types[2], DataType::Int(32), axis_shape, reporter); // input_zero_pt | ||
| // AssignType(types[1], DataType::Float(32), axis_shape, reporter); // input_scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uncomment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed - this PR should not change requantize.cc.
However, it is a bit of a tricky issue. In qnn_alter_op.py, I want to manually choose the int32 requantize scale to improve performance. However, Relay's requantize op only allows the output scale to be a float32.
I get around this by storing the scale data as a float32 array with the correct bytes, and reading it back as an int32 array. I've added a comment to qnn_alter_op.py to better explain what happens here. This is pretty gross.
Longer term, I'd love to add a new Relay op IntegerRequantize that takes int32 scale and shift arguments, which will let us solve this problem in a nice way. Would love your thoughts on the right way to address this!
tests/python/relay/strategy/arm_cpu/test_quantized_convolution.py
Outdated
Show resolved
Hide resolved
febb861 to
fae2a12
Compare
mkatanbaf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @guberti I added a few comments, mostly asking for clarifications.
| def _apply_simd_optimizations(instruction_tuples) -> Iterator[Tuple]: | ||
| """When possible, fuses single MACs into SIMD MAC instructions. | ||
| The compiler cannot do this automatically, as calling __builtin_arm_smlaxy forces the SMLAxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand this correctly, but does this mean that we will unroll the loop and get a long list of instructions instead? would this significantly increase the code size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the inner reduction loops will always be unrolled (this occurs in _get_draft_macs). We will often unroll even more than this, either as another unrolled copy of the inner loops for odd-numbered channels (this happens e.g. for 3x3 depthwise convolutions) or by computing multiple sums at the same times (i.e. when num_sums > 1).
Compared with the naive approach, this does increase code size. However, the increase is very small - for example, unrolling a 3x3 depthwise convolution might take ~10 extra instructions, or 0.01 KB more flash size. This is well worth it, as unrolling dramatically reduces overhead and increases speed by ~2x. The previous tensordot implementation also unrolled these loops for the same reason.
| # Arm GCC does not have `__builtin_arm_smlabt`, even though `__builtin_arm_smlatt`, | ||
| # `__builtin_arm_smlatb`, `__builtin_arm_smlad` and so on all exist. Perhaps this is a | ||
| # choice, since we can just use `smlabt` with the argument order swapped instead? Note that | ||
| # `__builtin_arm_smlabt` exists on most compilers (e.g. Clang) - this is just a GCC thing. | ||
| if instruction == "smlabt": | ||
| yield f"sum_{index} = __builtin_arm_smlatb({op2}, {op1}, sum_{index});" | ||
| else: | ||
| yield f"sum_{index} = __builtin_arm_{instruction}({op1}, {op2}, sum_{index});" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is because you're using the builtins directly rather than using the ACLE interface (
https://arm-software.github.io/acle/main/acle.html#accumulating-multiplications) - unsure how much guarantee you get with built-ins, I would move to the ACLE interface anyway.
Also see: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm_acle.h#L661-L675 😸
| ( | ||
| f""" | ||
| #include <stdint.h> | ||
| #include <arm_nnsupportfunctions.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay! I think this solves the same problem as #13363 😸 !
| # under the License. | ||
| """microTVM cares a lot about the convolution + bias + requantize + fused ReLU use case. There have | ||
| been some accuracy issues in the past, so this test steps through a model (MobileNetV1) layer by | ||
| layer and ensures there is 1-1 correspondance at each step. This test would run way faster if we ran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool, great idea!
29d97a8 to
f11243a
Compare
areusch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @guberti, did a more fine-grained pass now.
| scale = T.match_buffer(scale_handle, scale_shape) | ||
| output = T.match_buffer(output_handle, output_shape, dtype="int16") | ||
|
|
||
| # This hack prevents TVM from seeing these variables as "unused". I should be using T.reads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you file a bug for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is user error on my part, or an issue with TVM. I'll look around a bit and file an issue if it seems to be a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, apologies for bringing up an old PR thread, I just ran into a similar problem, was an issue filed in the end? If so, could you possibly point me to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhutton1 A bug still needs to be filed here - I meant to write up a small reproducible example, but never got around to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll take a look into it :)
f11243a to
9bd3598
Compare
areusch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @guberti, this is basically ready, i've highlighted a couple last areas (in particular the doctest). feel free to merge once you've addressed!
| including regular conv2d, depthwise conv2d, and grouped conv2d provided the data and kernel layouts | ||
| are the optimal ones. When groups=1, the optimal data layout is NHWC and kernel layout is OHWI. When | ||
| this is a depthwise convolution, the optimal data layout is NCHW and kernel layout is OIHW.""" | ||
| """Generates optimized code to compute a tensor dot product on ARMv7E-M. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this apply to v8-M also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes - this uses the DSP instructions, which are required in v7E-M but optional in v8-M. This code also does not use MVE, which is optional in v8-M but would be really useful for deep learning. I've clarified this in the docstring.
Get QNN strategy running QNN strategy with operator fusion
Assembly tensordot from other PR Tensordot offset support Hand tested tensordot code
Formatting fixes Don't use automatic AOT building when skipping pass Assorted tech for scheduling with TIR Hacky int16 support
Bugged schedule implementation Passing test! Works for all 1x1 conv2ds! External QNN operator altering Debugging work Pad with correct constant Broadly functional conv2d Reorganize quantize convolution test
Working depthwise convolution for strides=1 Working depthwise convolution!
Support Python 3.7 Clean up code to prepare for review
Second round of code review Fix tensordot opts test
dcd9c17 to
431e4e4
Compare
|
I've addressed the comments from @areusch, so per his instructions I'm merging this. Thanks for the feedback! |
For a long time, I've been unhappy with TVM's TE-based convolution schedules for Arm Cortex-M. They were a lot slower than the state-of-the-art, and had a lot of strange inefficiencies caused by limitations of TE.
This pull request rewrites regular and depthwise convolution schedules on Arm Cortex-M, using MetaSchedule and TIR to make them much faster. It took some work and ended up being a big PR (as many of these changes depend on the others), but I'm really happy with the result.
High level changes
qnnoperator strategy to TVM for Arm Cortex-M. With this change, we are able to skip the QNN lowering pass, letting us use Cortex-M specific implementations ofqnn_conv2d,add, andrequantizethat perform much better.alter_op_layoutfunctions foraddandrequantize. This reduces the amount of memory loaded during each requantization by over 5x with some snazzy tricks (pre-multiplying the kernel values with the input zero point, skipping the "shift" step in our floating point multiplication approximation, fusing the bias with the pre-multiplied zero point).vwwmodel using TFLite and ensures our implementation (with all the optimizations above) produces the same outputs. This is done by layer, so if there is ever an accuracy issue, we will know exactly which layer is causing the problem.TFLite-ground-truth Corstone300 Test
For a while, microTVM has had Corstone300 tests which compare our schedules for regular
nnops to implementations elsewhere in TVM, to make sure the schedules are written correctly. Despite this, we've had some accuracy issues (see #13364) when running models end-to-end, and we don't really have tools to debug these.The way I see it, the existing tests have two key limitations. They:
nn.conv2d), while leaving out the bias and re-quantize operations (which are normally fused).To fix this, I've added
test_quantized_convolution.pyin this PR. This test runs the convolution layers of thevwwmodel from TinyML perf using TensorFlow's TFLite Interpreter, while saving all the intermediate layer outputs.Then, one by one each layer is loaded with TVM and Corstone300, and the full operator (with fused convolution, bias, ReLU, and requantization) is run and compared to TFLite's result.
Quantized operators and fusion
TFLite Micro, CMSIS-NN, and (AFAIK) all other microcontroller AI platforms write code for "fused operators" - e.g. a convolution combined with a bias addition, ReLU activation, and requantization. This is good for a few reasons - it prevents us from having to store "intermediate results", it lets us combine steps from different operators, and it makes parts of the code easier to write.
This wasn't possible with TVM until recently, thanks to #12398 which enabled it for Hexagon. I've done the same thing here for Arm. I've also added strategy functions for 2D quantized convolutions on Arm, though (a) only some cases are supported and (b) the
qnn.Legalizepass must be disabled for these to be used.TVMScript convolution schedules
For a while, TE has had a known limitation that makes it impossible to fuse certain operators when they follow
reduceoperations. This meant microTVM would generate code like the following:I previously looked into this limitation, and with the help of Eric L. and others realized it would be really annoying to fix. Instead, our schedule has been replaced with a
T.prim_func, which lets us do this fusion (and have much more fine-grained control in general).I hit a few bugs doing this (e.g. #13330), and the limited docs for TVMScript meant I had to make some guesses about the right way to do things. It's totally possible this code is gross - I'll describe these issues more in a comment below. However, the generated code looks much nicer.
New optimized C intrinsic for convolutions
A few weeks ago, I wrote a faster version of microTVM's tensordot kernel. That got folded into this PR, as that schedule was not usable on its own. I've added a unit test test_topi_conv2d_tensordot_opts that goes into more detail about what the schedule does and why it is fast, but here's just a taste.
Our previous microTVM-specific schedule for regular
conv2dwas not very good, and was slower than just autotuning a generic implementation (for this reason, OctoML used a generic autotuned schedule to submit microTVM results to MLPerf Tiny). However, there are major limitations for how far an autotuning + C code generation approach can go, as GCC only uses the fast intrinsic functions in super narrow cases.For example, here is how microTVM would previously generate the inner loop of a 1x1 4-channel convolution:
Arm GCC 12.2 (with flags
-mcpu=cortex-m4 -O3) compiles this into instructions taking 29 cycles per output generated. That's not good, and the previous microTVM schedule was even worse.The new implementation in
tensordot.pyinstead gets compiled into just 15 cycles (though there is still work to be done to get this even lower):This is a very simple case, but we also have good support and tests for complex cases. We can work on data where the start pointers aren't word aligned, work on data where one or more of the data, kernel, or output has width not divisible by the SIMD width, have multiple sums running concurrently to reduce the number of memory loads (e.g. for 3x3 depthwise convolutions). The unit test checks all these capabilities, and the
tensordot.pyfile itself has comments explaining why doing it this way is faster.Faster re-quantization algorithm!
The way microTVM handled convolutions before was terrible. Here is an actual implementation from our MLPerf Tiny submission, which I've modified slightly for readability.
There are a bunch of things about this that aren't ideal:
int64values, and they are all padded with unnecessary zeros. This means we need to load eight words from memory for each re-quantization operation.int64ops, which are slow because Arm Cortex-M is a 32-bit platform.int8bounds checking is done with a wacky set of ternary operators. I checked - these do not get complied down nicely.I've fixed all these things using QNN
alter_op_layoutfunctions, and I've implemented a few more complex optimizations:biasbybias + sum(kernel) * input_zero_point(e.g. pre-multiplying the kernel values by the input zero point). This prevents us from having to subtract out the bias every time we do a multiplication by a kernel value (note that the input zero point is-128basically every time, because Cortex-M does not have auint x intinstruction). The result is stored in anint32value.32bits from ourint32 x int32multiplication. This lets us use zeroint64memory loads or instructions, without sacrificing accuracy.Together, this means our requantization code now looks like this:
All in all, requantization now takes ~8x fewer cycles per output than it did before.