Skip to content

Conversation

@guberti
Copy link
Member

@guberti guberti commented Oct 31, 2022

For a long time, I've been unhappy with TVM's TE-based convolution schedules for Arm Cortex-M. They were a lot slower than the state-of-the-art, and had a lot of strange inefficiencies caused by limitations of TE.

This pull request rewrites regular and depthwise convolution schedules on Arm Cortex-M, using MetaSchedule and TIR to make them much faster. It took some work and ended up being a big PR (as many of these changes depend on the others), but I'm really happy with the result.

High level changes

  • Adds a qnn operator strategy to TVM for Arm Cortex-M. With this change, we are able to skip the QNN lowering pass, letting us use Cortex-M specific implementations of qnn_conv2d, add, and requantize that perform much better.
  • Performs operator fusion on the (very common) convolution + bias + requantize block for Cortex-M. This lets us write one schedule that performs all three of these steps, which improves performance by letting us compute multiple operator outputs concurrently (which reduces the number of times the data and weights are loaded from memory).
  • Adds a TIR/TVMScript schedule for the convolution + bias + requantize block described above. This fixes a longstanding limitation of TE which required us to write convolution outputs to an intermediate buffer and took a lot of time.
  • Uses new, highly optimized C/Assembly extern function to actually perform the convolution. I spent a ton of time looking at and verifying the assembly code it gets compiled into is really fast.
  • Overhauls the way requantization is done on Cortex-M by adding new alter_op_layout functions for add and requantize. This reduces the amount of memory loaded during each requantization by over 5x with some snazzy tricks (pre-multiplying the kernel values with the input zero point, skipping the "shift" step in our floating point multiplication approximation, fusing the bias with the pre-multiplied zero point).
  • Adds a new, end-to-end Corstone300 test that runs the MLPerf Tiny vww model using TFLite and ensures our implementation (with all the optimizations above) produces the same outputs. This is done by layer, so if there is ever an accuracy issue, we will know exactly which layer is causing the problem.

TFLite-ground-truth Corstone300 Test

For a while, microTVM has had Corstone300 tests which compare our schedules for regular nn ops to implementations elsewhere in TVM, to make sure the schedules are written correctly. Despite this, we've had some accuracy issues (see #13364) when running models end-to-end, and we don't really have tools to debug these.

The way I see it, the existing tests have two key limitations. They:

  1. Compare our results only to other TVM implementations
  2. They run only the base operators (e.g. just the nn.conv2d), while leaving out the bias and re-quantize operations (which are normally fused).

To fix this, I've added test_quantized_convolution.py in this PR. This test runs the convolution layers of the vww model from TinyML perf using TensorFlow's TFLite Interpreter, while saving all the intermediate layer outputs.

Then, one by one each layer is loaded with TVM and Corstone300, and the full operator (with fused convolution, bias, ReLU, and requantization) is run and compared to TFLite's result.

Quantized operators and fusion

TFLite Micro, CMSIS-NN, and (AFAIK) all other microcontroller AI platforms write code for "fused operators" - e.g. a convolution combined with a bias addition, ReLU activation, and requantization. This is good for a few reasons - it prevents us from having to store "intermediate results", it lets us combine steps from different operators, and it makes parts of the code easier to write.

This wasn't possible with TVM until recently, thanks to #12398 which enabled it for Hexagon. I've done the same thing here for Arm. I've also added strategy functions for 2D quantized convolutions on Arm, though (a) only some cases are supported and (b) the qnn.Legalize pass must be disabled for these to be used.

TVMScript convolution schedules

For a while, TE has had a known limitation that makes it impossible to fuse certain operators when they follow reduce operations. This meant microTVM would generate code like the following:

for (int32_t k_outer = 0; k_outer < 2; ++k_outer) {
  for (int32_t i = 0; i < 48; ++i) {
    for (int32_t j = 0; j < 48; ++j) {
      int32_t cse_var_4 = (j * 8);
      int32_t cse_var_3 = (k_outer * 4);
      // Writes data to a buffer in memory
      convolution_helper_function(
        (&(((int32_t*)depthwise_conv2d)[(((i * 384) + cse_var_4) + cse_var_3)])), 
        (&(((int8_t*)padded_data)[(((i * 400) + cse_var_4) + cse_var_3)])), 
        (&(T_reshape[(k_outer * 36)])));
    }
  }
}
for (int32_t ax1_1 = 0; ax1_1 < 48; ++ax1_1) {
  for (int32_t ax2_1 = 0; ax2_1 < 48; ++ax2_1) {
    for (int32_t ax3_1 = 0; ax3_1 < 8; ++ax3_1) {
      int32_t cse_var_5 = (((ax1_1 * 384) + (ax2_1 * 8)) + ax3_1);
      // Then, has to read the data back before doing more operations. Would be way faster
      // to just fuse these loops, but TE won't let us.
      int32_t __1 = ((int32_t)(((((((int64_t)((int32_t*)depthwise_conv2d)[cse_var_5]) + ... 
      // The rest is omitted for brevity
    }
  }

I previously looked into this limitation, and with the help of Eric L. and others realized it would be really annoying to fix. Instead, our schedule has been replaced with a T.prim_func, which lets us do this fusion (and have much more fine-grained control in general).

I hit a few bugs doing this (e.g. #13330), and the limited docs for TVMScript meant I had to make some guesses about the right way to do things. It's totally possible this code is gross - I'll describe these issues more in a comment below. However, the generated code looks much nicer.

New optimized C intrinsic for convolutions

A few weeks ago, I wrote a faster version of microTVM's tensordot kernel. That got folded into this PR, as that schedule was not usable on its own. I've added a unit test test_topi_conv2d_tensordot_opts that goes into more detail about what the schedule does and why it is fast, but here's just a taste.

Our previous microTVM-specific schedule for regular conv2d was not very good, and was slower than just autotuning a generic implementation (for this reason, OctoML used a generic autotuned schedule to submit microTVM results to MLPerf Tiny). However, there are major limitations for how far an autotuning + C code generation approach can go, as GCC only uses the fast intrinsic functions in super narrow cases.

For example, here is how microTVM would previously generate the inner loop of a 1x1 4-channel convolution:

output[oco_1] = 0;
for (int ic_1 = 0; ic_1 < 4; ++ic_1) {
    output[oco_1] = (output[oco_1] + (((int)tensor[ic_1]) * ((int)((short*)kernel)[((oco_1 * 128) + ic_1)])));
}

Arm GCC 12.2 (with flags -mcpu=cortex-m4 -O3) compiles this into instructions taking 29 cycles per output generated. That's not good, and the previous microTVM schedule was even worse.

The new implementation in tensordot.py instead gets compiled into just 15 cycles (though there is still work to be done to get this even lower):

int tensor__y00_x00__y00_x01 = tensor[0];
int tensor__y00_x02__y00_x03 = tensor[1];

int kernel__y00_x00__y00_x01 = kernel[0];
int kernel__y00_x02__y00_x03 = kernel[1];

int sum_0 = __builtin_arm_smuad(tensor__y00_x00__y00_x01, kernel__y00_x00__y00_x01);
sum_0 = __builtin_arm_smlad(tensor__y00_x02__y00_x03, kernel__y00_x02__y00_x03, sum_0);
sum_0 = __builtin_arm_smlad(tensor__y00_x04__y00_x05, kernel__y00_x04__y00_x05, sum_0);
sum_0 = __builtin_arm_smlad(tensor__y00_x06__y00_x07, kernel__y00_x06__y00_x07, sum_0);

This is a very simple case, but we also have good support and tests for complex cases. We can work on data where the start pointers aren't word aligned, work on data where one or more of the data, kernel, or output has width not divisible by the SIMD width, have multiple sums running concurrently to reduce the number of memory loads (e.g. for 3x3 depthwise convolutions). The unit test checks all these capabilities, and the tensordot.py file itself has comments explaining why doing it this way is faster.

Faster re-quantization algorithm!

The way microTVM handled convolutions before was terrible. Here is an actual implementation from our MLPerf Tiny submission, which I've modified slightly for readability.

static const int32_t fused_nn_conv2d_constant_1[8] = {
    +0x00000f80, -0x00000180, +0x00007e80, +0x00002880, +0x00010680, -0x00000980, +0x00001380, +0x0000a900
};

static const int32_t fused_nn_conv2d_subtract_constant_2[8] = {
    +0x0000306e, +0x00003092, +0x00008470, +0x00004a13, +0x0000c411, +0x00012da6, +0x00003b70, +0x00015bd8
};

static const int64_t fused_nn_conv2d_subtract_add_cast_constant_3[8] = {
    +0x000000004648e699LL, +0x0000000063c512d1LL, +0x00000000611b0293LL, +0x000000007524d8c7LL, +0x000000007758617fLL, +0x00000000590a119bLL, +0x00000000500f9336LL, +0x0000000040ee5089LL
};

static const int64_t fused_nn_conv2d_subtract_add_cast_multiply_constant_4[8] = {
    +0x0000002000000000LL, +0x0000004000000000LL, +0x0000004000000000LL, +0x0000004000000000LL, +0x0000008000000000LL, +0x0000010000000000LL, +0x0000004000000000LL, +0x0000008000000000LL
};
static const int64_t fused_nn_conv2d_subtract_add_cast_multiply_add_constant_5[8] = {
    +0x0000000000000026LL, +0x0000000000000027LL, +0x0000000000000027LL, +0x0000000000000027LL, +0x0000000000000028LL, +0x0000000000000029LL, +0x0000000000000027LL, +0x0000000000000028LL
};

void requantize(void* compute, int32_t conv[8], int32_t i0_i1_outer_fused, int32_t i2_outer, int32_t i3_outer) {
  // Reorganized by @guberti for readability
  int64_t _0 = ((int64_t)conv[i3_outer]) + ((int64_t)(fused_nn_conv2d_subtract_constant_2)[i3_outer]);
  int64_t _1 = _0 - ((int64_t)(fused_nn_conv2d_constant_1)[i3_outer]);
  int64_t _2 = _1 * fused_nn_conv2d_subtract_add_cast_constant_3[i3_outer];
  int64_t _3 = _2 + fused_nn_conv2d_subtract_add_cast_multiply_constant_4[i3_outer];
  int64_t _4 = _3 >> fused_nn_conv2d_subtract_add_cast_multiply_add_constant_5[i3_outer];
  int32_t __1 = ((int32_t) (_4)) - 128;

  // Code below is untouched
  int32_t __2 = (__1) < (127) ? (__1) : (127);
  int8_t __3 = (int8_t)((__2) > (-128) ? (__2) : (-128));
  int8_t __4 = (int8_t)127;
  int8_t __5 = (__3) < (__4) ? (__3) : (__4);
  int8_t __6 = (int8_t)-128;
  ((int8_t*)compute)[(((i0_i1_outer_fused * 384) + (i2_outer * 8)) + i3_outer)] = ((__5) > (__6) ? (__5) : (__6));
} 

There are a bunch of things about this that aren't ideal:

  • We have FIVE re-quantization constants (not counting the zero point, which is correctly inlined).
  • 3/5 of the constants are int64 values, and they are all padded with unnecessary zeros. This means we need to load eight words from memory for each re-quantization operation.
  • The first five math operations are int64 ops, which are slow because Arm Cortex-M is a 32-bit platform.
  • int8 bounds checking is done with a wacky set of ternary operators. I checked - these do not get complied down nicely.

I've fixed all these things using QNN alter_op_layout functions, and I've implemented a few more complex optimizations:

  • When applicable, we now replace the bias by bias + sum(kernel) * input_zero_point (e.g. pre-multiplying the kernel values by the input zero point). This prevents us from having to subtract out the bias every time we do a multiplication by a kernel value (note that the input zero point is -128 basically every time, because Cortex-M does not have a uint x int instruction). The result is stored in an int32 value.
  • We force bitshifts to be >=33 (which in practice they always are), which allows us to only use the top 32 bits from our int32 x int32 multiplication. This lets us use zero int64 memory loads or instructions, without sacrificing accuracy.

Together, this means our requantization code now looks like this:

static const int32_t REQUANTIZE_SCALE[8] = {
    +0x067bed1c, +0x05c578b9, +0x03d08ea5, +0x01ed066f, 
    +0x0176b86f, +0x027f3977, +0x054e3783, +0x06d0a442, 
};

static const int32_t BIAS[16] = {
    +0x00006b58, -0x000023aa, +0x00005cf3, +0x00004cc6, 
    +0x0000605e, +0x00006eed, +0x0000512e, -0x00002526, 
};
// Some lines omitted for brevity

// Bias is added before convolution, as doing it this way is faster
int requant_0 = (sum_0 * (long long) REQUANTIZE_SCALE[j]) >> 32;
requant_0 = (requant_0 + 1) >> 1;
requant_0 = __builtin_arm_ssat(requant_0 - 128, 8);
((short*) output)[0] = (short) requant_0;

All in all, requantization now takes ~8x fewer cycles per output than it did before.

@tvm-bot
Copy link
Collaborator

tvm-bot commented Oct 31, 2022

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@guberti guberti force-pushed the arm-qnn-convolution branch 2 times, most recently from 8f0b1a4 to 4fd94e2 Compare November 7, 2022 12:34
@guberti guberti force-pushed the arm-qnn-convolution branch 3 times, most recently from f206531 to 40b5554 Compare November 11, 2022 12:24
@guberti guberti changed the title [microTVM] [WIP] Support and test QNN convolution and fusion on Arm Cortex-M [microTVM] [WIP] Modernize Arm Cortex-M convolution schedules Nov 13, 2022
@guberti guberti force-pushed the arm-qnn-convolution branch 3 times, most recently from 39cb5a4 to 7b465c2 Compare November 17, 2022 14:44
@guberti guberti marked this pull request as ready for review November 17, 2022 14:45
@guberti
Copy link
Member Author

guberti commented Nov 17, 2022

This pull request is ready for review! Would love reviews from @mkatanbaf (who's doing some microTVM + MetaSchedule work), @areusch, and @ekalda. Would also love a look from someone who's more familiar with TVMScript, and can critique my use of it :).

That said, there are a few known issues in this PR I still need to fix:

  • There is a hack where I read dummy data from a TVMScript buffer to prevent TVM from seeing the buffers as "unused". I think I'm supposed to use T.reads/T.writes in this situation, but I could not make those functions work.
  • It's kinda gross for me to use alter_op to change the requantize ops to be integers. It would be much better if I could do this in the TIR schedules, but this does not work as I cannot alter the requantize constants. I had to disable a type check to make this hack work, so I would like to find a different solution before merging.
  • The output zero point for requantization in tensordot.py is a fixed value of -128. I need to fix this to be dynamic. Done!

In a following PR, I'll also address:

  • The out_layout attribute is not supported for my conv2d or depthwise_conv2d schedules. Adding this will let me get some timing results!

@guberti guberti force-pushed the arm-qnn-convolution branch from a6dfafc to febb861 Compare November 18, 2022 16:27
Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a first pass here, thanks @guberti !

"""Addition is commutative, so we could add the bias before, during, or after performing our
multiply-accumulate operations. It "costs" one cycle either way - if done at the beginning we
can't use a SMULXY trick to set sum_i to zero for "free", and if done at the end it doesn't
combine with anything. However, doing it at the beginning frees up a register/prevents needing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about overflow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of bias addition does not change the overflow behavior. This comment is just stating we could do the additions as:

$$A_1 B_1 + A_2 B_2 + \cdots A_n B_n + \text{bias}$$

OR as:

$$\text{bias} + A_1 B_1 + A_2 B_2 + \cdots A_n B_n$$

I've changed the wording a bit to make this clearer.

// Check and assign types for scale and zero points.
AssignType(types[1], DataType::Float(32), axis_shape, reporter); // input_scale
AssignType(types[2], DataType::Int(32), axis_shape, reporter); // input_zero_pt
// AssignType(types[1], DataType::Float(32), axis_shape, reporter); // input_scale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uncomment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - this PR should not change requantize.cc.

However, it is a bit of a tricky issue. In qnn_alter_op.py, I want to manually choose the int32 requantize scale to improve performance. However, Relay's requantize op only allows the output scale to be a float32.

I get around this by storing the scale data as a float32 array with the correct bytes, and reading it back as an int32 array. I've added a comment to qnn_alter_op.py to better explain what happens here. This is pretty gross.

Longer term, I'd love to add a new Relay op IntegerRequantize that takes int32 scale and shift arguments, which will let us solve this problem in a nice way. Would love your thoughts on the right way to address this!

@guberti guberti force-pushed the arm-qnn-convolution branch from febb861 to fae2a12 Compare November 20, 2022 20:32
Copy link
Contributor

@mkatanbaf mkatanbaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @guberti I added a few comments, mostly asking for clarifications.

def _apply_simd_optimizations(instruction_tuples) -> Iterator[Tuple]:
"""When possible, fuses single MACs into SIMD MAC instructions.
The compiler cannot do this automatically, as calling __builtin_arm_smlaxy forces the SMLAxy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand this correctly, but does this mean that we will unroll the loop and get a long list of instructions instead? would this significantly increase the code size?

Copy link
Member Author

@guberti guberti Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the inner reduction loops will always be unrolled (this occurs in _get_draft_macs). We will often unroll even more than this, either as another unrolled copy of the inner loops for odd-numbered channels (this happens e.g. for 3x3 depthwise convolutions) or by computing multiple sums at the same times (i.e. when num_sums > 1).

Compared with the naive approach, this does increase code size. However, the increase is very small - for example, unrolling a 3x3 depthwise convolution might take ~10 extra instructions, or 0.01 KB more flash size. This is well worth it, as unrolling dramatically reduces overhead and increases speed by ~2x. The previous tensordot implementation also unrolled these loops for the same reason.

Comment on lines 174 to 181
# Arm GCC does not have `__builtin_arm_smlabt`, even though `__builtin_arm_smlatt`,
# `__builtin_arm_smlatb`, `__builtin_arm_smlad` and so on all exist. Perhaps this is a
# choice, since we can just use `smlabt` with the argument order swapped instead? Note that
# `__builtin_arm_smlabt` exists on most compilers (e.g. Clang) - this is just a GCC thing.
if instruction == "smlabt":
yield f"sum_{index} = __builtin_arm_smlatb({op2}, {op1}, sum_{index});"
else:
yield f"sum_{index} = __builtin_arm_{instruction}({op1}, {op2}, sum_{index});"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is because you're using the builtins directly rather than using the ACLE interface (
https://arm-software.github.io/acle/main/acle.html#accumulating-multiplications) - unsure how much guarantee you get with built-ins, I would move to the ACLE interface anyway.

Also see: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm_acle.h#L661-L675 😸

(
f"""
#include <stdint.h>
#include <arm_nnsupportfunctions.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! I think this solves the same problem as #13363 😸 !

@guberti guberti changed the title [microTVM] [WIP] Modernize Arm Cortex-M convolution schedules [microTVM] Modernize Arm Cortex-M convolution schedules Nov 23, 2022
@guberti
Copy link
Member Author

guberti commented Nov 23, 2022

@Mousius I'm a fan of switching to use ACLE! I originally used the __builtin functions simply because CMSIS-NN used them, but ACLE seems more stable. I've updated this PR to use ACLE.

This PR only affects tensordot.py, though. We still need #13363 to switch the rest of the micro schedules to ACLE.

# under the License.
"""microTVM cares a lot about the convolution + bias + requantize + fused ReLU use case. There have
been some accuracy issues in the past, so this test steps through a model (MobileNetV1) layer by
layer and ensures there is 1-1 correspondance at each step. This test would run way faster if we ran
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool, great idea!

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @guberti, did a more fine-grained pass now.

scale = T.match_buffer(scale_handle, scale_shape)
output = T.match_buffer(output_handle, output_shape, dtype="int16")

# This hack prevents TVM from seeing these variables as "unused". I should be using T.reads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you file a bug for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is user error on my part, or an issue with TVM. I'll look around a bit and file an issue if it seems to be a bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, apologies for bringing up an old PR thread, I just ran into a similar problem, was an issue filed in the end? If so, could you possibly point me to it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhutton1 A bug still needs to be filed here - I meant to write up a small reproducible example, but never got around to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll take a look into it :)

@guberti
Copy link
Member Author

guberti commented Dec 1, 2022

Thanks for the detailed review @areusch - your comments should be addressed by 9bd3598.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @guberti, this is basically ready, i've highlighted a couple last areas (in particular the doctest). feel free to merge once you've addressed!

including regular conv2d, depthwise conv2d, and grouped conv2d provided the data and kernel layouts
are the optimal ones. When groups=1, the optimal data layout is NHWC and kernel layout is OHWI. When
this is a depthwise convolution, the optimal data layout is NCHW and kernel layout is OIHW."""
"""Generates optimized code to compute a tensor dot product on ARMv7E-M.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this apply to v8-M also?

Copy link
Member Author

@guberti guberti Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes - this uses the DSP instructions, which are required in v7E-M but optional in v8-M. This code also does not use MVE, which is optional in v8-M but would be really useful for deep learning. I've clarified this in the docstring.

Get QNN strategy running

QNN strategy with operator fusion
Assembly tensordot from other PR

Tensordot offset support

Hand tested tensordot code
Formatting fixes

Don't use automatic AOT building when skipping pass

Assorted tech for scheduling with TIR

Hacky int16 support
Bugged schedule implementation

Passing test!

Works for all 1x1 conv2ds!

External QNN operator altering

Debugging work

Pad with correct constant

Broadly functional conv2d

Reorganize quantize convolution test
Working depthwise convolution for strides=1

Working depthwise convolution!
Support Python 3.7

Clean up code to prepare for review
Second round of code review

Fix tensordot opts test
@guberti guberti force-pushed the arm-qnn-convolution branch from dcd9c17 to 431e4e4 Compare December 6, 2022 00:27
@guberti
Copy link
Member Author

guberti commented Dec 6, 2022

I've addressed the comments from @areusch, so per his instructions I'm merging this. Thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants