[microTVM] Modernize Arm Cortex-M convolution schedules #13242

guberti · 2022-10-31T18:23:35Z

For a long time, I've been unhappy with TVM's TE-based convolution schedules for Arm Cortex-M. They were a lot slower than the state-of-the-art, and had a lot of strange inefficiencies caused by limitations of TE.

This pull request rewrites regular and depthwise convolution schedules on Arm Cortex-M, using MetaSchedule and TIR to make them much faster. It took some work and ended up being a big PR (as many of these changes depend on the others), but I'm really happy with the result.

High level changes

Adds a qnn operator strategy to TVM for Arm Cortex-M. With this change, we are able to skip the QNN lowering pass, letting us use Cortex-M specific implementations of qnn_conv2d, add, and requantize that perform much better.
Performs operator fusion on the (very common) convolution + bias + requantize block for Cortex-M. This lets us write one schedule that performs all three of these steps, which improves performance by letting us compute multiple operator outputs concurrently (which reduces the number of times the data and weights are loaded from memory).
Adds a TIR/TVMScript schedule for the convolution + bias + requantize block described above. This fixes a longstanding limitation of TE which required us to write convolution outputs to an intermediate buffer and took a lot of time.
Uses new, highly optimized C/Assembly extern function to actually perform the convolution. I spent a ton of time looking at and verifying the assembly code it gets compiled into is really fast.
Overhauls the way requantization is done on Cortex-M by adding new alter_op_layout functions for add and requantize. This reduces the amount of memory loaded during each requantization by over 5x with some snazzy tricks (pre-multiplying the kernel values with the input zero point, skipping the "shift" step in our floating point multiplication approximation, fusing the bias with the pre-multiplied zero point).
Adds a new, end-to-end Corstone300 test that runs the MLPerf Tiny vww model using TFLite and ensures our implementation (with all the optimizations above) produces the same outputs. This is done by layer, so if there is ever an accuracy issue, we will know exactly which layer is causing the problem.

TFLite-ground-truth Corstone300 Test

For a while, microTVM has had Corstone300 tests which compare our schedules for regular nn ops to implementations elsewhere in TVM, to make sure the schedules are written correctly. Despite this, we've had some accuracy issues (see #13364) when running models end-to-end, and we don't really have tools to debug these.

The way I see it, the existing tests have two key limitations. They:

Compare our results only to other TVM implementations
They run only the base operators (e.g. just the nn.conv2d), while leaving out the bias and re-quantize operations (which are normally fused).

To fix this, I've added test_quantized_convolution.py in this PR. This test runs the convolution layers of the vww model from TinyML perf using TensorFlow's TFLite Interpreter, while saving all the intermediate layer outputs.

Then, one by one each layer is loaded with TVM and Corstone300, and the full operator (with fused convolution, bias, ReLU, and requantization) is run and compared to TFLite's result.

Quantized operators and fusion

TFLite Micro, CMSIS-NN, and (AFAIK) all other microcontroller AI platforms write code for "fused operators" - e.g. a convolution combined with a bias addition, ReLU activation, and requantization. This is good for a few reasons - it prevents us from having to store "intermediate results", it lets us combine steps from different operators, and it makes parts of the code easier to write.

This wasn't possible with TVM until recently, thanks to #12398 which enabled it for Hexagon. I've done the same thing here for Arm. I've also added strategy functions for 2D quantized convolutions on Arm, though (a) only some cases are supported and (b) the qnn.Legalize pass must be disabled for these to be used.

TVMScript convolution schedules

For a while, TE has had a known limitation that makes it impossible to fuse certain operators when they follow reduce operations. This meant microTVM would generate code like the following:

for (int32_t k_outer = 0; k_outer < 2; ++k_outer) {
  for (int32_t i = 0; i < 48; ++i) {
    for (int32_t j = 0; j < 48; ++j) {
      int32_t cse_var_4 = (j * 8);
      int32_t cse_var_3 = (k_outer * 4);
      // Writes data to a buffer in memory
      convolution_helper_function(
        (&(((int32_t*)depthwise_conv2d)[(((i * 384) + cse_var_4) + cse_var_3)])), 
        (&(((int8_t*)padded_data)[(((i * 400) + cse_var_4) + cse_var_3)])), 
        (&(T_reshape[(k_outer * 36)])));
    }
  }
}
for (int32_t ax1_1 = 0; ax1_1 < 48; ++ax1_1) {
  for (int32_t ax2_1 = 0; ax2_1 < 48; ++ax2_1) {
    for (int32_t ax3_1 = 0; ax3_1 < 8; ++ax3_1) {
      int32_t cse_var_5 = (((ax1_1 * 384) + (ax2_1 * 8)) + ax3_1);
      // Then, has to read the data back before doing more operations. Would be way faster
      // to just fuse these loops, but TE won't let us.
      int32_t __1 = ((int32_t)(((((((int64_t)((int32_t*)depthwise_conv2d)[cse_var_5]) + ... 
      // The rest is omitted for brevity
    }
  }

I previously looked into this limitation, and with the help of Eric L. and others realized it would be really annoying to fix. Instead, our schedule has been replaced with a T.prim_func, which lets us do this fusion (and have much more fine-grained control in general).

I hit a few bugs doing this (e.g. #13330), and the limited docs for TVMScript meant I had to make some guesses about the right way to do things. It's totally possible this code is gross - I'll describe these issues more in a comment below. However, the generated code looks much nicer.

New optimized C intrinsic for convolutions

A few weeks ago, I wrote a faster version of microTVM's tensordot kernel. That got folded into this PR, as that schedule was not usable on its own. I've added a unit test test_topi_conv2d_tensordot_opts that goes into more detail about what the schedule does and why it is fast, but here's just a taste.

Our previous microTVM-specific schedule for regular conv2d was not very good, and was slower than just autotuning a generic implementation (for this reason, OctoML used a generic autotuned schedule to submit microTVM results to MLPerf Tiny). However, there are major limitations for how far an autotuning + C code generation approach can go, as GCC only uses the fast intrinsic functions in super narrow cases.

For example, here is how microTVM would previously generate the inner loop of a 1x1 4-channel convolution:

output[oco_1] = 0;
for (int ic_1 = 0; ic_1 < 4; ++ic_1) {
    output[oco_1] = (output[oco_1] + (((int)tensor[ic_1]) * ((int)((short*)kernel)[((oco_1 * 128) + ic_1)])));
}

Arm GCC 12.2 (with flags -mcpu=cortex-m4 -O3) compiles this into instructions taking 29 cycles per output generated. That's not good, and the previous microTVM schedule was even worse.

The new implementation in tensordot.py instead gets compiled into just 15 cycles (though there is still work to be done to get this even lower):

int tensor__y00_x00__y00_x01 = tensor[0];
int tensor__y00_x02__y00_x03 = tensor[1];

int kernel__y00_x00__y00_x01 = kernel[0];
int kernel__y00_x02__y00_x03 = kernel[1];

int sum_0 = __builtin_arm_smuad(tensor__y00_x00__y00_x01, kernel__y00_x00__y00_x01);
sum_0 = __builtin_arm_smlad(tensor__y00_x02__y00_x03, kernel__y00_x02__y00_x03, sum_0);
sum_0 = __builtin_arm_smlad(tensor__y00_x04__y00_x05, kernel__y00_x04__y00_x05, sum_0);
sum_0 = __builtin_arm_smlad(tensor__y00_x06__y00_x07, kernel__y00_x06__y00_x07, sum_0);

This is a very simple case, but we also have good support and tests for complex cases. We can work on data where the start pointers aren't word aligned, work on data where one or more of the data, kernel, or output has width not divisible by the SIMD width, have multiple sums running concurrently to reduce the number of memory loads (e.g. for 3x3 depthwise convolutions). The unit test checks all these capabilities, and the tensordot.py file itself has comments explaining why doing it this way is faster.

Faster re-quantization algorithm!

The way microTVM handled convolutions before was terrible. Here is an actual implementation from our MLPerf Tiny submission, which I've modified slightly for readability.

static const int32_t fused_nn_conv2d_constant_1[8] = {
    +0x00000f80, -0x00000180, +0x00007e80, +0x00002880, +0x00010680, -0x00000980, +0x00001380, +0x0000a900
};

static const int32_t fused_nn_conv2d_subtract_constant_2[8] = {
    +0x0000306e, +0x00003092, +0x00008470, +0x00004a13, +0x0000c411, +0x00012da6, +0x00003b70, +0x00015bd8
};

static const int64_t fused_nn_conv2d_subtract_add_cast_constant_3[8] = {
    +0x000000004648e699LL, +0x0000000063c512d1LL, +0x00000000611b0293LL, +0x000000007524d8c7LL, +0x000000007758617fLL, +0x00000000590a119bLL, +0x00000000500f9336LL, +0x0000000040ee5089LL
};

static const int64_t fused_nn_conv2d_subtract_add_cast_multiply_constant_4[8] = {
    +0x0000002000000000LL, +0x0000004000000000LL, +0x0000004000000000LL, +0x0000004000000000LL, +0x0000008000000000LL, +0x0000010000000000LL, +0x0000004000000000LL, +0x0000008000000000LL
};
static const int64_t fused_nn_conv2d_subtract_add_cast_multiply_add_constant_5[8] = {
    +0x0000000000000026LL, +0x0000000000000027LL, +0x0000000000000027LL, +0x0000000000000027LL, +0x0000000000000028LL, +0x0000000000000029LL, +0x0000000000000027LL, +0x0000000000000028LL
};

void requantize(void* compute, int32_t conv[8], int32_t i0_i1_outer_fused, int32_t i2_outer, int32_t i3_outer) {
  // Reorganized by @guberti for readability
  int64_t _0 = ((int64_t)conv[i3_outer]) + ((int64_t)(fused_nn_conv2d_subtract_constant_2)[i3_outer]);
  int64_t _1 = _0 - ((int64_t)(fused_nn_conv2d_constant_1)[i3_outer]);
  int64_t _2 = _1 * fused_nn_conv2d_subtract_add_cast_constant_3[i3_outer];
  int64_t _3 = _2 + fused_nn_conv2d_subtract_add_cast_multiply_constant_4[i3_outer];
  int64_t _4 = _3 >> fused_nn_conv2d_subtract_add_cast_multiply_add_constant_5[i3_outer];
  int32_t __1 = ((int32_t) (_4)) - 128;

  // Code below is untouched
  int32_t __2 = (__1) < (127) ? (__1) : (127);
  int8_t __3 = (int8_t)((__2) > (-128) ? (__2) : (-128));
  int8_t __4 = (int8_t)127;
  int8_t __5 = (__3) < (__4) ? (__3) : (__4);
  int8_t __6 = (int8_t)-128;
  ((int8_t*)compute)[(((i0_i1_outer_fused * 384) + (i2_outer * 8)) + i3_outer)] = ((__5) > (__6) ? (__5) : (__6));
}

There are a bunch of things about this that aren't ideal:

We have FIVE re-quantization constants (not counting the zero point, which is correctly inlined).
3/5 of the constants are int64 values, and they are all padded with unnecessary zeros. This means we need to load eight words from memory for each re-quantization operation.
The first five math operations are int64 ops, which are slow because Arm Cortex-M is a 32-bit platform.
int8 bounds checking is done with a wacky set of ternary operators. I checked - these do not get complied down nicely.

I've fixed all these things using QNN alter_op_layout functions, and I've implemented a few more complex optimizations:

When applicable, we now replace the bias by bias + sum(kernel) * input_zero_point (e.g. pre-multiplying the kernel values by the input zero point). This prevents us from having to subtract out the bias every time we do a multiplication by a kernel value (note that the input zero point is -128 basically every time, because Cortex-M does not have a uint x int instruction). The result is stored in an int32 value.
We force bitshifts to be >=33 (which in practice they always are), which allows us to only use the top 32 bits from our int32 x int32 multiplication. This lets us use zero int64 memory loads or instructions, without sacrificing accuracy.

Together, this means our requantization code now looks like this:

static const int32_t REQUANTIZE_SCALE[8] = {
    +0x067bed1c, +0x05c578b9, +0x03d08ea5, +0x01ed066f, 
    +0x0176b86f, +0x027f3977, +0x054e3783, +0x06d0a442, 
};

static const int32_t BIAS[16] = {
    +0x00006b58, -0x000023aa, +0x00005cf3, +0x00004cc6, 
    +0x0000605e, +0x00006eed, +0x0000512e, -0x00002526, 
};
// Some lines omitted for brevity

// Bias is added before convolution, as doing it this way is faster
int requant_0 = (sum_0 * (long long) REQUANTIZE_SCALE[j]) >> 32;
requant_0 = (requant_0 + 1) >> 1;
requant_0 = __builtin_arm_ssat(requant_0 - 128, 8);
((short*) output)[0] = (short) requant_0;

All in all, requantization now takes ~8x fewer cycles per output than it did before.

tvm-bot · 2022-10-31T18:23:39Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

cc @alanmacd, @gromero, @leandron, @mehrdadh _{See #10317 for details}
Built docs for commit e5e37cd can be found here.

_{Generated by tvm-bot}

guberti · 2022-11-17T21:08:13Z

This pull request is ready for review! Would love reviews from @mkatanbaf (who's doing some microTVM + MetaSchedule work), @areusch, and @ekalda. Would also love a look from someone who's more familiar with TVMScript, and can critique my use of it :).

That said, there are a few known issues in this PR I still need to fix:

There is a hack where I read dummy data from a TVMScript buffer to prevent TVM from seeing the buffers as "unused". I think I'm supposed to use T.reads/T.writes in this situation, but I could not make those functions work.
It's kinda gross for me to use alter_op to change the requantize ops to be integers. It would be much better if I could do this in the TIR schedules, but this does not work as I cannot alter the requantize constants. I had to disable a type check to make this hack work, so I would like to find a different solution before merging.
~~The output zero point for requantization in tensordot.py is a fixed value of -128. I need to fix this to be dynamic.~~ Done!

In a following PR, I'll also address:

The out_layout attribute is not supported for my conv2d or depthwise_conv2d schedules. Adding this will let me get some timing results!

python/tvm/testing/aot.py

areusch

did a first pass here, thanks @guberti !

python/tvm/relay/op/nn/_nn.py

python/tvm/relay/qnn/strategy/arm_cpu.py

areusch · 2022-11-18T15:20:20Z

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

+    """Addition is commutative, so we could add the bias before, during, or after performing our
+    multiply-accumulate operations. It "costs" one cycle either way - if done at the beginning we
+    can't use a SMULXY trick to set sum_i to zero for "free", and if done at the end it doesn't
+    combine with anything. However, doing it at the beginning frees up a register/prevents needing


what about overflow?

The order of bias addition does not change the overflow behavior. This comment is just stating we could do the additions as:
$$A_1 B_1 + A_2 B_2 + \cdots A_n B_n + \text{bias}$$
OR as:
$$\text{bias} + A_1 B_1 + A_2 B_2 + \cdots A_n B_n$$
I've changed the wording a bit to make this clearer.

areusch · 2022-11-18T16:24:09Z

src/relay/qnn/op/requantize.cc

  // Check and assign types for scale and zero points.
-  AssignType(types[1], DataType::Float(32), axis_shape, reporter);  // input_scale
-  AssignType(types[2], DataType::Int(32), axis_shape, reporter);    // input_zero_pt
+  // AssignType(types[1], DataType::Float(32), axis_shape, reporter);  // input_scale


Fixed - this PR should not change requantize.cc.

However, it is a bit of a tricky issue. In qnn_alter_op.py, I want to manually choose the int32 requantize scale to improve performance. However, Relay's requantize op only allows the output scale to be a float32.

I get around this by storing the scale data as a float32 array with the correct bytes, and reading it back as an int32 array. I've added a comment to qnn_alter_op.py to better explain what happens here. This is pretty gross.

Longer term, I'd love to add a new Relay op IntegerRequantize that takes int32 scale and shift arguments, which will let us solve this problem in a nice way. Would love your thoughts on the right way to address this!

tests/python/relay/strategy/arm_cpu/test_quantized_convolution.py

mkatanbaf

Great work @guberti I added a few comments, mostly asking for clarifications.

python/tvm/relay/qnn/strategy/arm_cpu.py

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

mkatanbaf · 2022-11-21T18:54:46Z

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

+def _apply_simd_optimizations(instruction_tuples) -> Iterator[Tuple]:
+    """When possible, fuses single MACs into SIMD MAC instructions.
+
+    The compiler cannot do this automatically, as calling __builtin_arm_smlaxy forces the SMLAxy


I'm not sure if I understand this correctly, but does this mean that we will unroll the loop and get a long list of instructions instead? would this significantly increase the code size?

Yes, the inner reduction loops will always be unrolled (this occurs in _get_draft_macs). We will often unroll even more than this, either as another unrolled copy of the inner loops for odd-numbered channels (this happens e.g. for 3x3 depthwise convolutions) or by computing multiple sums at the same times (i.e. when num_sums > 1).

Compared with the naive approach, this does increase code size. However, the increase is very small - for example, unrolling a 3x3 depthwise convolution might take ~10 extra instructions, or 0.01 KB more flash size. This is well worth it, as unrolling dramatically reduces overhead and increases speed by ~2x. The previous tensordot implementation also unrolled these loops for the same reason.

tests/python/relay/strategy/arm_cpu/test_quantized_convolution.py

src/relay/qnn/op/convolution.cc

Mousius · 2022-11-23T12:05:22Z

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

+        # Arm GCC does not have `__builtin_arm_smlabt`, even though `__builtin_arm_smlatt`,
+        # `__builtin_arm_smlatb`, `__builtin_arm_smlad` and so on all exist. Perhaps this is a
+        # choice, since we can just use `smlabt` with the argument order swapped instead? Note that
+        # `__builtin_arm_smlabt` exists on most compilers (e.g. Clang) - this is just a GCC thing.
+        if instruction == "smlabt":
+            yield f"sum_{index} = __builtin_arm_smlatb({op2}, {op1}, sum_{index});"
+        else:
+            yield f"sum_{index} = __builtin_arm_{instruction}({op1}, {op2}, sum_{index});"


I believe this is because you're using the builtins directly rather than using the ACLE interface (
https://arm-software.github.io/acle/main/acle.html#accumulating-multiplications) - unsure how much guarantee you get with built-ins, I would move to the ACLE interface anyway.

Also see: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm_acle.h#L661-L675 😸

Mousius · 2022-11-23T12:11:26Z

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

-        (
-            f"""
-        #include <stdint.h>
-        #include <arm_nnsupportfunctions.h>


Yay! I think this solves the same problem as #13363 😸 !

guberti · 2022-11-23T17:04:05Z

@Mousius I'm a fan of switching to use ACLE! I originally used the __builtin functions simply because CMSIS-NN used them, but ACLE seems more stable. I've updated this PR to use ACLE.

This PR only affects tensordot.py, though. We still need #13363 to switch the rest of the micro schedules to ACLE.

alanmacd · 2022-11-23T23:15:34Z

tests/python/relay/strategy/arm_cpu/test_quantized_convolution.py

+# under the License.
+"""microTVM cares a lot about the convolution + bias + requantize + fused ReLU use case. There have
+been some accuracy issues in the past, so this test steps through a model (MobileNetV1) layer by
+layer and ensures there is 1-1 correspondance at each step. This test would run way faster if we ran


This is very cool, great idea!

areusch

thanks @guberti, did a more fine-grained pass now.

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

areusch · 2022-11-30T17:25:13Z

python/tvm/topi/arm_cpu/qnn.py

+        scale = T.match_buffer(scale_handle, scale_shape)
+        output = T.match_buffer(output_handle, output_shape, dtype="int16")
+
+        # This hack prevents TVM from seeing these variables as "unused". I should be using T.reads


can you file a bug for this?

I'm not sure if this is user error on my part, or an issue with TVM. I'll look around a bit and file an issue if it seems to be a bug.

Hi, apologies for bringing up an old PR thread, I just ran into a similar problem, was an issue filed in the end? If so, could you possibly point me to it?

@lhutton1 A bug still needs to be filed here - I meant to write up a small reproducible example, but never got around to it.

Thanks, I'll take a look into it :)

python/tvm/topi/arm_cpu/qnn.py

guberti · 2022-12-01T15:48:56Z

Thanks for the detailed review @areusch - your comments should be addressed by 9bd3598.

areusch

thanks @guberti, this is basically ready, i've highlighted a couple last areas (in particular the doctest). feel free to merge once you've addressed!

python/tvm/topi/arm_cpu/qnn.py

areusch · 2022-12-05T21:15:24Z

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

-including regular conv2d, depthwise conv2d, and grouped conv2d provided the data and kernel layouts
-are the optimal ones. When groups=1, the optimal data layout is NHWC and kernel layout is OHWI. When
-this is a depthwise convolution, the optimal data layout is NCHW and kernel layout is OIHW."""
+"""Generates optimized code to compute a tensor dot product on ARMv7E-M.


does this apply to v8-M also?

Sometimes - this uses the DSP instructions, which are required in v7E-M but optional in v8-M. This code also does not use MVE, which is optional in v8-M but would be really useful for deep learning. I've clarified this in the docstring.

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

Get QNN strategy running QNN strategy with operator fusion

Assembly tensordot from other PR Tensordot offset support Hand tested tensordot code

Formatting fixes Don't use automatic AOT building when skipping pass Assorted tech for scheduling with TIR Hacky int16 support

Bugged schedule implementation Passing test! Works for all 1x1 conv2ds! External QNN operator altering Debugging work Pad with correct constant Broadly functional conv2d Reorganize quantize convolution test

Working depthwise convolution for strides=1 Working depthwise convolution!

Support Python 3.7 Clean up code to prepare for review

Second round of code review Fix tensordot opts test

guberti · 2022-12-06T16:27:39Z

I've addressed the comments from @areusch, so per his instructions I'm merging this. Thanks for the feedback!

guberti force-pushed the arm-qnn-convolution branch 2 times, most recently from 8f0b1a4 to 4fd94e2 Compare November 7, 2022 12:34

guberti force-pushed the arm-qnn-convolution branch 3 times, most recently from f206531 to 40b5554 Compare November 11, 2022 12:24

guberti changed the title ~~[microTVM] [WIP] Support and test QNN convolution and fusion on Arm Cortex-M~~ [microTVM] [WIP] Modernize Arm Cortex-M convolution schedules Nov 13, 2022

guberti force-pushed the arm-qnn-convolution branch 3 times, most recently from 39cb5a4 to 7b465c2 Compare November 17, 2022 14:44

guberti marked this pull request as ready for review November 17, 2022 14:45

ibsidorenko reviewed Nov 18, 2022

View reviewed changes

python/tvm/testing/aot.py Outdated Show resolved Hide resolved

guberti force-pushed the arm-qnn-convolution branch from a6dfafc to febb861 Compare November 18, 2022 16:27

areusch reviewed Nov 18, 2022

View reviewed changes

guberti force-pushed the arm-qnn-convolution branch from febb861 to fae2a12 Compare November 20, 2022 20:32

mkatanbaf reviewed Nov 21, 2022

View reviewed changes

ibsidorenko reviewed Nov 22, 2022

View reviewed changes

src/relay/qnn/op/convolution.cc Outdated Show resolved Hide resolved

Mousius reviewed Nov 23, 2022

View reviewed changes

Mousius mentioned this pull request Nov 23, 2022

[microTVM] Replace arm_nnsupportfunctions.h with arm_acle.h #13363

Merged

guberti changed the title ~~[microTVM] [WIP] Modernize Arm Cortex-M convolution schedules~~ [microTVM] Modernize Arm Cortex-M convolution schedules Nov 23, 2022

alanmacd reviewed Nov 23, 2022

View reviewed changes

guberti force-pushed the arm-qnn-convolution branch from 29d97a8 to f11243a Compare November 24, 2022 13:42

guberti mentioned this pull request Nov 29, 2022

[microTVM] [WIP] Optimized assembly schedules for Cortex-M convolution guberti/tvm#2

Closed

areusch reviewed Nov 30, 2022

View reviewed changes

guberti mentioned this pull request Dec 1, 2022

[microTVM] Use autotuning to choose num_outputs value #13528

Closed

guberti force-pushed the arm-qnn-convolution branch from f11243a to 9bd3598 Compare December 1, 2022 15:47

areusch mentioned this pull request Dec 5, 2022

[ci] Run doctest as part of CI #13553

Closed

areusch approved these changes Dec 5, 2022

View reviewed changes

guberti added 17 commits December 5, 2022 16:27

Quantized Corstone300 test draft

8feb4c3

Add QNN strategy with operator fusion for Cortex-M

24f6204

Get QNN strategy running QNN strategy with operator fusion

Add assembly tensordot code from other PR

e1b5341

Assembly tensordot from other PR Tensordot offset support Hand tested tensordot code

Helper work to support microTVM TIR schedules

07e28ca

Formatting fixes Don't use automatic AOT building when skipping pass Assorted tech for scheduling with TIR Hacky int16 support

TIR schedule for microTVM conv2d

1275265

Bugged schedule implementation Passing test! Works for all 1x1 conv2ds! External QNN operator altering Debugging work Pad with correct constant Broadly functional conv2d Reorganize quantize convolution test

TIR schedule for microTVM depthwise_conv2d

3728cbb

Working depthwise convolution for strides=1 Working depthwise convolution!

Clean up code

2f4717a

Support Python 3.7 Clean up code to prepare for review

Break qnn.py into helper functions

0fdc1f2

Finish reorganizing qnn.py

ad604be

Fix linting

351f719

Remove residual debug code and fix linting

ad27d62

Try repairing unit tests

f08170d

Run black to fix linting

d90b39c

Address code review comments

c760ea3

Second round of code review

7a8f506

Second round of code review Fix tensordot opts test

Address @areusch code review

cde0f47

More code review

431e4e4

guberti force-pushed the arm-qnn-convolution branch from dcd9c17 to 431e4e4 Compare December 6, 2022 00:27

Catch VWW model download with request hook

e5e37cd

guberti merged commit bbba8d9 into apache:main Dec 6, 2022

guberti mentioned this pull request Jan 10, 2023

[microTVM] Use QNN schedules to give SOTA performance #13752

Merged

leandron mentioned this pull request Feb 1, 2023

TVM v0.11.0 Release Candidate Notes #13899

Closed

[microTVM] Modernize Arm Cortex-M convolution schedules #13242

[microTVM] Modernize Arm Cortex-M convolution schedules #13242

Uh oh!

Conversation

guberti commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

High level changes

TFLite-ground-truth Corstone300 Test

Quantized operators and fusion

TVMScript convolution schedules

New optimized C intrinsic for convolutions

Faster re-quantization algorithm!

Uh oh!

tvm-bot commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guberti commented Nov 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

areusch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mkatanbaf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guberti Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guberti commented Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

areusch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

guberti commented Dec 1, 2022

guberti commented Oct 31, 2022 •

edited

Loading

tvm-bot commented Oct 31, 2022 •

edited

Loading

guberti commented Nov 17, 2022 •

edited

Loading

guberti Nov 23, 2022 •

edited

Loading

guberti commented Nov 23, 2022 •

edited

Loading

guberti Dec 5, 2022 •

edited

Loading