From 6f39abb2a7f982c5a74ac8a5e9daef4529dad7fb Mon Sep 17 00:00:00 2001
From: Ashutosh Parkhi <ashutosh.parkhi@arm.com>
Date: Thu, 22 Jul 2021 16:28:13 +0100
Subject: [PATCH 1/9] Markdown for CMSIS-NN integration

Change-Id: I3b0954f3fdb4d54b3e38a84de0ab649c1e79bca8
---
 rfcs/0013_Arm_CMSIS-NN_Integration.md | 113 ++++++++++++++++++++++++++
 1 file changed, 113 insertions(+)
 create mode 100644 rfcs/0013_Arm_CMSIS-NN_Integration.md

diff --git a/rfcs/0013_Arm_CMSIS-NN_Integration.md b/rfcs/0013_Arm_CMSIS-NN_Integration.md
new file mode 100644
index 00000000..8421a776
--- /dev/null
+++ b/rfcs/0013_Arm_CMSIS-NN_Integration.md
@@ -0,0 +1,113 @@
+- Feature Name: Arm(R) CMSIS-NN Integration for Cortex-M
+- Start Date: July 2021
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# Summary
+
+This RFC introduces plan of integration of CMSIS-NN library into TVM. It consists of efficient kernels targetted for Arm's Cortex-M architecture.
+
+Please refer to the following pages for more details on CMSIS-NN.
+https://arm-software.github.io/CMSIS_5/NN/html/index.html
+https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
+
+First PR in the series of PRs to fullfill this integration would be graph partitioner for softmax int8. Detailed plan can found below in this RFC.
+
+
+# Motivation
+
+CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M and are compliant with the quantization scheme used in Tensorflow Lite. They have been optimized for better performance and small memory footprint which is required on these embedded devices and it would make sense for TVM to reuse these while generating code for Cortex-M. They have been integrated with the TensorFlow Lite Micro project.
+
+
+# Guide-level explanation
+
+TVM's external code generation infrastructure allows for the automatic partitoning and code generation using the external compiler. Partitioned subgraphs containing operator(s) targetted for Cortex-M can then be translated into the CMSIS-NN C APIs which eventually become part of MLF. For this integration, we are heavily dependent on the TVM's infrastructure for external code generation.
+
+If a user runs tvmc, they will get a MLF format archive which calls out to the CMSIS operators.
+
+```
+tvmc --target=c,cmsisnn --output-format=mlf --executor=aot
+```
+
+
+# Reference-level explanation
+
+We will enable this integration by considering TFLite networks, but is equally applicable for all other networks that can be translated into Relay IR. TFLite test that contains just a quantized (int8) softmax is first converted as a sequence of following relay operations: *dequantize -> softmax -> quantize* by the TFLite frontend. Please refer to the code snippet below.
+
+```python
+def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
+  %0 = qnn.dequantize(%a, 0.02f /* ty=float32 */, 64 /* ty=int32 */) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+  %1 = nn.softmax(%0) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+  qnn.quantize(%1, 0.02f /* ty=float32 */, 64 /* ty=int32 */, out_dtype="int8") /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+```
+
+Following code block shows result of the graph partitioning for cmsisnn target.
+
+```python
+def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
+  @tvmgen_default_cmsisnn_0(%a) /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+
+def @tvmgen_default_cmsisnn_0(%cmsisnn_0_i0: Tensor[(1, 16, 16, 3), int8], Inline=1, Compiler="cmsisnn", global_symbol="tvmgen_default_cmsisnn_0", Primitive=1) -> Tensor[(1, 16, 16, 3), int8] {
+  %2 = fn (%FunctionVar_0_0: Tensor[(1, 16, 16, 3), int8], PartitionedFromPattern="qnn.dequantize_nn.softmax_qnn.quantize_", Composite="cmsisnn.qnn_softmax") -> Tensor[(1, 16, 16, 3), int8] {
+    %0 = qnn.dequantize(%FunctionVar_0_0, 0.02f /* ty=float32 */, 64 /* ty=int32 */) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+    %1 = nn.softmax(%0) /* ty=Tensor[(1, 16, 16, 3), float32] */;
+    qnn.quantize(%1, 0.02f /* ty=float32 */, 64 /* ty=int32 */, out_dtype="int8") /* ty=Tensor[(1, 16, 16, 3), int8] */
+  };
+  %2(%cmsisnn_0_i0) /* ty=Tensor[(1, 16, 16, 3), int8] */
+}
+```
+
+Target hooks for `relay_to_tir` implemented as part of https://github.com/apache/tvm-rfcs/pull/10 is used to obtain the following tir for graph with softmax. These hooks provide us with the flexibility to reuse memory planning and much of the TVM's code generation capabilities.
+
+```python
+primfn(placeholder_1: handle, out_write_1: handle) -> ()
+    attr = {"global_symbol": "main", "tir.noalias": True}
+    buffers = {placeholder: Buffer(placeholder_1: Pointer(int8), int8, [1, 300, 300, 3], []),
+                out_write: Buffer(out_write_1: Pointer(int8), int8, [1, 300, 300, 3], [])}
+    buffer_map = {placeholder_1: placeholder_1, out_write_1: out_write_1} {
+    ...
+    allocate(placeholder.d.global, uint8, [1,300,300,3]) {
+        @tir.call_extern("cmsisnn_softmax_s8", ..., dtype=handle)
+    }
+}
+```
+
+At last, code generator identifies the extern_call and generates code for softmax with the CMSIS-NN API for softmax int8.
+
+For more complex operations, CMSIS-NN structures will need to be used. For this purpose, `tir_to_runtime` will be used to extend the existing C Codegen and produce C code with the appropriate headers and calling patterns. Please refer to the [Additional Target Hooks RFC] (https://github.com/apache/tvm-rfcs/pull/10).
+
+# Testing
+
+As we introduce the operators, we will keep on adding individual unit tests. Once the operator support is partially completed, we will start adding network tests. We are planning to use [Arm® Corestone™-300 Fixed Virtual Platform] (https://developer.arm.com/ip-products/subsystem/corstone/corstone-300) to run these tests in the CI. Reference: [Arm Ethos-U Integration] (https://github.com/apache/tvm-rfcs/pull/11/files)
+
+# Drawbacks
+
+CMSIS-NN APIs provide hand coded kernels. Therefore, code generation skips the auto tuning capabilities of TVM. In future, we wish to make use of full power of TVM's auto scheduling.
+
+# Upstreaming Plan
+
+Before adding other operators from CMSIS-NN, the integration will be enabled only for softmax.
+
+P1: Graph partitioner for CMSIS-NN target
+P2: Code generation using existing BYOC
+P3: tvmc support to generate code for CMSIS-NN
+P4: Move this implementation using `tir_to_runtime` from target hooks
+P5: Use of CMSIS-NN data structures while supporting depthwise convolution
+P6: Support for Convolution
+P7: Support for Fully connected
+P8: Support for Max Pooling
+P9: Support for Avg Pooling
+P10: Support for MatMul
+
+
+# Prior art
+
+CMSIS-NN integration into TVM builds on top of ACL's integration into TVM. Existing infrastructure of BYOC allows for graph partitioning to detach the operators or chain of operations as a separate subgraph that then can be compiled for Cortex-M.
+
+Reference: [Arm Compute Lib] (https://tvm.apache.org/docs/deploy/arm_compute_lib.html)
+
+Code generation for CMSIS-NN will use the newly introduced target hooks.
+
+Reference: [Additional Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10/files)

From ca14511ed2a02b69f7627dfa58b3fcadc2179c41 Mon Sep 17 00:00:00 2001
From: Ashutosh Parkhi <ashutosh.parkhi@arm.com>
Date: Wed, 4 Aug 2021 12:20:21 +0100
Subject: [PATCH 2/9] Title changed to use of CMSIS-NN with TVM

Change-Id: I6142c001175cdf41c58b5bb555a39e07c834254f
---
 rfcs/0013_Arm_CMSIS-NN_Integration.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/rfcs/0013_Arm_CMSIS-NN_Integration.md b/rfcs/0013_Arm_CMSIS-NN_Integration.md
index 8421a776..d2bbb3d1 100644
--- a/rfcs/0013_Arm_CMSIS-NN_Integration.md
+++ b/rfcs/0013_Arm_CMSIS-NN_Integration.md
@@ -1,7 +1,7 @@
-- Feature Name: Arm(R) CMSIS-NN Integration for Cortex-M
+- Feature Name: [RFC] Use CMSIS-NN with TVM
 - Start Date: July 2021
-- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
-- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- RFC PR: https://github.com/apache/tvm-rfcs/pull/15
+- GitHub Issue: https://github.com/apache/tvm/issues/8646
 
 # Summary
 

From 1948c0a2c6168b8992943c8d0810395e2b725498 Mon Sep 17 00:00:00 2001
From: Ashutosh Parkhi <ashutosh.parkhi@arm.com>
Date: Fri, 6 Aug 2021 11:41:44 +0100
Subject: [PATCH 3/9] Added acronyms and fixed few spellings

Change-Id: Id63d1866cd783f5e59b568f36c9177ee8715bc4d
---
 rfcs/0013_Arm_CMSIS-NN_Integration.md | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/rfcs/0013_Arm_CMSIS-NN_Integration.md b/rfcs/0013_Arm_CMSIS-NN_Integration.md
index d2bbb3d1..616fadfa 100644
--- a/rfcs/0013_Arm_CMSIS-NN_Integration.md
+++ b/rfcs/0013_Arm_CMSIS-NN_Integration.md
@@ -3,15 +3,20 @@
 - RFC PR: https://github.com/apache/tvm-rfcs/pull/15
 - GitHub Issue: https://github.com/apache/tvm/issues/8646
 
+# Acronyms
+CMSIS: Common Microcontroller Software Interface Standard
+ACL: The Compute Library for the Arm® Architecture
+MLF: Model Library Format
+
 # Summary
 
-This RFC introduces plan of integration of CMSIS-NN library into TVM. It consists of efficient kernels targetted for Arm's Cortex-M architecture.
+This RFC introduces plan of integration of CMSIS-NN library into TVM. It consists of efficient kernels targeted for Arm's Cortex-M architecture.
 
 Please refer to the following pages for more details on CMSIS-NN.
 https://arm-software.github.io/CMSIS_5/NN/html/index.html
 https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
 
-First PR in the series of PRs to fullfill this integration would be graph partitioner for softmax int8. Detailed plan can found below in this RFC.
+First PR in the series of PRs to fulfill this integration would be graph partitioner for softmax int8. Detailed plan can found below in this RFC.
 
 
 # Motivation
@@ -21,12 +26,12 @@ CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M a
 
 # Guide-level explanation
 
-TVM's external code generation infrastructure allows for the automatic partitoning and code generation using the external compiler. Partitioned subgraphs containing operator(s) targetted for Cortex-M can then be translated into the CMSIS-NN C APIs which eventually become part of MLF. For this integration, we are heavily dependent on the TVM's infrastructure for external code generation.
+TVM's external code generation infrastructure allows for the automatic partitioning and code generation using the external compiler. Partitioned subgraphs containing operator(s) targeted for Cortex-M can then be translated into the CMSIS-NN C APIs which eventually become part of MLF. For this integration, we are heavily dependent on the TVM's infrastructure for external code generation.
 
 If a user runs tvmc, they will get a MLF format archive which calls out to the CMSIS operators.
 
 ```
-tvmc --target=c,cmsisnn --output-format=mlf --executor=aot
+tvmc --target=cmsisnn,c --output-format=mlf --executor=aot
 ```
 
 
@@ -106,7 +111,7 @@ P10: Support for MatMul
 
 CMSIS-NN integration into TVM builds on top of ACL's integration into TVM. Existing infrastructure of BYOC allows for graph partitioning to detach the operators or chain of operations as a separate subgraph that then can be compiled for Cortex-M.
 
-Reference: [Arm Compute Lib] (https://tvm.apache.org/docs/deploy/arm_compute_lib.html)
+Reference: [ACL] (https://tvm.apache.org/docs/deploy/arm_compute_lib.html)
 
 Code generation for CMSIS-NN will use the newly introduced target hooks.
 

From a10108e527937078df9c7b155a3ebd97a28392b4 Mon Sep 17 00:00:00 2001
From: Ashutosh Parkhi <ashutosh.parkhi@arm.com>
Date: Mon, 9 Aug 2021 11:02:39 +0100
Subject: [PATCH 4/9] Changed name of the markdown to match PR number

Change-Id: Id54bae4bd2ca4bd9c3ab734e8cae966ebbe332b2
---
 ...m_CMSIS-NN_Integration.md => 0015_Arm_CMSIS-NN_Integration.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename rfcs/{0013_Arm_CMSIS-NN_Integration.md => 0015_Arm_CMSIS-NN_Integration.md} (100%)

diff --git a/rfcs/0013_Arm_CMSIS-NN_Integration.md b/rfcs/0015_Arm_CMSIS-NN_Integration.md
similarity index 100%
rename from rfcs/0013_Arm_CMSIS-NN_Integration.md
rename to rfcs/0015_Arm_CMSIS-NN_Integration.md

From 03ce0e5c2c2e0314591343baa6ec7b548dda94d7 Mon Sep 17 00:00:00 2001
From: Ashutosh Parkhi <ashutosh.parkhi@arm.com>
Date: Tue, 10 Aug 2021 13:46:11 +0100
Subject: [PATCH 5/9] Cody's comments about python APIs and config.cmake

Change-Id: I56a5f9bf319576d342a5bdc3771402262584e8c4
---
 rfcs/0015_Arm_CMSIS-NN_Integration.md | 32 ++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/rfcs/0015_Arm_CMSIS-NN_Integration.md b/rfcs/0015_Arm_CMSIS-NN_Integration.md
index 616fadfa..5c4fd006 100644
--- a/rfcs/0015_Arm_CMSIS-NN_Integration.md
+++ b/rfcs/0015_Arm_CMSIS-NN_Integration.md
@@ -47,7 +47,16 @@ def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
 }
 ```
 
-Following code block shows result of the graph partitioning for cmsisnn target.
+Here is the API to obtain the partitioned function aimed at CMSIS-NN.
+
+```python
+    # API to call CMSIS-NN partitioning
+    from tvm.relay.op.contrib import cmsisnn
+        # Here, module is the relay module
+        cmsisnn_module = cmsisnn.partition_for_cmsisnn(module)        
+```
+
+Following code block shows the resultant IRModule.
 
 ```python
 def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
@@ -64,7 +73,16 @@ def @tvmgen_default_cmsisnn_0(%cmsisnn_0_i0: Tensor[(1, 16, 16, 3), int8], Inlin
 }
 ```
 
-Target hooks for `relay_to_tir` implemented as part of https://github.com/apache/tvm-rfcs/pull/10 is used to obtain the following tir for graph with softmax. These hooks provide us with the flexibility to reuse memory planning and much of the TVM's code generation capabilities.
+Above partitioned function is presented to the CMSIS-NN external code generator for *tir* generation using the TVM's build() API. 
+
+```python
+    # Invoke AOT compiler to get the MLF containing CMSIS-NN APIs
+    with tvm.target.Target("c -runtime=c --link-params -mcpu=cortex-m55 --executor=aot --unpacked-api=1"):
+        factory = tvm.relay.build(cmsisnn_mod)
+
+```
+
+Intermediate *tir* looks like this:
 
 ```python
 primfn(placeholder_1: handle, out_write_1: handle) -> ()
@@ -78,14 +96,18 @@ primfn(placeholder_1: handle, out_write_1: handle) -> ()
     }
 }
 ```
+In future, target hooks for `relay_to_tir` implemented as part of [Additional Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10) will be used to obtain the above tir for graph with softmax. These hooks provide us with the flexibility to reuse memory planning and much of the TVM's code generation capabilities.
+
+At last, code generator identifies the *tir* extern_call(s) and generates *c* code for softmax with the CMSIS-NN API for softmax int8.
+
+Note: There are no changes required in config.cmake as the CMSIS-NN APIs corresponding to the operators are hard coded. The testing infrastructure links them to the CMSIS-NN library. Execution of the networks works similar to what has been described in [Arm Ethos-U Integration] (https://github.com/apache/tvm-rfcs/pull/11).
 
-At last, code generator identifies the extern_call and generates code for softmax with the CMSIS-NN API for softmax int8.
+For more complex operations, CMSIS-NN structures will need to be used. For this purpose, `tir_to_runtime` will be used to extend the existing C Codegen and produce C code with the appropriate headers and calling patterns. Please refer to the [Additional Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10).
 
-For more complex operations, CMSIS-NN structures will need to be used. For this purpose, `tir_to_runtime` will be used to extend the existing C Codegen and produce C code with the appropriate headers and calling patterns. Please refer to the [Additional Target Hooks RFC] (https://github.com/apache/tvm-rfcs/pull/10).
 
 # Testing
 
-As we introduce the operators, we will keep on adding individual unit tests. Once the operator support is partially completed, we will start adding network tests. We are planning to use [Arm® Corestone™-300 Fixed Virtual Platform] (https://developer.arm.com/ip-products/subsystem/corstone/corstone-300) to run these tests in the CI. Reference: [Arm Ethos-U Integration] (https://github.com/apache/tvm-rfcs/pull/11/files)
+As we introduce the operators, we will keep on adding individual unit tests. Once the operator support is partially completed, we will start adding network tests. We are planning to use [Arm® Corestone™-300 Fixed Virtual Platform] (https://developer.arm.com/ip-products/subsystem/corstone/corstone-300) to run these tests in the CI. Reference: [Arm Ethos-U Integration] (https://github.com/apache/tvm-rfcs/pull/11/files). In its absence, tests to check the correctness of *tir* can be added.
 
 # Drawbacks
 

From a53d45103b0fefa4a48d9a5772a682d9382f4a5f Mon Sep 17 00:00:00 2001
From: Ashutosh Parkhi <ashutosh.parkhi@arm.com>
Date: Tue, 10 Aug 2021 15:19:58 +0100
Subject: [PATCH 6/9] Andrew's comments: more details about CMSIS-NN ops and
 fixed mistakes with some terminologies

Change-Id: I002be9cc67b72444ea27fe0a31769549fb6fd452
---
 rfcs/0015_Arm_CMSIS-NN_Integration.md | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/rfcs/0015_Arm_CMSIS-NN_Integration.md b/rfcs/0015_Arm_CMSIS-NN_Integration.md
index 5c4fd006..5281befe 100644
--- a/rfcs/0015_Arm_CMSIS-NN_Integration.md
+++ b/rfcs/0015_Arm_CMSIS-NN_Integration.md
@@ -7,14 +7,15 @@
 CMSIS: Common Microcontroller Software Interface Standard
 ACL: The Compute Library for the Arm® Architecture
 MLF: Model Library Format
+FVP: Arm® Corestone™-300 Fixed Virtual Platform
 
 # Summary
 
 This RFC introduces plan of integration of CMSIS-NN library into TVM. It consists of efficient kernels targeted for Arm's Cortex-M architecture.
 
 Please refer to the following pages for more details on CMSIS-NN.
-https://arm-software.github.io/CMSIS_5/NN/html/index.html
-https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
+* https://arm-software.github.io/CMSIS_5/NN/html/index.html
+* https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
 
 First PR in the series of PRs to fulfill this integration would be graph partitioner for softmax int8. Detailed plan can found below in this RFC.
 
@@ -26,7 +27,7 @@ CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M a
 
 # Guide-level explanation
 
-TVM's external code generation infrastructure allows for the automatic partitioning and code generation using the external compiler. Partitioned subgraphs containing operator(s) targeted for Cortex-M can then be translated into the CMSIS-NN C APIs which eventually become part of MLF. For this integration, we are heavily dependent on the TVM's infrastructure for external code generation.
+TVM's BYOC infrastructure allows for the partitioning and code generation using the external compiler. Partitioned subgraphs containing operator(s) targeted for Cortex-M can then be translated into the CMSIS-NN C APIs which eventually become part of MLF.
 
 If a user runs tvmc, they will get a MLF format archive which calls out to the CMSIS operators.
 
@@ -37,7 +38,7 @@ tvmc --target=cmsisnn,c --output-format=mlf --executor=aot
 
 # Reference-level explanation
 
-We will enable this integration by considering TFLite networks, but is equally applicable for all other networks that can be translated into Relay IR. TFLite test that contains just a quantized (int8) softmax is first converted as a sequence of following relay operations: *dequantize -> softmax -> quantize* by the TFLite frontend. Please refer to the code snippet below.
+We will enable this integration by considering TFLite networks, but is equally applicable for all other networks that can be translated into Relay IR. TFLite test that contains just a quantized (int8) softmax is first converted as a sequence of following relay operations: *dequantize -> softmax -> quantize* by the TFLite frontend. Please refer to the relay code snippet below obtained from TFLite frontend.
 
 ```python
 def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
@@ -102,12 +103,14 @@ At last, code generator identifies the *tir* extern_call(s) and generates *c* co
 
 Note: There are no changes required in config.cmake as the CMSIS-NN APIs corresponding to the operators are hard coded. The testing infrastructure links them to the CMSIS-NN library. Execution of the networks works similar to what has been described in [Arm Ethos-U Integration] (https://github.com/apache/tvm-rfcs/pull/11).
 
-For more complex operations, CMSIS-NN structures will need to be used. For this purpose, `tir_to_runtime` will be used to extend the existing C Codegen and produce C code with the appropriate headers and calling patterns. Please refer to the [Additional Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10).
+Once the entire infrastructure for CMSIS-NN mapping is in place using softmax API, we will add more complex operations such as depthwise convolution and pooling gradually to both the graph partitioning and code generation infrastructure.
 
 
 # Testing
 
-As we introduce the operators, we will keep on adding individual unit tests. Once the operator support is partially completed, we will start adding network tests. We are planning to use [Arm® Corestone™-300 Fixed Virtual Platform] (https://developer.arm.com/ip-products/subsystem/corstone/corstone-300) to run these tests in the CI. Reference: [Arm Ethos-U Integration] (https://github.com/apache/tvm-rfcs/pull/11/files). In its absence, tests to check the correctness of *tir* can be added.
+As we introduce the operators, we will keep on adding individual unit tests. Once the operator support is partially completed, we will start adding network tests. We are planning to use [Arm® Corestone™-300 Fixed Virtual Platform] (https://developer.arm.com/ip-products/subsystem/corstone/corstone-300) to run these tests in the CI. Reference: [Arm Ethos-U Integration] (https://github.com/apache/tvm-rfcs/pull/11/files). There will be two kinds of checks a unit test would provide: one around the correctness of the partitioned function and the other around validity of the output from corstone-300 against native TVM output.
+
+In case the above infrastructure is not available, we can test the correctness of *tir* in the unit tests.
 
 # Drawbacks
 
@@ -117,7 +120,7 @@ CMSIS-NN APIs provide hand coded kernels. Therefore, code generation skips the a
 
 Before adding other operators from CMSIS-NN, the integration will be enabled only for softmax.
 
-P1: Graph partitioner for CMSIS-NN target
+P1: Graph partitioner for CMSIS-NN target for Softmax
 P2: Code generation using existing BYOC
 P3: tvmc support to generate code for CMSIS-NN
 P4: Move this implementation using `tir_to_runtime` from target hooks

From 6dcdcb1b1d296fb27ea28208c8093f2525ab995d Mon Sep 17 00:00:00 2001
From: Ashutosh Parkhi <ashutosh.parkhi@arm.com>
Date: Mon, 16 Aug 2021 16:43:02 +0100
Subject: [PATCH 7/9] Andrew's comments II: restructuring testing, guide level
 explanations

---
 rfcs/0015_Arm_CMSIS-NN_Integration.md | 116 +++++++++++++++++++++-----
 1 file changed, 96 insertions(+), 20 deletions(-)

diff --git a/rfcs/0015_Arm_CMSIS-NN_Integration.md b/rfcs/0015_Arm_CMSIS-NN_Integration.md
index 5281befe..f44c3b05 100644
--- a/rfcs/0015_Arm_CMSIS-NN_Integration.md
+++ b/rfcs/0015_Arm_CMSIS-NN_Integration.md
@@ -7,38 +7,53 @@
 CMSIS: Common Microcontroller Software Interface Standard
 ACL: The Compute Library for the Arm® Architecture
 MLF: Model Library Format
-FVP: Arm® Corestone™-300 Fixed Virtual Platform
+Cortex-M: Arm® Cortex®-M processor
 
 # Summary
 
-This RFC introduces plan of integration of CMSIS-NN library into TVM. It consists of efficient kernels targeted for Arm's Cortex-M architecture.
+This RFC introduces plan of integration of CMSIS-NN library into TVM. It consists of efficient kernels targeted for Cortex-M architecture.
 
 Please refer to the following pages for more details on CMSIS-NN.
-* https://arm-software.github.io/CMSIS_5/NN/html/index.html
-* https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN
+* [CMSIS-NN user manual] (https://arm-software.github.io/CMSIS_5/NN/html/index.html)
+* [GITHUB CMSIS-NN Source] (https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN)
 
 First PR in the series of PRs to fulfill this integration would be graph partitioner for softmax int8. Detailed plan can found below in this RFC.
 
 
 # Motivation
 
-CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M and are compliant with the quantization scheme used in Tensorflow Lite. They have been optimized for better performance and small memory footprint which is required on these embedded devices and it would make sense for TVM to reuse these while generating code for Cortex-M. They have been integrated with the TensorFlow Lite Micro project.
+CMSIS-NN library consists of hand-tuned kernels that are suitable for Cortex-M and are compliant with the quantization scheme used in Tensorflow Lite. They have been optimized for better performance and small memory footprint which is required on these embedded devices and it would make sense for TVM to reuse these while generating code for Cortex-M. They have been integrated with the TensorFlow Lite Micro project. In this work, we plan to map TFLite operators to the existing CMSIS-NN APIs without performing any intermediate Relay level translations.
 
 
 # Guide-level explanation
 
+We will enable this integration by considering TFLite networks, but is equally applicable for all other networks that can be translated into Relay IR.
+
 TVM's BYOC infrastructure allows for the partitioning and code generation using the external compiler. Partitioned subgraphs containing operator(s) targeted for Cortex-M can then be translated into the CMSIS-NN C APIs which eventually become part of MLF.
 
-If a user runs tvmc, they will get a MLF format archive which calls out to the CMSIS operators.
+If a user runs tvmc, they will get a MLF format archive which calls out to the CMSIS operators. The source for the CMSIS-NN is not included in the MLF. Also, the support will remain up to date with changing library as we expect minimal changes to the CMSIS-NN API interface. Source code from github will be used for linking against the MLF by the test setup that allows execution on Cortex-M.
 
 ```
 tvmc --target=cmsisnn,c --output-format=mlf --executor=aot
 ```
+In the absence of tvmc support, following python APIs can be used to generate the C code. But eventually tvmc will be supporting CMSIS-NN as mentioned above.
+
+```python
+    from tvm.relay.op.contrib import cmsisnn
+        # API to call CMSIS-NN partitioning
+        # Here, module is the relay module
+        cmsisnn_module = cmsisnn.partition_for_cmsisnn(module)
+
+    # Invoke AOT compiler to get the MLF containing CMSIS-NN APIs
+    with tvm.target.Target("c -runtime=c --link-params -mcpu=cortex-m55 --executor=aot --unpacked-api=1"):
+        factory = tvm.relay.build(cmsisnn_mod)
+```
+
 
 
 # Reference-level explanation
 
-We will enable this integration by considering TFLite networks, but is equally applicable for all other networks that can be translated into Relay IR. TFLite test that contains just a quantized (int8) softmax is first converted as a sequence of following relay operations: *dequantize -> softmax -> quantize* by the TFLite frontend. Please refer to the relay code snippet below obtained from TFLite frontend.
+This section details how TFLite softmax int8 is converted into the C code. TFLite frontend first translates softmax int8 into the following sequence of relay operations: *dequantize -> softmax -> quantize*. Please refer to the relay code snippet below obtained from TFLite frontend.
 
 ```python
 def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
@@ -57,7 +72,30 @@ Here is the API to obtain the partitioned function aimed at CMSIS-NN.
         cmsisnn_module = cmsisnn.partition_for_cmsisnn(module)        
 ```
 
-Following code block shows the resultant IRModule.
+The API for partitioning will work through the pattern matching tables for CMSIS-NN which will look like the below snippet. It will include support for operators: Convolution, Depthwise convolution, Fully Connected, Pooling and MatMul.
+
+```python
+
+@register_pattern_table("cmsisnn")
+def pattern_table():
+    """Get the cmsisnn compiler pattern table."""
+
+    def softmax_pattern():
+        pattern = is_op("qnn.dequantize")(wildcard(), is_constant(), is_constant())
+        pattern = is_op("nn.softmax")(pattern)
+        pattern = is_op("qnn.quantize")(pattern, is_constant(), is_constant())
+        return pattern
+
+    def check_quantized_softmax(extract):
+       ...
+
+    return [
+        ("cmsisnn.qnn_softmax", softmax_pattern(), check_quantized_softmax),
+    ]
+
+```
+
+Following code block shows the resultant IRModule post partitioning.
 
 ```python
 def @main(%a: Tensor[(1, 16, 16, 3), int8]) -> Tensor[(1, 16, 16, 3), int8] {
@@ -74,7 +112,7 @@ def @tvmgen_default_cmsisnn_0(%cmsisnn_0_i0: Tensor[(1, 16, 16, 3), int8], Inlin
 }
 ```
 
-Above partitioned function is presented to the CMSIS-NN external code generator for *tir* generation using the TVM's build() API. 
+Above partitioned function is presented to the CMSIS-NN external code generator for TIR generation using the TVM's build() API.
 
 ```python
     # Invoke AOT compiler to get the MLF containing CMSIS-NN APIs
@@ -83,7 +121,7 @@ Above partitioned function is presented to the CMSIS-NN external code generator
 
 ```
 
-Intermediate *tir* looks like this:
+Resultant TIR looks like this:
 
 ```python
 primfn(placeholder_1: handle, out_write_1: handle) -> ()
@@ -97,34 +135,72 @@ primfn(placeholder_1: handle, out_write_1: handle) -> ()
     }
 }
 ```
-In future, target hooks for `relay_to_tir` implemented as part of [Additional Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10) will be used to obtain the above tir for graph with softmax. These hooks provide us with the flexibility to reuse memory planning and much of the TVM's code generation capabilities.
+In future, target hooks for `relay_to_tir` implemented as part of [Additional Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10) will be used to obtain the above TIR and it will be returned to the compilation pipeline. These hooks provide us with the flexibility to reuse memory planning and much of the TVM's code generation capabilities.
 
-At last, code generator identifies the *tir* extern_call(s) and generates *c* code for softmax with the CMSIS-NN API for softmax int8.
+At last, code generator identifies the TIR extern_call(s) and generates C code for softmax with the CMSIS-NN API for softmax int8. Both TIR and C are generated when function registered through `tvm._ffi.register_func("relay.ext.cmsisnn")` is invoked.
 
-Note: There are no changes required in config.cmake as the CMSIS-NN APIs corresponding to the operators are hard coded. The testing infrastructure links them to the CMSIS-NN library. Execution of the networks works similar to what has been described in [Arm Ethos-U Integration] (https://github.com/apache/tvm-rfcs/pull/11).
+```c++
+#ifdef __cplusplus
+extern "C" {
+#endif
+// C code generator produces hard coded values from the network
+static const int32_t num_rows = 28;
+static const int32_t row_size = 28;
+static const int32_t mult = 1;
+static const int32_t shift = 0;
+static const int32_t diff_min = -128;
+
+static int32_t tvmgen_default_cmsisnn_main_0_(int8_t* in0, int8_t* out0) {
+
+    arm_softmax_s8(in0, num_rows, row_size, mult, shift, diff_min, out0);
+    return 0;
+}
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+TVM_DLL int32_t tvmgen_default_ethosu_main_0(TVMValue* args, int* type_code, int num_args, TVMValue* out_value, int* out_type_code) {
+  DLTensor* arg0 = (DLTensor*)(((TVMValue*)args)[0].v_handle);
+  DLTensor* ret1 = (DLTensor*)(((TVMValue*)args)[1].v_handle);
+  return tvmgen_default_cmsisnn_main_0_(arg0, ret1);
+}
+#ifdef __cplusplus
+}
+#endif
+```
+
+Note: CMSIS-NN APIs for each operator are hard coded into the generated C file. The C generator can be excluded from the source by setting USE_CMSISNN to OFF in the config.cmake. In orde to link the C file to the CMSIS-NN library, Ethosu test runner infrastructure is used as has been described here: [Arm Ethos-U Integration] (https://github.com/apache/tvm-rfcs/pull/11).
 
 Once the entire infrastructure for CMSIS-NN mapping is in place using softmax API, we will add more complex operations such as depthwise convolution and pooling gradually to both the graph partitioning and code generation infrastructure.
 
 
 # Testing
 
-As we introduce the operators, we will keep on adding individual unit tests. Once the operator support is partially completed, we will start adding network tests. We are planning to use [Arm® Corestone™-300 Fixed Virtual Platform] (https://developer.arm.com/ip-products/subsystem/corstone/corstone-300) to run these tests in the CI. Reference: [Arm Ethos-U Integration] (https://github.com/apache/tvm-rfcs/pull/11/files). There will be two kinds of checks a unit test would provide: one around the correctness of the partitioned function and the other around validity of the output from corstone-300 against native TVM output.
+Unit tests will be added alongside operator support. Once operator support matures, we will add network tests.
+
+A unit tests will be of two kinds.
+
+* Match operator patterns used by the graph partitioner.
+    * It will be done for each operator and for a combination of operators both.
+* Correctness of the CMSIS-NN operators against the native TVM output.
+    * Actual output can be generated using [Corstone-300 reference system] (https://github.com/apache/tvm-rfcs/pull/11)
+    * In case the reference system is unavailable, checks will be added for TIR's correctness.
 
-In case the above infrastructure is not available, we can test the correctness of *tir* in the unit tests.
 
 # Drawbacks
 
 CMSIS-NN APIs provide hand coded kernels. Therefore, code generation skips the auto tuning capabilities of TVM. In future, we wish to make use of full power of TVM's auto scheduling.
 
+
 # Upstreaming Plan
 
 Before adding other operators from CMSIS-NN, the integration will be enabled only for softmax.
 
-P1: Graph partitioner for CMSIS-NN target for Softmax
-P2: Code generation using existing BYOC
+P1: Graph partitioner for CMSIS-NN target for softmax
+P2: Softmax code generation using existing BYOC
 P3: tvmc support to generate code for CMSIS-NN
-P4: Move this implementation using `tir_to_runtime` from target hooks
-P5: Use of CMSIS-NN data structures while supporting depthwise convolution
+P4: Use of CMSIS-NN data structures while supporting depthwise convolution
+P5: Use target hooks for CMSIS-NN code generation [Addition Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10/files) 
 P6: Support for Convolution
 P7: Support for Fully connected
 P8: Support for Max Pooling
@@ -138,6 +214,6 @@ CMSIS-NN integration into TVM builds on top of ACL's integration into TVM. Exist
 
 Reference: [ACL] (https://tvm.apache.org/docs/deploy/arm_compute_lib.html)
 
-Code generation for CMSIS-NN will use the newly introduced target hooks.
+Evenutally, code generation for CMSIS-NN will use the newly introduced target hooks.
 
 Reference: [Additional Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10/files)

From 6a3517b84bb658b92f0244d52156122d60ddd2e9 Mon Sep 17 00:00:00 2001
From: Ashutosh Parkhi <ashutosh.parkhi@arm.com>
Date: Mon, 16 Aug 2021 16:47:24 +0100
Subject: [PATCH 8/9] Upstreaming plan misses line separator

---
 rfcs/0015_Arm_CMSIS-NN_Integration.md | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/rfcs/0015_Arm_CMSIS-NN_Integration.md b/rfcs/0015_Arm_CMSIS-NN_Integration.md
index f44c3b05..5f1be953 100644
--- a/rfcs/0015_Arm_CMSIS-NN_Integration.md
+++ b/rfcs/0015_Arm_CMSIS-NN_Integration.md
@@ -196,16 +196,16 @@ CMSIS-NN APIs provide hand coded kernels. Therefore, code generation skips the a
 
 Before adding other operators from CMSIS-NN, the integration will be enabled only for softmax.
 
-P1: Graph partitioner for CMSIS-NN target for softmax
-P2: Softmax code generation using existing BYOC
-P3: tvmc support to generate code for CMSIS-NN
-P4: Use of CMSIS-NN data structures while supporting depthwise convolution
-P5: Use target hooks for CMSIS-NN code generation [Addition Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10/files) 
-P6: Support for Convolution
-P7: Support for Fully connected
-P8: Support for Max Pooling
-P9: Support for Avg Pooling
-P10: Support for MatMul
+P1: Graph partitioner for CMSIS-NN target for softmax  
+P2: Softmax code generation using existing BYOC  
+P3: tvmc support to generate code for CMSIS-NN  
+P4: Use of CMSIS-NN data structures while supporting depthwise convolution  
+P5: Use target hooks for CMSIS-NN code generation [Addition Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10/files)   
+P6: Support for Convolution  
+P7: Support for Fully connected  
+P8: Support for Max Pooling  
+P9: Support for Avg Pooling  
+P10: Support for MatMul  
 
 
 # Prior art

From 203cf3235c27cadea4d71cfb0d164fefe4676f02 Mon Sep 17 00:00:00 2001
From: Ashutosh Parkhi <ashutosh.parkhi@arm.com>
Date: Mon, 16 Aug 2021 16:49:01 +0100
Subject: [PATCH 9/9] Upstreaming plan misses line separator

---
 rfcs/0015_Arm_CMSIS-NN_Integration.md | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/rfcs/0015_Arm_CMSIS-NN_Integration.md b/rfcs/0015_Arm_CMSIS-NN_Integration.md
index 5f1be953..71fa4944 100644
--- a/rfcs/0015_Arm_CMSIS-NN_Integration.md
+++ b/rfcs/0015_Arm_CMSIS-NN_Integration.md
@@ -196,16 +196,16 @@ CMSIS-NN APIs provide hand coded kernels. Therefore, code generation skips the a
 
 Before adding other operators from CMSIS-NN, the integration will be enabled only for softmax.
 
-P1: Graph partitioner for CMSIS-NN target for softmax  
-P2: Softmax code generation using existing BYOC  
-P3: tvmc support to generate code for CMSIS-NN  
-P4: Use of CMSIS-NN data structures while supporting depthwise convolution  
-P5: Use target hooks for CMSIS-NN code generation [Addition Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10/files)   
-P6: Support for Convolution  
-P7: Support for Fully connected  
-P8: Support for Max Pooling  
-P9: Support for Avg Pooling  
-P10: Support for MatMul  
+* P1: Graph partitioner for CMSIS-NN target for softmax  
+* P2: Softmax code generation using existing BYOC  
+* P3: tvmc support to generate code for CMSIS-NN  
+* P4: Use of CMSIS-NN data structures while supporting depthwise convolution  
+* P5: Use target hooks for CMSIS-NN code generation [Addition Target Hooks] (https://github.com/apache/tvm-rfcs/pull/10/files)   
+* P6: Support for Convolution  
+* P7: Support for Fully connected  
+* P8: Support for Max Pooling  
+* P9: Support for Avg Pooling  
+* P10: Support for MatMul  
 
 
 # Prior art