From 611568e2926a35a9dc4f414f53454d8325502e7c Mon Sep 17 00:00:00 2001 From: zhuwenxi Date: Thu, 16 Dec 2021 20:24:47 +0800 Subject: [PATCH 1/8] RFC to integrate LIBXSMM with TVM. --- rfcs/0046-Intel-LIBXSMM-integration.md | 97 ++++++++++++++++++++++++++ 1 file changed, 97 insertions(+) create mode 100644 rfcs/0046-Intel-LIBXSMM-integration.md diff --git a/rfcs/0046-Intel-LIBXSMM-integration.md b/rfcs/0046-Intel-LIBXSMM-integration.md new file mode 100644 index 00000000..34e8bef9 --- /dev/null +++ b/rfcs/0046-Intel-LIBXSMM-integration.md @@ -0,0 +1,97 @@ +# Summary +This RFC introduces the plan of integrating LIBXSMM into TVM. LIBXSMM leverages JIT code generator to produce high efficient kernels targeting x86 architectures. + +For details of LIBXSMM, please refer to: +* [LIBXSMM User Manual](https://libxsmm.readthedocs.io/en/latest/) +* [LIBXSMM github repo](https://github.com/hfp/libxsmm) + +# Motivation +TVM has shown satisfactory performance on MLP models with CPU. However there are still some defects in the assembly code generated by LLVM which block AutoTVM/AutoScheduler from achieving optimal on GEMM. + +LIBXSMM is a open source library developed by Intel Lab for accelerating small matrix multiplication. It leverages the JIT code generator to generate high efficient GEMM kernels for x86 CPU, which could be very close to hardware rootline. According to our evaluation, in “small” GEMM (cube_root(m * n * k) <= 256) , LIBXSMM shows a superior performance over the well-known BLAS library Intel MKL. + +By the way, given that LIBXSMM can generate quite efficient GEMM kernel implementation, it is also an ideal substitution for inner-kernel of normal size GEMM. According our experiments, the AutoTVM templates we wrote with LIBXSMM as register-block generation, has a much higher performance comparing to MKL and existing TOPI implementation. + +# Guide-level explanation +This proposal aims to integrate LIBXSMM into TVM to accelerate small GEMM and serve as inner-kernel to accelerate normal size GEMM. + +We will integrate LIBXSMM with TVM in following 3 components: +1. Add extern call “tvm.contrib.libxsmm.gemm” in “src/runtime/contrib” directory, and corresponding python interface in "python/tvm/contrib/" directory, so users can call them just as CBLAS; +2. Use BYOC to accelerate small GEMM (cube_root(m * n * k ) <= 256) and its epilogue fusion variations (bias/relu/sigmoid/bias_relu/bias_sigmoid); +3. AutoTVM template we wrote with LIBXSMM as inner kernel into TOPI, as a GEMM implementation candidate. + +# Reference-level explanation +1. Users can call libxsmm as CBLAS through extern call API. +``` + def matmul(lhs, rhs, transa=False, transb=False, alpha=1.0, beta=0.0, lda=-1, ldb=-1, ldc=-1, **kwargs): + n = lhs.shape[1] if transa else lhs.shape[0] + m = rhs.shape[0] if transb else rhs.shape[1] + return te.extern( + (n, m), + [lhs, rhs], + lambda ins, outs: tvm.tir.call_packed( + "tvm.contrib.libxsmm.matmul", ins[0], ins[1], outs[0], transa, transb, alpha, beta, lda, ldb, ldc), + name="C", + **kwargs, + ) +``` +2. BYOC allows for graph partitioning and using LIBXSMM for code generation. + * API to obtain the partitioned function: +``` + from tvm.relay.op.contrib import libxsmm + + # API to call LIBXSMM partitioning + libxsmm_module = libxsmm.partition_for_cmsisnn(module) +``` + * Pattern matching table: +``` +@register_pattern_table("libxsmm") +def pattern_table(): + dense_pattern = ("libxsmm.dense", make_pattern(with_bias=False, with_activation=None)) + denese_bias_pattern = ("libxsmm.dense_bias", make_pattern(with_bias=True, with_activation=None)) + denese_relu_pattern = ("libxsmm.dense_relu", make_pattern(with_bias=False, with_activation="relu")) + denese_sigmoid_pattern = ("libxsmm.dense_sigmoid", make_pattern(with_bias=False, with_activation="sigmoid")) + denese_bias_relu = ("libxsmm.dense_bias_relu", make_pattern(with_bias=True, with_activation="relu")) + denese_bias_sigmoid = ("libxsmm.dense_bias_sigmoid", make_pattern(with_bias=True, with_activation="sigmoid")) + libxsmm_pattern = [dense_pattern, denese_bias_pattern, denese_relu_pattern, denese_sigmoid_pattern, denese_bias_relu, denese_bias_sigmoid] + return libxsmm_pattern +``` + * Build with TVM +``` +with tvm.transform.PassContext(opt_level=3): + lib = relay.build(libxsmm_module, target="cpu", params=params) +``` +3. Integrate into TOPI, an GEMM autotvm template with LIBXSMM as inner kernel. + * Use Tensorize/TensorIR to substitute register block of GEMM with LIBXSMM +``` +def intrin_func(ins, outs): + def _body(): + ib = tvm.tir.ir_builder.create() + ib.emit( + tvm.tir.call_extern( + "int", "libxsmm_sgemm", m, n, k, 1.0, ins[0].access_ptr("r"), K, ins[1].access_ptr("r"), n, 0.0, outs[0].access_ptr("w"), N + ) + ) + return ib.get() + def _update(): + ib = tvm.tir.ir_builder.create() + ib.emit( + tvm.tir.call_extern( + "int", "libxsmm_sgemm", m, n, k, 1.0, ins[0].access_ptr("r"), K, ins[1].access_ptr("r"), n, 1.0, outs[0].access_ptr("w"), N + ) + ) + return ib.get() +``` + +# Testing +We will add unittest for coresponding extern call, BYOC and TOPI related code: +* Make sure the result LIBXSMM produces is correct with its TVM counter part; +* Confirm match patterns are working as expected. + +# Drawbacks +* Though LIBXSMM works well with AutoTVM, it does not help AutoScheduler; +* Memory footprint would increase as JIT code generated, a LRU kernel cache might be required to mitigate it. + +# Future possibilities +* LIBXSMM has DNN support, so it might be interesting to also integrate DNN primitives such as Conv to TVM; +* LIBXSMM has quantized kernel (int8), we can also integrate it to TVM, as long as it surpass existing oneDNN implementations. From ffbd9213422bd0067ee5248a124bb576feba11ef Mon Sep 17 00:00:00 2001 From: zhuwenxi Date: Fri, 17 Dec 2021 10:36:46 +0800 Subject: [PATCH 2/8] Fix indent. --- rfcs/0046-Intel-LIBXSMM-integration.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/rfcs/0046-Intel-LIBXSMM-integration.md b/rfcs/0046-Intel-LIBXSMM-integration.md index 34e8bef9..2ac9d57b 100644 --- a/rfcs/0046-Intel-LIBXSMM-integration.md +++ b/rfcs/0046-Intel-LIBXSMM-integration.md @@ -23,17 +23,17 @@ We will integrate LIBXSMM with TVM in following 3 components: # Reference-level explanation 1. Users can call libxsmm as CBLAS through extern call API. ``` - def matmul(lhs, rhs, transa=False, transb=False, alpha=1.0, beta=0.0, lda=-1, ldb=-1, ldc=-1, **kwargs): - n = lhs.shape[1] if transa else lhs.shape[0] - m = rhs.shape[0] if transb else rhs.shape[1] - return te.extern( - (n, m), - [lhs, rhs], - lambda ins, outs: tvm.tir.call_packed( - "tvm.contrib.libxsmm.matmul", ins[0], ins[1], outs[0], transa, transb, alpha, beta, lda, ldb, ldc), - name="C", - **kwargs, - ) + def matmul(lhs, rhs, transa=False, transb=False, alpha=1.0, beta=0.0, lda=-1, ldb=-1, ldc=-1, **kwargs): + n = lhs.shape[1] if transa else lhs.shape[0] + m = rhs.shape[0] if transb else rhs.shape[1] + return te.extern( + (n, m), + [lhs, rhs], + lambda ins, outs: tvm.tir.call_packed( + "tvm.contrib.libxsmm.matmul", ins[0], ins[1], outs[0], transa, transb, alpha, beta, lda, ldb, ldc), + name="C", + **kwargs, + ) ``` 2. BYOC allows for graph partitioning and using LIBXSMM for code generation. * API to obtain the partitioned function: From f76f46a73920209496cdf1b9277f0f6c61657956 Mon Sep 17 00:00:00 2001 From: zhuwenxi Date: Fri, 17 Dec 2021 10:39:46 +0800 Subject: [PATCH 3/8] Convert tab to space. --- rfcs/0046-Intel-LIBXSMM-integration.md | 38 +++++++++++++------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/rfcs/0046-Intel-LIBXSMM-integration.md b/rfcs/0046-Intel-LIBXSMM-integration.md index 2ac9d57b..71201799 100644 --- a/rfcs/0046-Intel-LIBXSMM-integration.md +++ b/rfcs/0046-Intel-LIBXSMM-integration.md @@ -23,27 +23,27 @@ We will integrate LIBXSMM with TVM in following 3 components: # Reference-level explanation 1. Users can call libxsmm as CBLAS through extern call API. ``` - def matmul(lhs, rhs, transa=False, transb=False, alpha=1.0, beta=0.0, lda=-1, ldb=-1, ldc=-1, **kwargs): - n = lhs.shape[1] if transa else lhs.shape[0] - m = rhs.shape[0] if transb else rhs.shape[1] - return te.extern( - (n, m), - [lhs, rhs], - lambda ins, outs: tvm.tir.call_packed( - "tvm.contrib.libxsmm.matmul", ins[0], ins[1], outs[0], transa, transb, alpha, beta, lda, ldb, ldc), - name="C", - **kwargs, - ) + def matmul(lhs, rhs, transa=False, transb=False, alpha=1.0, beta=0.0, lda=-1, ldb=-1, ldc=-1, **kwargs): + n = lhs.shape[1] if transa else lhs.shape[0] + m = rhs.shape[0] if transb else rhs.shape[1] + return te.extern( + (n, m), + [lhs, rhs], + lambda ins, outs: tvm.tir.call_packed( + "tvm.contrib.libxsmm.matmul", ins[0], ins[1], outs[0], transa, transb, alpha, beta, lda, ldb, ldc), + name="C", + **kwargs, + ) ``` 2. BYOC allows for graph partitioning and using LIBXSMM for code generation. - * API to obtain the partitioned function: + * API to obtain the partitioned function: ``` - from tvm.relay.op.contrib import libxsmm + from tvm.relay.op.contrib import libxsmm - # API to call LIBXSMM partitioning + # API to call LIBXSMM partitioning libxsmm_module = libxsmm.partition_for_cmsisnn(module) ``` - * Pattern matching table: + * Pattern matching table: ``` @register_pattern_table("libxsmm") def pattern_table(): @@ -56,16 +56,16 @@ def pattern_table(): libxsmm_pattern = [dense_pattern, denese_bias_pattern, denese_relu_pattern, denese_sigmoid_pattern, denese_bias_relu, denese_bias_sigmoid] return libxsmm_pattern ``` - * Build with TVM + * Build with TVM ``` with tvm.transform.PassContext(opt_level=3): - lib = relay.build(libxsmm_module, target="cpu", params=params) + lib = relay.build(libxsmm_module, target="cpu", params=params) ``` 3. Integrate into TOPI, an GEMM autotvm template with LIBXSMM as inner kernel. - * Use Tensorize/TensorIR to substitute register block of GEMM with LIBXSMM + * Use Tensorize/TensorIR to substitute register block of GEMM with LIBXSMM ``` def intrin_func(ins, outs): - def _body(): + def _body(): ib = tvm.tir.ir_builder.create() ib.emit( tvm.tir.call_extern( From c34e97fff808e059aaa00ff657e6087255b9e17b Mon Sep 17 00:00:00 2001 From: zhuwenxi Date: Fri, 17 Dec 2021 10:53:30 +0800 Subject: [PATCH 4/8] Fix typo: partition_for_cmsisnn -> partition_for_libxsmm --- rfcs/0046-Intel-LIBXSMM-integration.md | 56 +++++++++++++------------- 1 file changed, 28 insertions(+), 28 deletions(-) diff --git a/rfcs/0046-Intel-LIBXSMM-integration.md b/rfcs/0046-Intel-LIBXSMM-integration.md index 71201799..f4c41687 100644 --- a/rfcs/0046-Intel-LIBXSMM-integration.md +++ b/rfcs/0046-Intel-LIBXSMM-integration.md @@ -41,46 +41,46 @@ We will integrate LIBXSMM with TVM in following 3 components: from tvm.relay.op.contrib import libxsmm # API to call LIBXSMM partitioning - libxsmm_module = libxsmm.partition_for_cmsisnn(module) + libxsmm_module = libxsmm.partition_for_libxsmm(module) ``` * Pattern matching table: ``` -@register_pattern_table("libxsmm") -def pattern_table(): - dense_pattern = ("libxsmm.dense", make_pattern(with_bias=False, with_activation=None)) - denese_bias_pattern = ("libxsmm.dense_bias", make_pattern(with_bias=True, with_activation=None)) - denese_relu_pattern = ("libxsmm.dense_relu", make_pattern(with_bias=False, with_activation="relu")) - denese_sigmoid_pattern = ("libxsmm.dense_sigmoid", make_pattern(with_bias=False, with_activation="sigmoid")) - denese_bias_relu = ("libxsmm.dense_bias_relu", make_pattern(with_bias=True, with_activation="relu")) - denese_bias_sigmoid = ("libxsmm.dense_bias_sigmoid", make_pattern(with_bias=True, with_activation="sigmoid")) - libxsmm_pattern = [dense_pattern, denese_bias_pattern, denese_relu_pattern, denese_sigmoid_pattern, denese_bias_relu, denese_bias_sigmoid] - return libxsmm_pattern + @register_pattern_table("libxsmm") + def pattern_table(): + dense_pattern = ("libxsmm.dense", make_pattern(with_bias=False, with_activation=None)) + denese_bias_pattern = ("libxsmm.dense_bias", make_pattern(with_bias=True, with_activation=None)) + denese_relu_pattern = ("libxsmm.dense_relu", make_pattern(with_bias=False, with_activation="relu")) + denese_sigmoid_pattern = ("libxsmm.dense_sigmoid", make_pattern(with_bias=False, with_activation="sigmoid")) + denese_bias_relu = ("libxsmm.dense_bias_relu", make_pattern(with_bias=True, with_activation="relu")) + denese_bias_sigmoid = ("libxsmm.dense_bias_sigmoid", make_pattern(with_bias=True, with_activation="sigmoid")) + libxsmm_pattern = [dense_pattern, denese_bias_pattern, denese_relu_pattern, denese_sigmoid_pattern, denese_bias_relu, denese_bias_sigmoid] + return libxsmm_pattern ``` * Build with TVM ``` -with tvm.transform.PassContext(opt_level=3): - lib = relay.build(libxsmm_module, target="cpu", params=params) + with tvm.transform.PassContext(opt_level=3): + lib = relay.build(libxsmm_module, target="cpu", params=params) ``` 3. Integrate into TOPI, an GEMM autotvm template with LIBXSMM as inner kernel. * Use Tensorize/TensorIR to substitute register block of GEMM with LIBXSMM ``` -def intrin_func(ins, outs): - def _body(): - ib = tvm.tir.ir_builder.create() - ib.emit( - tvm.tir.call_extern( - "int", "libxsmm_sgemm", m, n, k, 1.0, ins[0].access_ptr("r"), K, ins[1].access_ptr("r"), n, 0.0, outs[0].access_ptr("w"), N - ) + def intrin_func(ins, outs): + def _body(): + ib = tvm.tir.ir_builder.create() + ib.emit( + tvm.tir.call_extern( + "int", "libxsmm_sgemm", m, n, k, 1.0, ins[0].access_ptr("r"), K, ins[1].access_ptr("r"), n, 0.0, outs[0].access_ptr("w"), N ) - return ib.get() - def _update(): - ib = tvm.tir.ir_builder.create() - ib.emit( - tvm.tir.call_extern( - "int", "libxsmm_sgemm", m, n, k, 1.0, ins[0].access_ptr("r"), K, ins[1].access_ptr("r"), n, 1.0, outs[0].access_ptr("w"), N - ) + ) + return ib.get() + def _update(): + ib = tvm.tir.ir_builder.create() + ib.emit( + tvm.tir.call_extern( + "int", "libxsmm_sgemm", m, n, k, 1.0, ins[0].access_ptr("r"), K, ins[1].access_ptr("r"), n, 1.0, outs[0].access_ptr("w"), N ) - return ib.get() + ) + return ib.get() ``` # Testing From 5f8eeaa3590b64c24986c1a3b4938af2bfa84146 Mon Sep 17 00:00:00 2001 From: zhuwenxi Date: Fri, 17 Dec 2021 11:03:09 +0800 Subject: [PATCH 5/8] Add python annotation. --- rfcs/0046-Intel-LIBXSMM-integration.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/rfcs/0046-Intel-LIBXSMM-integration.md b/rfcs/0046-Intel-LIBXSMM-integration.md index f4c41687..d05f7ee2 100644 --- a/rfcs/0046-Intel-LIBXSMM-integration.md +++ b/rfcs/0046-Intel-LIBXSMM-integration.md @@ -22,7 +22,7 @@ We will integrate LIBXSMM with TVM in following 3 components: # Reference-level explanation 1. Users can call libxsmm as CBLAS through extern call API. -``` +```python def matmul(lhs, rhs, transa=False, transb=False, alpha=1.0, beta=0.0, lda=-1, ldb=-1, ldc=-1, **kwargs): n = lhs.shape[1] if transa else lhs.shape[0] m = rhs.shape[0] if transb else rhs.shape[1] @@ -37,14 +37,14 @@ We will integrate LIBXSMM with TVM in following 3 components: ``` 2. BYOC allows for graph partitioning and using LIBXSMM for code generation. * API to obtain the partitioned function: -``` +```python from tvm.relay.op.contrib import libxsmm # API to call LIBXSMM partitioning libxsmm_module = libxsmm.partition_for_libxsmm(module) ``` * Pattern matching table: -``` +```python @register_pattern_table("libxsmm") def pattern_table(): dense_pattern = ("libxsmm.dense", make_pattern(with_bias=False, with_activation=None)) @@ -57,13 +57,13 @@ We will integrate LIBXSMM with TVM in following 3 components: return libxsmm_pattern ``` * Build with TVM -``` +```python with tvm.transform.PassContext(opt_level=3): lib = relay.build(libxsmm_module, target="cpu", params=params) ``` 3. Integrate into TOPI, an GEMM autotvm template with LIBXSMM as inner kernel. * Use Tensorize/TensorIR to substitute register block of GEMM with LIBXSMM -``` +```python def intrin_func(ins, outs): def _body(): ib = tvm.tir.ir_builder.create() @@ -73,6 +73,7 @@ We will integrate LIBXSMM with TVM in following 3 components: ) ) return ib.get() + def _update(): ib = tvm.tir.ir_builder.create() ib.emit( From 4d7fb910d35ce4c3dd2bbee2f14151d348d0d658 Mon Sep 17 00:00:00 2001 From: zhuwenxi Date: Fri, 17 Dec 2021 11:32:19 +0800 Subject: [PATCH 6/8] Add support for target system and Relay op strategy. --- rfcs/0046-Intel-LIBXSMM-integration.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/rfcs/0046-Intel-LIBXSMM-integration.md b/rfcs/0046-Intel-LIBXSMM-integration.md index d05f7ee2..06e627f8 100644 --- a/rfcs/0046-Intel-LIBXSMM-integration.md +++ b/rfcs/0046-Intel-LIBXSMM-integration.md @@ -16,9 +16,10 @@ By the way, given that LIBXSMM can generate quite efficient GEMM kernel implemen This proposal aims to integrate LIBXSMM into TVM to accelerate small GEMM and serve as inner-kernel to accelerate normal size GEMM. We will integrate LIBXSMM with TVM in following 3 components: -1. Add extern call “tvm.contrib.libxsmm.gemm” in “src/runtime/contrib” directory, and corresponding python interface in "python/tvm/contrib/" directory, so users can call them just as CBLAS; +1. Add extern call “tvm.contrib.libxsmm.gemm” in “src/runtime/contrib” directory, and corresponding python interface in "python/tvm/contrib/" directory, so users can call them just as CBLAS; 2. Use BYOC to accelerate small GEMM (cube_root(m * n * k ) <= 256) and its epilogue fusion variations (bias/relu/sigmoid/bias_relu/bias_sigmoid); 3. AutoTVM template we wrote with LIBXSMM as inner kernel into TOPI, as a GEMM implementation candidate. +4. Add target system and Relay op strategy support. When users specify `llvm -libs=libxsmm`, Relay op strategy automatically lowers corresponding GEMM ops to libxsmm. # Reference-level explanation 1. Users can call libxsmm as CBLAS through extern call API. From 4a1d99226e6ef2cdf73013e3ca3cc90b94203c26 Mon Sep 17 00:00:00 2001 From: zhuwenxi Date: Fri, 17 Dec 2021 11:38:44 +0800 Subject: [PATCH 7/8] Add upstream plan. --- rfcs/0046-Intel-LIBXSMM-integration.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/rfcs/0046-Intel-LIBXSMM-integration.md b/rfcs/0046-Intel-LIBXSMM-integration.md index 06e627f8..052ff75d 100644 --- a/rfcs/0046-Intel-LIBXSMM-integration.md +++ b/rfcs/0046-Intel-LIBXSMM-integration.md @@ -97,3 +97,12 @@ We will add unittest for coresponding extern call, BYOC and TOPI related code: # Future possibilities * LIBXSMM has DNN support, so it might be interesting to also integrate DNN primitives such as Conv to TVM; * LIBXSMM has quantized kernel (int8), we can also integrate it to TVM, as long as it surpass existing oneDNN implementations. + +# Upstream plan +This proposal would be split to following PR series: +1. Add LIBXSMM as extern call; +2. Add LIBXSMM to BYOC for accelerating small gemm; +3. Add LIBXSMM-enabled normal size GEMM to TOPI; +4. Add LIBXSMM-enabled normal size GEMM to Relay op strategy. + +Test cases will be provided with these PRs. From ed59c81d4e7aaf12cfeea98b16cf5cd7780c1d31 Mon Sep 17 00:00:00 2001 From: zhuwenxi Date: Thu, 23 Dec 2021 19:45:23 +0800 Subject: [PATCH 8/8] Reschedule the integration plan. --- rfcs/0046-Intel-LIBXSMM-integration.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/rfcs/0046-Intel-LIBXSMM-integration.md b/rfcs/0046-Intel-LIBXSMM-integration.md index 052ff75d..c3384e1d 100644 --- a/rfcs/0046-Intel-LIBXSMM-integration.md +++ b/rfcs/0046-Intel-LIBXSMM-integration.md @@ -100,9 +100,10 @@ We will add unittest for coresponding extern call, BYOC and TOPI related code: # Upstream plan This proposal would be split to following PR series: -1. Add LIBXSMM as extern call; -2. Add LIBXSMM to BYOC for accelerating small gemm; -3. Add LIBXSMM-enabled normal size GEMM to TOPI; -4. Add LIBXSMM-enabled normal size GEMM to Relay op strategy. +1. Add LIBXSMM to TVM CI; +2. BYOC support for accelerating small gemm; +3. Documentation about LIBXSMM support, including supported ops/patterns/dtypes/versions and limitations. +4. TOPI op support for normal size GEMM; +4. Relay op strategy for normal size GEMM. Test cases will be provided with these PRs.