cpu: risc-v: convolution: Vectorize post-ops with RVV intrinsics#3852
cpu: risc-v: convolution: Vectorize post-ops with RVV intrinsics#3852vpirogov merged 5 commits intouxlfoundation:mainfrom
Conversation
b4e4d5f to
6827e48
Compare
|
Hi team, This PR is now ready for review. I've implemented RVV (RISC-V Vector Extension) optimizations for the post-operations in the GEMM-based convolution primitive. The primary focus was on vectorizing the bias addition and ReLU activation functions for both NSPC (nhwc) and NCSP (nchw) layouts. The latest commit optimizes the ReLU implementation for NCSP and reverts the NSPC implementation back to a non-vectorized version due to its poor performance. preformance data: rvv_gemmconv_benchmark.log NSPC (nhwc) Layout
NCSP (nchw) Layout
Note: The speedup data shown are only for the accelerated parts (i.e., ReLU and Bias Add), not the entire operation. The acceleration for the GEMM-based convolution operator itself is addressed in #3785. |
|
Nicely done! I wanted to share some thoughts regarding the integration of ReLU intrinsics for future extensibility with additional post-operations. I noticed that a version of The current supported features include:
Given this, I would like to suggest that maybe you could use our Thank you for considering this suggestion! |
Thank you for the heads-up. This current PR not only provides post ops acceleration but also offers a framework for RVV GEMM-based convolution. I will continue to optimize the img2col acceleration in the future, so this change remains important. Once your PR is approved, both you and I can apply these operators and submit new PRs later on. |
8e33bc2 to
7e74687
Compare
7e74687 to
c946899
Compare
|
Please address CI fails. |
Thank you very much for your reminder. Due to the current public holiday in China, our verification servers are temporarily offline. Therefore, I am working on resolving this issue locally, which may take longer than usual. I sincerely apologize for any inconvenience this may cause. |
Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>
Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>
Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>
Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>
9c8a6cb to
25468f0
Compare
|
Hi everyone, This issue is now fixed. The previous change to the post_ops_ok check for this op caused the graph compiler to fuse a pattern with an element-wise binary operation, a mode our kernel implementation doesn't yet support. In the failing test, the symptom appeared as if the final Add operation was being skipped. I have now corrected these necessary checks, and all tests are passing. Thanks a lot for all your help! |
Description
I'm opening this Draft pull request to get early feedback on the necessity and direction of this change. The full ctest validation is currently in progress and will take some time.
On the RISC-V platform, the GEMM-based convolution is currently the most performant implementation. Its execution time can be broken down into three main parts: im2col, the GEMM computation, and the post-processing stage.
The optimization for the GEMM stage using RISC-V Vector (RVV) extensions has already been contributed in PR #3785, which effectively improved the performance of convolution and deconvolution primitives.
However, the im2col and post-processing stages remain as performance bottlenecks as they are not yet effectively vectorized. This PR focuses on the post-processing part, which currently relies on OpenMP auto-vectorization (
PRAGMA_OMP_SIMD()). This feature is not yet supported on our target RVV platform, making manual vectorization a necessary step for performance improvement.This PR addresses the post-processing bottleneck by replacing the OpenMP pragmas with explicit, manual vectorization using RVV intrinsics for bias addition and ReLU activation loops. Furthermore, this change provides a framework that will facilitate future RVV-based optimizations of the im2col process.
Checklist
General
make testandmake test_benchdnn_*) pass locally for each commit?Performance improvements