cpu: riscv: gemm: add f32 gemm SIMD optimization with RISC-V V Extension#3785
cpu: riscv: gemm: add f32 gemm SIMD optimization with RISC-V V Extension#3785mgouicem merged 4 commits intouxlfoundation:mainfrom
Conversation
d5f2994 to
d185c8d
Compare
|
Hello everyone, Thank you for your patience. I've updated the PR with a new commit to address the performance instability issues observed in the initial submission. Here's a summary of the changes and their impact: Key Updates in the New Commit:
Performance Impact:This optimization has been tested, showing performance gains across all benchmarks. It yields significant speedups for Matmul (averaging 4.00x) and Deconvolution (averaging 5.16x) primitives, while the overall impact on convolution is generally positive. Detailed Performance Data:Matmul Performance: Convolution Performance: Deconvolution Performance: This GEMM optimization effectively enhances the performance of the default convolution and deconvolution implementations in RVV, as they directly invoke GEMM rather than Matmul. Consequently, this submission does not conflict with any existing (and also under reviewing) RVV matrix multiplication optimizations. We kindly request that you consider this PR for approval. Ctest Log: |
Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>
The previous RVV F32 GEMM implementation showed polarized performance,
with significant speedups in some cases but severe regressions in others.
This commit addresses the instability through several key changes:
* Adjust the F32 GEMM kernel's inner block size to 8 to align with
the 256-bit vector length (VLEN), improving vector unit utilization.
* Revert the TransA (A^T * B) case to the reference C++ implementation,
as the existing RVV-specific optimization for this path was
inefficient and caused regressions.
* Improve register utilization by manually unrolling the loop for the
n=4 case. This avoids spilling `vfloat32m1_t` variables to the stack,
working around the limitation that RVV does not support arrays of
vector types.
Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>
8af27dc to
ba512bb
Compare
Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>
ba512bb to
950b35d
Compare
|
Hi @dzarukin , Thank you so much for your time and for the very detailed review. Your feedback has been incredibly helpful and I've learned a great deal from it. I have now addressed all the points you raised, with the one exception being the empty struct gemm_traits_t {}; generic template. For that particular point, I've left my explanation of the design rationale in the original thread for your consideration. I am now in the process of running the rebuild and testing these changes locally. Once the local tests pass, I will formally re-request your review through GitHub. Thanks again for all your help and guidance! |
|
Hi @dzarukin, I have now implemented all the changes from your feedback, and I'm happy to report that all local tests are passing successfully. During the testing process, I noticed that the excellent PR #3784 was successfully merged. In light of its contributions, the matmul performance of this PR will now align with the improvements introduced there, GEMM-based matmul will no longer be used by default on RVV devices. However, this PR continues to provide value by enhancing the performance of extended_sgemm on the RVV platform, this accelerates primitives like convolution and deconvolution, as well as any other operation that calls the extended_sgemm function, for instance, ref_rnn. Furthermore, the gemm kernel introduced here is designed with greater long-term optimization potential. The current implementation is a balanced trade-off for 32KB L1 cache, and in the future, this framework will allow us to easily tune key parameters like m and n to deliver even better performance for variable cache size across different hardware. Thank you once again for your invaluable time and guidance! |
Co-authored-by: Dmitry Zarukin <dmitry.zarukin@intel.com>
|
Great work! I noticed that a significant portion of cases (43.2% in conv.csv) show speedups less than 1, indicating negative performance improvements. Could you help explain why this is happening? Specifically, for the case:
in mm.csv, the speedup is below 0.25 with the kind of normal args. |
Hi, thanks for bringing this up. It's important to note that the previous performance data was collected on a low-performing 8-core Spacemit(R) X60 development board. Its limited memory bandwidth prevented it from accurately showcasing the optimization's full effect. I have now gained access to an SG2044 server and have re-run the matmul and conv tests. The results are shown in the following file: During my development, it's worth noting that #3784 was successfully merged. However, my optimization baseline was the internal The convolution primitive also achieved significant acceleration. The issue you raised regarding the significant performance drop with I believe that the application scenarios for oneDNN are not limited to 8-core development boards; those devices are better suited for embedded neural network libraries. Additionally, it's important to note that the current |
Agreed. QEMU simulation results should only be used to verify functional correctness, not for performance evaluation.
It is confusing that which line of your code shows specific optimization of VLEN=256? In my understanding, |
While the RISC-V V extension supports a dynamic vector length, the finite capacity of vector registers constrains the loop unrolling factor. Consequently, this must be adjusted for the specific vector length, cache size and register capacity of the target processor. |
|
Hi @mgouicem, these commits requires 2 approvals for this PR. Could you please take a look when you have a moment? Thank you very much! |
Hi, I am facing a similar situation. May I kindly ask: is the memory bandwidth truly the key bottleneck causing the poor performance on the Spacemit X60 board? To clarify, does this imply that programs using RVV intrinsics require more memory bandwidth to leverage their strengths effectively compared to scalar implementations? Interesting! Do you got any research papers to have it proved? |
Description
This PR introduces a SIMD-optimized f32 GEMM kernel for the RISC-V 64-bit architecture, leveraging the RISC-V Vector (V) Extension.
The existing generic GEMM implementation in oneDNN (
ref_gemm) relies onPRAGMA_OMP_SIMDfor auto-vectorization. However, current RISC-V toolchains do not effectively vectorize this code, leading to suboptimal performance on RV64 platforms. This commit addresses this critical performance gap by providing a manually optimized SIMD implementation,rvv_gemm_f32, which is dispatched fromextended_sgemm.This optimization is integrated directly into the low-level
extended_sgemmfunction. This provides two major benefits:sgemmcalls across the library.gemm_convolution_fwd_t, one of the most efficient forward convolution implementations in oneDNN.By targeting the foundational GEMM routine, this PR delivers performance gains for models like Convolutional Neural Networks (CNNs).
The current implementation only focuses on the
f32data type.Checklist
General
make testandmake test_benchdnn_*) pass locally for each commit?Performance improvements
All performance data was measured on a Spacemit X60 development board (VLEN=256). The performance baseline is the default
ref_gemmimplementation in the oneDNN main branch. Below are some example performance benchmarks for different problem sizes.Matmul Primitives Performance
ref_gemm)rvv_gemm)--matmul --stag=acb --wtag=cab 2x457x3888:2x3888x2888--matmul --bia-dt=f32 1024x1024:1024x1024--matmul --stag=acb --wtag=abc 2x551x1276:2x1276x58Convolution Primitives Performance (via GEMM)
This demonstrates the advantage of optimizing the
extended_sgemm.gemm_conv)gemm_convwithrvv_gemm)--conv mb1ic3id116ih132iw132oc32od114oh130ow130kd3kh3kw3pd0ph0pw0n"3d_unet:conv_1"New features
Bug fixes
RFC PR