Skip to content

cpu: riscv: gemm: add f32 gemm SIMD optimization with RISC-V V Extension#3785

Merged
mgouicem merged 4 commits intouxlfoundation:mainfrom
xiazhuozhao:rvv-gemm
Sep 26, 2025
Merged

cpu: riscv: gemm: add f32 gemm SIMD optimization with RISC-V V Extension#3785
mgouicem merged 4 commits intouxlfoundation:mainfrom
xiazhuozhao:rvv-gemm

Conversation

@xiazhuozhao
Copy link
Contributor

@xiazhuozhao xiazhuozhao commented Aug 19, 2025

Description

This PR introduces a SIMD-optimized f32 GEMM kernel for the RISC-V 64-bit architecture, leveraging the RISC-V Vector (V) Extension.

The existing generic GEMM implementation in oneDNN (ref_gemm) relies on PRAGMA_OMP_SIMD for auto-vectorization. However, current RISC-V toolchains do not effectively vectorize this code, leading to suboptimal performance on RV64 platforms. This commit addresses this critical performance gap by providing a manually optimized SIMD implementation, rvv_gemm_f32, which is dispatched from extended_sgemm.

This optimization is integrated directly into the low-level extended_sgemm function. This provides two major benefits:

  1. It accelerates direct sgemm calls across the library.
  2. It boosts the performance of high-level primitives that rely on this function. The example is gemm_convolution_fwd_t, one of the most efficient forward convolution implementations in oneDNN.

By targeting the foundational GEMM routine, this PR delivers performance gains for models like Convolutional Neural Networks (CNNs).

The current implementation only focuses on the f32 data type.

Checklist

General

  • [√] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • [√] Have you formatted the code using clang-format?

Performance improvements

  • [√] Have you submitted performance data that demonstrates performance improvements?

All performance data was measured on a Spacemit X60 development board (VLEN=256). The performance baseline is the default ref_gemm implementation in the oneDNN main branch. Below are some example performance benchmarks for different problem sizes.

Matmul Primitives Performance

Benchmark Case Baseline (ref_gemm) This PR (rvv_gemm) Speedup
--matmul --stag=acb --wtag=cab 2x457x3888:2x3888x2888 11.44s 3.28s 3.49x
--matmul --bia-dt=f32 1024x1024:1024x1024 3.34s 2.53s 1.32x
--matmul --stag=acb --wtag=abc 2x551x1276:2x1276x58 0.09s 0.04s ~2.25x

Convolution Primitives Performance (via GEMM)

This demonstrates the advantage of optimizing the extended_sgemm.

Benchmark Case Baseline (gemm_conv) This PR (gemm_conv with rvv_gemm) Speedup
--conv mb1ic3id116ih132iw132oc32od114oh130ow130kd3kh3kw3pd0ph0pw0n"3d_unet:conv_1" 2.09s 1.55s 1.35x

New features

  • [ ] Have you published an RFC for the new feature?
  • [ ] Was the RFC approved?
  • [ ] Have you added relevant tests?

Bug fixes

  • [ ] Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
  • [ ] Have you added relevant regression tests?

RFC PR

  • [ ] Does RFC document follow the template?
  • [ ] Have you added a link to the rendered document?

@xiazhuozhao xiazhuozhao requested a review from a team as a code owner August 19, 2025 08:00
@xiazhuozhao xiazhuozhao force-pushed the rvv-gemm branch 4 times, most recently from d5f2994 to d185c8d Compare August 28, 2025 09:27
@xiazhuozhao
Copy link
Contributor Author

xiazhuozhao commented Aug 28, 2025

Hello everyone,

Thank you for your patience. I've updated the PR with a new commit to address the performance instability issues observed in the initial submission. Here's a summary of the changes and their impact:

Key Updates in the New Commit:

  1. Inner Block Size Adjustment: The F32 GEMM kernel's inner block size has been set to 8 to better align with the 256-bit vector length (VLEN), improving vector unit utilization. For future work, optimizations for various VLENs can be implemented using templates, which aligns with the variable-length design of the RISC-V V extension.
  2. Fallback for TransA Case: The TransA (A^T * B) case has been fallbacked to the ref_gemm_f32 implementation. The previous RVV-specific optimization for this path was inefficient and caused regressions.
  3. Improved Register Utilization: Manual loop unrolling for the n=4 case has been implemented. This avoids spilling vfloat32m1_t variables to the stack.

Performance Impact:

This optimization has been tested, showing performance gains across all benchmarks. It yields significant speedups for Matmul (averaging 4.00x) and Deconvolution (averaging 5.16x) primitives, while the overall impact on convolution is generally positive.

Detailed Performance Data:

Matmul Performance:
mm.csv

Convolution Performance:
conv.csv

Deconvolution Performance:
deconv.csv

This GEMM optimization effectively enhances the performance of the default convolution and deconvolution implementations in RVV, as they directly invoke GEMM rather than Matmul. Consequently, this submission does not conflict with any existing (and also under reviewing) RVV matrix multiplication optimizations. We kindly request that you consider this PR for approval.

Ctest Log:
make_test.log

@xiazhuozhao xiazhuozhao changed the title cpu: rv64: gemm: add f32 gemm SIMD optimization with RISC-V V Extension cpu: riscv: gemm: add f32 gemm SIMD optimization with RISC-V V Extension Aug 28, 2025
@xiazhuozhao
Copy link
Contributor Author

Hi, @dzarukin .
Could you please take a look at PR #3785 when you have a moment? Your feedback would be greatly appreciated.
Thanks!

xiazhuozhao and others added 2 commits September 4, 2025 00:59
Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>
The previous RVV F32 GEMM implementation showed polarized performance,
with significant speedups in some cases but severe regressions in others.
This commit addresses the instability through several key changes:

* Adjust the F32 GEMM kernel's inner block size to 8 to align with
    the 256-bit vector length (VLEN), improving vector unit utilization.

* Revert the TransA (A^T * B) case to the reference C++ implementation,
    as the existing RVV-specific optimization for this path was
    inefficient and caused regressions.

* Improve register utilization by manually unrolling the loop for the
    n=4 case. This avoids spilling `vfloat32m1_t` variables to the stack,
    working around the limitation that RVV does not support arrays of
    vector types.

Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>
@xiazhuozhao xiazhuozhao force-pushed the rvv-gemm branch 2 times, most recently from 8af27dc to ba512bb Compare September 3, 2025 17:04
Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>
@xiazhuozhao
Copy link
Contributor Author

xiazhuozhao commented Sep 3, 2025

Hi @dzarukin ,

Thank you so much for your time and for the very detailed review. Your feedback has been incredibly helpful and I've learned a great deal from it.

I have now addressed all the points you raised, with the one exception being the empty struct gemm_traits_t {}; generic template. For that particular point, I've left my explanation of the design rationale in the original thread for your consideration.

I am now in the process of running the rebuild and testing these changes locally.

Once the local tests pass, I will formally re-request your review through GitHub.

Thanks again for all your help and guidance!

@xiazhuozhao
Copy link
Contributor Author

xiazhuozhao commented Sep 4, 2025

Hi @dzarukin,

I have now implemented all the changes from your feedback, and I'm happy to report that all local tests are passing successfully.

During the testing process, I noticed that the excellent PR #3784 was successfully merged. In light of its contributions, the matmul performance of this PR will now align with the improvements introduced there, GEMM-based matmul will no longer be used by default on RVV devices.

However, this PR continues to provide value by enhancing the performance of extended_sgemm on the RVV platform, this accelerates primitives like convolution and deconvolution, as well as any other operation that calls the extended_sgemm function, for instance, ref_rnn.

Furthermore, the gemm kernel introduced here is designed with greater long-term optimization potential. The current implementation is a balanced trade-off for 32KB L1 cache, and in the future, this framework will allow us to easily tune key parameters like m and n to deliver even better performance for variable cache size across different hardware.

Thank you once again for your invaluable time and guidance!

@xiazhuozhao xiazhuozhao requested a review from dzarukin September 4, 2025 14:35
Co-authored-by: Dmitry Zarukin <dmitry.zarukin@intel.com>
@zhangjian29
Copy link
Contributor

Great work! I noticed that a significant portion of cases (43.2% in conv.csv) show speedups less than 1, indicating negative performance improvements. Could you help explain why this is happening?

Specifically, for the case:

--mode=P --matmul --stag=ab --wtag=ab --dtag=ab 2048x13:13x512_n"DLRM_train:FWD,0 * 1"

in mm.csv, the speedup is below 0.25 with the kind of normal args.

@xiazhuozhao
Copy link
Contributor Author

xiazhuozhao commented Sep 12, 2025

Great work! I noticed that a significant portion of cases (43.2% in conv.csv) show speedups less than 1, indicating negative performance improvements. Could you help explain why this is happening?

Specifically, for the case:

--mode=P --matmul --stag=ab --wtag=ab --dtag=ab 2048x13:13x512_n"DLRM_train:FWD,0 * 1"

in mm.csv, the speedup is below 0.25 with the kind of normal args.

Hi, thanks for bringing this up. It's important to note that the previous performance data was collected on a low-performing 8-core Spacemit(R) X60 development board. Its limited memory bandwidth prevented it from accurately showcasing the optimization's full effect. I have now gained access to an SG2044 server and have re-run the matmul and conv tests. The results are shown in the following file:

conv_2.csv
mm_2.csv

During my development, it's worth noting that #3784 was successfully merged. However, my optimization baseline was the internal ref_gemm. Therefore, I conducted three sets of tests for matmul, with the 3784 branch representing the changes introduced by #3784. As you can see, the rvv_gemm performance is excellent compared to both the ref_gemm and #3784 baselines. It should be noted that the performance data of #3784 in his/hers/theirs PR was measured in QEMU and does not perfectly reflect the true performance of the RISC-V V extension; in my experience, it might have been translated into AVX instructions, making its performance more akin to an X86 processor. However, since my changes do not include a downgrade of the #3784 priority, the default matmul performance will remain consistent with #3784.

The convolution primitive also achieved significant acceleration. The issue you raised regarding the significant performance drop with --mode=P --matmul --stag=ab --wtag=ab --dtag=ab 2048x13:13x512_n"DLRM_train:FWD,0 * 1" has also been resolved.

I believe that the application scenarios for oneDNN are not limited to 8-core development boards; those devices are better suited for embedded neural network libraries.

Additionally, it's important to note that the current rvv_gemm is a compromise optimized for 32KB L1 cache, whereas the SG2044's is 64KB, meaning half of the theoretical performance has yet to be unlocked. In future development, I will adapt for different cache sizes. I am confident that the GEMM-based convolution approach is an excellent solution. This method is mature, efficiently utilizes the pipeline, is widely used by neural network libraries like Torch and TensorFlow, and serves as a general fallback plan for oneDNN. When evaluating different algorithmic approaches, it's crucial to consider not only current performance but also its potential for future optimization.

@zhangjian29
Copy link
Contributor

in my experience, it might have been translated into AVX instructions, making its performance more akin to an X86 processor.

Agreed. QEMU simulation results should only be used to verify functional correctness, not for performance evaluation.

Additionally, it's important to note that the current rvv_gemm is a compromise optimized for VLEN=256, whereas the SG2044's VLEN is 512, meaning half of the theoretical performance has yet to be unlocked.

It is confusing that which line of your code shows specific optimization of VLEN=256? In my understanding, vl is runtime-determined by vsetvl with avl and VLEN, which is suitable for machines with different VLENs. How does performance differ?

@xiazhuozhao
Copy link
Contributor Author

xiazhuozhao commented Sep 15, 2025

It is confusing that which line of your code shows specific optimization of VLEN=256? In my understanding, vl is runtime-determined by vsetvl with avl and VLEN, which is suitable for machines with different VLENs. How does performance differ?

While the RISC-V V extension supports a dynamic vector length, the finite capacity of vector registers constrains the loop unrolling factor. Consequently, this must be adjusted for the specific vector length, cache size and register capacity of the target processor.

@xiazhuozhao
Copy link
Contributor Author

Hi @mgouicem, these commits requires 2 approvals for this PR. Could you please take a look when you have a moment? Thank you very much!

@zhangjian29
Copy link
Contributor

It's important to note that the previous performance data was collected on a low-performing 8-core Spacemit(R) X60 development board. Its limited memory bandwidth prevented it from accurately showcasing the optimization's full effect.

Hi, I am facing a similar situation. May I kindly ask: is the memory bandwidth truly the key bottleneck causing the poor performance on the Spacemit X60 board?

To clarify, does this imply that programs using RVV intrinsics require more memory bandwidth to leverage their strengths effectively compared to scalar implementations? Interesting! Do you got any research papers to have it proved?

@mgouicem mgouicem merged commit 40724eb into uxlfoundation:main Sep 26, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants