cpu: riscv: gemm: add f32 gemm SIMD optimization with RISC-V V Extension by xiazhuozhao · Pull Request #3785 · uxlfoundation/oneDNN

xiazhuozhao · 2025-08-19T08:00:24Z

Description

This PR introduces a SIMD-optimized f32 GEMM kernel for the RISC-V 64-bit architecture, leveraging the RISC-V Vector (V) Extension.

The existing generic GEMM implementation in oneDNN (ref_gemm) relies on PRAGMA_OMP_SIMD for auto-vectorization. However, current RISC-V toolchains do not effectively vectorize this code, leading to suboptimal performance on RV64 platforms. This commit addresses this critical performance gap by providing a manually optimized SIMD implementation, rvv_gemm_f32, which is dispatched from extended_sgemm.

This optimization is integrated directly into the low-level extended_sgemm function. This provides two major benefits:

It accelerates direct sgemm calls across the library.
It boosts the performance of high-level primitives that rely on this function. The example is gemm_convolution_fwd_t, one of the most efficient forward convolution implementations in oneDNN.

By targeting the foundational GEMM routine, this PR delivers performance gains for models like Convolutional Neural Networks (CNNs).

The current implementation only focuses on the f32 data type.

Checklist

General

[√] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
[√] Have you formatted the code using clang-format?

Performance improvements

[√] Have you submitted performance data that demonstrates performance improvements?

All performance data was measured on a Spacemit X60 development board (VLEN=256). The performance baseline is the default ref_gemm implementation in the oneDNN main branch. Below are some example performance benchmarks for different problem sizes.

Matmul Primitives Performance

Benchmark Case	Baseline (`ref_gemm`)	This PR (`rvv_gemm`)	Speedup
`--matmul --stag=acb --wtag=cab 2x457x3888:2x3888x2888`	11.44s	3.28s	3.49x
`--matmul --bia-dt=f32 1024x1024:1024x1024`	3.34s	2.53s	1.32x
`--matmul --stag=acb --wtag=abc 2x551x1276:2x1276x58`	0.09s	0.04s	~2.25x

Convolution Primitives Performance (via GEMM)

This demonstrates the advantage of optimizing the extended_sgemm.

Benchmark Case	Baseline (`gemm_conv`)	This PR (`gemm_conv` with `rvv_gemm`)	Speedup
`--conv mb1ic3id116ih132iw132oc32od114oh130ow130kd3kh3kw3pd0ph0pw0n"3d_unet:conv_1"`	2.09s	1.55s	1.35x

New features

[ ] Have you published an RFC for the new feature?
[ ] Was the RFC approved?
[ ] Have you added relevant tests?

Bug fixes

[ ] Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
[ ] Have you added relevant regression tests?

RFC PR

[ ] Does RFC document follow the template?
[ ] Have you added a link to the rendered document?

xiazhuozhao · 2025-08-28T09:52:49Z

Hello everyone,

Thank you for your patience. I've updated the PR with a new commit to address the performance instability issues observed in the initial submission. Here's a summary of the changes and their impact:

Key Updates in the New Commit:

Inner Block Size Adjustment: The F32 GEMM kernel's inner block size has been set to 8 to better align with the 256-bit vector length (VLEN), improving vector unit utilization. For future work, optimizations for various VLENs can be implemented using templates, which aligns with the variable-length design of the RISC-V V extension.
Fallback for TransA Case: The TransA (A^T * B) case has been fallbacked to the ref_gemm_f32 implementation. The previous RVV-specific optimization for this path was inefficient and caused regressions.
Improved Register Utilization: Manual loop unrolling for the n=4 case has been implemented. This avoids spilling vfloat32m1_t variables to the stack.

Performance Impact:

This optimization has been tested, showing performance gains across all benchmarks. It yields significant speedups for Matmul (averaging 4.00x) and Deconvolution (averaging 5.16x) primitives, while the overall impact on convolution is generally positive.

Detailed Performance Data:

Matmul Performance:
mm.csv

Convolution Performance:
conv.csv

Deconvolution Performance:
deconv.csv

This GEMM optimization effectively enhances the performance of the default convolution and deconvolution implementations in RVV, as they directly invoke GEMM rather than Matmul. Consequently, this submission does not conflict with any existing (and also under reviewing) RVV matrix multiplication optimizations. We kindly request that you consider this PR for approval.

Ctest Log:
make_test.log

xiazhuozhao · 2025-08-29T00:22:40Z

Hi, @dzarukin .
Could you please take a look at PR #3785 when you have a moment? Your feedback would be greatly appreciated.
Thanks!

src/cpu/gemm/gemm.cpp

src/cpu/rv64/gemm/rvv_gemm_f32.cpp

src/cpu/rv64/gemm/rvv_gemm_utils_f32.hpp

src/cpu/rv64/gemm/rvv_gemm_f32.hpp

Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>

The previous RVV F32 GEMM implementation showed polarized performance, with significant speedups in some cases but severe regressions in others. This commit addresses the instability through several key changes: * Adjust the F32 GEMM kernel's inner block size to 8 to align with the 256-bit vector length (VLEN), improving vector unit utilization. * Revert the TransA (A^T * B) case to the reference C++ implementation, as the existing RVV-specific optimization for this path was inefficient and caused regressions. * Improve register utilization by manually unrolling the loop for the n=4 case. This avoids spilling `vfloat32m1_t` variables to the stack, working around the limitation that RVV does not support arrays of vector types. Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>

Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>

xiazhuozhao · 2025-09-03T17:28:46Z

Hi @dzarukin ,

Thank you so much for your time and for the very detailed review. Your feedback has been incredibly helpful and I've learned a great deal from it.

I have now addressed all the points you raised, with the one exception being the empty struct gemm_traits_t {}; generic template. For that particular point, I've left my explanation of the design rationale in the original thread for your consideration.

I am now in the process of running the rebuild and testing these changes locally.

Once the local tests pass, I will formally re-request your review through GitHub.

Thanks again for all your help and guidance!

xiazhuozhao · 2025-09-04T14:35:21Z

Hi @dzarukin,

I have now implemented all the changes from your feedback, and I'm happy to report that all local tests are passing successfully.

During the testing process, I noticed that the excellent PR #3784 was successfully merged. In light of its contributions, the matmul performance of this PR will now align with the improvements introduced there, GEMM-based matmul will no longer be used by default on RVV devices.

However, this PR continues to provide value by enhancing the performance of extended_sgemm on the RVV platform, this accelerates primitives like convolution and deconvolution, as well as any other operation that calls the extended_sgemm function, for instance, ref_rnn.

Furthermore, the gemm kernel introduced here is designed with greater long-term optimization potential. The current implementation is a balanced trade-off for 32KB L1 cache, and in the future, this framework will allow us to easily tune key parameters like m and n to deliver even better performance for variable cache size across different hardware.

Thank you once again for your invaluable time and guidance!

src/cpu/rv64/gemm/rvv_gemm_utils_f32.hpp

Co-authored-by: Dmitry Zarukin <dmitry.zarukin@intel.com>

zhangjian29 · 2025-09-12T02:39:06Z

Great work! I noticed that a significant portion of cases (43.2% in conv.csv) show speedups less than 1, indicating negative performance improvements. Could you help explain why this is happening?

Specifically, for the case:

--mode=P --matmul --stag=ab --wtag=ab --dtag=ab 2048x13:13x512_n"DLRM_train:FWD,0 * 1"

in mm.csv, the speedup is below 0.25 with the kind of normal args.

xiazhuozhao · 2025-09-12T14:41:12Z

Great work! I noticed that a significant portion of cases (43.2% in conv.csv) show speedups less than 1, indicating negative performance improvements. Could you help explain why this is happening?

Specifically, for the case:

--mode=P --matmul --stag=ab --wtag=ab --dtag=ab 2048x13:13x512_n"DLRM_train:FWD,0 * 1"

in mm.csv, the speedup is below 0.25 with the kind of normal args.

Hi, thanks for bringing this up. It's important to note that the previous performance data was collected on a low-performing 8-core Spacemit(R) X60 development board. Its limited memory bandwidth prevented it from accurately showcasing the optimization's full effect. I have now gained access to an SG2044 server and have re-run the matmul and conv tests. The results are shown in the following file:

conv_2.csv
mm_2.csv

During my development, it's worth noting that #3784 was successfully merged. However, my optimization baseline was the internal ref_gemm. Therefore, I conducted three sets of tests for matmul, with the 3784 branch representing the changes introduced by #3784. As you can see, the rvv_gemm performance is excellent compared to both the ref_gemm and #3784 baselines. It should be noted that the performance data of #3784 in his/hers/theirs PR was measured in QEMU and does not perfectly reflect the true performance of the RISC-V V extension; in my experience, it might have been translated into AVX instructions, making its performance more akin to an X86 processor. However, since my changes do not include a downgrade of the #3784 priority, the default matmul performance will remain consistent with #3784.

The convolution primitive also achieved significant acceleration. The issue you raised regarding the significant performance drop with --mode=P --matmul --stag=ab --wtag=ab --dtag=ab 2048x13:13x512_n"DLRM_train:FWD,0 * 1" has also been resolved.

I believe that the application scenarios for oneDNN are not limited to 8-core development boards; those devices are better suited for embedded neural network libraries.

Additionally, it's important to note that the current rvv_gemm is a compromise optimized for 32KB L1 cache, whereas the SG2044's is 64KB, meaning half of the theoretical performance has yet to be unlocked. In future development, I will adapt for different cache sizes. I am confident that the GEMM-based convolution approach is an excellent solution. This method is mature, efficiently utilizes the pipeline, is widely used by neural network libraries like Torch and TensorFlow, and serves as a general fallback plan for oneDNN. When evaluating different algorithmic approaches, it's crucial to consider not only current performance but also its potential for future optimization.

zhangjian29 · 2025-09-15T00:59:41Z

in my experience, it might have been translated into AVX instructions, making its performance more akin to an X86 processor.

Agreed. QEMU simulation results should only be used to verify functional correctness, not for performance evaluation.

Additionally, it's important to note that the current rvv_gemm is a compromise optimized for VLEN=256, whereas the SG2044's VLEN is 512, meaning half of the theoretical performance has yet to be unlocked.

It is confusing that which line of your code shows specific optimization of VLEN=256? In my understanding, vl is runtime-determined by vsetvl with avl and VLEN, which is suitable for machines with different VLENs. How does performance differ?

xiazhuozhao · 2025-09-15T11:41:49Z

It is confusing that which line of your code shows specific optimization of VLEN=256? In my understanding, vl is runtime-determined by vsetvl with avl and VLEN, which is suitable for machines with different VLENs. How does performance differ?

While the RISC-V V extension supports a dynamic vector length, the finite capacity of vector registers constrains the loop unrolling factor. Consequently, this must be adjusted for the specific vector length, cache size and register capacity of the target processor.

xiazhuozhao · 2025-09-16T15:24:54Z

Hi @mgouicem, these commits requires 2 approvals for this PR. Could you please take a look when you have a moment? Thank you very much!

zhangjian29 · 2025-09-19T07:04:50Z

It's important to note that the previous performance data was collected on a low-performing 8-core Spacemit(R) X60 development board. Its limited memory bandwidth prevented it from accurately showcasing the optimization's full effect.

Hi, I am facing a similar situation. May I kindly ask: is the memory bandwidth truly the key bottleneck causing the poor performance on the Spacemit X60 board?

To clarify, does this imply that programs using RVV intrinsics require more memory bandwidth to leverage their strengths effectively compared to scalar implementations? Interesting! Do you got any research papers to have it proved?

xiazhuozhao requested a review from a team as a code owner August 19, 2025 08:00

github-actions bot added the platform:cpu-rv64 RISC-V label Aug 19, 2025

xiazhuozhao force-pushed the rvv-gemm branch 4 times, most recently from d5f2994 to d185c8d Compare August 28, 2025 09:27

xiazhuozhao changed the title ~~cpu: rv64: gemm: add f32 gemm SIMD optimization with RISC-V V Extension~~ cpu: riscv: gemm: add f32 gemm SIMD optimization with RISC-V V Extension Aug 28, 2025

ryanle1017 mentioned this pull request Aug 29, 2025

cpu: rv64: gemm: add bf16 gemm SIMD optimization with RISC-V V Extension #3845

Closed

7 tasks

xiazhuozhao mentioned this pull request Sep 1, 2025

cpu: risc-v: convolution: Vectorize post-ops with RVV intrinsics #3852

Merged

3 tasks

dzarukin reviewed Sep 2, 2025

View reviewed changes

xiazhuozhao and others added 2 commits September 4, 2025 00:59

cpu: rv64: gemm: add f32 gemm SIMD optimization with RISC-V V Extension

1539f55

Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>

xiazhuozhao force-pushed the rvv-gemm branch 2 times, most recently from 8af27dc to ba512bb Compare September 3, 2025 17:04

cpu: riscv: address review feedback for rvv_gemm_f32

950b35d

Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>

xiazhuozhao force-pushed the rvv-gemm branch from ba512bb to 950b35d Compare September 3, 2025 17:05

xiazhuozhao requested a review from dzarukin September 4, 2025 14:35

dzarukin approved these changes Sep 9, 2025

View reviewed changes

src/cpu/rv64/gemm/rvv_gemm_utils_f32.hpp Show resolved Hide resolved

cpu: riscv: gemm: include c_types_map.hpp for dim_t type

0beb782

Co-authored-by: Dmitry Zarukin <dmitry.zarukin@intel.com>

mgouicem approved these changes Sep 26, 2025

View reviewed changes

mgouicem merged commit 40724eb into uxlfoundation:main Sep 26, 2025
12 checks passed

xiazhuozhao mentioned this pull request Oct 31, 2025

cpu: rv64: gemm: Implemented variable loop unrolling for GEMM #4258

Merged

3 tasks

zhangjian29 mentioned this pull request Nov 21, 2025

cpu: rv64: matmul: improving rvv matmul performance by integrating with gemm kernel #4363

Merged

3 tasks

xiazhuozhao mentioned this pull request Dec 26, 2025

cpu: riscv: add ISCAS copyright to new files #4487

Merged

Conversation

xiazhuozhao commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

General

Performance improvements

Matmul Primitives Performance

Convolution Primitives Performance (via GEMM)

New features

Bug fixes

RFC PR

Uh oh!

xiazhuozhao commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Updates in the New Commit:

Performance Impact:

Detailed Performance Data:

Uh oh!

xiazhuozhao commented Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xiazhuozhao commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiazhuozhao commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zhangjian29 commented Sep 12, 2025

Uh oh!

xiazhuozhao commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangjian29 commented Sep 15, 2025

Uh oh!

xiazhuozhao commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiazhuozhao commented Sep 16, 2025

Uh oh!

zhangjian29 commented Sep 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiazhuozhao commented Aug 19, 2025 •

edited

Loading

xiazhuozhao commented Aug 28, 2025 •

edited

Loading

xiazhuozhao commented Sep 3, 2025 •

edited

Loading

xiazhuozhao commented Sep 4, 2025 •

edited

Loading

xiazhuozhao commented Sep 12, 2025 •

edited

Loading

xiazhuozhao commented Sep 15, 2025 •

edited

Loading