Skip to content

cpu: rv64: matmul: improving rvv matmul performance by integrating with gemm kernel#4363

Merged
dzarukin merged 2 commits intouxlfoundation:mainfrom
zhangjian29:add-rvv-matmul-gemm-kernel
Dec 2, 2025
Merged

cpu: rv64: matmul: improving rvv matmul performance by integrating with gemm kernel#4363
dzarukin merged 2 commits intouxlfoundation:mainfrom
zhangjian29:add-rvv-matmul-gemm-kernel

Conversation

@zhangjian29
Copy link
Contributor

Description

This PR improved the preformance of rvv_matmul implementation by integrating with rvv_gemm_f32 kernel proposed by #3785. We proposed a high-performance and light-weighted approach for rvv_matmul implementation.

Key Features: Higher Performance and Less Code

  • Removed the original row-major and column-major kernel.
  • Added init_gemm_conf method to initlize the rvv_gemm_f32 kernel.
  • Call rvv_gemm_f32 kernel in execution for GEMM computation.
  • The row-major matrices are reinterpreted as column-major matrices to obtain higher performance in GEMM computation.

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Performance Improvements

  • Have you submitted performance data that demonstrates performance improvements?

All performance experiments are tested on a SG2044 platform with fixed resources using taskset -c 27 cmd and with the same compilation flags (gcc14.2 -O3). Tests include:

  • Test Cases: benchdnn with different input shapes
  • Test Dtypes: f32
  • Test Args: --mode=P

Results

Averagely, we improved the performance by 13.1 times compared with the current implementation.

Detailed results are as follows:

Table 1:Runtime Comparisons of Implementaions of Main Branch and This PR

Batch Shape Main Branch (ms) This PR (ms) Speedups
shapes_converted_ip_inf_lb_wd 4570.61 348.229 13.13
shapes_converted_ip_inf_lb_gmnt 447.421 65.6864 6.81
shapes_converted_ip_inf_lb_googlenet 3572.91 348.128 10.26
shapes_converted_ip_inf_lb_resnet 767.848 177.444 4.33
shapes_transformer 367.592 211.356 1.74
total 9726.381 742.715 13.10

Comparison with PR #4350

We appreciate the efforts in PR #4350 to improve rvv_matmul. Our PR builds on a different approach, a lightweight integration with the mature rvv_gemm_f32 kernel, which demonstrates over 5.1x performance improvementcompared to PR #4350. By inheriting ongoing optimizations to the GEMM kernel, this approach promises sustained performance gains. We kindly request consideration for this PR.

Table 2:Runtime Comparisons of Implementaions of PR #4350 and This PR

Batch Shape PR #4350 (ms) This PR (ms) Speedups
shapes_converted_ip_inf_lb_wd 970.328 348.229 2.79
shapes_converted_ip_inf_lb_gmnt 413.218 65.6864 6.29
shapes_converted_ip_inf_lb_googlenet 1429.51 348.128 4.11
shapes_converted_ip_inf_lb_resnet 662.37 177.444 3.73
shapes_transformer 308.03 211.356 1.46
total 3783.46 742.715 5.10

@zhangjian29
Copy link
Contributor Author

Hi @vpirogov @dzarukin @mgouicem,

Looking forward to your feedbacks on this PR. Thanks a lot.

@dzarukin dzarukin merged commit b73fc31 into uxlfoundation:main Dec 2, 2025
11 checks passed
@zhangjian29 zhangjian29 deleted the add-rvv-matmul-gemm-kernel branch December 3, 2025 01:10
@krishnasai-mcw
Copy link
Contributor

Hi @zhangjian29 ,

I also think that integrating the GEMM kernel directly into the matmul primitive was the right approach this way we can focus on improving and optimizing GEMM kernel rather than maintaining multiple paths.

I tested the latest code on the Banana Pi BPI-F3 board, and I’m seeing failures specifically in cases where the weights are in column-major layout. Could you please take a look at the logs below?
shapes_2D.log
shapes_3D.log
shapes_4D.log
test_matmul_all.log

Thanks!

@zhangjian29
Copy link
Contributor Author

Hi @krishnasai-mcw ,

Thank you for reporting this issue. I am seeing these failures on my platform too.

It's a mistake for sure that current rvv_matmul.hpp allows column-major layout but rvv_matmul.cpp take it as row-major. I am going to fix it in another PR. Thank you so much.

@zhangjian29
Copy link
Contributor Author

By the way, why does CI test passed anyway? Maybe we should have stronger tests on similar issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants