cpu: rv64: matmul: improving rvv matmul performance by integrating with gemm kernel by zhangjian29 · Pull Request #4363 · uxlfoundation/oneDNN

zhangjian29 · 2025-11-21T07:46:48Z

Description

This PR improved the preformance of rvv_matmul implementation by integrating with rvv_gemm_f32 kernel proposed by #3785. We proposed a high-performance and light-weighted approach for rvv_matmul implementation.

Key Features: Higher Performance and Less Code

Removed the original row-major and column-major kernel.
Added init_gemm_conf method to initlize the rvv_gemm_f32 kernel.
Call rvv_gemm_f32 kernel in execution for GEMM computation.
The row-major matrices are reinterpreted as column-major matrices to obtain higher performance in GEMM computation.

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

Performance Improvements

Have you submitted performance data that demonstrates performance improvements?

All performance experiments are tested on a SG2044 platform with fixed resources using taskset -c 27 cmd and with the same compilation flags (gcc14.2 -O3). Tests include:

Test Cases: benchdnn with different input shapes
Test Dtypes: f32
Test Args: --mode=P

Results

Averagely, we improved the performance by 13.1 times compared with the current implementation.

Detailed results are as follows:

Table 1：Runtime Comparisons of Implementaions of Main Branch and This PR

Batch Shape	Main Branch (ms)	This PR (ms)	Speedups
shapes_converted_ip_inf_lb_wd	4570.61	348.229	13.13
shapes_converted_ip_inf_lb_gmnt	447.421	65.6864	6.81
shapes_converted_ip_inf_lb_googlenet	3572.91	348.128	10.26
shapes_converted_ip_inf_lb_resnet	767.848	177.444	4.33
shapes_transformer	367.592	211.356	1.74
total	9726.381	742.715	13.10

Comparison with PR #4350

We appreciate the efforts in PR #4350 to improve rvv_matmul. Our PR builds on a different approach, a lightweight integration with the mature rvv_gemm_f32 kernel, which demonstrates over 5.1x performance improvementcompared to PR #4350. By inheriting ongoing optimizations to the GEMM kernel, this approach promises sustained performance gains. We kindly request consideration for this PR.

Table 2：Runtime Comparisons of Implementaions of PR #4350 and This PR

Batch Shape	PR #4350 (ms)	This PR (ms)	Speedups
shapes_converted_ip_inf_lb_wd	970.328	348.229	2.79
shapes_converted_ip_inf_lb_gmnt	413.218	65.6864	6.29
shapes_converted_ip_inf_lb_googlenet	1429.51	348.128	4.11
shapes_converted_ip_inf_lb_resnet	662.37	177.444	3.73
shapes_transformer	308.03	211.356	1.46
total	3783.46	742.715	5.10

Test logs:
test_mm_main.log
test_mm_this_pr.log
test_mm_pr_4350.log

zhangjian29 · 2025-11-21T08:34:13Z

Hi @vpirogov @dzarukin @mgouicem,

Looking forward to your feedbacks on this PR. Thanks a lot.

krishnasai-mcw · 2025-12-10T04:24:18Z

Hi @zhangjian29 ,

I also think that integrating the GEMM kernel directly into the matmul primitive was the right approach this way we can focus on improving and optimizing GEMM kernel rather than maintaining multiple paths.

I tested the latest code on the Banana Pi BPI-F3 board, and I’m seeing failures specifically in cases where the weights are in column-major layout. Could you please take a look at the logs below?
shapes_2D.log
shapes_3D.log
shapes_4D.log
test_matmul_all.log

Thanks!

zhangjian29 · 2025-12-10T07:12:21Z

Hi @krishnasai-mcw ,

Thank you for reporting this issue. I am seeing these failures on my platform too.

It's a mistake for sure that current rvv_matmul.hpp allows column-major layout but rvv_matmul.cpp take it as row-major. I am going to fix it in another PR. Thank you so much.

zhangjian29 · 2025-12-10T07:13:50Z

By the way, why does CI test passed anyway? Maybe we should have stronger tests on similar issues.

cpu: rv64: matmul: add rvv matmul gemm kernel

86871a7

zhangjian29 requested a review from a team as a code owner November 21, 2025 07:46

github-actions bot added the platform:cpu-rv64 RISC-V label Nov 21, 2025

zhangjian29 mentioned this pull request Nov 21, 2025

cpu: riscv: matmul: refactor and optimize RVV kernels for better performance #4350

Closed

3 tasks

cpu: rv64: matmul: fix weight broadcast logic

a769148

vpirogov approved these changes Nov 25, 2025

View reviewed changes

dzarukin approved these changes Dec 2, 2025

View reviewed changes

dzarukin merged commit b73fc31 into uxlfoundation:main Dec 2, 2025
11 checks passed

zhangjian29 deleted the add-rvv-matmul-gemm-kernel branch December 3, 2025 01:10

zhangjian29 mentioned this pull request Dec 10, 2025

cpu: rv64: matmul: fix column-major weights transpose bug #4445

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: rv64: matmul: improving rvv matmul performance by integrating with gemm kernel#4363

cpu: rv64: matmul: improving rvv matmul performance by integrating with gemm kernel#4363
dzarukin merged 2 commits intouxlfoundation:mainfrom
zhangjian29:add-rvv-matmul-gemm-kernel

zhangjian29 commented Nov 21, 2025

Uh oh!

zhangjian29 commented Nov 21, 2025

Uh oh!

Uh oh!

krishnasai-mcw commented Dec 10, 2025

Uh oh!

zhangjian29 commented Dec 10, 2025

Uh oh!

zhangjian29 commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhangjian29 commented Nov 21, 2025

Description

Key Features: Higher Performance and Less Code

Checklist

General

Performance Improvements

Results

Table 1：Runtime Comparisons of Implementaions of Main Branch and This PR

Comparison with PR #4350

Table 2：Runtime Comparisons of Implementaions of PR #4350 and This PR

Uh oh!

zhangjian29 commented Nov 21, 2025

Uh oh!

Uh oh!

krishnasai-mcw commented Dec 10, 2025

Uh oh!

zhangjian29 commented Dec 10, 2025

Uh oh!

zhangjian29 commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants