cpu: rv64: matmul: improving rvv matmul performance by integrating with gemm kernel#4363
Conversation
|
Hi @zhangjian29 , I also think that integrating the GEMM kernel directly into the matmul primitive was the right approach this way we can focus on improving and optimizing GEMM kernel rather than maintaining multiple paths. I tested the latest code on the Banana Pi BPI-F3 board, and I’m seeing failures specifically in cases where the weights are in column-major layout. Could you please take a look at the logs below? Thanks! |
|
Hi @krishnasai-mcw , Thank you for reporting this issue. I am seeing these failures on my platform too. It's a mistake for sure that current |
|
By the way, why does CI test passed anyway? Maybe we should have stronger tests on similar issues. |
Description
This PR improved the preformance of
rvv_matmulimplementation by integrating withrvv_gemm_f32kernel proposed by #3785. We proposed a high-performance and light-weighted approach forrvv_matmulimplementation.Key Features: Higher Performance and Less Code
row-majorandcolumn-majorkernel.init_gemm_confmethod to initlize thervv_gemm_f32kernel.rvv_gemm_f32kernel in execution for GEMM computation.row-majormatrices are reinterpreted ascolumn-majormatrices to obtain higher performance in GEMM computation.Checklist
General
make testandmake test_benchdnn_*) pass locally for each commit?Performance Improvements
All performance experiments are tested on a SG2044 platform with fixed resources using
taskset -c 27cmd and with the same compilation flags (gcc14.2 -O3). Tests include:benchdnnwith different input shapesf32--mode=PResults
Averagely, we improved the performance by 13.1 times compared with the current implementation.
Detailed results are as follows:
Table 1:Runtime Comparisons of Implementaions of Main Branch and This PR
Comparison with PR #4350
We appreciate the efforts in PR #4350 to improve
rvv_matmul. Our PR builds on a different approach, a lightweight integration with the maturervv_gemm_f32kernel, which demonstrates over 5.1x performance improvementcompared to PR #4350. By inheriting ongoing optimizations to the GEMM kernel, this approach promises sustained performance gains. We kindly request consideration for this PR.Table 2:Runtime Comparisons of Implementaions of PR #4350 and This PR
test_mm_main.log
test_mm_this_pr.log
test_mm_pr_4350.log