Skip to content

cpu: rv64: gemm: Implemented variable loop unrolling for GEMM#4258

Merged
vpirogov merged 1 commit intouxlfoundation:mainfrom
xiazhuozhao:V-GEMM
Dec 2, 2025
Merged

cpu: rv64: gemm: Implemented variable loop unrolling for GEMM#4258
vpirogov merged 1 commit intouxlfoundation:mainfrom
xiazhuozhao:V-GEMM

Conversation

@xiazhuozhao
Copy link
Contributor

@xiazhuozhao xiazhuozhao commented Oct 31, 2025

Description

This implementation provides a variable loop unrolling GEMM implementation for the RISC-V platform, which significantly improves the performance of GEMM.

This change introduces logic to select an appropriate loop unrolling kernel based on the L1 cache size. This significantly improves matmul performance on devices with a 64KB L1 cache, while performance on 32KB devices will remain consistent with the #3785 implementation.

For context, the current RISC-V matmul implementation (#3784) is non-GEMM-based. While an efficient GEMM kernel was also implemented in #3785, the resulting GEMM-based matmul was not prioritized over #3784.

On a 64KB L1 cache device, this new implementation (#4258) achieves a significant average speedup of 23.69x over the current matmul (#3784) and 15.43x over the GEMM-based matmul from #3785.

performance data (The unit of data in the table is average GFLOPS, and the test command is ./benchdnn --mode=p --matmul --batch=inputs/matmul/perf_matmul_training.)

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Performance improvements

  • Have you submitted performance data that demonstrates performance improvements?

@xiazhuozhao xiazhuozhao force-pushed the V-GEMM branch 4 times, most recently from d5f6940 to 7637e27 Compare October 31, 2025 10:28
@xiazhuozhao xiazhuozhao marked this pull request as ready for review October 31, 2025 17:21
@xiazhuozhao xiazhuozhao requested a review from a team as a code owner October 31, 2025 17:21
Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>
@vpirogov vpirogov merged commit d6107dd into uxlfoundation:main Dec 2, 2025
11 checks passed
@zhangjian29
Copy link
Contributor

On a 64KB L1 cache device, this new implementation (#4258) achieves a significant average speedup of 23.69x over the current matmul (#3784) and 15.43x over the GEMM-based matmul from #3785.

Hi @xiazhuozhao ,

I tested your code of n unrolling logic on a 2044 platform with 64KB L1 cache on each single core. It doesn't look like n-unroll-factor of 8 performs the best on it.

Batch Shape N-Unroll-2 N-Unroll-4 N-Unroll-8 N-Unroll-16
shapes_converted_ip_inf_lb_wd 207.592 204.823 214.517 257.341
shapes_converted_ip_inf_lb_gmnt 27.9092 28.639 29.4266 44.6874
shapes_converted_ip_inf_lb_googlenet 257.508 255.205 301.095 473.827
shapes_converted_ip_inf_lb_resnet 114.712 110.454 128.085 193.939
shapes_transformer 149.28 149.719 139.334 771.982

What do you think goes wrong on my tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants