Skip to content

cpu: rv64: gemm: add bf16 gemm SIMD optimization with RISC-V V Extension#3845

Closed
ryanle1017 wants to merge 1 commit intouxlfoundation:mainfrom
ryanle1017:feature/rvv-bf16-gemm-github
Closed

cpu: rv64: gemm: add bf16 gemm SIMD optimization with RISC-V V Extension#3845
ryanle1017 wants to merge 1 commit intouxlfoundation:mainfrom
ryanle1017:feature/rvv-bf16-gemm-github

Conversation

@ryanle1017
Copy link

Description

This PR introduces a SIMD-optimized bfloat16 GEMM kernel for the RISC-V 64-bit architecture, leveraging the RISC-V Vector (V) Extension. This work extends the foundational f32 GEMM implementation from PR #3785, enabling high-performance mixed-precision computations.
The primary motivation is to accelerate inference on emerging RISC-V platforms. As bfloat16 becomes a critical data type for modern deep learning models, offering significant memory bandwidth savings with a dynamic range comparable to f32, this optimized kernel fills a crucial performance gap.
This implementation focuses on the bf16:bf16:f32 data type combination (bfloat16 inputs, float32 accumulation and output), which is a common and numerically robust approach for mixed-precision GEMM.

Key Changes

  • RVV-Optimized BF16 Kernel: Added a new GEMM kernel (rvv_gemm_bf16bf16f32) specifically designed for bfloat16 inputs and float32 outputs/accumulation.
  • GEMM Dispatch Integration: The main GEMM dispatch logic in src/cpu/gemm/gemm.cpp is updated to route bf16bf16f32 requests to the new RVV kernel when running on a compatible RISC-V platform.
  • Platform Feature Detection: Extended platform::mayiuse_bf16() to correctly detect and enable bfloat16 support when RVV intrinsics are available for RISC-V.

Checklist

General

  • [√] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • [√] Have you formatted the code using clang-format?

Performance improvements

  • [√] Have you submitted performance data that demonstrates performance improvements?

All performance data was measured on a
--TODO
The performance baseline is the default ref_gemm implementation in the oneDNN main branch. Below are some example performance benchmarks for different problem sizes.

Matmul Primitives Performance(--dt=bf16:bf16:f32)

New features

  • Have you published an RFC for the new feature?
  • Was the RFC approved?
  • Have you added relevant tests?

Bug fixes

  • Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
  • Have you added relevant regression tests?

RFC PR

  • Does RFC document follow the template?
  • Have you added a link to the rendered document?

Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>
@ryanle1017 ryanle1017 force-pushed the feature/rvv-bf16-gemm-github branch from ac70101 to 7bd3fe8 Compare August 29, 2025 14:38
@ryanle1017 ryanle1017 closed this Aug 29, 2025
@ryanle1017 ryanle1017 deleted the feature/rvv-bf16-gemm-github branch August 29, 2025 14:53
@ryanle1017 ryanle1017 restored the feature/rvv-bf16-gemm-github branch August 29, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant