cpu: rv64: add support for rvv convolution feature by zhangjian29 · Pull Request #3915 · uxlfoundation/oneDNN

zhangjian29 · 2025-09-10T09:43:32Z

Description

This PR introduces optimized convolution kernels for RISC-V architectures using RVV (RISC-V Vector) intrinsics. The rvv_convolution implementation directly vectorizes convolutional sources and kernels by broadcasting the channel dimension, achieving significant performance improvements compared to scalar and auto-vectorization implementations.

This initial version provides:

Implementing SIMD acceleration of convolution utilizing RVV intrinsics, which broadcasts along the channel dimension.
Multiple data types including f32, f16, s32, s8 and u8.
Initially supporting nchw or nhwc memory layouts for sources, and oihw layout for weights.
Integrated with the newest rvv_postops to support ReLU post operation and scale multiplication.

We've noticed that a draft PR #3852 of rvv_gemm_conv has been in progress, which uses another method to vectorize and implement convolutional computation for RVV arch. We would like to address the benefits proposed by our rvv_convolution implementation here.

Minimal Memory Overheads: The rvv_gemm_conv implementation requires a large amount of memory to buffer data in case of using im2col method, which transfers a conv image into a large matrix to leverage the matrix multiplication acceleration. However, this rvv_convolution implementation vectorizes a conv image along its input channel dimension without transforming and buffering of matrices. Therefore, this rvv_convolution implementation is more generic due to its minimal memory overheads, which is essential to be implemented.

Key Features

Data Types: Supported src_dtype and wei_dtype combinations match the cases in ref design without bf16 dtype, which are:
- f32 & f32
- f16 & f16
- f32 & f16
- s8 & s8
- u8 & s8
The dst dtypes can be one of f32, f16, s8, u8, s32, which is transferred at the end of each output scalar computation.
Memory Layouts: Source and destination tensors must be nchw or nhwc, and weight tensors must be iohw layout. Note that computations are performed using nhwc layout, which means that nchw layouts of src and dst must be reordered. The reordered methods are provided in rvv_convolution_utils.hpp.
Vectorization: It vectorizes along the input channel dimension instead of h/w dimension for better performance.
Compute Kernel: After vectorization, data goes into rvv_dot_ic_fwd_*(src_dtype)_*(wei_dtype) to compute and dot product to obtain the output scalar result, which goes into finalize_conv_acc to compute bias, scales and post ops.
Post-Ops: The newest rvv_postops is integrated to support ReLU post operation and scale multiplication.
Parallelization: By adopting parallel methods provided by oneDNN API including balance211, parallel_nd and parallel, it leverages RVV-vectorize inner loops optimization and multi-core CPU performance.

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

Performance improvements

Have you submitted performance data that demonstrates performance improvements?

All experiments are performed on the LRW platform. We draw comparisons among 1st baseline method of scalar implementation , 2nd baseline method of auto vectoration by compiler,and our method of RVV intrinsic implementation.

Test uses benchdnn with arguments of one of the layers in VGG11

./benchdnn --stag=abcd --wtag=abcd --dtag=abcd --attr-post-ops=ReLU mb64ic256ih7oc160oh7kh3ph1n"vgg_11:conv5_1"

In the training process with extra arguments --dir=FWD_D --dt=f32:f32:f32, experimental results are as follows:

Methods	Perf. Time (s)	Speedups	Total Time (s)
scalar	0.65	1.48 ×	1.79
auto vectorization	0.44	1 ×	6.78%
RVV intrinsic	2.32	-	3.62

In the inference process with extra arguments --dir=FWD_I --dt=u8:s8:s32, experimental results are as follows:

Methods	Perf. Time (s)	Speedups	Total Time (s)
scalar	16.41	56.59 ×	17.07
auto vectorization	25.47	87.83 ×	25.91
RVV intrinsic	0.29	-	0.68

Note that the total time is comprised of creation, filling, executation, refernce computation, and comparison, where the last two are for correctness verification using scalar computation. For fair comparisons, we minus the reference computation and comparison time to obatin the performance time.

New features

[N/A] Have you published an RFC for the new feature?
[N/A] Was the RFC approved?
[N/A] Have you added relevant tests?

Bug fixes

[N/A] Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
[N/A] Have you added relevant regression tests?

RFC PR

[N/A] Does RFC document follow the template?
[N/A] Have you added a link to the rendered document?

zhangjian29 · 2025-09-15T09:18:35Z

Hey guys, I pushed a updated version with key changes:

Key Changes

Removed templatization of reordering and packing memory methods.
Revised #ifdef.

Evaluation

I rebuilt and retested using the RISC-V GNU toolchain version 14.2, verifying the functionality under the QEMU RISCV64 emulator. I used shapes_vgg_11 to test my implementation with cmd ONEDNN_VERBOSE=1 ./benchdnn --conv --batch=test_conv_vgg11, which passed all cases successfully.

File test_conv_vgg11 includes tests:

--reset

--alg=direct
--stag=abcd
--wtag=abcd
--dtag=abcd

# # Training Forwarding
--dir=FWD_D
--dt=f32
--attr-post-ops=relu
--batch=shapes_vgg_11

# Inference
--dir=FWD_I
--dt=u8:s8:s32
--attr-post-ops=relu
--batch=shapes_vgg_11

Calls to the implemented rvv_convolution can be traced by searching for RISCV64GCV in:

test_conv_vgg11_rvv.log

Thanks!

zhangjian29 requested a review from a team as a code owner September 10, 2025 09:43

github-actions bot added platform:cpu-rv64 RISC-V component:common labels Sep 10, 2025

zhangjian29 force-pushed the add-rvv-convolution-integration branch from f3f388b to 1eecc29 Compare September 10, 2025 09:45

zhangjian29 mentioned this pull request Sep 12, 2025

cpu: rv64: add support for rv64 binary feature with RISC-V Vector Extension #3899

Merged

3 tasks

zhangjian29 force-pushed the add-rvv-convolution-integration branch from 2563c11 to a66105d Compare October 3, 2025 23:54

zhangjian29 added 4 commits October 9, 2025 15:54

cpu: rv64: add support for rvv convolution feature

7628b48

cpu: rv64: conv: fix templatization and ifdef line

815772e

cpu: rv64: conv: fix: Add Zvfh extension guard to f16-related codes

807782b

cpu: rv64: conv: remove f16 deadcode & rebase

e009e2a

zhangjian29 force-pushed the add-rvv-convolution-integration branch from a66105d to e009e2a Compare October 9, 2025 07:54

cpu: rv64: conv: remove postops for now

c968b94

zhangjian29 closed this Nov 21, 2025

zhangjian29 deleted the add-rvv-convolution-integration branch February 24, 2026 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: rv64: add support for rvv convolution feature#3915

cpu: rv64: add support for rvv convolution feature#3915
zhangjian29 wants to merge 5 commits intouxlfoundation:mainfrom
zhangjian29:add-rvv-convolution-integration

zhangjian29 commented Sep 10, 2025

Uh oh!

zhangjian29 commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhangjian29 commented Sep 10, 2025

Description

Key Features

Checklist

General

Performance improvements

New features

Bug fixes

RFC PR

Uh oh!

zhangjian29 commented Sep 15, 2025

Key Changes

Evaluation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant