Skip to content

cpu: rv64: add support for rvv convolution feature#3915

Closed
zhangjian29 wants to merge 5 commits intouxlfoundation:mainfrom
zhangjian29:add-rvv-convolution-integration
Closed

cpu: rv64: add support for rvv convolution feature#3915
zhangjian29 wants to merge 5 commits intouxlfoundation:mainfrom
zhangjian29:add-rvv-convolution-integration

Conversation

@zhangjian29
Copy link
Contributor

Description

This PR introduces optimized convolution kernels for RISC-V architectures using RVV (RISC-V Vector) intrinsics. The rvv_convolution implementation directly vectorizes convolutional sources and kernels by broadcasting the channel dimension, achieving significant performance improvements compared to scalar and auto-vectorization implementations.

This initial version provides:

  1. Implementing SIMD acceleration of convolution utilizing RVV intrinsics, which broadcasts along the channel dimension.
  2. Multiple data types including f32, f16, s32, s8 and u8.
  3. Initially supporting nchw or nhwc memory layouts for sources, and oihw layout for weights.
  4. Integrated with the newest rvv_postops to support ReLU post operation and scale multiplication.

We've noticed that a draft PR #3852 of rvv_gemm_conv has been in progress, which uses another method to vectorize and implement convolutional computation for RVV arch. We would like to address the benefits proposed by our rvv_convolution implementation here.

  • Minimal Memory Overheads: The rvv_gemm_conv implementation requires a large amount of memory to buffer data in case of using im2col method, which transfers a conv image into a large matrix to leverage the matrix multiplication acceleration. However, this rvv_convolution implementation vectorizes a conv image along its input channel dimension without transforming and buffering of matrices. Therefore, this rvv_convolution implementation is more generic due to its minimal memory overheads, which is essential to be implemented.

Key Features

  • Data Types: Supported src_dtype and wei_dtype combinations match the cases in ref design without bf16 dtype, which are:

    • f32 & f32
    • f16 & f16
    • f32 & f16
    • s8 & s8
    • u8 & s8

    The dst dtypes can be one of f32, f16, s8, u8, s32, which is transferred at the end of each output scalar computation.

  • Memory Layouts: Source and destination tensors must be nchw or nhwc, and weight tensors must be iohw layout. Note that computations are performed using nhwc layout, which means that nchw layouts of src and dst must be reordered. The reordered methods are provided in rvv_convolution_utils.hpp.

  • Vectorization: It vectorizes along the input channel dimension instead of h/w dimension for better performance.

  • Compute Kernel: After vectorization, data goes into rvv_dot_ic_fwd_*(src_dtype)_*(wei_dtype) to compute and dot product to obtain the output scalar result, which goes into finalize_conv_acc to compute bias, scales and post ops.

  • Post-Ops: The newest rvv_postops is integrated to support ReLU post operation and scale multiplication.

  • Parallelization: By adopting parallel methods provided by oneDNN API including balance211, parallel_nd and parallel, it leverages RVV-vectorize inner loops optimization and multi-core CPU performance.

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Performance improvements

  • Have you submitted performance data that demonstrates performance improvements?

All experiments are performed on the LRW platform. We draw comparisons among 1st baseline method of scalar implementation , 2nd baseline method of auto vectoration by compiler,and our method of RVV intrinsic implementation.

Test uses benchdnn with arguments of one of the layers in VGG11

./benchdnn --stag=abcd --wtag=abcd --dtag=abcd --attr-post-ops=ReLU mb64ic256ih7oc160oh7kh3ph1n"vgg_11:conv5_1"

  • In the training process with extra arguments --dir=FWD_D --dt=f32:f32:f32, experimental results are as follows:
Methods Perf. Time (s) Speedups Total Time (s)
scalar 0.65 1.48 × 1.79
auto vectorization 0.44 1 × 6.78%
RVV intrinsic 2.32 - 3.62
  • In the inference process with extra arguments --dir=FWD_I --dt=u8:s8:s32, experimental results are as follows:
Methods Perf. Time (s) Speedups Total Time (s)
scalar 16.41 56.59 × 17.07
auto vectorization 25.47 87.83 × 25.91
RVV intrinsic 0.29 - 0.68

Note that the total time is comprised of creation, filling, executation, refernce computation, and comparison, where the last two are for correctness verification using scalar computation. For fair comparisons, we minus the reference computation and comparison time to obatin the performance time.

New features

  • [N/A] Have you published an RFC for the new feature?
  • [N/A] Was the RFC approved?
  • [N/A] Have you added relevant tests?

Bug fixes

  • [N/A] Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
  • [N/A] Have you added relevant regression tests?

RFC PR

  • [N/A] Does RFC document follow the template?
  • [N/A] Have you added a link to the rendered document?

@zhangjian29
Copy link
Contributor Author

Hey guys, I pushed a updated version with key changes:

Key Changes

  1. Removed templatization of reordering and packing memory methods.
  2. Revised #ifdef.

Evaluation

I rebuilt and retested using the RISC-V GNU toolchain version 14.2, verifying the functionality under the QEMU RISCV64 emulator. I used shapes_vgg_11 to test my implementation with cmd ONEDNN_VERBOSE=1 ./benchdnn --conv --batch=test_conv_vgg11, which passed all cases successfully.

File test_conv_vgg11 includes tests:

--reset

--alg=direct
--stag=abcd
--wtag=abcd
--dtag=abcd

# # Training Forwarding
--dir=FWD_D
--dt=f32
--attr-post-ops=relu
--batch=shapes_vgg_11

# Inference
--dir=FWD_I
--dt=u8:s8:s32
--attr-post-ops=relu
--batch=shapes_vgg_11

Calls to the implemented rvv_convolution can be traced by searching for RISCV64GCV in:

Thanks!

@zhangjian29 zhangjian29 force-pushed the add-rvv-convolution-integration branch from 2563c11 to a66105d Compare October 3, 2025 23:54
@zhangjian29 zhangjian29 force-pushed the add-rvv-convolution-integration branch from a66105d to e009e2a Compare October 9, 2025 07:54
@zhangjian29 zhangjian29 deleted the add-rvv-convolution-integration branch February 24, 2026 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant