Skip to content

cpu: risc-v: convolution: Vectorize post-ops with RVV intrinsics#3852

Merged
vpirogov merged 5 commits intouxlfoundation:mainfrom
xiazhuozhao:rvv_gemm_conv
Oct 17, 2025
Merged

cpu: risc-v: convolution: Vectorize post-ops with RVV intrinsics#3852
vpirogov merged 5 commits intouxlfoundation:mainfrom
xiazhuozhao:rvv_gemm_conv

Conversation

@xiazhuozhao
Copy link
Contributor

@xiazhuozhao xiazhuozhao commented Sep 1, 2025

Description

I'm opening this Draft pull request to get early feedback on the necessity and direction of this change. The full ctest validation is currently in progress and will take some time.

On the RISC-V platform, the GEMM-based convolution is currently the most performant implementation. Its execution time can be broken down into three main parts: im2col, the GEMM computation, and the post-processing stage.

The optimization for the GEMM stage using RISC-V Vector (RVV) extensions has already been contributed in PR #3785, which effectively improved the performance of convolution and deconvolution primitives.

However, the im2col and post-processing stages remain as performance bottlenecks as they are not yet effectively vectorized. This PR focuses on the post-processing part, which currently relies on OpenMP auto-vectorization (PRAGMA_OMP_SIMD()). This feature is not yet supported on our target RVV platform, making manual vectorization a necessary step for performance improvement.

This PR addresses the post-processing bottleneck by replacing the OpenMP pragmas with explicit, manual vectorization using RVV intrinsics for bias addition and ReLU activation loops. Furthermore, this change provides a framework that will facilitate future RVV-based optimizations of the im2col process.

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Performance improvements

  • Have you submitted performance data that demonstrates performance improvements?

@xiazhuozhao
Copy link
Contributor Author

Hi team,

This PR is now ready for review.

I've implemented RVV (RISC-V Vector Extension) optimizations for the post-operations in the GEMM-based convolution primitive. The primary focus was on vectorizing the bias addition and ReLU activation functions for both NSPC (nhwc) and NCSP (nchw) layouts.

The latest commit optimizes the ReLU implementation for NCSP and reverts the NSPC implementation back to a non-vectorized version due to its poor performance.

preformance data: rvv_gemmconv_benchmark.log

NSPC (nhwc) Layout

Problem Operation Original Time (ns) RVV Time (ns) Speedup
ic16ih32iw32_oc32_kh3kw3 NSPC Bias Add 332,196 246,224 1.35x
ic64ih128iw128_oc128_kh3kw3 NSPC Bias Add 12,233,900 8,926,369 1.37x
ic128ih224iw224_oc256_kh3kw3 NSPC Bias Add 89,632,009 63,978,210 1.40x
ic3ih512iw512_oc64_kh3kw3 NSPC Bias Add 98,410,925 80,318,324 1.23x
ic512ih16iw16_oc512_kh3kw3 NSPC Bias Add 642,679 351,849 1.83x

NCSP (nchw) Layout

Problem Operation Original Time (ns) RVV Time (ns) Speedup
ic16ih32iw32_oc32_kh3kw3 NCSP Bias Add 322,047 117,207 2.75x
ic16ih32iw32_oc32_kh3kw3 NCSP ReLU (FWD_D) 177,712 159,211 1.12x
ic16ih32iw32_oc32_kh3kw3 NCSP ReLU (FWD_B) 177,452 161,348 1.10x
ic64ih128iw128_oc128_kh3kw3 NCSP Bias Add 88,204,332 12,359,496 7.14x
ic64ih128iw128_oc128_kh3kw3 NCSP ReLU (FWD_D) 23,530,782 14,668,514 1.60x
ic64ih128iw128_oc128_kh3kw3 NCSP ReLU (FWD_B) 23,139,237 14,514,854 1.59x
ic128ih224iw224_oc256_kh3kw3 NCSP Bias Add 432,921,852 70,840,021 6.11x
ic128ih224iw224_oc256_kh3kw3 NCSP ReLU (FWD_D) 103,647,810 85,090,423 1.22x
ic128ih224iw224_oc256_kh3kw3 NCSP ReLU (FWD_B) 105,404,787 83,487,036 1.26x
ic3ih512iw512_oc64_kh3kw3 NCSP Bias Add 127,033,289 61,843,702 2.05x
ic3ih512iw512_oc64_kh3kw3 NCSP ReLU (FWD_D) 95,887,737 69,365,254 1.38x
ic3ih512iw512_oc64_kh3kw3 NCSP ReLU (FWD_B) 96,720,147 75,895,192 1.27x
ic512ih16iw16_oc512_kh3kw3 NCSP Bias Add 1,024,659 554,340 1.85x
ic512ih16iw16_oc512_kh3kw3 NCSP ReLU (FWD_D) 795,820 686,400 1.16x
ic512ih16iw16_oc512_kh3kw3 NCSP ReLU (FWD_B) 780,216 694,031 1.12x

Note: The speedup data shown are only for the accelerated parts (i.e., ReLU and Bias Add), not the entire operation. The acceleration for the GEMM-based convolution operator itself is addressed in #3785.

@xiazhuozhao xiazhuozhao changed the title cpu: risc-v: convolution: Vectorize post-ops with RVV intrinsics [WIP] cpu: risc-v: convolution: Vectorize post-ops with RVV intrinsics Sep 10, 2025
@zhangjian29
Copy link
Contributor

Nicely done! I wanted to share some thoughts regarding the integration of ReLU intrinsics for future extensibility with additional post-operations.

I noticed that a version of rvv_postops.hpp has alreadly been merged. And I have open two PRs for #3898 rvv_eltwise and #3899 rvv_binary, which are currently pending review. These will be integrated into the existing rvv_postops framework in the very future.

The current supported features include:

  • Eltwise ops​​: ReLU, square, abs, sqrt, linear, clip, hardsigmoid, hardswish.
  • Binary ops​​: add, div, max, min, mul, sub, ge, gt, le, lt, eq, ne
  • Data types​​: f32, f16, s32, s8, u8

Given this, I would like to suggest that maybe you could use our rvv_eltwise and rvv_binary for post ops via rvv_postops in your next version for a more unified and extensible post-op implementation.

Thank you for considering this suggestion!

@xiazhuozhao
Copy link
Contributor Author

Nicely done! I wanted to share some thoughts regarding the integration of ReLU intrinsics for future extensibility with additional post-operations.

I noticed that a version of rvv_postops.hpp has alreadly been merged. And I have open two PRs for #3898 rvv_eltwise and #3899 rvv_binary, which are currently pending review. These will be integrated into the existing rvv_postops framework in the very future.

The current supported features include:

  • Eltwise ops​​: ReLU, square, abs, sqrt, linear, clip, hardsigmoid, hardswish.
  • Binary ops​​: add, div, max, min, mul, sub, ge, gt, le, lt, eq, ne
  • Data types​​: f32, f16, s32, s8, u8

Given this, I would like to suggest that maybe you could use our rvv_eltwise and rvv_binary for post ops via rvv_postops in your next version for a more unified and extensible post-op implementation.

Thank you for considering this suggestion!

Thank you for the heads-up.

This current PR not only provides post ops acceleration but also offers a framework for RVV GEMM-based convolution. I will continue to optimize the img2col acceleration in the future, so this change remains important. Once your PR is approved, both you and I can apply these operators and submit new PRs later on.

@xiazhuozhao xiazhuozhao force-pushed the rvv_gemm_conv branch 2 times, most recently from 8e33bc2 to 7e74687 Compare September 15, 2025 14:07
@vpirogov vpirogov removed the request for review from dzarukin October 2, 2025 23:10
@vpirogov
Copy link
Contributor

vpirogov commented Oct 7, 2025

Please address CI fails.

@xiazhuozhao xiazhuozhao marked this pull request as draft October 7, 2025 17:25
@xiazhuozhao
Copy link
Contributor Author

Please address CI fails.

Thank you very much for your reminder. Due to the current public holiday in China, our verification servers are temporarily offline. Therefore, I am working on resolving this issue locally, which may take longer than usual. I sincerely apologize for any inconvenience this may cause.

@xiazhuozhao xiazhuozhao marked this pull request as ready for review October 15, 2025 12:39
@xiazhuozhao
Copy link
Contributor Author

Hi everyone,

This issue is now fixed.

The previous change to the post_ops_ok check for this op caused the graph compiler to fuse a pattern with an element-wise binary operation, a mode our kernel implementation doesn't yet support. In the failing test, the symptom appeared as if the final Add operation was being skipped.

I have now corrected these necessary checks, and all tests are passing. Thanks a lot for all your help!

@xiazhuozhao xiazhuozhao requested a review from dzarukin October 17, 2025 16:57
@vpirogov vpirogov merged commit dca0b6c into uxlfoundation:main Oct 17, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants