cpu: risc-v: convolution: Vectorize post-ops with RVV intrinsics by xiazhuozhao · Pull Request #3852 · uxlfoundation/oneDNN

xiazhuozhao · 2025-09-01T16:51:06Z

Description

I'm opening this Draft pull request to get early feedback on the necessity and direction of this change. The full ctest validation is currently in progress and will take some time.

On the RISC-V platform, the GEMM-based convolution is currently the most performant implementation. Its execution time can be broken down into three main parts: im2col, the GEMM computation, and the post-processing stage.

The optimization for the GEMM stage using RISC-V Vector (RVV) extensions has already been contributed in PR #3785, which effectively improved the performance of convolution and deconvolution primitives.

However, the im2col and post-processing stages remain as performance bottlenecks as they are not yet effectively vectorized. This PR focuses on the post-processing part, which currently relies on OpenMP auto-vectorization (PRAGMA_OMP_SIMD()). This feature is not yet supported on our target RVV platform, making manual vectorization a necessary step for performance improvement.

This PR addresses the post-processing bottleneck by replacing the OpenMP pragmas with explicit, manual vectorization using RVV intrinsics for bias addition and ReLU activation loops. Furthermore, this change provides a framework that will facilitate future RVV-based optimizations of the im2col process.

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

Performance improvements

Have you submitted performance data that demonstrates performance improvements?

xiazhuozhao · 2025-09-10T17:10:01Z

Hi team,

This PR is now ready for review.

I've implemented RVV (RISC-V Vector Extension) optimizations for the post-operations in the GEMM-based convolution primitive. The primary focus was on vectorizing the bias addition and ReLU activation functions for both NSPC (nhwc) and NCSP (nchw) layouts.

The latest commit optimizes the ReLU implementation for NCSP and reverts the NSPC implementation back to a non-vectorized version due to its poor performance.

preformance data: rvv_gemmconv_benchmark.log

NSPC (nhwc) Layout

Problem	Operation	Original Time (ns)	RVV Time (ns)	Speedup
ic16ih32iw32_oc32_kh3kw3	NSPC Bias Add	332,196	246,224	1.35x
ic64ih128iw128_oc128_kh3kw3	NSPC Bias Add	12,233,900	8,926,369	1.37x
ic128ih224iw224_oc256_kh3kw3	NSPC Bias Add	89,632,009	63,978,210	1.40x
ic3ih512iw512_oc64_kh3kw3	NSPC Bias Add	98,410,925	80,318,324	1.23x
ic512ih16iw16_oc512_kh3kw3	NSPC Bias Add	642,679	351,849	1.83x

NCSP (nchw) Layout

Problem	Operation	Original Time (ns)	RVV Time (ns)	Speedup
ic16ih32iw32_oc32_kh3kw3	NCSP Bias Add	322,047	117,207	2.75x
ic16ih32iw32_oc32_kh3kw3	NCSP ReLU (FWD_D)	177,712	159,211	1.12x
ic16ih32iw32_oc32_kh3kw3	NCSP ReLU (FWD_B)	177,452	161,348	1.10x
ic64ih128iw128_oc128_kh3kw3	NCSP Bias Add	88,204,332	12,359,496	7.14x
ic64ih128iw128_oc128_kh3kw3	NCSP ReLU (FWD_D)	23,530,782	14,668,514	1.60x
ic64ih128iw128_oc128_kh3kw3	NCSP ReLU (FWD_B)	23,139,237	14,514,854	1.59x
ic128ih224iw224_oc256_kh3kw3	NCSP Bias Add	432,921,852	70,840,021	6.11x
ic128ih224iw224_oc256_kh3kw3	NCSP ReLU (FWD_D)	103,647,810	85,090,423	1.22x
ic128ih224iw224_oc256_kh3kw3	NCSP ReLU (FWD_B)	105,404,787	83,487,036	1.26x
ic3ih512iw512_oc64_kh3kw3	NCSP Bias Add	127,033,289	61,843,702	2.05x
ic3ih512iw512_oc64_kh3kw3	NCSP ReLU (FWD_D)	95,887,737	69,365,254	1.38x
ic3ih512iw512_oc64_kh3kw3	NCSP ReLU (FWD_B)	96,720,147	75,895,192	1.27x
ic512ih16iw16_oc512_kh3kw3	NCSP Bias Add	1,024,659	554,340	1.85x
ic512ih16iw16_oc512_kh3kw3	NCSP ReLU (FWD_D)	795,820	686,400	1.16x
ic512ih16iw16_oc512_kh3kw3	NCSP ReLU (FWD_B)	780,216	694,031	1.12x

Note: The speedup data shown are only for the accelerated parts (i.e., ReLU and Bias Add), not the entire operation. The acceleration for the GEMM-based convolution operator itself is addressed in #3785.

zhangjian29 · 2025-09-12T03:32:16Z

Nicely done! I wanted to share some thoughts regarding the integration of ReLU intrinsics for future extensibility with additional post-operations.

I noticed that a version of rvv_postops.hpp has alreadly been merged. And I have open two PRs for #3898 rvv_eltwise and #3899 rvv_binary, which are currently pending review. These will be integrated into the existing rvv_postops framework in the very future.

The current supported features include:

Eltwise ops: ReLU, square, abs, sqrt, linear, clip, hardsigmoid, hardswish.
Binary ops: add, div, max, min, mul, sub, ge, gt, le, lt, eq, ne
Data types: f32, f16, s32, s8, u8

Given this, I would like to suggest that maybe you could use our rvv_eltwise and rvv_binary for post ops via rvv_postops in your next version for a more unified and extensible post-op implementation.

Thank you for considering this suggestion!

xiazhuozhao · 2025-09-12T14:56:50Z

Nicely done! I wanted to share some thoughts regarding the integration of ReLU intrinsics for future extensibility with additional post-operations.

I noticed that a version of rvv_postops.hpp has alreadly been merged. And I have open two PRs for #3898 rvv_eltwise and #3899 rvv_binary, which are currently pending review. These will be integrated into the existing rvv_postops framework in the very future.

The current supported features include:

Eltwise ops: ReLU, square, abs, sqrt, linear, clip, hardsigmoid, hardswish.

Binary ops: add, div, max, min, mul, sub, ge, gt, le, lt, eq, ne

Data types: f32, f16, s32, s8, u8

Given this, I would like to suggest that maybe you could use our rvv_eltwise and rvv_binary for post ops via rvv_postops in your next version for a more unified and extensible post-op implementation.

Thank you for considering this suggestion!

Thank you for the heads-up.

This current PR not only provides post ops acceleration but also offers a framework for RVV GEMM-based convolution. I will continue to optimize the img2col acceleration in the future, so this change remains important. Once your PR is approved, both you and I can apply these operators and submit new PRs later on.

src/cpu/rv64/rvv_gemm_convolution.hpp

src/cpu/cpu_convolution_list.cpp

src/cpu/rv64/rvv_gemm_convolution.cpp

vpirogov · 2025-10-07T15:43:11Z

Please address CI fails.

xiazhuozhao · 2025-10-07T17:25:49Z

Please address CI fails.

Thank you very much for your reminder. Due to the current public holiday in China, our verification servers are temporarily offline. Therefore, I am working on resolving this issue locally, which may take longer than usual. I sincerely apologize for any inconvenience this may cause.

Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>

xiazhuozhao · 2025-10-15T13:31:31Z

Hi everyone,

This issue is now fixed.

The previous change to the post_ops_ok check for this op caused the graph compiler to fuse a pattern with an element-wise binary operation, a mode our kernel implementation doesn't yet support. In the failing test, the symptom appeared as if the final Add operation was being skipped.

I have now corrected these necessary checks, and all tests are passing. Thanks a lot for all your help!

github-actions bot added platform:cpu-rv64 RISC-V component:common labels Sep 1, 2025

zhangjian29 mentioned this pull request Sep 10, 2025

cpu: rv64: add support for rvv convolution feature #3915

Closed

3 tasks

xiazhuozhao force-pushed the rvv_gemm_conv branch from b4e4d5f to 6827e48 Compare September 10, 2025 16:53

xiazhuozhao marked this pull request as ready for review September 10, 2025 16:54

xiazhuozhao requested a review from a team as a code owner September 10, 2025 16:54

xiazhuozhao changed the title ~~cpu: risc-v: convolution: Vectorize post-ops with RVV intrinsics [WIP]~~ cpu: risc-v: convolution: Vectorize post-ops with RVV intrinsics Sep 10, 2025

dzarukin reviewed Sep 12, 2025

View reviewed changes

src/cpu/rv64/rvv_gemm_convolution.hpp Outdated Show resolved Hide resolved

src/cpu/rv64/rvv_gemm_convolution.hpp Outdated Show resolved Hide resolved

src/cpu/rv64/rvv_gemm_convolution.hpp Outdated Show resolved Hide resolved

src/cpu/cpu_convolution_list.cpp Outdated Show resolved Hide resolved

xiazhuozhao force-pushed the rvv_gemm_conv branch 2 times, most recently from 8e33bc2 to 7e74687 Compare September 15, 2025 14:07

xiazhuozhao requested a review from dzarukin September 15, 2025 16:10

vpirogov approved these changes Oct 2, 2025

View reviewed changes

vpirogov removed the request for review from dzarukin October 2, 2025 23:10

vpirogov reviewed Oct 2, 2025

View reviewed changes

src/cpu/rv64/rvv_gemm_convolution.cpp Outdated Show resolved Hide resolved

vpirogov force-pushed the rvv_gemm_conv branch from 7e74687 to c946899 Compare October 3, 2025 18:09

xiazhuozhao marked this pull request as draft October 7, 2025 17:25

xiazhuozhao and others added 5 commits October 15, 2025 20:38

cpu: rv64: convolution: Vectorize post-ops with RVV intrinsics

6ec16df

Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>

cpu: rv64: convolution: optimize ncsp relu and adjust nspc impl

0e89895

Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>

cpu: rv64: conv: Refactor validation logic and cleanup

b9425ec

Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>

cpu: risc-v: convolution: Add a newline at the end of the file.

b701411

cpu: risc-v: Restore specific checks in post_ops_ok for gemm convolution

25468f0

Co-authored-by: Fei Zhang <zhangfei@iscas.ac.cn>

xiazhuozhao force-pushed the rvv_gemm_conv branch from 9c8a6cb to 25468f0 Compare October 15, 2025 12:38

xiazhuozhao marked this pull request as ready for review October 15, 2025 12:39

xiazhuozhao requested a review from dzarukin October 17, 2025 16:57

dzarukin approved these changes Oct 17, 2025

View reviewed changes

vpirogov merged commit dca0b6c into uxlfoundation:main Oct 17, 2025
28 checks passed

xiazhuozhao mentioned this pull request Dec 26, 2025

cpu: riscv: add ISCAS copyright to new files #4487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: risc-v: convolution: Vectorize post-ops with RVV intrinsics#3852

cpu: risc-v: convolution: Vectorize post-ops with RVV intrinsics#3852
vpirogov merged 5 commits intouxlfoundation:mainfrom
xiazhuozhao:rvv_gemm_conv

xiazhuozhao commented Sep 1, 2025 •

edited

Loading

Uh oh!

xiazhuozhao commented Sep 10, 2025

Uh oh!

zhangjian29 commented Sep 12, 2025

Uh oh!

xiazhuozhao commented Sep 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vpirogov commented Oct 7, 2025

Uh oh!

xiazhuozhao commented Oct 7, 2025

Uh oh!

xiazhuozhao commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xiazhuozhao commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

General

Performance improvements

Uh oh!

xiazhuozhao commented Sep 10, 2025

NSPC (nhwc) Layout

NCSP (nchw) Layout

Uh oh!

zhangjian29 commented Sep 12, 2025

Uh oh!

xiazhuozhao commented Sep 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vpirogov commented Oct 7, 2025

Uh oh!

xiazhuozhao commented Oct 7, 2025

Uh oh!

xiazhuozhao commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiazhuozhao commented Sep 1, 2025 •

edited

Loading