cpu: rv64: add support for rvv convolution feature#3915
Closed
zhangjian29 wants to merge 5 commits intouxlfoundation:mainfrom
Closed
cpu: rv64: add support for rvv convolution feature#3915zhangjian29 wants to merge 5 commits intouxlfoundation:mainfrom
zhangjian29 wants to merge 5 commits intouxlfoundation:mainfrom
Conversation
f3f388b to
1eecc29
Compare
3 tasks
Contributor
Author
|
Hey guys, I pushed a updated version with key changes: Key Changes
EvaluationI rebuilt and retested using the RISC-V GNU toolchain version 14.2, verifying the functionality under the QEMU RISCV64 emulator. I used File Calls to the implemented Thanks! |
2563c11 to
a66105d
Compare
a66105d to
e009e2a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces optimized convolution kernels for RISC-V architectures using RVV (RISC-V Vector) intrinsics. The
rvv_convolutionimplementation directly vectorizes convolutional sources and kernels by broadcasting the channel dimension, achieving significant performance improvements compared to scalar and auto-vectorization implementations.This initial version provides:
f32,f16,s32,s8andu8.nchwornhwcmemory layouts for sources, andoihwlayout for weights.rvv_postopsto supportReLUpost operation and scale multiplication.We've noticed that a draft PR #3852 of
rvv_gemm_convhas been in progress, which uses another method to vectorize and implement convolutional computation for RVV arch. We would like to address the benefits proposed by ourrvv_convolutionimplementation here.rvv_gemm_convimplementation requires a large amount of memory to buffer data in case of usingim2colmethod, which transfers a conv image into a large matrix to leverage the matrix multiplication acceleration. However, thisrvv_convolutionimplementation vectorizes a conv image along its input channel dimension without transforming and buffering of matrices. Therefore, thisrvv_convolutionimplementation is more generic due to its minimal memory overheads, which is essential to be implemented.Key Features
Data Types: Supported src_dtype and wei_dtype combinations match the cases in ref design without
bf16dtype, which are:f32&f32f16&f16f32&f16s8&s8u8&s8The dst dtypes can be one of
f32,f16,s8,u8,s32, which is transferred at the end of each output scalar computation.Memory Layouts: Source and destination tensors must be
nchwornhwc, and weight tensors must beiohwlayout. Note that computations are performed usingnhwclayout, which means thatnchwlayouts of src and dst must be reordered. The reordered methods are provided inrvv_convolution_utils.hpp.Vectorization: It vectorizes along the input channel dimension instead of h/w dimension for better performance.
Compute Kernel: After vectorization, data goes into
rvv_dot_ic_fwd_*(src_dtype)_*(wei_dtype)to compute and dot product to obtain the output scalar result, which goes intofinalize_conv_accto compute bias, scales and post ops.Post-Ops: The newest
rvv_postopsis integrated to supportReLUpost operation and scale multiplication.Parallelization: By adopting parallel methods provided by oneDNN API including
balance211,parallel_ndandparallel, it leverages RVV-vectorize inner loops optimization and multi-core CPU performance.Checklist
General
make testandmake test_benchdnn_*) pass locally for each commit?Performance improvements
All experiments are performed on the LRW platform. We draw comparisons among 1st baseline method of scalar implementation , 2nd baseline method of auto vectoration by compiler,and our method of RVV intrinsic implementation.
Test uses
benchdnnwith arguments of one of the layers in VGG11./benchdnn --stag=abcd --wtag=abcd --dtag=abcd --attr-post-ops=ReLU mb64ic256ih7oc160oh7kh3ph1n"vgg_11:conv5_1"--dir=FWD_D --dt=f32:f32:f32, experimental results are as follows:--dir=FWD_I --dt=u8:s8:s32, experimental results are as follows:Note that the total time is comprised of creation, filling, executation, refernce computation, and comparison, where the last two are for correctness verification using scalar computation. For fair comparisons, we minus the reference computation and comparison time to obatin the performance time.
New features
Bug fixes
RFC PR