Add RVV (RISC-V Vector Extension) optimized convolution and pooling kernels for the NCHWc blocked format in MLAS by velonica0 · Pull Request #28411 · microsoft/onnxruntime

velonica0 · 2026-05-08T10:06:46Z

New kernel files:

riscv64/sconv_depthwise_kernel_rvv.cpp — RVV-optimized 3x3 stride-1 depthwise convolution (NCHW format), replacing the MLAS_FLOAT32X4 generic vectorized version
riscv64/sconv_nchwc_kernel_rvv.cpp — 7 NCHWc kernels using vfloat32m4_t (LMUL=4, BlockSize=16):
- Direct NCHW conv (MlasConvNchwFloatKernelRvv)
- Direct NCHWc conv (MlasConvNchwcFloatKernelRvv)
- Depthwise NCHWc conv (MlasConvDepthwiseFloatKernelRvv)
- Pointwise NCHWc conv (MlasConvPointwiseFloatKernelRvv)
- Max/AvgExcludePad/AvgIncludePad pooling

Following #28261, Optimize more MLAS kernels using RISC-V Vector (RVV) extensions.

Please Note:

On the K3 (SpacemiT X60), VLEN=256. With LMUL=4 and e32, the hardware can hold (256/32) * 4 = 32 floats per vector register group — but we only request 16. So we're using half the available vector width.
The reason is that BlockSize=16 is baked into the NCHWc data layout across the whole framework (matching ARM64 NEON). Changing it to 32 would require a different NCHWc format and is not a localized change.

All tests pass with zero numerical error.

velonica0 · 2026-05-08T10:11:32Z

Hi @hariharans29
Could you please take a look at this PR when you have a moment? I’d really appreciate your help.

convolution and pooling kernels

99f3914

Provide feedback