Skip to content

Conversation

@Rohanjames1997
Copy link
Contributor

Description

This PR implements optimized Arm NEON kernels for NCHWc (channels-last with channel blocking) convolution and pooling operations in MLAS, significantly improving performance on Arm64 platforms.

Motivation and Context

Fixes #24790

The new NCHWc kernels improve performance by 5-6x, depending on the configuration of threads, model, etc.
For example, here is the performance gain witnessed during mobilenet inference: Focus on the "Number of inferences per second" (93 inf/s -> 498 inf/s)

System configuration
Architecture:             aarch64
  CPU op-mode(s):         64-bit
  Byte Order:             Little Endian
CPU(s):                   64
  On-line CPU(s) list:    0-63
Vendor ID:                ARM
  Model name:             Neoverse-V2
    Model:                1
    Thread(s) per core:   1
    Core(s) per socket:   64
    Socket(s):            1
    Stepping:             r0p1
    BogoMIPS:             2000.00
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp 
                          sve2 sveaes svepmull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
Caches (sum of all):      
  L1d:                    4 MiB (64 instances)
  L1i:                    4 MiB (64 instances)
  L2:                     128 MiB (64 instances)
  L3:                     36 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-63
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; __user pointer sanitization
  Spectre v2:             Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected
Perf with current upstream kernels
./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 1000 ~/scripts/mobilenet.onnx

Setting intra_op_num_threads to 32
Session creation time cost: 0.0238608 s
First inference time cost: 11 ms
Total inference time cost: 10.7458 s
Total inference requests: 1000
Average inference time cost: 10.7458 ms
Total inference run time: 10.7465 s
Number of inferences per second: 93.0534 
Avg CPU usage: 50 %
Peak working set size: 70410240 bytes
Avg CPU usage:50
Peak working set size:70410240
Runs:1000
Min Latency: 0.0106707 s
Max Latency: 0.0113617 s
P50 Latency: 0.0107453 s
P90 Latency: 0.0107695 s
P95 Latency: 0.0107785 s
P99 Latency: 0.0107965 s
P999 Latency: 0.0113617 s
Perf with NCHWc kernels
./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 1000 ~/scripts/mobilenet.onnx

Setting intra_op_num_threads to 32
Session creation time cost: 0.0358121 s
First inference time cost: 2 ms
Total inference time cost: 2.00561 s
Total inference requests: 1000
Average inference time cost: 2.00561 ms
Total inference run time: 2.00607 s
Number of inferences per second: 498.488 
Avg CPU usage: 50 %
Peak working set size: 92467200 bytes
Avg CPU usage:50
Peak working set size:92467200
Runs:1000
Min Latency: 0.00198387 s
Max Latency: 0.00204784 s
P50 Latency: 0.00200537 s
P90 Latency: 0.0020155 s
P95 Latency: 0.00201822 s
P99 Latency: 0.0020251 s
P999 Latency: 0.00204784 s

Happy to run further performance tests as required.

@Rohanjames1997
Copy link
Contributor Author

@microsoft-github-policy-service agree

@Rohanjames1997
Copy link
Contributor Author

@skottmckay @snnn @yufenglee, appreciate it if the CI can be triggered. Thanks!

hariharans29 pushed a commit that referenced this pull request Sep 2, 2025
### Description
The
[vfmaq_f32](https://developer.arm.com/architectures/instruction-sets/intrinsics/vfmaq_f32)
intrinsic compiles to the
[FMLA](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-Multiply-Add-to-accumulator--vector--?lang=en)
instruction which is more performant than separate `fmul`+`fadd`
instructions that
[vmlaq_f32](https://developer.arm.com/architectures/instruction-sets/intrinsics/vmlaq_f32)
compiles to on latest GCC versions: https://godbolt.org/z/aYc9as5Wh
Note that this is not a breaking change, as vmlaq_f32 compiles to FMLA
instructions already on the latest clang compilers (which are the
default for MacOS ORT builds already)


### Motivation and Context
With this change, the NEON version of `MlasMultiplyAddFloat32x4`
achieves parity with the x86 version that uses `_mm_fmadd_ps`.
It also achieves up to ~15% speedups compared to the current `vmlaq_f32`
implementation when tested on top of #25580
@hariharans29 hariharans29 reopened this Sep 4, 2025
@hariharans29 hariharans29 requested a review from Copilot September 5, 2025 00:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements optimized Arm NEON kernels for NCHWc (channels-last with channel blocking) convolution and pooling operations in MLAS, targeting significant performance improvements on Arm64 platforms. The implementation demonstrates a 5-6x performance improvement in real workloads.

  • Implements NEON-optimized convolution kernels supporting NCHWc and NCHW formats, including pointwise and depthwise variants
  • Adds NEON-optimized pooling kernels for maximum and average pooling (both include/exclude padding modes)
  • Enables ARM64 support for NCHWc kernels by updating conditional compilation and platform initialization

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sconv_kernel_neon.cpp New NEON convolution kernels with template-based implementation for different formats
spool_kernel_neon.cpp New NEON pooling kernels for maximum and average pooling operations
snchwc.cpp Updates conditional compilation to include ARM64 alongside AMD64 and LARCH64
sconv.h New header defining convolution kernel flags and calling conventions
platform.cpp Registers new ARM64 kernels and sets NCHWc block size to 16
mlasi.h Adds function declarations and platform structure fields for ARM64 kernels
onnxruntime_mlas.cmake Includes new source files in ARM64 build configuration

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@hariharans29
Copy link
Member

I think there are some build failures because some fields in the MLAS struct are only made available on Linux and they are referenced without proper ifdefs elsewhere ? I get similar build breaks on Windows when I try to pull this change and try it.

As an aside, is there any special intrinsic being used that works only on Linux. Can this be enabled on all platforms supporting NEON ?

@hariharans29
Copy link
Member

hariharans29 commented Sep 17, 2025

Yup, neoverse-v2.

I can benchmark on neoverse-v1 as well and post the numbers, but I don't have access to other SKUs like neoverse-n1

Thanks - please share them when you can !

Any recommendations on how to tune other related perf parameters - like thread count, etc. to maximize perf ?
Are there any other input shape related nuances with this implementation (i.e.) are there filter size, batch size related dependencies that make these kernels faster or slower than baseline ?

The reason I ask is: I have a Windows ARM64 machine (Qualcomm Snapdragon X Elite) and I have a Conv heavy model given to me by a customer (unfortunately not shareable right now - I would need their permission) and I see a ~20% increase in forward pass latency with this change included and I am trying to work out how to mitigate that. They reported a similar perf degradation on their environment as well (I have asked them what their SKU is) but I am able to repro the same with my Qualcomm device which I tried as a sanity check.

@Rohanjames1997
Copy link
Contributor Author

Rohanjames1997 commented Sep 17, 2025

On a Neoverse-V1 instance, the peak performance is 450 inf/sec; so, 10% less than on Neoverse-V2.
All my benchmarking has been done on Ubuntu 22.04 & 24.04

To maximize perf, I have noticed that these NCHWc kernels scale well (almost linearly) with threads, until 32 threads.

For reference, the mobilenet model I used was from ONNX/models, and this model does not accept batched inputs (tracking issue: onnx/models#680) but that is apparently normal for ONNX models in that repository.

That is interesting, I could not get a chance to benchmark on Windows. Would you be able to find/tell me a similar model in https://github.com/onnx/models? We can benchmark it using onnxruntime_perf_test -x <threads>.

@hariharans29
Copy link
Member

On a Neoverse-V1 instance, the peak performance is 450 inf/sec; so, 10% less than on Neoverse-V2. All my benchmarking has been done on Ubuntu 22.04 & 24.04

To maximize perf, I have noticed that these NCHWc kernels scale well (almost linearly) with threads, until 32 threads.

For reference, the mobilenet model I used was from ONNX/models, and this model does not accept batched inputs (tracking issue: onnx/models#680) but that is apparently normal for ONNX models in that repository.

That is interesting, I could not get a chance to benchmark on Windows. Would you be able to find/tell me a similar model in https://github.com/onnx/models? We can benchmark it using onnxruntime_perf_test -x <threads>.

Sure, let me get back to you on this one - I will either get a model from the zoo or a shareable version of the customer's model ? Thanks for the support.

One last question - I am guessing the kernels are called on multiple threads to cover different filters, batched inputs - is that correct ?

@Rohanjames1997
Copy link
Contributor Author

Yes, that's right.

This is effectively an implementation of grouped convolution, where channels and filters are split into groups. So, (I'm pretty sure) what happens is, each thread works on a different channel-group and a filter-group and that is why this algorithm is conducive to threading.

@hariharans29
Copy link
Member

hariharans29 commented Sep 17, 2025

Yes, that's right.

This is effectively an implementation of grouped convolution, where channels and filters are split into groups. So, (I'm pretty sure) what happens is, each thread works on a different channel-group and a filter-group and that is why this algorithm is conducive to threading.

Did you happen to see what the perf looked like when you didn't use -x 32 (i.e.) using the default thread count in the threadpool as is on Neoverse V2 ?

@Rohanjames1997
Copy link
Contributor Author

Hmm so the default value of -x is 0, which uses all cores in the system. I was bencharking on a Neoverse-V2 system with 64 vCPUs, so I tested multiple configs: 2^0 to 2^6.

I don't have the exact numbers now, but the NCHWc kernels outperformed the baseline when using >=4 threads. The baseline was only faster in the single-threaded case, and it scaled very poorly after 4 threads.

@hariharans29
Copy link
Member

hariharans29 commented Sep 22, 2025

Hi @Rohanjames1997 -

Here is a shareable version of the model: https://drive.google.com/file/d/1dpzSVMvSFVbLodAE6aQYm_qAOgkmAtkZ/view?usp=drive_link. Let me know if you are able to access this.

I have tried to time the Conv node latencies with and without the NCHWc layout transformation by making this change in the sequential_executor.cc file (attached)
sequential_executor_timing_updated.txt. I do find the Conv nodes are costlier than before the change. Would you be able to try and confirm ?

@Rohanjames1997
Copy link
Contributor Author

Rohanjames1997 commented Sep 23, 2025

Hi, I don't have access to this model without requesting for it. Can you make it a publicly available object?

Also, do you mind explaining to me the change you made in sequential_executor.cc? Also, is there a diff on one of your branches that I can look at to understand it?

Also, what machine & OS did you benchmark on?

Ty!

@hariharans29
Copy link
Member

Hi, I don't have access to this model without requesting for it. Can you make it a publicly available object?

Also, do you mind explaining to me the change you made in sequential_executor.cc? Also, is there a diff on one of your branches that I can look at to understand it?

Also, what machine & OS did you benchmark on?

Ty!

Thanks ! I just made the link public.

The change in sequential executor just times the call of an OpKernel's Compute() method - the MLAS routines you added will be called an OpKernel's instance - specifically it should be called from this method here -

const auto* X = context->Input<Tensor>(0);
.

As for the diff, it can be diffed with current main branch's version of the file. Actually, you don't really need that change if you already have a kernel profiling methodology.

I benchmarked on a Copilot+Pc Surface Laptop - Qualcomm Snapdragon X Elite + Windows OS. The customer is running this on an m8g.xlarge (Graviton4)

@hariharans29
Copy link
Member

hariharans29 commented Sep 25, 2025

Did you happen to check performance of any model other than mobilenet ? I see that mobilenet only invokes these 2 code paths - Pointwise and Depthwise convolutions, whereas the sample model I pasted above uses that and "regular" NCHWc path. I see that is the path that is more costlier. Possibly that kernel has some perf deficiencies ?

@Rohanjames1997
Copy link
Contributor Author

Hi!

No, I couldn't benchmark it yet. I'll spend some time on it today.

Thanks for the note about the NCHWc path.

Do you happen to know if it is deficient when allocating more threads too?

@Rohanjames1997
Copy link
Contributor Author

Here are the benchmark results using the shared model on a Graviton4 & Ubuntu22.
Inference is faster at higher thread counts and slower at lower thread counts.

All numbers are inferences/sec

threads 1 4 8 16 32 64
vanilla 3.35551 9.88463 14.071 17.6474 18.954 19.1331
nchwc 1.89976 7.17679 13.0548 21.4388 29.3814 32.0701
speedup 0.566 0.726 0.928 1.215 1.550 1.676

@hariharans29 I can look more into optimizing the perf at lower thread counts. But it might be tricky.
Thinking out loud - do you know if we can have a heuristic that checks for thread pool size and reroutes inference to the more optimized path - vanilla or NCHWc - based on that?

@hariharans29
Copy link
Member

Thanks @Rohanjames1997. I am figuring my way around the MLAS library as well - but I'd say being to identify when to use regular (im2col + MlasGemm) vs when to use NCHWc might be equally hard as well - both code-wise and identifying the right cut-off points on different platforms - that would probably need some "online benchmarking" (i.e.) use the first run to run both paths and pick the fastest algo to use from subsequent runs. Even this would be a change in first Run() behavior and needs more intenral discussion.

Just a quick clarifying question - Which Graviton powered instance are you using to allow using upto 64 threads ? The customer is on a m8g.xlarge which is a 4 core instance I believe. So from the table above, given the regression with lower thread counts, I think that explains the regression ?

@Rohanjames1997
Copy link
Contributor Author

Rohanjames1997 commented Sep 25, 2025

That's true. Even I shall think more about the online benchmarking method and about heuristics.
I know Pytorch has heuristics for choosing the right backend for matmuls based on the size of the input matrices. So, I'm drawing inspiration from there.

Yes, the lower thread count explains the regression. I've been running on a c8g.16xlarge, mainly to build ORT faster and to benchmark on a variety of thread counts. The number of vCPUs is (usually) =4n for a .nxlarge instance

@hariharans29
Copy link
Member

Thanks.

Was the benchmarking setup you used same as the mobilenet one for the new model ? The new model supports batching. Were you able to batch more images ?

The reason I ask is - there was a nice contribution that improves the thread utlization for batched inputs and group convolutions (regular path) here. Here is my "internal" copy of the same code. I had seen this a while ago but this only just struck me last week. With the increased thread counts, this code may close the gap of the "regular" with the NCHWc variant.

@Rohanjames1997
Copy link
Contributor Author

Rohanjames1997 commented Sep 25, 2025

Yes, it was the same setup.

How do I batch using onnxruntime_perf_test?
I tried looking at the help section and I think the -f or -F flags may help but I would require either the dimension_name or the dimension_denotation. By default, when specifying -I, "Free dimensions are treated as 1 unless overridden using -f."

That's an interesting PR, and you're right, it could close the gap!
One thing I don't know yet is which CPU platform that PR targets, or if it is platform-agnostic. When I was developing, I noticed that on x86, MlasConv was not invoked by default, and the NCHWc variant was always invoked.

One thing worth mentioning is that even while running onnxruntime_perf_test with the regular conv (MlasConv), the specified number of cores are being utilized to 100%, i.e. there is no "underutilization".

@hariharans29
Copy link
Member

hariharans29 commented Sep 25, 2025

Which platform does that PR target ?

That PR should improve the "regular" Conv perf for all platforms. It is not platform specific. It re-works the thread utilization for the "regular" Conv (Im2Col + MlasGemm). The MlasGemm implementation would be platform specific of course.

Why is NCHWc Conv invoked by default on x86?

It depends on the graph optimization level the user sets - by default it is ORT_ENABLE_ALL which by default includes layout transformation optimizations. Keep in mind though if there is a Conv that doesn't qualify for layout transformations - it will drop down to using the "regular" Conv.

See here

Prior to your PR, ARM64 was not NCHWc compliant and hence it was always using the regular Conv path

How do I batch using perf_test?

I think your test data located in the same file as the model should contain batched data.

Here is some sample code to generate and serialize a batched numpy array using the onnx python package. Please make appropriate changes where needed:

import numpy as np
from onnx.numpy_helper import from_array, to_array
import onnx

shape = (2, 384, 384, 3) # Batch size 2
arr = np.zeros(shape, dtype=np.float32)
onnx_tensor = from_array(arr)
serialized_tensor = onnx_tensor.SerializeToString()

with open("input_1.pb", "wb") as f:
f.write(serialized_tensor)

You can drop in the generated test data into the sub-folder containing the test data of the folder that hosts the model.

@Rohanjames1997
Copy link
Contributor Author

Thanks for the explanation!

Yes, I was aware that MlasNchwcGetBlockSize controlled whether NCHWc was invoked or not.

On x86, since MlasNchwcGetBlockSize was always defined, can you give me an example of a "Conv that doesn't qualify for layout transformations" ?

I see, thanks. I always generated testing data automatically. I'll give this a shot next week.

@hariharans29
Copy link
Member

On x86, since MlasNchwcGetBlockSize was always defined, can you give me an example of a "Conv that doesn't qualify for layout transformations" ?

I am talking about cases like these where we bail out while transform the regular Conv into an NCHWc compliant one. There are certain criteria to be met before we can use the NCHWc version - although I am not sure how often in practice, these criteria are not met.

I see, thanks. I always generated testing data automatically. I'll give this a shot next week.

No pressure, thanks for your contribution and your time. Since, there is atleast one known case where the regular Conv performs better than the NCHWc one - I am thinking it may be prudent to temporarily revert this change so that we can let the main branch go back to the last know stable state. What do you think ?

@Rohanjames1997
Copy link
Contributor Author

I see, thanks!

Regarding reverting the PR- it is totally up to you and the other maintainers.

NCHWc seems to underperform at lower thread counts and outperform at higher thread counts. Also, the peak achievable throughput is always higher using NCHWc kernels - this was seen on Mobilenet and the shared model.

I think adding a thread-count based heuristic would be the better approach, but if you do decide to revert the change, I respect that as well.

@hariharans29
Copy link
Member

I think there are arguments to be made for both - reverting the change for now and keeping the change as is.

  1. The model transformations happen statically (i.e.) conversion from a regular Conv to NCHWc Conv happen as the model loads and we will need to think about what is the best way to use the threadpool information to trigger the optimizer or not to. This needs some design discussions.

  2. Not all client ARM devices may be able to increase their thread count to see the improvements.

  3. It would be interesting to see if the x86 NCHWc kernel hits the same issue (i.e.) performs poorly with lowered thread count. If it doesn't, would it compute model be applicable to ARM64 as well ?

Regards to reverting, I will discuss internally and get back. :)

hariharans29 added a commit that referenced this pull request Sep 25, 2025
@hariharans29
Copy link
Member

Hi @Rohanjames1997 -

As a temporary measure, could you consider making this an optional feature (turned OFF by default initially) ? Maybe following the model here - #25238 ? That way users who would like to leverage this feature get to do so by building from source.

CC: @edgchen1

@Rohanjames1997
Copy link
Contributor Author

Sure thing, that makes sense.

I will probably get to work on this only on the week of Oct 6th. So until then, I don't mind if it's reverted/kept as is.

@hariharans29
Copy link
Member

I just went ahead and added it: #26171.

I will test that it works as expected and merge it.

hariharans29 added a commit that referenced this pull request Sep 29, 2025
### Description
Add a build option for new kernels introduced in
#25580

### Motivation and Context
This enables building ORT with NCHWc ARM kernels.
At the time of writing, it is turned OFF by default because its
performance relative to "regular" NCHW kernels
is not good at smaller thread counts. But its speed-up is non-negligible
with higher thread counts on supporting
ARM platforms.
Once the gap is closed for smaller thread counts, it can be turned on by
default.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
fs-eire pushed a commit that referenced this pull request Oct 24, 2025
### Description
Add a build option for new kernels introduced in
#25580

### Motivation and Context
This enables building ORT with NCHWc ARM kernels.
At the time of writing, it is turned OFF by default because its
performance relative to "regular" NCHW kernels
is not good at smaller thread counts. But its speed-up is non-negligible
with higher thread counts on supporting
ARM platforms.
Once the gap is closed for smaller thread counts, it can be turned on by
default.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
naomiOvad pushed a commit to naomiOvad/onnxruntime that referenced this pull request Nov 2, 2025
### Description
Add a build option for new kernels introduced in
microsoft#25580

### Motivation and Context
This enables building ORT with NCHWc ARM kernels.
At the time of writing, it is turned OFF by default because its
performance relative to "regular" NCHW kernels
is not good at smaller thread counts. But its speed-up is non-negligible
with higher thread counts on supporting
ARM platforms.
Once the gap is closed for smaller thread counts, it can be turned on by
default.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Optimized Convolution kernels for aarch64

3 participants