-
Notifications
You must be signed in to change notification settings - Fork 3.7k
NEON kernels for NCHWc Convolution and Pooling #25580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@microsoft-github-policy-service agree |
|
@skottmckay @snnn @yufenglee, appreciate it if the CI can be triggered. Thanks! |
### Description The [vfmaq_f32](https://developer.arm.com/architectures/instruction-sets/intrinsics/vfmaq_f32) intrinsic compiles to the [FMLA](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-Multiply-Add-to-accumulator--vector--?lang=en) instruction which is more performant than separate `fmul`+`fadd` instructions that [vmlaq_f32](https://developer.arm.com/architectures/instruction-sets/intrinsics/vmlaq_f32) compiles to on latest GCC versions: https://godbolt.org/z/aYc9as5Wh Note that this is not a breaking change, as vmlaq_f32 compiles to FMLA instructions already on the latest clang compilers (which are the default for MacOS ORT builds already) ### Motivation and Context With this change, the NEON version of `MlasMultiplyAddFloat32x4` achieves parity with the x86 version that uses `_mm_fmadd_ps`. It also achieves up to ~15% speedups compared to the current `vmlaq_f32` implementation when tested on top of #25580
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements optimized Arm NEON kernels for NCHWc (channels-last with channel blocking) convolution and pooling operations in MLAS, targeting significant performance improvements on Arm64 platforms. The implementation demonstrates a 5-6x performance improvement in real workloads.
- Implements NEON-optimized convolution kernels supporting NCHWc and NCHW formats, including pointwise and depthwise variants
- Adds NEON-optimized pooling kernels for maximum and average pooling (both include/exclude padding modes)
- Enables ARM64 support for NCHWc kernels by updating conditional compilation and platform initialization
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| sconv_kernel_neon.cpp | New NEON convolution kernels with template-based implementation for different formats |
| spool_kernel_neon.cpp | New NEON pooling kernels for maximum and average pooling operations |
| snchwc.cpp | Updates conditional compilation to include ARM64 alongside AMD64 and LARCH64 |
| sconv.h | New header defining convolution kernel flags and calling conventions |
| platform.cpp | Registers new ARM64 kernels and sets NCHWc block size to 16 |
| mlasi.h | Adds function declarations and platform structure fields for ARM64 kernels |
| onnxruntime_mlas.cmake | Includes new source files in ARM64 build configuration |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
I think there are some build failures because some fields in the MLAS struct are only made available on Linux and they are referenced without proper ifdefs elsewhere ? I get similar build breaks on Windows when I try to pull this change and try it. As an aside, is there any special intrinsic being used that works only on Linux. Can this be enabled on all platforms supporting NEON ? |
Thanks - please share them when you can ! Any recommendations on how to tune other related perf parameters - like thread count, etc. to maximize perf ? The reason I ask is: I have a Windows ARM64 machine (Qualcomm Snapdragon X Elite) and I have a Conv heavy model given to me by a customer (unfortunately not shareable right now - I would need their permission) and I see a ~20% increase in forward pass latency with this change included and I am trying to work out how to mitigate that. They reported a similar perf degradation on their environment as well (I have asked them what their SKU is) but I am able to repro the same with my Qualcomm device which I tried as a sanity check. |
|
On a Neoverse-V1 instance, the peak performance is 450 inf/sec; so, 10% less than on Neoverse-V2. To maximize perf, I have noticed that these NCHWc kernels scale well (almost linearly) with threads, until 32 threads. For reference, the mobilenet model I used was from ONNX/models, and this model does not accept batched inputs (tracking issue: onnx/models#680) but that is apparently normal for ONNX models in that repository. That is interesting, I could not get a chance to benchmark on Windows. Would you be able to find/tell me a similar model in https://github.com/onnx/models? We can benchmark it using |
Sure, let me get back to you on this one - I will either get a model from the zoo or a shareable version of the customer's model ? Thanks for the support. One last question - I am guessing the kernels are called on multiple threads to cover different filters, batched inputs - is that correct ? |
|
Yes, that's right. This is effectively an implementation of grouped convolution, where channels and filters are split into groups. So, (I'm pretty sure) what happens is, each thread works on a different channel-group and a filter-group and that is why this algorithm is conducive to threading. |
Did you happen to see what the perf looked like when you didn't use |
|
Hmm so the default value of I don't have the exact numbers now, but the NCHWc kernels outperformed the baseline when using >=4 threads. The baseline was only faster in the single-threaded case, and it scaled very poorly after 4 threads. |
|
Hi @Rohanjames1997 - Here is a shareable version of the model: https://drive.google.com/file/d/1dpzSVMvSFVbLodAE6aQYm_qAOgkmAtkZ/view?usp=drive_link. Let me know if you are able to access this. I have tried to time the Conv node latencies with and without the NCHWc layout transformation by making this change in the sequential_executor.cc file (attached) |
|
Hi, I don't have access to this model without requesting for it. Can you make it a publicly available object? Also, do you mind explaining to me the change you made in Also, what machine & OS did you benchmark on? Ty! |
Thanks ! I just made the link public. The change in sequential executor just times the call of an OpKernel's
As for the diff, it can be diffed with current main branch's version of the file. Actually, you don't really need that change if you already have a kernel profiling methodology. I benchmarked on a Copilot+Pc Surface Laptop - Qualcomm Snapdragon X Elite + Windows OS. The customer is running this on an m8g.xlarge (Graviton4) |
|
Did you happen to check performance of any model other than mobilenet ? I see that mobilenet only invokes these 2 code paths - Pointwise and Depthwise convolutions, whereas the sample model I pasted above uses that and "regular" NCHWc path. I see that is the path that is more costlier. Possibly that kernel has some perf deficiencies ? |
|
Hi! No, I couldn't benchmark it yet. I'll spend some time on it today. Thanks for the note about the NCHWc path. Do you happen to know if it is deficient when allocating more threads too? |
|
Here are the benchmark results using the shared model on a Graviton4 & Ubuntu22. All numbers are inferences/sec
@hariharans29 I can look more into optimizing the perf at lower thread counts. But it might be tricky. |
|
Thanks @Rohanjames1997. I am figuring my way around the MLAS library as well - but I'd say being to identify when to use regular (im2col + MlasGemm) vs when to use NCHWc might be equally hard as well - both code-wise and identifying the right cut-off points on different platforms - that would probably need some "online benchmarking" (i.e.) use the first run to run both paths and pick the fastest algo to use from subsequent runs. Even this would be a change in first Run() behavior and needs more intenral discussion. Just a quick clarifying question - Which Graviton powered instance are you using to allow using upto 64 threads ? The customer is on a m8g.xlarge which is a 4 core instance I believe. So from the table above, given the regression with lower thread counts, I think that explains the regression ? |
|
That's true. Even I shall think more about the online benchmarking method and about heuristics. Yes, the lower thread count explains the regression. I've been running on a c8g.16xlarge, mainly to build ORT faster and to benchmark on a variety of thread counts. The number of vCPUs is (usually) |
|
Thanks. Was the benchmarking setup you used same as the mobilenet one for the new model ? The new model supports batching. Were you able to batch more images ? The reason I ask is - there was a nice contribution that improves the thread utlization for batched inputs and group convolutions (regular path) here. Here is my "internal" copy of the same code. I had seen this a while ago but this only just struck me last week. With the increased thread counts, this code may close the gap of the "regular" with the NCHWc variant. |
|
Yes, it was the same setup. How do I batch using That's an interesting PR, and you're right, it could close the gap! One thing worth mentioning is that even while running |
|
Which platform does that PR target ? That PR should improve the "regular" Conv perf for all platforms. It is not platform specific. It re-works the thread utilization for the "regular" Conv (Im2Col + MlasGemm). The MlasGemm implementation would be platform specific of course. Why is NCHWc Conv invoked by default on x86? It depends on the graph optimization level the user sets - by default it is ORT_ENABLE_ALL which by default includes layout transformation optimizations. Keep in mind though if there is a Conv that doesn't qualify for layout transformations - it will drop down to using the "regular" Conv. See here Prior to your PR, ARM64 was not NCHWc compliant and hence it was always using the regular Conv path How do I batch using perf_test? I think your test data located in the same file as the model should contain batched data. Here is some sample code to generate and serialize a batched numpy array using the onnx python package. Please make appropriate changes where needed: import numpy as np shape = (2, 384, 384, 3) # Batch size 2 with open("input_1.pb", "wb") as f: You can drop in the generated test data into the sub-folder containing the test data of the folder that hosts the model. |
|
Thanks for the explanation! Yes, I was aware that On x86, since I see, thanks. I always generated testing data automatically. I'll give this a shot next week. |
|
On x86, since MlasNchwcGetBlockSize was always defined, can you give me an example of a "Conv that doesn't qualify for layout transformations" ? I am talking about cases like these where we bail out while transform the regular Conv into an NCHWc compliant one. There are certain criteria to be met before we can use the NCHWc version - although I am not sure how often in practice, these criteria are not met. I see, thanks. I always generated testing data automatically. I'll give this a shot next week. No pressure, thanks for your contribution and your time. Since, there is atleast one known case where the regular Conv performs better than the NCHWc one - I am thinking it may be prudent to temporarily revert this change so that we can let the main branch go back to the last know stable state. What do you think ? |
|
I see, thanks! Regarding reverting the PR- it is totally up to you and the other maintainers. NCHWc seems to underperform at lower thread counts and outperform at higher thread counts. Also, the peak achievable throughput is always higher using NCHWc kernels - this was seen on Mobilenet and the shared model. I think adding a thread-count based heuristic would be the better approach, but if you do decide to revert the change, I respect that as well. |
|
I think there are arguments to be made for both - reverting the change for now and keeping the change as is.
Regards to reverting, I will discuss internally and get back. :) |
This reverts commit 2d2a3e5.
|
Hi @Rohanjames1997 - As a temporary measure, could you consider making this an optional feature (turned OFF by default initially) ? Maybe following the model here - #25238 ? That way users who would like to leverage this feature get to do so by building from source. CC: @edgchen1 |
|
Sure thing, that makes sense. I will probably get to work on this only on the week of Oct 6th. So until then, I don't mind if it's reverted/kept as is. |
|
I just went ahead and added it: #26171. I will test that it works as expected and merge it. |
### Description Add a build option for new kernels introduced in #25580 ### Motivation and Context This enables building ORT with NCHWc ARM kernels. At the time of writing, it is turned OFF by default because its performance relative to "regular" NCHW kernels is not good at smaller thread counts. But its speed-up is non-negligible with higher thread counts on supporting ARM platforms. Once the gap is closed for smaller thread counts, it can be turned on by default. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description Add a build option for new kernels introduced in #25580 ### Motivation and Context This enables building ORT with NCHWc ARM kernels. At the time of writing, it is turned OFF by default because its performance relative to "regular" NCHW kernels is not good at smaller thread counts. But its speed-up is non-negligible with higher thread counts on supporting ARM platforms. Once the gap is closed for smaller thread counts, it can be turned on by default. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description Add a build option for new kernels introduced in microsoft#25580 ### Motivation and Context This enables building ORT with NCHWc ARM kernels. At the time of writing, it is turned OFF by default because its performance relative to "regular" NCHW kernels is not good at smaller thread counts. But its speed-up is non-negligible with higher thread counts on supporting ARM platforms. Once the gap is closed for smaller thread counts, it can be turned on by default. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Description
This PR implements optimized Arm NEON kernels for NCHWc (channels-last with channel blocking) convolution and pooling operations in MLAS, significantly improving performance on Arm64 platforms.
Motivation and Context
Fixes #24790
The new NCHWc kernels improve performance by 5-6x, depending on the configuration of threads, model, etc.
For example, here is the performance gain witnessed during mobilenet inference: Focus on the "Number of inferences per second" (93 inf/s -> 498 inf/s)
System configuration
Perf with current upstream kernels
Perf with NCHWc kernels
Happy to run further performance tests as required.