Add build option for ARM NCHWc kernels #26171

hariharans29 · 2025-09-26T04:09:38Z

Description

Add a build option for new kernels introduced in #25580

Motivation and Context

This enables building ORT with NCHWc ARM kernels.
At the time of writing, it is turned OFF by default because its performance relative to "regular" NCHW kernels
is not good at smaller thread counts. But its speed-up is non-negligible with higher thread counts on supporting
ARM platforms.
Once the gap is closed for smaller thread counts, it can be turned on by default.

tools/ci_build/build_args.py

github-actions

You can commit the suggested changes from lintrunner.

tools/ci_build/build_args.py

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

.github/workflows/android.yml

.github/workflows/macos-ci-build-and-test-workflow.yml

github-actions

You can commit the suggested changes from lintrunner.

tools/ci_build/build_args.py

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

tools/ci_build/build_args.py

onnxruntime/core/mlas/lib/sconv_kernel_neon.cpp

Added a comment regarding the performance of NCHWc ARM kernels and their default state.

damdoo01-arm · 2025-09-30T19:47:28Z

Hi all, I have noticed a unit test failure likely associated with this PR using KleidiAI on the Mac M4. Interestingly the failure only happens when the test is run as part of the full onnxruntime_test_all suite. Yet if you run it in isolation it passes. This points to a potential variable that has not been reset.

Unit Test Name: NchwcOptimizerTests.ConvNoBiasAddFusion

Reproduce instructions:
On Mac.
./build.sh --config Release --cmake_generator Ninja --apple_sysroot macosx --osx_arch arm64 --apple_deploy_target 15 --cmake_extra_defines onnxruntime_USE_KLEIDIAI=ON onnxruntime_USE_ARM_NEON_NCHWC=ON

./onnxruntime_test_all - Shows test failure
./onnxruntime_test_all --gtest_filter NchwcOptimizerTests.ConvNoBiasAddFusion - Test Passes

hariharans29 · 2025-09-30T20:44:41Z

Hi all, I have noticed a unit test failure likely associated with this PR using KleidiAI on the Mac M4. Interestingly the failure only happens when the test is run as part of the full onnxruntime_test_all suite. Yet if you run it in isolation it passes. This points to a potential variable that has not been reset.

Unit Test Name: NchwcOptimizerTests.ConvNoBiasAddFusion

Reproduce instructions: On Mac. ./build.sh --config Release --cmake_generator Ninja --apple_sysroot macosx --osx_arch arm64 --apple_deploy_target 15 --cmake_extra_defines onnxruntime_USE_KLEIDIAI=ON onnxruntime_USE_ARM_NEON_NCHWC=ON

./onnxruntime_test_all - Shows test failure ./onnxruntime_test_all --gtest_filter NchwcOptimizerTests.ConvNoBiasAddFusion - Test Passes

Thanks @damdoo01-arm !

Hi @Rohanjames1997 - Could you please take a look when you get a chance ? Our partners from ARM recently encountered the above test failure that seems to originate from the NCHWc ARM64 support (#25580). Thanks!

Rohanjames1997 · 2025-09-30T21:28:48Z

Hi @damdoo01-arm , thanks for reporting!

I tried reproducing it, but I don't have the same setup. So a few questions:

Is this reproducible on a linux build as well? (I don't have a Mac to test on)
Is this reproducible with onnxruntime_USE_KLEIDIAI=OFF?
Is this reproducible when you run ./build/Linux/Release/onnxruntime_test_all --gtest_filter Nchwc* ? Asking because my GraphTransformationTests* fail as I don't have the .onnx models required for it, so I can't even reach the NchwcOptimizerTests.
Running with --gtest_filter Nchwc* passes all tests on my C8g EC2 instance with Ubuntu 22.04. I built ORT using ./build.sh --config=Release --build_shared_lib --parallel --skip_tests --cmake_extra_defines onnxruntime_USE_ARM_NEON_NCHWC=ON (no kledi)

Also, any idea why the CI did not catch this? @hariharans29🤔

aviralagrawal · 2025-09-30T22:17:17Z

@hariharans29, please ensure you update the Readme or other documentation so that it is clear to all how to enable this amazing feature. Thanks!

hariharans29 · 2025-10-01T00:40:08Z

4. onnxruntime_USE_ARM_NEON_NCHWC

Hi @Rohanjames1997 - If I were to take an educated guess, I think this will only repro on a machine that has SME2 supported (Mac M4) not just on a build with KleidiAI is enabled. This is the PR that introduced KleidiAI SME2 Conv kernels for ARM64 - https://github.com/microsoft/onnxruntime/pull/25187/files#diff-ae80f8c17f8c3c31a01bff6f1058df55c4287ce3f6741a4bb73df3a24253b7c0. Perhaps, there is an edge case to be accounted for somewhere at the boundary of the 2 PRs. Unfortunately, that is all I can think of right now.
@damdoo01-arm - I think @Rohanjames1997 's question "does this repro with USE_KLEIDIAI = OFF" makes sense in this context. Can you please confirm ? That may help @Rohanjames1997 narrow it down further. I don't think Rohan has access to an SME2 based machine (M4?) right now and it will take me some time to get one and debug. So, any stack trace/clue as to where the test crashes would help.

any idea why the CI did not catch this?
@Rohanjames1997 - Unfortunately, I don't think we have SME2 enabled machines in CI yet. Sorry about that. That probably would have caught this.

hariharans29 · 2025-10-01T00:48:17Z

@hariharans29, please ensure you update the Readme or other documentation so that it is clear to all how to enable this amazing feature. Thanks!

We will document it and announce it in the next release, for now enabling it is as simple as using the build flag in this PR to build the feature from main

damdoo01-arm · 2025-10-01T09:26:22Z

Hi @Rohanjames1997,
Thanks for the response. I can confirm the tests are indeed passing when KleidiAI is switched off. I can also confirm that ./onnxruntime_test_all --gtest_filter NchwcOptimizerTests* as a suite DOES cause the test to fail which likely means the spillover is contained within that test suite.

Rohanjames1997 · 2025-10-06T15:45:43Z

Thanks @damdoo01-arm ,

Is the test failing only on a SME2-supported machine like @hariharans29 suggested? I couldn't reproduce this on a Neoverse-V1 or a V2 machine.

damdoo01-arm · 2025-10-14T15:47:54Z

Apologies for the delay @Rohanjames1997, since I have an M4, I can attempt to diagnose and attempt to solve it, I'll post here with any updates, Damien.

### Description Add a build option for new kernels introduced in #25580 ### Motivation and Context This enables building ORT with NCHWc ARM kernels. At the time of writing, it is turned OFF by default because its performance relative to "regular" NCHW kernels is not good at smaller thread counts. But its speed-up is non-negligible with higher thread counts on supporting ARM platforms. Once the gap is closed for smaller thread counts, it can be turned on by default. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description Add a build option for new kernels introduced in microsoft#25580 ### Motivation and Context This enables building ORT with NCHWc ARM kernels. At the time of writing, it is turned OFF by default because its performance relative to "regular" NCHW kernels is not good at smaller thread counts. But its speed-up is non-negligible with higher thread counts on supporting ARM platforms. Once the gap is closed for smaller thread counts, it can be turned on by default. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

hariharans29 added 4 commits September 25, 2025 19:53

Add build option for ARM NCHWc kernels

e28e1b6

Fix

45f850c

Add message in cmake

ac6cafb

Enable NCHWc ARM kernels on Mac OS alone

513234a

hariharans29 mentioned this pull request Sep 26, 2025

NEON kernels for NCHWc Convolution and Pooling #25580

Merged

github-advanced-security bot found potential problems Sep 26, 2025

View reviewed changes

tools/ci_build/build_args.py Fixed Show fixed Hide fixed

tools/ci_build/build_args.py Fixed Show fixed Hide fixed

github-actions bot reviewed Sep 26, 2025

View reviewed changes

tools/ci_build/build_args.py Outdated Show resolved Hide resolved

tools/ci_build/build_args.py Outdated Show resolved Hide resolved

hariharans29 closed this Sep 26, 2025

hariharans29 reopened this Sep 26, 2025

hariharans29 and others added 2 commits September 26, 2025 17:41

Update tools/ci_build/build_args.py

68932da

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update tools/ci_build/build_args.py

3029cb4

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

hariharans29 commented Sep 27, 2025

View reviewed changes

.github/workflows/android.yml Show resolved Hide resolved

hariharans29 commented Sep 27, 2025

View reviewed changes

.github/workflows/macos-ci-build-and-test-workflow.yml Outdated Show resolved Hide resolved

hariharans29 added 2 commits September 26, 2025 21:33

Reflect Neon

98f5d83

Resolve conflicts

d505f70

github-actions bot reviewed Sep 27, 2025

View reviewed changes

tools/ci_build/build_args.py Outdated Show resolved Hide resolved

Update tools/ci_build/build_args.py

401bf94

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

edgchen1 previously approved these changes Sep 29, 2025

View reviewed changes

tools/ci_build/build_args.py Show resolved Hide resolved

onnxruntime/core/mlas/lib/sconv_kernel_neon.cpp Show resolved Hide resolved

Add comment about NCHWc ARM kernels performance

7a59bd6

Added a comment regarding the performance of NCHWc ARM kernels and their default state.

hariharans29 dismissed edgchen1’s stale review via 7a59bd6 September 29, 2025 17:49

hariharans29 changed the title ~~WIP: Add build option for ARM NCHWc kernels~~ Add build option for ARM NCHWc kernels Sep 29, 2025

edgchen1 approved these changes Sep 29, 2025

View reviewed changes

hariharans29 merged commit 04386c9 into main Sep 29, 2025
92 checks passed

hariharans29 deleted the hari/mlas_fix_3 branch September 29, 2025 21:40

Rohanjames1997 mentioned this pull request Dec 10, 2025

[MLAS] Simplify & optimize Arm64 NCHWc Convolution kernels #26691

Merged

Add build option for ARM NCHWc kernels #26171

Add build option for ARM NCHWc kernels #26171

Uh oh!

Conversation

hariharans29 commented Sep 26, 2025

Description

Motivation and Context

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

damdoo01-arm commented Sep 30, 2025

Uh oh!

hariharans29 commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rohanjames1997 commented Sep 30, 2025

Uh oh!

aviralagrawal commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hariharans29 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hariharans29 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damdoo01-arm commented Oct 1, 2025

Uh oh!

Rohanjames1997 commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damdoo01-arm commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hariharans29 commented Sep 30, 2025 •

edited

Loading

aviralagrawal commented Sep 30, 2025 •

edited

Loading

hariharans29 commented Oct 1, 2025 •

edited

Loading

hariharans29 commented Oct 1, 2025 •

edited

Loading

Rohanjames1997 commented Oct 6, 2025 •

edited

Loading