Skip to content

Conversation

@hariharans29
Copy link
Member

Description

Add a build option for new kernels introduced in #25580

Motivation and Context

This enables building ORT with NCHWc ARM kernels.
At the time of writing, it is turned OFF by default because its performance relative to "regular" NCHW kernels
is not good at smaller thread counts. But its speed-up is non-negligible with higher thread counts on supporting
ARM platforms.
Once the gap is closed for smaller thread counts, it can be turned on by default.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

hariharans29 and others added 2 commits September 26, 2025 17:41
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
edgchen1
edgchen1 previously approved these changes Sep 29, 2025
Added a comment regarding the performance of NCHWc ARM kernels and their default state.
@hariharans29 hariharans29 changed the title WIP: Add build option for ARM NCHWc kernels Add build option for ARM NCHWc kernels Sep 29, 2025
@hariharans29 hariharans29 merged commit 04386c9 into main Sep 29, 2025
92 checks passed
@hariharans29 hariharans29 deleted the hari/mlas_fix_3 branch September 29, 2025 21:40
@damdoo01-arm
Copy link
Contributor

Hi all, I have noticed a unit test failure likely associated with this PR using KleidiAI on the Mac M4. Interestingly the failure only happens when the test is run as part of the full onnxruntime_test_all suite. Yet if you run it in isolation it passes. This points to a potential variable that has not been reset.

Unit Test Name: NchwcOptimizerTests.ConvNoBiasAddFusion

Reproduce instructions:
On Mac.
./build.sh --config Release --cmake_generator Ninja --apple_sysroot macosx --osx_arch arm64 --apple_deploy_target 15 --cmake_extra_defines onnxruntime_USE_KLEIDIAI=ON onnxruntime_USE_ARM_NEON_NCHWC=ON

./onnxruntime_test_all - Shows test failure
./onnxruntime_test_all --gtest_filter NchwcOptimizerTests.ConvNoBiasAddFusion - Test Passes

@hariharans29
Copy link
Member Author

hariharans29 commented Sep 30, 2025

Hi all, I have noticed a unit test failure likely associated with this PR using KleidiAI on the Mac M4. Interestingly the failure only happens when the test is run as part of the full onnxruntime_test_all suite. Yet if you run it in isolation it passes. This points to a potential variable that has not been reset.

Unit Test Name: NchwcOptimizerTests.ConvNoBiasAddFusion

Reproduce instructions: On Mac. ./build.sh --config Release --cmake_generator Ninja --apple_sysroot macosx --osx_arch arm64 --apple_deploy_target 15 --cmake_extra_defines onnxruntime_USE_KLEIDIAI=ON onnxruntime_USE_ARM_NEON_NCHWC=ON

./onnxruntime_test_all - Shows test failure ./onnxruntime_test_all --gtest_filter NchwcOptimizerTests.ConvNoBiasAddFusion - Test Passes

Thanks @damdoo01-arm !

Hi @Rohanjames1997 - Could you please take a look when you get a chance ? Our partners from ARM recently encountered the above test failure that seems to originate from the NCHWc ARM64 support (#25580). Thanks!

@Rohanjames1997
Copy link
Contributor

Hi @damdoo01-arm , thanks for reporting!

I tried reproducing it, but I don't have the same setup. So a few questions:

  1. Is this reproducible on a linux build as well? (I don't have a Mac to test on)
  2. Is this reproducible with onnxruntime_USE_KLEIDIAI=OFF?
  3. Is this reproducible when you run ./build/Linux/Release/onnxruntime_test_all --gtest_filter Nchwc* ? Asking because my GraphTransformationTests* fail as I don't have the .onnx models required for it, so I can't even reach the NchwcOptimizerTests.
  4. Running with --gtest_filter Nchwc* passes all tests on my C8g EC2 instance with Ubuntu 22.04. I built ORT using ./build.sh --config=Release --build_shared_lib --parallel --skip_tests --cmake_extra_defines onnxruntime_USE_ARM_NEON_NCHWC=ON (no kledi)

Also, any idea why the CI did not catch this? @hariharans29🤔

@aviralagrawal
Copy link

aviralagrawal commented Sep 30, 2025

@hariharans29, please ensure you update the Readme or other documentation so that it is clear to all how to enable this amazing feature. Thanks!

@hariharans29
Copy link
Member Author

hariharans29 commented Oct 1, 2025

4. onnxruntime_USE_ARM_NEON_NCHWC

Hi @Rohanjames1997 - If I were to take an educated guess, I think this will only repro on a machine that has SME2 supported (Mac M4) not just on a build with KleidiAI is enabled. This is the PR that introduced KleidiAI SME2 Conv kernels for ARM64 - https://github.com/microsoft/onnxruntime/pull/25187/files#diff-ae80f8c17f8c3c31a01bff6f1058df55c4287ce3f6741a4bb73df3a24253b7c0. Perhaps, there is an edge case to be accounted for somewhere at the boundary of the 2 PRs. Unfortunately, that is all I can think of right now.
@damdoo01-arm - I think @Rohanjames1997 's question "does this repro with USE_KLEIDIAI = OFF" makes sense in this context. Can you please confirm ? That may help @Rohanjames1997 narrow it down further. I don't think Rohan has access to an SME2 based machine (M4?) right now and it will take me some time to get one and debug. So, any stack trace/clue as to where the test crashes would help.

any idea why the CI did not catch this?
@Rohanjames1997 - Unfortunately, I don't think we have SME2 enabled machines in CI yet. Sorry about that. That probably would have caught this.

@hariharans29
Copy link
Member Author

hariharans29 commented Oct 1, 2025

@hariharans29, please ensure you update the Readme or other documentation so that it is clear to all how to enable this amazing feature. Thanks!

We will document it and announce it in the next release, for now enabling it is as simple as using the build flag in this PR to build the feature from main

@damdoo01-arm
Copy link
Contributor

Hi @Rohanjames1997,
Thanks for the response. I can confirm the tests are indeed passing when KleidiAI is switched off. I can also confirm that ./onnxruntime_test_all --gtest_filter NchwcOptimizerTests* as a suite DOES cause the test to fail which likely means the spillover is contained within that test suite.

@Rohanjames1997
Copy link
Contributor

Rohanjames1997 commented Oct 6, 2025

Thanks @damdoo01-arm ,

Is the test failing only on a SME2-supported machine like @hariharans29 suggested? I couldn't reproduce this on a Neoverse-V1 or a V2 machine.

@damdoo01-arm
Copy link
Contributor

Apologies for the delay @Rohanjames1997, since I have an M4, I can attempt to diagnose and attempt to solve it, I'll post here with any updates, Damien.

fs-eire pushed a commit that referenced this pull request Oct 24, 2025
### Description
Add a build option for new kernels introduced in
#25580

### Motivation and Context
This enables building ORT with NCHWc ARM kernels.
At the time of writing, it is turned OFF by default because its
performance relative to "regular" NCHW kernels
is not good at smaller thread counts. But its speed-up is non-negligible
with higher thread counts on supporting
ARM platforms.
Once the gap is closed for smaller thread counts, it can be turned on by
default.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
naomiOvad pushed a commit to naomiOvad/onnxruntime that referenced this pull request Nov 2, 2025
### Description
Add a build option for new kernels introduced in
microsoft#25580

### Motivation and Context
This enables building ORT with NCHWc ARM kernels.
At the time of writing, it is turned OFF by default because its
performance relative to "regular" NCHW kernels
is not good at smaller thread counts. But its speed-up is non-negligible
with higher thread counts on supporting
ARM platforms.
Once the gap is closed for smaller thread counts, it can be turned on by
default.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants