Fix Conv LHS packing padding/uninitialized ptrs by JonathanC-ARM · Pull Request #27214 · microsoft/onnxruntime

JonathanC-ARM · 2026-01-30T16:36:02Z

Description

This PR fixes non-deterministic corruption on CPU EP observed when the CPU Conv path is dispatched through the MLAS KleidiAI implementation.

The fix is entirely within onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp and addresses two correctness issues in the KleidiAI Conv LHS packing/padding logic:

Fully initialize the LHS indirection table (lhs_ptrs) for partial tiles (m < m_step) so the LHS packing kernel never reads uninitialized entries.
Replace the padding buffer that was effectively fixed-size-after-first-call with a per-thread grow-only buffer sized to at least the current ci, preventing out-of-bounds reads for later convolutions with larger channel counts.

Motivation and Context

This change is required to fix incorrect, non-deterministic Conv outputs on CPU (often extreme magnitudes / -FLT_MAX-like values) after running certain Conv-heavy models in the same process. It fixes the root causes of uninitialized reads in the KleidiAI Conv LHS packing path and out-of-bounds reads from an undersized padding buffer.
Fixes: #26669

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>

xenova

Tested in the same environment from #26669, and I now get deterministic outputs. Thanks so much!

adrianlizarraga

I can also confirm that this patch resolves the non-deterministic output on M4. Thank you!

adrianlizarraga · 2026-01-30T17:28:46Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Win_TRT_Minimal_CUDA_Test_CI, Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-30T17:29:03Z

Azure Pipelines successfully started running 3 pipeline(s).

adrianlizarraga · 2026-01-30T17:29:36Z

/azp run Windows ARM64 QNN CI Pipeline

azure-pipelines · 2026-01-30T17:29:46Z

Azure Pipelines successfully started running 1 pipeline(s).

edgchen1 · 2026-01-30T17:44:07Z

thanks for the fix. are there existing unit tests which would catch this or do we need more test coverage?

JonathanC-ARM · 2026-01-30T17:52:19Z

thanks for the fix. are there existing unit tests which would catch this or do we need more test coverage?

I was planning on investigating this and following on with a subsequent pr to address it.

I suspect no tests cover this exactly is the answer though, we routinely run the unit and integration tests with this code path and we haven't seen any accuracy issues like this before.

yuslepukhin · 2026-01-30T17:53:47Z

onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp

+
+    // pad_ptr must be at least 'ci' floats for padding pixels.
+    // Using a thread_local grow-only buffer to avoid cross-thread interference and ensure sizing is correct.
+    thread_local std::vector<float> pad_ptr;


thread_local std::vector pad_ptr;

Would a stack local vector avoid cross-thread interference? And if indeed there is an issue, should this be synchronized?

So each thread would have its own individual instance of this vector, and no synchronization mechanism is required because the storage is thread-local and not shared between threads.

edgchen1 · 2026-01-30T17:58:46Z

thanks for the fix. are there existing unit tests which would catch this or do we need more test coverage?

I was planning on investigating this and following on with a subsequent pr to address it.

I suspect no tests cover this exactly is the answer though, we routinely run the unit and integration tests with this code path and we haven't seen any accuracy issues like this before.

sounds good. I'm fine with a subsequent PR for it due to the urgency.

Copilot

Pull request overview

This PR attempts to fix non-deterministic corruption in CPU Conv operations when using the MLAS KleidiAI implementation. It addresses two issues:

Changes:

Fully initialize the LHS indirection table (lhs_ptrs) to prevent uninitialized reads by the packing kernel
Replace the static fixed-size padding buffer with a thread-local grow-only buffer to handle varying channel counts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-30T18:00:05Z

onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp

+    thread_local std::vector<float> pad_ptr;
+    if (pad_ptr.size() < padsize) {
+        pad_ptr.resize(padsize, 0.f);
+    } else {
+        // Ensure any previously-used region remains zeroed (grow-only means it should already be zeros,
+        // but keep this explicit for safety).
+        std::fill(pad_ptr.begin(), pad_ptr.end(), 0.f);
+    }


The cached lhs_ptrs array contains pointers to pad_ptr (line 403, and via pixel_offset lambda at lines 420, 425, 430). When pad_ptr is resized (line 473), its underlying storage may be reallocated to a new address, invalidating all pointers stored in previously cached lhs_ptrs entries.

Example scenario:

Call with ci=100, padsize=256: pad_ptr at address A, lhs_ptrs cached with pointers to A

Call with ci=500, padsize=512: pad_ptr resized to address B

Call with ci=100 again: cache hit returns lhs_ptrs with stale pointers to A (now invalid)

The old static vector never moved after first initialization (though it had the undersizing bug this PR fixes). The thread_local grow-only approach fixes the sizing issue but breaks the caching because the cache doesn't invalidate when pad_ptr moves.

Possible solutions:

Invalidate lhs_ptrs_cache when pad_ptr is resized

Store pad_ptr address in cache key to detect when it has moved

Use a stable allocation for pad_ptr that doesn't move when it grows

Don't cache lhs_ptrs, or recalculate pointers before use

I think Copilot has a point about potential hanging on to pointers that may be invalidated due to the resize() logic

Absolutely.

I don't think this comment is accurate. The lhs_ptrs array is created every time LhsPtrFill is called. LhsPtrFill always receives the updated pad_ptr after it has been resized. Please correct me if I'm mistaken @JonathanC-ARM .

Update: It seems maybe this is a problem

I think adding pad_ptr to the cache key might be the way to go (Coplilot's second bullet)

### Description Refer to V1 of the fix here: #27214 This PR includes all fixes from the V1 PR + logic to invalidate the lhs cache pointers in case the pad buffer's underlying buffer has changed due to a resize. The ARM team will look at potentially enhancing this logic after the 1.24.0 release. ### Motivation and Context Fix #26669

xenova · 2026-02-04T16:23:20Z

can be closed since #27215 is merged, right? 👍

hariharans29 · 2026-02-04T18:47:09Z

Closing this as "V2" of the fix has been merged

JonathanC-ARM · 2026-02-04T19:48:56Z

Apologies, forgot to close this on my end!

hariharans29 · 2026-02-04T20:35:35Z

Apologies, forgot to close this on my end!

No problem :)

### Description This pr introduces some minor code changes which do the following: - Fix copilot header include suggestions from #27439 - Add testcase which covers code path fixed via #27215 and test case discussed in #27214 - Change pointer invalidation to cover only updated pointer in pad structure ### Testing This patch was tested in two ways. 1) After creating tests which I thought would trigger a previous failure case I reverted the convolve_kleidiai.cpp code to before the initial fix in [Hari's change](#27215) for pad ptr was introduced. Added debug logging and tested for failures to highlight the moving and invalidation of pointer. Example failure below 2) I reintroduced the current code and then tested multiple times <br> `for i in $(seq 1 2000); do echo "ITER=$i"; ./onnxruntime_mlas_test --long --gtest_filter='*Conv2d*' || break; done` ### Explanation of Subsequent logs <br> 1) **Padding buffer relocation** - `KLEIDIAI_CONV_LHS pad_buf MOVED ci=320 padsize=512 old=0x12e80d800 new=0x12e81ac00` - Meaning: the internal zero padding buffer used for out-of-bounds pixels was resized and the underlying storage address changed (`old` → `new`). Any previously-built indirection table entries that pointed at the old padding buffer are now stale. 2) **Reuse of cached indirection table after the move** - `KLEIDIAI_CONV_LHS indirection_cache HIT ci=64 m=121 **pad=0x12e81ac00 old_pad=0x12e80d800 (after_pad_move)**` - Meaning: for a later convolution (`ci=64`) the indirection-table cache returned a HIT. The log prints the current pad buffer address (`pad=...`) and the most recent prior padding-buffer address (`old_pad=...`) captured during the move. The `(after_pad_move)` tag indicates that this cache HIT occurred after a pad-buffer relocation event, which is the dangerous case in the pre-fix implementation (cached tables may still contain pointers to `old_pad`). In failing runs, the output mismatch occurs immediately after this sequence, showing a clear correlation: **pad buffer moved → cached indirection table reused → incorrect results**. * one note for the test is I commented out most of the rest of the fixture in the changed file before running for time constraints on the 2000 runs ``` jonclo01$ ./onnxruntime_mlas_test --long --gtest_filter='*Conv2d*' clear ------------------------------------------------------- ----Total 3066 tests registered programmably! ------------------------------------------------------- Note: Google Test filter = *Conv2d* [==========] Running 2 tests from 2 test suites. [----------] Global test environment set-up. [----------] 1 test from Conv2d_SingleThread [ RUN ] Conv2d_SingleThread.LongExecute [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 496 : KLEIDIAI_CONV_LHS pad_buf ci=64 padsize=256 addr=0x12e80d800 [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 543 : KLEIDIAI_CONV_LHS indirection_cache MISS ci=64 m=121 pad=0x12e80d800 [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 325 : kai_run_lhs_imatmul_pack_x32p2vlx1_x32p_sme M=121 k_chunk_count=9 k_chunk_length=64 [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 376 : kai_run_rhs_imatmul_pack_kxn_x32p2vlx1b_x32_x32_sme N=32 k_chunk_count=9 k_chunk_length=64 rhs_stride_row=128 [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 653 : kai_run_imatmul_clamp_f32_f32p2vlx1_f32p2vlx1b_2vlx2vl_sme2_mopa M=121 N=32 k_chunk_count=9 k_chunk_length=64 [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp : 349 : kai_run_rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme Groups=1 N=121 K=576 nr=32 kr=1 sr=1 rhs_stride_row=484 [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 490 : KLEIDIAI_CONV_LHS **pad_buf MOVED ci=320 padsize=512 old=0x12e80d800 new=0x12e81ac00** [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 543 : KLEIDIAI_CONV_LHS indirection_cache MISS ci=320 m=121 pad=0x12e81ac00 [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 325 : kai_run_lhs_imatmul_pack_x32p2vlx1_x32p_sme M=121 k_chunk_count=9 k_chunk_length=320 [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 376 : kai_run_rhs_imatmul_pack_kxn_x32p2vlx1b_x32_x32_sme N=32 k_chunk_count=9 k_chunk_length=320 rhs_stride_row=128 [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 653 : kai_run_imatmul_clamp_f32_f32p2vlx1_f32p2vlx1b_2vlx2vl_sme2_mopa M=121 N=32 k_chunk_count=9 k_chunk_length=320 [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp : 349 : kai_run_rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme Groups=1 N=121 K=2880 nr=32 kr=1 sr=1 rhs_stride_row=484 [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 535 : KLEIDIAI_CONV_LHS indirection_cache HIT ci=64 m=121 **pad=0x12e81ac00 old_pad=0x12e80d800 (after_pad_move)** [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 325 : kai_run_lhs_imatmul_pack_x32p2vlx1_x32p_sme M=121 k_chunk_count=9 k_chunk_length=64 [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp : 653 : kai_run_imatmul_clamp_f32_f32p2vlx1_f32p2vlx1b_2vlx2vl_sme2_mopa M=121 N=32 k_chunk_count=9 k_chunk_length=64 [KLEIDIAI KERNEL]: /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp : 349 : kai_run_rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme Groups=1 N=121 K=576 nr=32 kr=1 sr=1 rhs_stride_row=484 /Users/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/test/mlas/unittest/test_conv2d.h:249: Failure Expected equality of these values: memcmp(Output, OutputReference, OutputElements * sizeof(float)) Which is: 90 0 B1/G1/Cpg64/Fpg32/H11/W11/KH3/KW3/Pad1,1,1,1/Dilation1,1/Stride1,1 Stack trace: 0x10247ba34: MlasConv2DTest<>::ExecuteLong() 0x102651904: testing::internal::HandleExceptionsInMethodIfSupported<>() 0x1026517a4: testing::Test::Run() 0x102652b5c: testing::TestInfo::Run() 0x102653c84: testing::TestSuite::Run() ... Google Test internal frames ... [ FAILED ] Conv2d_SingleThread.LongExecute, where GetParam() = LongExecute (10 ms) ``` --------- Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>

Fix Conv LHS packing padding/uninitialized ptrs

3efd920

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>

xenova approved these changes Jan 30, 2026

View reviewed changes

adrianlizarraga requested review from hariharans29 and tianleiwu January 30, 2026 17:24

adrianlizarraga approved these changes Jan 30, 2026

View reviewed changes

yuslepukhin reviewed Jan 30, 2026

View reviewed changes

yuslepukhin requested a review from Copilot January 30, 2026 17:54

Copilot started reviewing on behalf of yuslepukhin January 30, 2026 17:54 View session

Copilot AI reviewed Jan 30, 2026

View reviewed changes

adrianlizarraga self-requested a review January 30, 2026 18:04

adrianlizarraga added the release:1.24.0 label Jan 30, 2026

hariharans29 mentioned this pull request Jan 30, 2026

Fix Conv LHS packing padding/uninitialized ptrs V2 #27215

Merged

hariharans29 removed the release:1.24.0 label Jan 31, 2026

hariharans29 closed this Feb 4, 2026

JonathanC-ARM mentioned this pull request Feb 26, 2026

[MLAS] Add kleidiai pad ptr invalidation test case #27465

Merged

Conversation

JonathanC-ARM commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

xenova left a comment

Choose a reason for hiding this comment

Uh oh!

adrianlizarraga left a comment

Choose a reason for hiding this comment

Uh oh!

adrianlizarraga commented Jan 30, 2026

Uh oh!

azure-pipelines bot commented Jan 30, 2026

Uh oh!

adrianlizarraga commented Jan 30, 2026

Uh oh!

azure-pipelines bot commented Jan 30, 2026

Uh oh!

edgchen1 commented Jan 30, 2026

Uh oh!

JonathanC-ARM commented Jan 30, 2026

Uh oh!

yuslepukhin Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

JonathanC-ARM Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

edgchen1 commented Jan 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

hariharans29 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

yuslepukhin Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrianlizarraga Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

xenova commented Feb 4, 2026

Uh oh!

hariharans29 commented Feb 4, 2026

Uh oh!

JonathanC-ARM commented Feb 4, 2026

Uh oh!

hariharans29 commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

JonathanC-ARM commented Jan 30, 2026 •

edited

Loading

yuslepukhin Jan 30, 2026 •

edited

Loading

adrianlizarraga Jan 30, 2026 •

edited

Loading