Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 17 additions & 2 deletions onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,12 @@ static std::shared_ptr<const void*[]> LhsPtrFill(const size_t ci, const size_t i
auto lhs_ptrs = std::shared_ptr<const void*[]>(new const void*[lhs_ptrs_k * lhs_ptrs_m],
std::default_delete<const void*[]>());

// Initialize all padding entries. For partial tiles (m < m_step),
// the kai LHS packing kernel may still read pointer entries beyond the logically
// filled 'm' positions. Leaving these uninitialized can cause non-deterministic
// reads and corrupt packed LHS data.
auto lhs_ptrs_ = lhs_ptrs.get();
std::fill(lhs_ptrs_, lhs_ptrs_ + (lhs_ptrs_k * lhs_ptrs_m), reinterpret_cast<const void*>(&pad_ptr[0]));

auto ih_out_size = ComputeConvOutSize(ih, kh, padding, 1);
auto iw_out_size = ComputeConvOutSize(iw, kw, padding, 1);
Expand Down Expand Up @@ -430,7 +436,6 @@ static std::shared_ptr<const void*[]> LhsPtrFill(const size_t ci, const size_t i
};

size_t m_{0};
auto lhs_ptrs_ = lhs_ptrs.get();
for (size_t ih_ = 0; ih_ < ih_out_size; ih_ += sh) {
for (size_t iw_ = 0; iw_ < iw_out_size; iw_ += sw, ++m_) {
size_t k_{0};
Expand Down Expand Up @@ -460,7 +465,17 @@ static std::unique_ptr<std::byte[]> LhsPackImageDataSme(const size_t ci, const s
// figure out how many blocks needed to correctly fill padding
padsize = ((ci + padsize - 1) / padsize) * padsize;
}
static std::vector<float>pad_ptr(padsize, 0.f);

// pad_ptr must be at least 'ci' floats for padding pixels.
// Using a thread_local grow-only buffer to avoid cross-thread interference and ensure sizing is correct.
thread_local std::vector<float> pad_ptr;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thread_local std::vector pad_ptr;

Would a stack local vector avoid cross-thread interference? And if indeed there is an issue, should this be synchronized?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So each thread would have its own individual instance of this vector, and no synchronization mechanism is required because the storage is thread-local and not shared between threads.

if (pad_ptr.size() < padsize) {
pad_ptr.resize(padsize, 0.f);
} else {
// Ensure any previously-used region remains zeroed (grow-only means it should already be zeros,
// but keep this explicit for safety).
std::fill(pad_ptr.begin(), pad_ptr.end(), 0.f);
}
Comment on lines +471 to +478

Copilot AI Jan 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cached lhs_ptrs array contains pointers to pad_ptr (line 403, and via pixel_offset lambda at lines 420, 425, 430). When pad_ptr is resized (line 473), its underlying storage may be reallocated to a new address, invalidating all pointers stored in previously cached lhs_ptrs entries.

Example scenario:

  1. Call with ci=100, padsize=256: pad_ptr at address A, lhs_ptrs cached with pointers to A
  2. Call with ci=500, padsize=512: pad_ptr resized to address B
  3. Call with ci=100 again: cache hit returns lhs_ptrs with stale pointers to A (now invalid)

The old static vector never moved after first initialization (though it had the undersizing bug this PR fixes). The thread_local grow-only approach fixes the sizing issue but breaks the caching because the cache doesn't invalidate when pad_ptr moves.

Possible solutions:

  • Invalidate lhs_ptrs_cache when pad_ptr is resized
  • Store pad_ptr address in cache key to detect when it has moved
  • Use a stable allocation for pad_ptr that doesn't move when it grows
  • Don't cache lhs_ptrs, or recalculate pointers before use

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Copilot has a point about potential hanging on to pointers that may be invalidated due to the resize() logic

@yuslepukhin yuslepukhin Jan 30, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely.

@adrianlizarraga adrianlizarraga Jan 30, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is accurate. The lhs_ptrs array is created every time LhsPtrFill is called. LhsPtrFill always receives the updated pad_ptr after it has been resized. Please correct me if I'm mistaken @JonathanC-ARM .

Update: It seems maybe this is a problem

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding pad_ptr to the cache key might be the way to go (Coplilot's second bullet)


LhsCacheKey key = {
ci, ih, iw,
Expand Down
Loading