Skip to content

fix: out of bounds access for resize operation#27419

Merged
tianleiwu merged 6 commits intomicrosoft:mainfrom
lukas-folle-snkeos:main
Feb 25, 2026
Merged

fix: out of bounds access for resize operation#27419
tianleiwu merged 6 commits intomicrosoft:mainfrom
lukas-folle-snkeos:main

Conversation

@lukas-folle-snkeos
Copy link
Copy Markdown
Contributor

@lukas-folle-snkeos lukas-folle-snkeos commented Feb 23, 2026

Description

This PR fixes:

  • An out-of-bounds write in CUDA Resize for LINEAR mode when running trilinear paths (3D/5D)
  • A race condition for the reduction kernel

Root cause

  1. The temporary dims-mapping buffer for LINEAR mode was sized using only H+W, while the trilinear coordinate mapping kernel writes D+H+W entries.
  2. shared-memory race in the block-level reduction loop inside reduction_functions.cu. The condition allowed threads outside the active lower half to update shared memory in the same stride phase, creating overlapping read/write hazards

My colleague @korbinian-mechlem-snkeos noticed this warning from compute-sanitzer

========= Invalid global write of size 4 bytes
========= at void onnxruntime::cuda::_ResizeTrilinearCoordinateMapping<float, onnxruntime::cuda::TransformCoordinate_HALF_PIXEL>(long long, long long, long long, long long, long long, long long, float, float, float, float, float, float, float, float, float, unsigned long long, bool, const T2 &, onnxruntime::cuda::LinearMappingInfo *)+0x400
========= by thread (17,0,0) in block (2,0,0)
========= Address 0xb28fff7cc is out of bounds
========= and is 205 bytes after the nearest allocation at 0xb28fff400 of size 768 bytes
========= Saved host backtrace up to driver entry point at kernel launch time

AND

========= Warning: Race reported between Read access at void onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float, float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity, (bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xe80
========= and Write access at void onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float, float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity, (bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xea0 [337920 hazards]

Motivation and Context

Update LINEAR buffer size calculation to:

  • use H+W for bilinear (2D/4D)
  • use D+H+W for trilinear (3D/5D)

Prevents invalid global writes and intermittent CUDA memory errors in trilinear resize workloads.

@johannes-rehm-snkeos

@lukas-folle-snkeos
Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree

@lukas-folle-snkeos lukas-folle-snkeos marked this pull request as ready for review February 23, 2026 10:10
@xadupre
Copy link
Copy Markdown
Member

xadupre commented Feb 23, 2026

A unit test would be good.

@lukas-folle-snkeos
Copy link
Copy Markdown
Contributor Author

@xadupre our internal testing showed that the onnxruntime memory context corruption did indeed disappear with those two fixes in place.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two CUDA correctness issues identified via compute-sanitizer: an out-of-bounds write in Resize LINEAR trilinear coordinate mapping (3D/5D paths) and a shared-memory race in the reduction block-level reduction loop.

Changes:

  • Adjust CUDA Resize LINEAR temp buffer sizing to account for trilinear mapping requiring D+H+W entries.
  • Fix shared-memory reduction loop condition to avoid overlapping reads/writes within a reduction stride.
  • Add regression tests covering repeated column-reduction execution and a CUDA-only 5D trilinear Resize case.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
onnxruntime/test/providers/cuda/test_cases/reduction_functions_test.cc Adds a repeated-run test to help catch intermittent reduction race issues.
onnxruntime/test/providers/cpu/tensor/resize_op_test.cc Adds a CUDA-only trilinear Resize regression test exercising the 5D linear path.
onnxruntime/core/providers/cuda/tensor/resize_impl.cu Fixes LINEAR mapping buffer size calculation for trilinear (3D/5D) to prevent OOB writes.
onnxruntime/core/providers/cuda/reduction/reduction_functions.cu Fixes block-level reduction condition to prevent shared-memory race hazards.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tianleiwu
Copy link
Copy Markdown
Contributor

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

@tianleiwu tianleiwu merged commit 79e0676 into microsoft:main Feb 25, 2026
91 of 92 checks passed
tianleiwu pushed a commit that referenced this pull request Feb 26, 2026
### Description
This PR fixes:
* An out-of-bounds write in CUDA Resize for LINEAR mode when running
trilinear paths (3D/5D)
* A race condition for the reduction kernel

### Root cause
1. The temporary dims-mapping buffer for LINEAR mode was sized using
only H+W, while the trilinear coordinate mapping kernel writes D+H+W
entries.
2. shared-memory race in the block-level reduction loop inside
[reduction_functions.cu](vscode-file://vscode-app/c:/Users/lukas.folle/AppData/Local/Programs/Microsoft%20VS%20Code/072586267e/resources/app/out/vs/code/electron-browser/workbench/workbench.html).
The condition allowed threads outside the active lower half to update
shared memory in the same stride phase, creating overlapping read/write
hazards

My colleague @korbinian-mechlem-snkeos noticed this warning from
compute-sanitzer
> ========= Invalid __global__ write of size 4 bytes
========= at void
onnxruntime::cuda::_ResizeTrilinearCoordinateMapping<float,
onnxruntime::cuda::TransformCoordinate_HALF_PIXEL>(long long, long long,
long long, long long, long long, long long, float, float, float, float,
float, float, float, float, float, unsigned long long, bool, const T2 &,
onnxruntime::cuda::LinearMappingInfo *)+0x400
=========     by thread (17,0,0) in block (2,0,0)
=========     Address 0xb28fff7cc is out of bounds
========= and is 205 bytes after the nearest allocation at 0xb28fff400
of size 768 bytes
========= Saved host backtrace up to driver entry point at kernel launch
time

AND

> ========= Warning: Race reported between Read access at void
onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float,
float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity,
(bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xe80
========= and Write access at void
onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float,
float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity,
(bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xea0 [337920 hazards]

### Motivation and Context
Update LINEAR buffer size calculation to:
* use H+W for bilinear (2D/4D)
* use D+H+W for trilinear (3D/5D)

Prevents invalid global writes and intermittent CUDA memory errors in
trilinear resize workloads.

@johannes-rehm-snkeos
tianleiwu added a commit that referenced this pull request Feb 27, 2026
This cherry-picks the following commits for the release:

| Commit ID | PR Number | Commit Title |
|-----------|-----------|-------------|
| decd177 | #27090 | Fix GatherND division by zero when batch
dimensions mismatch |
| 55f8234 | #27360 | Fix QMoE CPU Operator |
| df9146f | #27403 | [MLAS] Adding DynamicQGemm function pointers and
ukernel interface |
| 0f93853 | #27318 | [js/web] Use embedded WASM module in Blob URL
workers when wasmBinary is provided |
| b2a6e69 | #27364 | QMoE CPU Performance Update (Up to 4x on 4-bit)
|
| f501e1d | #27413 | Fix refcount bug in map input conversion that
caused shutdown segfault |
| b32b205 | #27421 | Fix error where bytes is not assigned for
dynamic qgemm pack b size |
| 426b006 | #27397 | Fix DllImportResolver |
| 0982844 | #27412 | MatmulNBits prepacking scales fix |
| 9afb0d2 | #27430 | Fix validation for external data paths for
models loaded from bytes |
| 71d2cd0 | #27401 | Enable Python 3.14 CI and Upgrade Dependencies |
| 79e0676 | #27419 | fix: out of bounds access for resize operation |
| 82eb99c | #27459 | Fix SkipLayerNorm fusion incorrectly applied
when gamma/beta are not 1D |
| 355278a | #27444 | Fix GatherCopyData Integer Truncation Leading to
Heap Out-of-Bounds Read/Write |
| cf96123 | #27411 | [web] fix usage of wasmBinary together with a
blob URL for .mjs |
| 1131a86 | #27399 | [web] remove the unhelpful "Unknown CPU vendor"
warning. |
| ffbbc4f | #27316 | Build Windows ARM64X binaries as part of
packaging pipeline |

---------

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
Co-authored-by: patryk-kaiser-ARM <patryk.kaiser@arm.com>
Co-authored-by: don <70039285+0-don@users.noreply.github.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Lukas Folle <126877803+lukas-folle-snkeos@users.noreply.github.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Chaya <cha182350@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Erik <erscor@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants