fix: out of bounds access for resize operation#27419
fix: out of bounds access for resize operation#27419tianleiwu merged 6 commits intomicrosoft:mainfrom
Conversation
|
@microsoft-github-policy-service agree |
|
A unit test would be good. |
|
@xadupre our internal testing showed that the onnxruntime memory context corruption did indeed disappear with those two fixes in place. |
There was a problem hiding this comment.
Pull request overview
Fixes two CUDA correctness issues identified via compute-sanitizer: an out-of-bounds write in Resize LINEAR trilinear coordinate mapping (3D/5D paths) and a shared-memory race in the reduction block-level reduction loop.
Changes:
- Adjust CUDA Resize LINEAR temp buffer sizing to account for trilinear mapping requiring D+H+W entries.
- Fix shared-memory reduction loop condition to avoid overlapping reads/writes within a reduction stride.
- Add regression tests covering repeated column-reduction execution and a CUDA-only 5D trilinear Resize case.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
onnxruntime/test/providers/cuda/test_cases/reduction_functions_test.cc |
Adds a repeated-run test to help catch intermittent reduction race issues. |
onnxruntime/test/providers/cpu/tensor/resize_op_test.cc |
Adds a CUDA-only trilinear Resize regression test exercising the 5D linear path. |
onnxruntime/core/providers/cuda/tensor/resize_impl.cu |
Fixes LINEAR mapping buffer size calculation for trilinear (3D/5D) to prevent OOB writes. |
onnxruntime/core/providers/cuda/reduction/reduction_functions.cu |
Fixes block-level reduction condition to prevent shared-memory race hazards. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
### Description This PR fixes: * An out-of-bounds write in CUDA Resize for LINEAR mode when running trilinear paths (3D/5D) * A race condition for the reduction kernel ### Root cause 1. The temporary dims-mapping buffer for LINEAR mode was sized using only H+W, while the trilinear coordinate mapping kernel writes D+H+W entries. 2. shared-memory race in the block-level reduction loop inside [reduction_functions.cu](vscode-file://vscode-app/c:/Users/lukas.folle/AppData/Local/Programs/Microsoft%20VS%20Code/072586267e/resources/app/out/vs/code/electron-browser/workbench/workbench.html). The condition allowed threads outside the active lower half to update shared memory in the same stride phase, creating overlapping read/write hazards My colleague @korbinian-mechlem-snkeos noticed this warning from compute-sanitzer > ========= Invalid __global__ write of size 4 bytes ========= at void onnxruntime::cuda::_ResizeTrilinearCoordinateMapping<float, onnxruntime::cuda::TransformCoordinate_HALF_PIXEL>(long long, long long, long long, long long, long long, long long, float, float, float, float, float, float, float, float, float, unsigned long long, bool, const T2 &, onnxruntime::cuda::LinearMappingInfo *)+0x400 ========= by thread (17,0,0) in block (2,0,0) ========= Address 0xb28fff7cc is out of bounds ========= and is 205 bytes after the nearest allocation at 0xb28fff400 of size 768 bytes ========= Saved host backtrace up to driver entry point at kernel launch time AND > ========= Warning: Race reported between Read access at void onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float, float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity, (bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xe80 ========= and Write access at void onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float, float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity, (bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xea0 [337920 hazards] ### Motivation and Context Update LINEAR buffer size calculation to: * use H+W for bilinear (2D/4D) * use D+H+W for trilinear (3D/5D) Prevents invalid global writes and intermittent CUDA memory errors in trilinear resize workloads. @johannes-rehm-snkeos
This cherry-picks the following commits for the release: | Commit ID | PR Number | Commit Title | |-----------|-----------|-------------| | decd177 | #27090 | Fix GatherND division by zero when batch dimensions mismatch | | 55f8234 | #27360 | Fix QMoE CPU Operator | | df9146f | #27403 | [MLAS] Adding DynamicQGemm function pointers and ukernel interface | | 0f93853 | #27318 | [js/web] Use embedded WASM module in Blob URL workers when wasmBinary is provided | | b2a6e69 | #27364 | QMoE CPU Performance Update (Up to 4x on 4-bit) | | f501e1d | #27413 | Fix refcount bug in map input conversion that caused shutdown segfault | | b32b205 | #27421 | Fix error where bytes is not assigned for dynamic qgemm pack b size | | 426b006 | #27397 | Fix DllImportResolver | | 0982844 | #27412 | MatmulNBits prepacking scales fix | | 9afb0d2 | #27430 | Fix validation for external data paths for models loaded from bytes | | 71d2cd0 | #27401 | Enable Python 3.14 CI and Upgrade Dependencies | | 79e0676 | #27419 | fix: out of bounds access for resize operation | | 82eb99c | #27459 | Fix SkipLayerNorm fusion incorrectly applied when gamma/beta are not 1D | | 355278a | #27444 | Fix GatherCopyData Integer Truncation Leading to Heap Out-of-Bounds Read/Write | | cf96123 | #27411 | [web] fix usage of wasmBinary together with a blob URL for .mjs | | 1131a86 | #27399 | [web] remove the unhelpful "Unknown CPU vendor" warning. | | ffbbc4f | #27316 | Build Windows ARM64X binaries as part of packaging pipeline | --------- Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com> Co-authored-by: patryk-kaiser-ARM <patryk.kaiser@arm.com> Co-authored-by: don <70039285+0-don@users.noreply.github.com> Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Lukas Folle <126877803+lukas-folle-snkeos@users.noreply.github.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Chaya <cha182350@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Erik <erscor@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Description
This PR fixes:
Root cause
My colleague @korbinian-mechlem-snkeos noticed this warning from compute-sanitzer
AND
Motivation and Context
Update LINEAR buffer size calculation to:
Prevents invalid global writes and intermittent CUDA memory errors in trilinear resize workloads.
@johannes-rehm-snkeos