Fix GatherCopyData Integer Truncation Leading to Heap Out-of-Bounds Read/Write #27444
Merged
Fix GatherCopyData Integer Truncation Leading to Heap Out-of-Bounds Read/Write #27444
Conversation
Contributor
|
Thanks @chilo-ms . There's a linter error but otherwise looks good to me! |
adrianlizarraga
approved these changes
Feb 25, 2026
tianleiwu
approved these changes
Feb 26, 2026
tianleiwu
pushed a commit
that referenced
this pull request
Feb 26, 2026
…ead/Write (#27444) ### Description This pull request improves the robustness and correctness of the CPU implementation of the Gather operator in ONNX Runtime. The key changes focus on preventing integer overflow issues in parallel processing and output shape calculations, as well as enhancing test coverage to verify these safeguards. Enhancements to overflow handling and parallel processing: * Changed the lambda function in `GatherCopyData` to use `ptrdiff_t` instead of `int64_t` for the index, and explicitly cast batch and i variables, ensuring safer arithmetic for large tensor sizes. * Updated the parallel loop in `GatherCopyData` to iterate using `ptrdiff_t` indices, preventing potential overflow when processing large tensors. Testing improvements: * Added a new unit test `Gather_overflow_check` in `gather_op_test.cc` to verify that the Gather operator correctly handles very large output shapes without overflowing, specifically testing dimensions that exceed the 32-bit integer limit. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
tianleiwu
added a commit
that referenced
this pull request
Feb 27, 2026
This cherry-picks the following commits for the release: | Commit ID | PR Number | Commit Title | |-----------|-----------|-------------| | decd177 | #27090 | Fix GatherND division by zero when batch dimensions mismatch | | 55f8234 | #27360 | Fix QMoE CPU Operator | | df9146f | #27403 | [MLAS] Adding DynamicQGemm function pointers and ukernel interface | | 0f93853 | #27318 | [js/web] Use embedded WASM module in Blob URL workers when wasmBinary is provided | | b2a6e69 | #27364 | QMoE CPU Performance Update (Up to 4x on 4-bit) | | f501e1d | #27413 | Fix refcount bug in map input conversion that caused shutdown segfault | | b32b205 | #27421 | Fix error where bytes is not assigned for dynamic qgemm pack b size | | 426b006 | #27397 | Fix DllImportResolver | | 0982844 | #27412 | MatmulNBits prepacking scales fix | | 9afb0d2 | #27430 | Fix validation for external data paths for models loaded from bytes | | 71d2cd0 | #27401 | Enable Python 3.14 CI and Upgrade Dependencies | | 79e0676 | #27419 | fix: out of bounds access for resize operation | | 82eb99c | #27459 | Fix SkipLayerNorm fusion incorrectly applied when gamma/beta are not 1D | | 355278a | #27444 | Fix GatherCopyData Integer Truncation Leading to Heap Out-of-Bounds Read/Write | | cf96123 | #27411 | [web] fix usage of wasmBinary together with a blob URL for .mjs | | 1131a86 | #27399 | [web] remove the unhelpful "Unknown CPU vendor" warning. | | ffbbc4f | #27316 | Build Windows ARM64X binaries as part of packaging pipeline | --------- Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com> Co-authored-by: patryk-kaiser-ARM <patryk.kaiser@arm.com> Co-authored-by: don <70039285+0-don@users.noreply.github.com> Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Lukas Folle <126877803+lukas-folle-snkeos@users.noreply.github.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Chaya <cha182350@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Erik <erscor@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
adrianlizarraga
added a commit
that referenced
this pull request
Feb 28, 2026
#27483) ### Description This PR reduces the size of the memory allocation for expected outputs from ~4GiB to ~2GiB in the Gather_overflow_check test. The updated test still verifies that the integer overflow fix from PR #27444 is valid. That is, that the CPU Gather operator correctly handles output tensors with element counts that exceed INT32_MAX. Changes: - Reduced test dimension from 65537 to 46341 (output shape from 65537×65537 to 46341×46341), which results in a total number of elements that is just over INT32_MAX, which is required to test the bug fix. - The peak memory usage is reduced to ~4GiB + overhead. - Increase Android emulator memory to 5GiB (from 4GiB) to be able to run the test. ### Motivation Android CI fails to run the unit test introduced in #27444 due to memory usage that exceeds the Android emulator's default memory of 4GiB. This PR lowers the peak memory usage of the unit test and increases the Android emulator's memory by 1GiB.
tianleiwu
pushed a commit
that referenced
this pull request
Feb 28, 2026
#27483) ### Description This PR reduces the size of the memory allocation for expected outputs from ~4GiB to ~2GiB in the Gather_overflow_check test. The updated test still verifies that the integer overflow fix from PR #27444 is valid. That is, that the CPU Gather operator correctly handles output tensors with element counts that exceed INT32_MAX. Changes: - Reduced test dimension from 65537 to 46341 (output shape from 65537×65537 to 46341×46341), which results in a total number of elements that is just over INT32_MAX, which is required to test the bug fix. - The peak memory usage is reduced to ~4GiB + overhead. - Increase Android emulator memory to 5GiB (from 4GiB) to be able to run the test. ### Motivation Android CI fails to run the unit test introduced in #27444 due to memory usage that exceeds the Android emulator's default memory of 4GiB. This PR lowers the peak memory usage of the unit test and increases the Android emulator's memory by 1GiB.
This was referenced Mar 9, 2026
chilo-ms
added a commit
that referenced
this pull request
Mar 12, 2026
…rite (#27544) ### Description <!-- Describe your changes. --> This pull request refactors several tensor operation kernels (`GatherND`, `ScatterND`, and `GatherGrad`) to improve type safety and consistency in parallelized code execution. The main change is replacing `int` loop indices with `ptrdiff_t` to avoid overflow. ### Parallelization and Type Safety Improvements * Updated lambda functions and parallel loop indices in `gather_nd.cc` (`GatherNDBase::PrepareForCompute`, `GatherND::GatherNumber`, and `GatherND::GatherString`) to use `ptrdiff_t` instead of `int64_t`, and replaced index arithmetic with explicit casts to maintain correctness. [[1]](diffhunk://#diff-a456934cd8ef2c51197e04af32ecbef5b531dae83f7f8c2aca46802b7a5e7b7bL96-R100) [[2]](diffhunk://#diff-a456934cd8ef2c51197e04af32ecbef5b531dae83f7f8c2aca46802b7a5e7b7bL121-R121) [[3]](diffhunk://#diff-a456934cd8ef2c51197e04af32ecbef5b531dae83f7f8c2aca46802b7a5e7b7bL192-R216) * Refactored `scatter_nd.cc` (`ScatterNDDispatchTarget`) to use `ptrdiff_t` for loop indices and index arithmetic in all reduction cases, ensuring consistent type usage in parallel execution. * Modified `gather_grad.cc` (`GatherGrad::ComputeImpl`) to use `ptrdiff_t` for parallel loop indices, aligning with the changes in other tensor kernels. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Another same issue was fixed in #27444
This was referenced Mar 17, 2026
This was referenced Mar 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This pull request improves the robustness and correctness of the CPU implementation of the Gather operator in ONNX Runtime. The key changes focus on preventing integer overflow issues in parallel processing and output shape calculations, as well as enhancing test coverage to verify these safeguards.
Enhancements to overflow handling and parallel processing:
GatherCopyDatato useptrdiff_tinstead ofint64_tfor the index, and explicitly cast batch and i variables, ensuring safer arithmetic for large tensor sizes.GatherCopyDatato iterate usingptrdiff_tindices, preventing potential overflow when processing large tensors.Testing improvements:
Gather_overflow_checkingather_op_test.ccto verify that the Gather operator correctly handles very large output shapes without overflowing, specifically testing dimensions that exceed the 32-bit integer limit.Motivation and Context