Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements by Copilot · Pull Request #28386 · microsoft/onnxruntime

Copilot · 2026-05-06T20:29:37Z

Description

Switch per-thread element indices from CUDA_LONG (int32_t) to int64_t in CUDA Cast and UnaryElementWise kernels to prevent illegal memory access on tensors exceeding 2^31 elements.

cu_inc/unary_elementwise_impl.cuh: Change N parameter and loop index in _UnaryElementWise kernel from CUDA_LONG to int64_t. Fix blocksPerGrid intermediate multiplication to use size_t. This fixes the overflow for all unary elementwise ops (Cast, Abs, Neg, Sqrt, Log, Exp, Erf, etc.).
tensor/cast_op.cu: Same fix for CastKernelStd, CastKernelSat, and CudaCastPairwiseKernel. Remove static_cast<int>(num_of_elements) truncation in launch functions.
cast_op_test.cc: Add LargeTensorCastNoCrash regression test.

Before (crashes):

// N silently wraps negative for count >= 2^31, bound check passes for invalid offsets
CUDA_LONG N = static_cast<CUDA_LONG>(count);  // int32 truncation
CUDA_LONG id = NumElementsPerThread * NumThreadsPerBlock * blockIdx.x + threadIdx.x;  // int32 overflow

After:

int64_t N = static_cast<int64_t>(count);
int64_t id = static_cast<int64_t>(NumElementsPerThread) * NumThreadsPerBlock * blockIdx.x + threadIdx.x;

Motivation and Context

Same class of bug fixed in Gather by #28108 (response to #28107). The Cast kernel uses CUDA_LONG = int32_t for its element index, which wraps negative once element count crosses INT32_MAX. This hits any causal LM ONNX export where vocab_size × seq_length > 2^31 — practically every long-context HF model on CUDA EP at seq_length ≥ 16K–32K.

…ors with >2^31 elements Switch per-thread element index from CUDA_LONG (int32_t) to int64_t in: - _UnaryElementWise kernel (cu_inc/unary_elementwise_impl.cuh) - CastKernelStd kernel (tensor/cast_op.cu) - CastKernelSat kernel (tensor/cast_op.cu) - CudaCastPairwiseKernel (tensor/cast_op.cu) Also fix the launch functions to pass element count as int64_t instead of truncating via static_cast<int>, and fix blocksPerGrid calculation to avoid int32 overflow in the intermediate multiplication. Add regression test for large tensor cast. Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/0b1e04ca-17bd-4f26-aaec-728240d54577 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/0b1e04ca-17bd-4f26-aaec-728240d54577 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

tianleiwu

Review Summary

Correct and well-scoped fix for a real int32 overflow bug in CUDA Cast and UnaryElementWise kernels. The changes consistently replace CUDA_LONG (int32_t) with int64_t across kernel parameters and index calculations, matching the same fix pattern applied to Gather in PR #28108.

Positives:

The static_cast<int64_t>(NumElementsPerThread) correctly anchors the multiplication chain in 64-bit arithmetic before multiplying with blockIdx.x, preventing intermediate overflow.
The unary_elementwise_impl.cuh header change propagates the fix to all unary elementwise ops (Abs, Neg, Sqrt, Log, Exp, Erf, etc.) in a single edit.
All three cast kernel variants (CastKernelStd, CastKernelSat, CudaCastPairwiseKernel) are consistently updated — no kernel was missed.
Removal of static_cast<int>(num_of_elements) truncation in the launch functions is the most important part, since that's where size_t → int32_t silently lost high bits.

Broader concern (out of scope): The CALCULATE_ELEMENTWISE_INDEX_OR_EXIT macro still uses CUDA_LONG and is used by dozens of other CUDA kernels (expand, tile, scatter_nd, resize, upsample, etc.) — they have the same int32 overflow vulnerability. Consider filing a follow-up issue to track the systemic fix.

tianleiwu · 2026-05-07T19:19:09Z

+TEST(CastOpTest, LargeTensorCastNoCrash) {
+  // Use a tensor large enough to be meaningful but not require excessive memory.
+  // 2^24 = 16M elements is enough to exercise the kernel grid calculation while
+  // staying within typical CI GPU memory limits.


16M elements (2^24) is far below INT32_MAX (2^31). The old code with CUDA_LONG indices would also pass this test — this does not provide regression protection against someone accidentally reverting the index type back to CUDA_LONG.

The test name LargeTensorCastNoCrash and comment "Regression test for CUDA Cast kernel int32 overflow" overstate what it validates. It's a useful correctness smoke test, but not an overflow regression test.

Options to improve:

Rename to CastKernelCorrectness_ModerateSize to reflect what it actually tests.

Add a separate test gated on available GPU memory (e.g., skip if < 10 GB free) that allocates >2^31 elements.

Add a host-side unit test that verifies the grid launch calculation (blocksPerGrid, N) uses 64-bit arithmetic for counts > INT32_MAX.

tianleiwu · 2026-05-07T19:19:09Z

  bool is_odd = (num_of_elements & 0x01) != 0;

-  int pair_count = static_cast<int>(num_of_elements / 2);
+  int64_t pair_count = static_cast<int64_t>(num_of_elements / 2);


Minor: num_of_elements is already size_t, so pair_count could stay size_t instead of converting size_t → int64_t here and then back to size_t in the CeilDiv call below (static_cast<size_t>(pair_count)). The current code is correct but the double conversion is mildly surprising.

Initial plan

3ace5f3

Copilot AI assigned Copilot and tianleiwu May 6, 2026

Copilot started work on behalf of tianleiwu May 6, 2026 20:29 View session

Copilot AI linked an issue May 6, 2026 that may be closed by this pull request

CUDA Cast kernel crashes with illegal memory access on tensors with >2^31 elements (int32 overflow) — same family as #28107 #28385

Open

Copilot AI and others added 2 commits May 6, 2026 20:35

Fix placeholder issue number in test comment

92ac4fa

Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/0b1e04ca-17bd-4f26-aaec-728240d54577 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix CUDA Cast kernel illegal memory access on large tensors~~ Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements May 6, 2026

Copilot finished work on behalf of tianleiwu May 6, 2026 20:37

Copilot AI requested a review from tianleiwu May 6, 2026 20:37

tianleiwu reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements#28386

Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements#28386
Copilot wants to merge 3 commits intomainfrom
copilot/fix-cuda-cast-kernel-crash

Copilot AI commented May 6, 2026 •

edited

Loading

Uh oh!

tianleiwu left a comment

Uh oh!

tianleiwu May 7, 2026

Uh oh!

tianleiwu May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

tianleiwu May 7, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 6, 2026 •

edited

Loading