Fix int32 overflow in CUDA Gather kernel for large tensors by justinchuby · Pull Request #28108 · microsoft/onnxruntime

justinchuby · 2026-04-17T01:21:35Z

Description

The _GatherKernel in gather_impl.cu uses CUDA_LONG (int32_t) for input_index. When the input tensor has more than INT32_MAX (~2.1 billion) elements, the offset computation overflows, causing an illegal memory access (CUDA error 700).

Concrete example: Gemma4's per-layer embedding table is [262144, 8960] = 2.35 billion elements. Any token ID ≥ 239674 triggers the overflow because:

239674 × 8960 + 8959 = 2,147,487,999 > INT32_MAX (2,147,483,647)

Fix

Change input_index from CUDA_LONG (int32_t) to int64_t, and explicitly cast input_block_index to int64_t before multiplication. The other operands (input_block_size, idx) are already int64_t, so the full expression evaluates in 64-bit arithmetic.

Reproduction

See the minimal repro script in issue #28107. On any CUDA GPU with ORT 1.24.x:

import numpy as np, onnxruntime as ort, onnx
from onnx import helper, TensorProto

rows, cols = 262144, 8960
data = np.zeros((rows, cols), dtype=np.float32)
indices = np.array([255999], dtype=np.int64)  # > row 239674

graph = helper.make_graph(
    [helper.make_node("Gather", ["data", "indices"], ["out"], axis=0)],
    "g",
    [helper.make_tensor_value_info("data", TensorProto.FLOAT, [rows, cols]),
     helper.make_tensor_value_info("indices", TensorProto.INT64, [1])],
    [helper.make_tensor_value_info("out", TensorProto.FLOAT, [1, cols])],
)
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 21)])
onnx.save(model, "/tmp/gather.onnx")

sess = ort.InferenceSession("/tmp/gather.onnx", providers=["CUDAExecutionProvider"])
sess.run(None, {"data": data, "indices": indices})  # CRASH: illegal memory access

Motivation and Context

This affects any model with an embedding table exceeding ~2B elements. Currently blocks Gemma4 multimodal inference on CUDA EP since special tokens like <|image|> (ID 255999) are above the overflow threshold.

Fixes #28107

The _GatherKernel used CUDA_LONG (int32_t) for input_index, which overflows when the input tensor has more than INT32_MAX (~2.1B) elements. For example, Gemma4's embed_tokens_per_layer embedding table is [262144, 8960] = 2.35B elements. Token IDs >= 239674 cause the offset computation to exceed INT32_MAX, resulting in an illegal memory access on CUDA. Fix: use int64_t for input_index and explicitly cast input_block_index to int64_t before multiplication to ensure the entire expression is evaluated in 64-bit arithmetic. Fixes #28107 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

github-actions

You can commit the suggested changes from lintrunner.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Copilot

Pull request overview

This PR fixes an integer overflow in the CUDA EP Gather kernel that could compute a wrapped/negative input offset for very large input tensors (> INT32_MAX elements), leading to CUDA illegal memory accesses.

Changes:

Switch input_index computation in _GatherKernel from CUDA_LONG (int32_t) to int64_t.
Ensure 64-bit arithmetic by explicitly casting input_block_index to int64_t before multiplication.
Add an inline comment documenting the overflow scenario and a concrete large-embedding example.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/tensor/gather_impl.cu Outdated

Update onnxruntime/core/providers/cuda/tensor/gather_impl.cu

27540d8

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

justinchuby requested a review from Copilot April 17, 2026 01:52

Copilot started reviewing on behalf of justinchuby April 17, 2026 01:53 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

tianleiwu approved these changes Apr 17, 2026

View reviewed changes

justinchuby merged commit 4dd5d36 into main Apr 17, 2026
100 checks passed

justinchuby deleted the justinchu/fix-gather-int32-overflow branch April 17, 2026 04:12

TimPietrusky mentioned this pull request May 6, 2026

CUDA Cast kernel crashes with illegal memory access on tensors with >2^31 elements (int32 overflow) — same family as #28107 #28385

Open

Copilot AI mentioned this pull request May 6, 2026

Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements #28386

Draft

BrewTestBot mentioned this pull request May 8, 2026

onnxruntime 1.26.0 Homebrew/homebrew-core#281672

Merged

This was referenced May 8, 2026

chore(deps): Bump Microsoft.ML.OnnxRuntime from 1.22.0 to 1.26.0 verbara/Verbara.Sdk#3

Open

Bump Microsoft.ML.OnnxRuntime from 1.25.1 to 1.26.0 IoTSharp/SonnetDB#62

Open

Bump Microsoft.ML.OnnxRuntime from 1.25.1 to 1.26.0 lopatnov/translate#23

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix int32 overflow in CUDA Gather kernel for large tensors#28108

Fix int32 overflow in CUDA Gather kernel for large tensors#28108
justinchuby merged 2 commits intomainfrom
justinchu/fix-gather-int32-overflow

justinchuby commented Apr 17, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

justinchuby commented Apr 17, 2026

Description

Fix

Reproduction

Motivation and Context

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants