Support TensorScatter (24) - CUDA by titaiwangms · Pull Request #27446 · microsoft/onnxruntime

titaiwangms · 2026-02-25T00:27:52Z

This pull request adds a new CUDA kernel implementation for the TensorScatter operator in ONNX Runtime, targeting opset 24. The implementation includes kernel registration, device-side logic, and comprehensive input validation, supporting both "linear" and "circular" scatter modes. The operator is also documented and tested, including negative and out-of-bounds input scenarios.

New Operator Implementation:

Introduced the TensorScatter CUDA kernel (onnxruntime/core/providers/cuda/llm/tensorscatter.cc, tensorscatter.h, tensorscatter_impl.cu, tensorscatter_impl.h), supporting both "linear" and "circular" modes, with detailed input validation and device-side scatter logic. [1] [2] [3] [4]

Kernel Registration and Integration:

Registered the TensorScatter kernel for CUDA in opset 24 within the execution provider and kernel registry (onnxruntime/core/providers/cuda/cuda_execution_provider.cc). [1] [2]

Documentation:

Added the TensorScatter operator to the operator kernel documentation, specifying supported types and attributes (docs/OperatorKernels.md).

Testing and Validation:

Added unit tests for TensorScatter, including negative tests for invalid write_indices and out-of-bounds conditions in both "linear" and "circular" modes (onnxruntime/test/providers/cpu/llm/tensorscatter_op_test.cc).

Copilot

Pull request overview

This pull request adds CUDA execution provider support for the new TensorScatter operator (opset 24). The implementation enables efficient tensor scatter operations on CUDA with support for both linear and circular update modes. The operator is designed for cache update scenarios in LLM inference, allowing updates to be scattered into a target tensor along a specified axis.

Changes:

Registered TensorScatter operator (opset 24) in the CUDA execution provider
Implemented CUDA kernel with efficient element-wise parallelization and template-based circular/linear mode dispatch
Added host-side operator logic with input validation, shape checking, and tensor copying before scatter updates

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
onnxruntime/core/providers/cuda/cuda_execution_provider.cc	Added TensorScatter operator registration for opset 24 in CUDA EP
onnxruntime/core/providers/cuda/llm/tensorscatter.h	Header defining TensorScatter class interface with axis and circular mode attributes
onnxruntime/core/providers/cuda/llm/tensorscatter.cc	Host-side implementation with validation, memcpy, and kernel dispatch logic
onnxruntime/core/providers/cuda/llm/tensorscatter_impl.h	CUDA kernel interface declaration for TensorScatterImpl function
onnxruntime/core/providers/cuda/llm/tensorscatter_impl.cu	CUDA kernel implementation with element-size-based dispatch and circular/linear mode templates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hariharans29 · 2026-02-27T02:00:35Z

+        cudaMemcpyAsync(host_write_indices.data(), write_indices,
+                        static_cast<size_t>(batch_size) * sizeof(int64_t),
+                        cudaMemcpyDeviceToHost, Stream(context)));
+    CUDA_RETURN_IF_ERROR(cudaStreamSynchronize(Stream(context)));


Such synchronization operations are going to cause issues during cuda graph capture. Usually, on GPU ops , we don't bring data to the host to perform validations if the input is already on CUDA memory. It is probably better to use CUDA_KERNEL_ASSERT() within the kernel to asynchronously report invalid data.

I see. I will submit a follow up. How did you find out about this?

Just happened to look at the PR post merge.

Also re-

"The sync is the real cost, but it happens before the scatter kernel launch, not in the middle of a kernel pipeline"

While this is true, a stream sync here would mean that it needs to block untill all asynchronous work queued on this stream complete.

draft

47b50f0

titaiwangms requested a review from Copilot February 25, 2026 00:27

Copilot started reviewing on behalf of titaiwangms February 25, 2026 00:28 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

titaiwangms added 2 commits February 25, 2026 21:56

update doc

ceb399e

address copilt reviews

e2732dd

titaiwangms requested a review from Copilot February 25, 2026 22:25

Copilot started reviewing on behalf of titaiwangms February 25, 2026 22:27 View session

titaiwangms marked this pull request as ready for review February 25, 2026 22:33

titaiwangms requested review from gramalingam, tianleiwu and xadupre February 25, 2026 22:33

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Comment thread onnxruntime/test/providers/cpu/llm/tensorscatter_op_test.cc

Comment thread onnxruntime/core/providers/cuda/llm/tensorscatter.cc

Comment thread onnxruntime/core/providers/cuda/llm/tensorscatter.cc

titaiwangms added the ep:CUDA issues related to the CUDA execution provider label Feb 25, 2026

xadupre approved these changes Feb 26, 2026

View reviewed changes

titaiwangms merged commit 39bdea4 into main Feb 26, 2026
98 of 99 checks passed

titaiwangms deleted the titaiwang/support_tensor_scatter_cuda branch February 26, 2026 17:25

hariharans29 reviewed Feb 27, 2026

View reviewed changes

titaiwangms mentioned this pull request Feb 27, 2026

[Feature Request] Support nonpad_kv_seqlen and external kv cache in-place update in Attention-24 (CUDA) #27485

Closed

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support TensorScatter (24) - CUDA#27446

Support TensorScatter (24) - CUDA#27446
titaiwangms merged 3 commits intomainfrom
titaiwang/support_tensor_scatter_cuda

titaiwangms commented Feb 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 Feb 27, 2026

Uh oh!

titaiwangms Feb 27, 2026

Uh oh!

hariharans29 Feb 27, 2026

Uh oh!

hariharans29 Feb 27, 2026

Uh oh!

titaiwangms Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

titaiwangms commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

titaiwangms Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

hariharans29 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

hariharans29 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

titaiwangms Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

titaiwangms commented Feb 25, 2026 •

edited

Loading