[webgpu] Support CopyTensors with offsets #27004

qjia7 · 2026-01-14T06:23:24Z

This pull request introduces enhanced support for copying tensor data with fine-grained control over source/destination offsets and copy sizes, both in the ONNX Runtime C API and the WebGPU provider. The changes add new API methods and extend existing interfaces to allow partial or offset-based tensor copies, with corresponding updates to the CPU and WebGPU data transfer implementations and buffer management logic.

Copilot

Pull request overview

This pull request extends the ONNX Runtime C API and internal data transfer infrastructure to support fine-grained tensor copying with source/destination offsets and custom copy sizes. The changes add a new CopyTensorsEx API function, extend the OrtDataTransferImpl interface to accept offset parameters, and update CPU and WebGPU data transfer implementations to handle offset-based copying.

Changes:

Added new CopyTensorsEx C API function with offset and size parameters for partial tensor copies
Extended OrtDataTransferImpl::CopyTensors to accept source_offsets, destination_offsets, and sizes arrays
Updated WebGPU BufferManager and DataTransfer to support offset-based memory operations

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
include/onnxruntime/core/session/onnxruntime_c_api.h	Adds CopyTensorsEx API declaration with offset/size parameters
include/onnxruntime/core/session/onnxruntime_ep_c_api.h	Updates OrtDataTransferImpl::CopyTensors signature with offset parameters
onnxruntime/core/session/ort_apis.h	Declares CopyTensorsEx implementation function
onnxruntime/core/session/onnxruntime_c_api.cc	Implements CopyTensors and CopyTensorsEx using shared helper function
onnxruntime/core/framework/data_transfer.h	Adds offset/size fields to SrcDstPair and new CopyTensor overload
onnxruntime/core/framework/data_transfer.cc	Implements offset-aware CopyTensor for CPU and base interface
onnxruntime/core/framework/plugin_data_transfer.cc	Updates to call CopyTensors with offset parameters
onnxruntime/core/providers/webgpu/data_transfer.h	Declares offset-aware CopyTensor overload
onnxruntime/core/providers/webgpu/data_transfer.cc	Implements offset-based tensor copying for WebGPU
onnxruntime/core/providers/webgpu/buffer_manager.h	Updates Upload/MemCpy/Download signatures with offset parameters
onnxruntime/core/providers/webgpu/buffer_manager.cc	Implements offset support in buffer operations
onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc	Updates WebGpuDataTransferImpl to extract and pass offset parameters
onnxruntime/test/autoep/library/example_plugin_ep/ep_data_transfer.h	Updates example EP data transfer signature
onnxruntime/test/autoep/library/example_plugin_ep/ep_data_transfer.cc	Implements offset handling in example EP
onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_data_transfer.h	Updates example EP kernel registry data transfer signature
onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_data_transfer.cc	Implements offset handling in example EP kernel registry
onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/kernels/utils.h	Updates CopyTensor call with null offset parameters
onnxruntime/test/shared_lib/test_data_copy.cc	Adds comment about backward compatibility test

Comments suppressed due to low confidence (2)

onnxruntime/core/providers/webgpu/data_transfer.cc:46

Missing bounds validation for offset and size parameters. The function should validate that src_offset + bytes does not exceed src.SizeInBytes() and dst_offset + bytes does not exceed dst.SizeInBytes() before performing the copy operations. This is especially important for CPU to GPU and GPU to CPU transfers where buffer overflow could occur.

common::Status DataTransfer::CopyTensor(const Tensor& src, Tensor& dst, size_t src_offset, size_t dst_offset, size_t size) const {
  size_t bytes = size > 0 ? size : src.SizeInBytes();
  if (bytes > 0) {
    void const* src_data = src.DataRaw();
    void* dst_data = dst.MutableDataRaw();

    auto& src_device = src.Location().device;
    auto& dst_device = dst.Location().device;

    if (dst_device.Type() == OrtDevice::GPU) {
      if (src_device.Type() == OrtDevice::GPU) {
        // copy from GPU to GPU
        buffer_manager_.MemCpy(static_cast<WGPUBuffer>(const_cast<void*>(src_data)),
                               static_cast<WGPUBuffer>(dst_data), bytes, src_offset, dst_offset);
      } else {
        // copy from CPU to GPU
        buffer_manager_.Upload(const_cast<void*>(src_data),
                               static_cast<WGPUBuffer>(dst_data), bytes, src_offset, dst_offset);
      }
    } else /* if (src_device.Type() == OrtDevice::GPU) */ {
      // copy from GPU to CPU
      buffer_manager_.Download(static_cast<WGPUBuffer>(const_cast<void*>(src_data)),
                               dst_data, bytes, src_offset, dst_offset);
    }
  }

  return Status::OK();

include/onnxruntime/core/session/onnxruntime_ep_c_api.h:142

The documentation for CopyTensors should clarify the expected behavior when offsets and sizes would cause out-of-bounds access. It should specify whether implementations are expected to validate bounds and return an error, or if the caller is responsible for ensuring valid parameters. This is important for EP implementers to understand their responsibilities.

  /** \brief Copy tensors from src_tensors to dst_tensors using the provided streams.
   *
   * The implementation can use the provided streams to perform asynchronous copies if supported.
   * If a stream is not available, the copy is performed synchronously.
   *
   * \param[in] this_ptr Pointer to the OrtDataTransferImpl instance.
   * \param[in] src_tensors Array of source OrtValue pointers to copy from.
   * \param[in] dst_tensors Array of destination OrtValue pointers to copy to.
   * \param[in] source_offsets Optional array of source offsets in bytes. May be nullptr for all zeros.
   * \param[in] destination_offsets Optional array of destination offsets in bytes. May be nullptr for all zeros.
   * \param[in] sizes Optional array of sizes in bytes to copy. May be nullptr to copy entire tensors.
   * \param[in] streams Array of OrtSyncStream pointers for the copy operations, if the execution provider is stream
   *                    aware. nullptr if it is not.
   * \param[in] num_tensors Number of tensors to copy.
   *
   * \snippet{doc} snippets.dox OrtStatus Return Value
   *
   * \since Version 1.23.
   */

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-14T08:18:57Z

onnxruntime/core/providers/webgpu/buffer_manager.cc

+void BufferManager::MemCpy(WGPUBuffer src, WGPUBuffer dst, size_t size, size_t src_offset, size_t dst_offset) const {
  ORT_ENFORCE(src != dst, "Source and destination buffers must be different.");
  EnforceBufferUnmapped(context_, src);
  EnforceBufferUnmapped(context_, dst);

  auto buffer_size = NormalizeBufferSize(size);
  auto src_size = static_cast<size_t>(wgpuBufferGetSize(src));
  auto dst_size = static_cast<size_t>(wgpuBufferGetSize(dst));
-  ORT_ENFORCE(buffer_size <= src_size && buffer_size <= dst_size,
+  ORT_ENFORCE(src_offset + buffer_size <= src_size && dst_offset + buffer_size <= dst_size,
              "Source and destination buffers must have enough space for the copy operation. src_size=",
-              src_size, ", dst_size=", dst_size, ", copy_size=", buffer_size, ".");
+              src_size, ", dst_size=", dst_size, ", src_offset=", src_offset, ", dst_offset=", dst_offset, ", copy_size=", buffer_size, ".");

  auto& command_encoder = context_.GetCommandEncoder();
  context_.EndComputePass();
-  command_encoder.CopyBufferToBuffer(src, 0, dst, 0, buffer_size);
+  command_encoder.CopyBufferToBuffer(src, src_offset, dst, dst_offset, buffer_size);
 }


Potential issue with buffer size normalization when using offsets. The NormalizeBufferSize function rounds up 'size' to be aligned to 16 bytes, but when dst_offset is applied, the total required buffer space is actually dst_offset + buffer_size. The current validation at line 489 checks dst_offset + buffer_size against dst_size, which is correct. However, if the destination buffer was created with a size that was normalized independently, there could be cases where the aligned buffer_size causes the operation to exceed the actual buffer bounds when combined with the offset.

Copilot · 2026-01-14T08:18:57Z

onnxruntime/test/shared_lib/test_data_copy.cc

    if (!device_tensors.empty()) {
+      // Test original CopyTensors (backward compatible)
      ASSERT_CXX_ORTSTATUS_OK(ort_env->CopyTensors(cpu_tensors, device_tensors, stream));



The new CopyTensorsEx API function is not covered by any tests. The PR adds this significant new functionality but there are no test cases that exercise copying tensors with offsets or custom sizes to verify the implementation works correctly.

Copilot · 2026-01-14T08:18:58Z

include/onnxruntime/core/session/onnxruntime_c_api.h

+  /** \brief Copy OrtValue instances containing Tensors between devices with offset and size control.
+   *
+   * Extended version of CopyTensors that supports copying with source/destination offsets and custom sizes.
+   * All offsets and sizes are in bytes.
+   *
+   * \param[in] env The OrtEnv instance to use.
+   * \param[in] src_tensors Array of OrtValue instances containing the source tensors to copy.
+   * \param[in] dst_tensors Array of OrtValue instances to copy the source tensors to.
+   * \param[in] source_offsets Optional array of source offsets in bytes. May be nullptr for all zeros.
+   * \param[in] destination_offsets Optional array of destination offsets in bytes. May be nullptr for all zeros.
+   * \param[in] sizes Optional array of sizes in bytes to copy. May be nullptr to copy entire tensors.
+   * \param[in] stream Optional OrtSyncStream that can be used to perform the copy asynchronously. May be nullptr.
+   * \param[in] num_tensors The number of tensors to copy.
+   *
+   * \snippet{doc} snippets.dox OrtStatus Return Value
+   *
+   * \since Version 1.24
+   */


The documentation for CopyTensorsEx should clarify the expected behavior when offsets and sizes would cause out-of-bounds access. It should specify whether the implementation is expected to validate bounds and return an error, or if the caller is responsible for ensuring valid parameters. This is important for API consumers to understand their responsibilities and avoid undefined behavior.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

fs-eire · 2026-01-14T23:04:42Z

@tianleiwu @skottmckay

fs-eire

Adding CopyTensorsEx should probably be fine, but modifying the signature of existing OrtDataTransferImpl::CopyTensors probably cause backward compatibilty issues.

skottmckay · 2026-01-14T23:09:24Z

This pull request introduces enhanced support for copying tensor data with fine-grained control over source/destination offsets and copy sizes, both in the ONNX Runtime C API and the WebGPU provider. The changes add new API methods and extend existing interfaces to allow partial or offset-based tensor copies, with corresponding updates to the CPU and WebGPU data transfer implementations and buffer management logic.

Can you please explain 'why' this is required? Changing OrtDataTransferImpl::CopyTensors will impact every non-CPU EP.

A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is?

tianleiwu · 2026-01-14T23:58:05Z

A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is?

I agree. From API point of review, we only need an API to copy from source location to target location for specified number of bytes.

User can add a helper function for sub-tensor copy, but not needed to expose it as API. The helper function can be shared by some EPs if it is used internally.

fs-eire · 2026-01-15T01:01:10Z

A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is?

For webgpu, a buffer is just a handle. Unlike CUDA which uses memory model that allows to add offset to pointers, there is no way to represent that in webgpu.

skottmckay · 2026-01-15T01:20:21Z

A user can create an OrtValue using existing data, so could they not take that approach to create an OrtValue for the subset of data to copy and use the existing interfaces as-is?

For webgpu, a buffer is just a handle. Unlike CUDA which uses memory model that allows to add offset to pointers, there is no way to represent that in webgpu.

It's up to the EP (and its data transfer implementation) to interpret what the void* for data in the tensor represents. e.g. DML EP that is struct with metadata and not just a raw pointer to memory. WebGPU EP could do something similar and have a struct with the handle and offset info. User would need to know how to create that struct but could use CreateTensorWithDataAsOrtValue and pass it in as the void* p_data.

Adding a whole new API to ORT this sort of thing just for the WebGPU EP feels like the wrong place to be doing it. The ORT API in general deals in a granularity of tensors not small chunks of data within a tensor. What's the use-case where you need to copy a subset of data from an OrtValue?

qjia7 · 2026-01-15T02:10:35Z

Adding a whole new API to ORT this sort of thing just for the WebGPU EP feels like the wrong place to be doing it. The ORT API in general deals in a granularity of tensors not small chunks of data within a tensor. What's the use-case where you need to copy a subset of data from an OrtValue?

Two use cases in onnxruntime-genai:

Support CopyFrom(size_t begin_dest, DeviceBuffer& source, size_t begin_source, size_t size_in_bytes) in gpu. I ever commented there https://github.com/microsoft/onnxruntime-genai/pull/1848/files#r2513245293.
Support UpdateAttentionMask in gpu. When graph capture is enabled, the attention mask is in gpu with shape [batch_size,max_sequence_length], we can use copyTensors with offset to update the content of attention_mask at the specific position. For example, just copy 1 to of attention_mask[total_sequence_length-1].

skottmckay · 2026-01-16T08:57:32Z

What's the motivation for other EP authors to handle offset based copy in their IDataTransfer implementation? it would be a bit of a smell that it doesn't belong in the ORT API if only the CPU and WebGPU EPs support it.

FWIW it feels a little loose to be taking arbitrary offsets and size given the source and destination are Tensor instances with specific shapes. could we use something like an axis and index for the source and the target locations to ensure the copy makes sense?

If this is purely to enable some copies in genai is another option to do that via a model by either augmenting the original model with the model editor API, or having a small helper model that is used? e.g. something like ScatterElements or ScatterND might be applicable. if the model input and output for the copy was the same buffer it should in-place the writes.

Is another alternative having a CreateTensor variant that takes an offset to support the void* not necessarily being an addressable pointer to memory? Or a CreateSubTensor that takes an OrtValue with Tensor, axis and index and returns a Tensor for that slice (ownership would obviously stay with the original OrtValue, might need const and non-const versions).

qjia7 · 2026-01-19T03:35:15Z

@skottmckay

What's the motivation for other EP authors to handle offset based copy in their IDataTransfer implementation? it would be a bit of a smell that it doesn't belong in the ORT API if only the CPU and WebGPU EPs support it.

You're absolutely right. From an implementation perspective, CPU/CUDA already support creating tensors with offsets implicitly (via pointer arithmetic), while handle-based EPs like WebGPU cannot. However, from a user's perspective, we should provide a unified API that works consistently across all EPs. Your suggestions of either a CreateTensor variant that takes an offset or a CreateSubTensor that takes an OrtValue would achieve this nicely.

FWIW it feels a little loose to be taking arbitrary offsets and size given the source and destination are Tensor instances with specific shapes. could we use something like an axis and index for the source and the target locations to ensure the copy makes sense?

I agree that using axis+index would provide better type safety. If we align on the overall direction, I'm happy to work out the specific API design with you—whether that's axis/index based or offset based with proper validation.

If this is purely to enable some copies in genai is another option to do that via a model by either augmenting the original model with the model editor API, or having a small helper model that is used? e.g. something like ScatterElements or ScatterND might be applicable. if the model input and output for the copy was the same buffer it should in-place the writes.

I've prototyped the model-based approach (in fact, I'm using it for Cast operations in https://github.com/microsoft/onnxruntime-genai/pull/1895). However, for frequent small updates like CopyFrom and UpdateAttentionMask, my benchmarks show that direct copying significantly outperforms running a dedicated ONNX model, even a single-op one.

Is another alternative having a CreateTensor variant that takes an offset to support the void* not necessarily being an addressable pointer to memory? Or a CreateSubTensor that takes an OrtValue with Tensor, axis and index and returns a Tensor for that slice (ownership would obviously stay with the original OrtValue, might need const and non-const versions).

Both approaches would work well for the WebGPU use case. From the EP's perspective, they achieve the same goal: enabling partial tensor updates for both pointer-based EPs (CPU, CUDA) and handle-based EPs (WebGPU, and potentially Vulkan/Metal).

To confirm the path forward: Do I understand correctly that you're supportive of adding new API functionality to enable partial tensor updates across all EPs, and we need to finalize whether to use:

CreateTensor with offset + existing CopyTensors
CreateSubTensor (with axis/index) + existing CopyTensors
Enhanced CopyTensors with axis/index parameters

Happy to prototype whichever approach you think fits best with ORT's API design principles.

skottmckay · 2026-01-19T08:08:04Z

To confirm the path forward: Do I understand correctly that you're supportive of adding new API functionality to enable partial tensor updates across all EPs, and we need to finalize whether to use:

CreateTensor with offset + existing CopyTensors

CreateSubTensor (with axis/index) + existing CopyTensors

Enhanced CopyTensors with axis/index parameters

One of the challenges with creating a new Tensor instance pointing to a subset of an existing tensor is the path to EP-specific logic to interpret the handle. Typically that sort of logic is in the IAllocator for the OrtDevice, but as we're not allocating in this scenario that doesn't quite fit. Doesn't quite fit to add to IDataTransfer either given we're not doing a data transfer at that point.

One option might be the new external resource importer where you can import memory and create an OrtValue from it. Added recently in #26828. That already supports a void* + offset and provides an EP specific implementation. Whilst we're technically not importing 'external' memory (e.g. D3D12 or Vulkan), it does seem to have the features required. It's also optional for an EP to implement so zero cost for an EP implementer if not needed. In terms of API changes it would possibly only require adding a value to OrtExternalMemoryHandleType. You could actually import the entire Tensor once, and call CreateTensorFromMemory for each slice as that supports an offset in OrtExternalTensorDescriptor.

Maybe a choice between that and #3. Would like to know what others on the ORT team think about the options. @adrianlizarraga @edgchen1 @yuslepukhin

qjia7 · 2026-01-22T07:24:48Z

Ping @adrianlizarraga @edgchen1 @yuslepukhin. Any suggestions on Scott's above comment?

qjia7 added 2 commits January 13, 2026 16:52

[webgpu] Support CopyTensors with offsets

caedd86

Merge branch 'main' into copyTensors

f39d7cd

qjia7 requested a review from Copilot January 14, 2026 08:13

Copilot started reviewing on behalf of qjia7 January 14, 2026 08:14 View session

Copilot AI reviewed Jan 14, 2026

View reviewed changes

fs-eire reviewed Jan 14, 2026

View reviewed changes

[webgpu] Support CopyTensors with offsets #27004

Are you sure you want to change the base?

[webgpu] Support CopyTensors with offsets #27004

Conversation

qjia7 commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

fs-eire commented Jan 14, 2026

Uh oh!

fs-eire left a comment

Choose a reason for hiding this comment

Uh oh!

skottmckay commented Jan 14, 2026

Uh oh!

tianleiwu commented Jan 14, 2026

Uh oh!

fs-eire commented Jan 15, 2026

Uh oh!

skottmckay commented Jan 15, 2026

Uh oh!

qjia7 commented Jan 15, 2026

Uh oh!

skottmckay commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qjia7 commented Jan 19, 2026

Uh oh!

skottmckay commented Jan 19, 2026

Uh oh!

qjia7 commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

skottmckay commented Jan 16, 2026 •

edited

Loading