Skip to content

Declare Shape, Reshape, Transpose, Squeeze, Unsqueeze for opsets 21, 23 on CUDA#26075

Merged
xadupre merged 8 commits intomainfrom
xadupre/missingcudaop
Sep 25, 2025
Merged

Declare Shape, Reshape, Transpose, Squeeze, Unsqueeze for opsets 21, 23 on CUDA#26075
xadupre merged 8 commits intomainfrom
xadupre/missingcudaop

Conversation

@xadupre
Copy link
Copy Markdown
Member

@xadupre xadupre commented Sep 18, 2025

Description

Fixes #26065.

@xadupre xadupre changed the title Declare Shape, Reshape, Transpose for opsets 21, 23 on CUDA Declare Shape, Reshape, Transpose, Squeeze, Unsqueeze for opsets 21, 23 on CUDA Sep 18, 2025
@yuslepukhin yuslepukhin requested a review from Copilot September 19, 2025 01:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for Shape, Reshape, Transpose, Squeeze, and Unsqueeze operations for ONNX opsets 21 and 23 on the CUDA execution provider. This addresses issue #26065 by declaring these tensor operations for the newer opset versions.

Key changes include:

  • Adding versioned kernel declarations for opsets 21-22 and new opset 23 kernel declarations
  • Updating existing opset 13 and 19 kernels to be versioned up to opset 20
  • Adding comprehensive test coverage for the new opset versions

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

File Description
onnxruntime/core/providers/cuda/tensor/*.cc Add CUDA kernel declarations for Shape, Reshape, Transpose, Squeeze, Unsqueeze ops for opsets 21-23
onnxruntime/core/providers/cuda/cuda_execution_provider.cc Update kernel class declarations and registrations to support versioned kernels and new opsets
onnxruntime/test/providers/cpu/tensor/*.cc Add test cases for opsets 21 and 23
onnxruntime/test/optimizer/transpose_optimizer_test.cc Update transpose optimizer tests to include opsets 21 and 23

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@yuslepukhin
Copy link
Copy Markdown
Member

Does it fix it?

@xadupre
Copy link
Copy Markdown
Member Author

xadupre commented Sep 22, 2025

It works fine on the model I used to test. I did not check the whole list of missing kernels declarations. We could also change the logic behind the kernel selection. An operator is often upgraded because it supports more types but onnxruntime does not always implements the kernel for the new types.

@xadupre xadupre marked this pull request as ready for review September 23, 2025 11:29
@xadupre xadupre merged commit 6650e07 into main Sep 25, 2025
97 of 98 checks passed
@xadupre xadupre deleted the xadupre/missingcudaop branch September 25, 2025 15:39
fs-eire pushed a commit that referenced this pull request Oct 24, 2025
naomiOvad pushed a commit to naomiOvad/onnxruntime that referenced this pull request Nov 2, 2025
Rishi-Dave added a commit to Rishi-Dave/onnxruntime that referenced this pull request Mar 18, 2026
…Loop, Scan, ConstantOfShape, Size (microsoft#27102)

When ONNX introduces a new version of an operator in opset 21, the kernel
registry's VerifyVersion rejects non-versioned (open-ended) CUDA kernels
because kernel_start_version != since_ver while kernel_end_version == INT_MAX.
This causes those operators to fall back from CUDA to CPU, introducing
unnecessary host↔device copies that can lead to value corruption on Windows.

PR microsoft#26075 previously fixed this for Shape, Reshape, Transpose, Squeeze, and
Unsqueeze. This commit extends the same fix to the remaining affected
operators: Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size.

For each operator:
- Cap existing non-versioned kernel to opset 20 (VERSIONED)
- Add VERSIONED(21, 22) kernel with identical type constraints
- Add non-versioned opset 23 kernel for forward compatibility
Rishi-Dave added a commit to Rishi-Dave/onnxruntime that referenced this pull request Mar 18, 2026
…Loop, Scan, ConstantOfShape, Size (microsoft#27102)

When ONNX introduces a new version of an operator in opset 21, the kernel
registry's VerifyVersion rejects non-versioned (open-ended) CUDA kernels
because kernel_start_version != since_ver while kernel_end_version == INT_MAX.
This causes those operators to fall back from CUDA to CPU, introducing
unnecessary host↔device copies that can lead to value corruption on Windows.

PR microsoft#26075 previously fixed this for Shape, Reshape, Transpose, Squeeze, and
Unsqueeze. This commit extends the same fix to the remaining affected
operators: Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size.

For each operator:
- Cap existing non-versioned kernel to opset 20 (VERSIONED)
- Add VERSIONED(21, 22) kernel with identical type constraints
- Add non-versioned opset 23 kernel for forward compatibility
Rishi-Dave added a commit to Rishi-Dave/onnxruntime that referenced this pull request Mar 18, 2026
…Loop, Scan, ConstantOfShape, Size (microsoft#27102)

When ONNX introduces a new version of an operator in opset 21, the kernel
registry's VerifyVersion rejects non-versioned (open-ended) CUDA kernels
because kernel_start_version != since_ver while kernel_end_version == INT_MAX.
This causes those operators to fall back from CUDA to CPU, introducing
unnecessary host↔device copies that can lead to value corruption on Windows.

PR microsoft#26075 previously fixed this for Shape, Reshape, Transpose, Squeeze, and
Unsqueeze. This commit extends the same fix to the remaining affected
operators: Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size.

For each operator:
- Cap existing non-versioned kernel to opset 20 (VERSIONED)
- Add VERSIONED(21, 22) kernel with identical type constraints
- Add non-versioned opset 23 kernel for forward compatibility
Rishi-Dave added a commit to Rishi-Dave/onnxruntime that referenced this pull request Mar 20, 2026
…Loop, Scan, ConstantOfShape, Size (microsoft#27102)

When ONNX introduces a new version of an operator in opset 21, the kernel
registry's VerifyVersion rejects non-versioned (open-ended) CUDA kernels
because kernel_start_version != since_ver while kernel_end_version == INT_MAX.
This causes those operators to fall back from CUDA to CPU, introducing
unnecessary host↔device copies that can lead to value corruption on Windows.

PR microsoft#26075 previously fixed this for Shape, Reshape, Transpose, Squeeze, and
Unsqueeze. This commit extends the same fix to the remaining affected
operators: Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size.

For each operator:
- Cap existing non-versioned kernel to opset 20 (VERSIONED)
- Add VERSIONED(21, 22) kernel with identical type constraints
- Add non-versioned opset 23 kernel for forward compatibility
tianleiwu pushed a commit that referenced this pull request Mar 21, 2026
…Loop, Scan, ConstantOfShape, Size (#27728)

## Summary
- Extend CUDA EP opset 21/23 kernel registrations to 7 additional
operators that were updated in ONNX opset 21 but lacked proper CUDA
kernel version declarations
- Operators fixed: **Flatten**, **Identity**, **If**, **Loop**,
**Scan**, **ConstantOfShape**, **Size**
- Follows the identical pattern established in PR #26075 for Shape,
Reshape, Transpose, Squeeze, Unsqueeze

## Motivation
Fixes #27102.

When ONNX introduces a new operator version in opset 21, ORT's
`VerifyVersion` function in `kernel_registry.cc` rejects non-versioned
(open-ended) CUDA kernels. The check at
[kernel_registry.cc:L126-L133](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/kernel_registry.cc#L126)
requires either an exact version match or a bounded version range — a
kernel registered as `since_version=N, end_version=INT_MAX` fails when
`since_ver` (from the opset 21 schema) differs from `N`.

This causes the affected operators to fall back from CUDA to CPU,
introducing unnecessary host↔device memory copies. On Windows with CUDA
EP, this fallback path can produce corrupted shape computation values
(e.g., `124647109376` instead of `6`), leading to downstream Reshape
failures.

PR #26075 fixed this for Shape, Reshape, Transpose, Squeeze, and
Unsqueeze. This PR extends the same fix to the 7 remaining operators
that were updated in ONNX opset 21 and had non-versioned CUDA kernels.

## Changes
For each of the 7 operators:
1. **Cap existing non-versioned kernel** to opset 20
(`ONNX_OPERATOR_KERNEL` → `ONNX_OPERATOR_VERSIONED_KERNEL`)
2. **Add VERSIONED(21, 22) kernel** with identical type constraints
3. **Add non-versioned opset 23 kernel** for forward compatibility
(opset 23 introduced another schema update for these operators)

Files modified:
- `onnxruntime/core/providers/cuda/cuda_execution_provider.cc` — forward
declarations + `BuildKernelCreateInfo` registration
- `onnxruntime/core/providers/cuda/tensor/flatten.cc`
- `onnxruntime/core/providers/cuda/tensor/identity_op.cc`
- `onnxruntime/core/providers/cuda/tensor/size.cc`
- `onnxruntime/core/providers/cuda/generator/constant_of_shape.cc`
- `onnxruntime/core/providers/cuda/controlflow/if.cc`
- `onnxruntime/core/providers/cuda/controlflow/loop.cc`
- `onnxruntime/core/providers/cuda/controlflow/scan.cc`

## Test Plan
- [ ] Verify CUDA EP build compiles successfully (CI)
- [ ] Existing opset 21 tests for Shape/Reshape/Squeeze/Unsqueeze pass
(validates the pattern)
- [ ] Verify operators are no longer falling back to CPU when running
opset 21 models on CUDA
- [ ] No regression in existing CUDA EP tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance] Shape operator in a model defined with opset 21 falls back to CPU

4 participants