Add opset 21/23 CUDA kernel registrations for Flatten, Identity, If, Loop, Scan, ConstantOfShape, Size by Rishi-Dave · Pull Request #27728 · microsoft/onnxruntime

Rishi-Dave · 2026-03-18T15:34:06Z

Summary

Extend CUDA EP opset 21/23 kernel registrations to 7 additional operators that were updated in ONNX opset 21 but lacked proper CUDA kernel version declarations
Operators fixed: Flatten, Identity, If, Loop, Scan, ConstantOfShape, Size
Follows the identical pattern established in PR Declare Shape, Reshape, Transpose, Squeeze, Unsqueeze for opsets 21, 23 on CUDA #26075 for Shape, Reshape, Transpose, Squeeze, Unsqueeze

Motivation

When ONNX introduces a new operator version in opset 21, ORT's VerifyVersion function in kernel_registry.cc rejects non-versioned (open-ended) CUDA kernels. The check at kernel_registry.cc:L126-L133 requires either an exact version match or a bounded version range — a kernel registered as since_version=N, end_version=INT_MAX fails when since_ver (from the opset 21 schema) differs from N.

This causes the affected operators to fall back from CUDA to CPU, introducing unnecessary host↔device memory copies. On Windows with CUDA EP, this fallback path can produce corrupted shape computation values (e.g., 124647109376 instead of 6), leading to downstream Reshape failures.

PR #26075 fixed this for Shape, Reshape, Transpose, Squeeze, and Unsqueeze. This PR extends the same fix to the 7 remaining operators that were updated in ONNX opset 21 and had non-versioned CUDA kernels.

Changes

For each of the 7 operators:

Cap existing non-versioned kernel to opset 20 (ONNX_OPERATOR_KERNEL → ONNX_OPERATOR_VERSIONED_KERNEL)
Add VERSIONED(21, 22) kernel with identical type constraints
Add non-versioned opset 23 kernel for forward compatibility (opset 23 introduced another schema update for these operators)

Files modified:

onnxruntime/core/providers/cuda/cuda_execution_provider.cc — forward declarations + BuildKernelCreateInfo registration
onnxruntime/core/providers/cuda/tensor/flatten.cc
onnxruntime/core/providers/cuda/tensor/identity_op.cc
onnxruntime/core/providers/cuda/tensor/size.cc
onnxruntime/core/providers/cuda/generator/constant_of_shape.cc
onnxruntime/core/providers/cuda/controlflow/if.cc
onnxruntime/core/providers/cuda/controlflow/loop.cc
onnxruntime/core/providers/cuda/controlflow/scan.cc

Test Plan

Verify CUDA EP build compiles successfully (CI)
Existing opset 21 tests for Shape/Reshape/Squeeze/Unsqueeze pass (validates the pattern)
Verify operators are no longer falling back to CPU when running opset 21 models on CUDA
No regression in existing CUDA EP tests

Copilot

Pull request overview

Extends CUDA Execution Provider kernel registrations to correctly declare bounded version ranges for several operators whose ONNX schema versions changed in opsets 21/23, preventing unintended CUDA→CPU fallback due to VerifyVersion rejecting open-ended kernels.

Changes:

Add ONNX_OPERATOR_VERSIONED_KERNEL_EX registrations for opset ranges (e.g., 13–20 and 21–22) for Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size.
Move prior open-ended CUDA registrations to target opset 23 for these operators.
Update CUDA EP forward declarations and BuildKernelCreateInfo registration list accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
onnxruntime/core/providers/cuda/cuda_execution_provider.cc	Updates kernel class forward-decls and registry entries for the newly versioned kernels.
onnxruntime/core/providers/cuda/tensor/flatten.cc	Adds version-bounded CUDA kernel registrations for Flatten for opsets 13–20 and 21–22; shifts open-ended to 23.
onnxruntime/core/providers/cuda/tensor/identity_op.cc	Adds version-bounded CUDA kernel registrations for Identity for opsets 19–20 and 21–22; shifts open-ended to 23.
onnxruntime/core/providers/cuda/tensor/size.cc	Adds version-bounded CUDA kernel registrations for Size for opsets 13–20 and 21–22; shifts open-ended to 23.
onnxruntime/core/providers/cuda/generator/constant_of_shape.cc	Adds version-bounded CUDA kernel registrations for ConstantOfShape for opsets 9–20 and 21–22; shifts open-ended to 23.
onnxruntime/core/providers/cuda/controlflow/if.cc	Adds version-bounded CUDA kernel registrations for If for opsets 19–20 and 21–22; shifts open-ended to 23.
onnxruntime/core/providers/cuda/controlflow/loop.cc	Adds version-bounded CUDA kernel registrations for Loop for opsets 19–20 and 21–22; shifts open-ended to 23.
onnxruntime/core/providers/cuda/controlflow/scan.cc	Adds version-bounded CUDA kernel registrations for Scan for opsets 19–20 and 21–22; shifts open-ended to 23.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Rishi-Dave · 2026-03-18T15:58:50Z

Good catch — updated in the latest push.

The opset-23 registrations were unbounded (end_version=INT_MAX), which meant VerifyVersion would only match since_version==23 and opset 24/25 models would still fall back to CPU.

Changes made:

Changed all 7 opset-23 registrations from unbounded to versioned 23–24 (ONNX_OPERATOR_VERSIONED_KERNEL_EX)
Added unbounded opset-25 registrations (ONNX_OPERATOR_KERNEL_EX) for all 7 operators
Updated forward declarations and BuildKernelCreateInfo entries in cuda_execution_provider.cc accordingly

This now matches the CPU provider's version ranges (21–22, 23–24 bounded, 25 unbounded) for Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size.

tianleiwu · 2026-03-18T20:17:56Z

/azup run Windows GPU Doc Gen CI Pipeline, Windows ARM64 QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI

Rishi-Dave · 2026-03-18T20:20:15Z

Rebased on latest upstream/main — no conflicts.

To summarize the current state of the PR:

Opset version ranges now fully match the CPU provider pattern for all 7 operators (Flatten, Identity, If, Loop, Scan, ConstantOfShape, Size):
- Original range → end_version capped at opset 20
- 21–22 bounded
- 23–24 bounded
- 25 unbounded (open-ended)
The "Build Extended Minimal" CI failure on the previous run was a transient vcpkg infrastructure issue (asset cache unreachable → boost-mp11 download failed), not related to this change. The re-triggered CI from the rebase should clear it.
All other CI checks (CUDA build, Linux build, Python format, etc.) passed cleanly.

Let me know if anything else needs adjustment — happy to iterate.

tianleiwu · 2026-03-18T22:53:30Z

/azup run Windows GPU Doc Gen CI Pipeline

Rishi-Dave · 2026-03-18T22:56:23Z

Rebased on latest upstream/main — no conflicts, no code changes needed.

All prior review feedback (unbounded opset-23 registrations) was addressed in the previous push. Current state for all 7 operators (Flatten, Identity, If, Loop, Scan, ConstantOfShape, Size):

Range	Registration Type
Original → 20	`ONNX_OPERATOR_VERSIONED_KERNEL_EX` (bounded)
21 → 22	`ONNX_OPERATOR_VERSIONED_KERNEL_EX` (bounded)
23 → 24	`ONNX_OPERATOR_VERSIONED_KERNEL_EX` (bounded)
25 → ∞	`ONNX_OPERATOR_KERNEL_EX` (unbounded)

This matches the CPU provider's version ranges. Forward declarations and BuildKernelCreateInfo entries in cuda_execution_provider.cc are aligned.

Ready for review when CI clears.

tianleiwu · 2026-03-18T23:28:41Z

/azp run Windows GPU Doc Gen CI Pipeline

tianleiwu · 2026-03-18T23:31:41Z

@Rishi-Dave, Please update docs/OperatorKernels.md.

You can wait until "Windows GPU Doc Gen CI Pipeline" job finishes, then download the file from artifacts of the job.

Rishi-Dave · 2026-03-18T23:33:52Z

Thanks for triggering the doc gen pipeline. I'll download the updated OperatorKernels.md from the pipeline artifacts once it completes and push the update.

azure-pipelines · 2026-03-19T02:37:33Z

Azure Pipelines successfully started running 1 pipeline(s).

…Loop, Scan, ConstantOfShape, Size (microsoft#27102) When ONNX introduces a new version of an operator in opset 21, the kernel registry's VerifyVersion rejects non-versioned (open-ended) CUDA kernels because kernel_start_version != since_ver while kernel_end_version == INT_MAX. This causes those operators to fall back from CUDA to CPU, introducing unnecessary host↔device copies that can lead to value corruption on Windows. PR microsoft#26075 previously fixed this for Shape, Reshape, Transpose, Squeeze, and Unsqueeze. This commit extends the same fix to the remaining affected operators: Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size. For each operator: - Cap existing non-versioned kernel to opset 20 (VERSIONED) - Add VERSIONED(21, 22) kernel with identical type constraints - Add non-versioned opset 23 kernel for forward compatibility

Bound the opset-23 registrations to 23-24 and add unbounded opset-25 registrations for Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size, matching the CPU provider's version ranges. This prevents opset 24/25 models from falling back to CPU due to VerifyVersion rejecting open-ended opset-23 kernels.

Add version-bounded rows for ConstantOfShape, Flatten, Identity, If, Loop, Scan, and Size to reflect the new CUDA EP kernel registrations (21-22, 23-24 bounded, 25+ unbounded).

Rishi-Dave · 2026-03-20T20:23:29Z

Rebased on latest upstream/main and updated docs/OperatorKernels.md.

Changes in this push:

Rebase conflict resolution — cuda_execution_provider.cc had conflicts with the Squeeze/Unsqueeze opset 24-25 registrations that landed upstream. Resolved by keeping upstream's Squeeze/Unsqueeze versioning (23→23, 24→24, 25→∞) and integrating our 7 operators alongside it.
docs/OperatorKernels.md update — Updated the CUDA EP section for all 7 operators (ConstantOfShape, Flatten, Identity, If, Loop, Scan, Size) to reflect the new version ranges:

Range	Registration Type
Original → 20	`VERSIONED` (bounded)
21 → 22	`VERSIONED` (bounded)
23 → 24	`VERSIONED` (bounded)
25 → ∞	Unbounded

This matches the CPU provider's version ranges and follows the same pattern as Shape, Reshape, Transpose, Squeeze, and Unsqueeze.

Note: OperatorKernels.md was updated manually since the doc gen script (gen_opkernel_doc.py) requires a compiled build. The version ranges and type constraints match the C++ kernel registrations exactly. The next doc gen pipeline run should produce an identical result — if there's any discrepancy, happy to pull the artifact and overwrite.

tianleiwu · 2026-03-20T22:01:53Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-20T22:02:11Z

Azure Pipelines successfully started running 4 pipeline(s).

### Description Extends the CUDA Transpose kernel registration from opset 23 to opset 25. - **`transpose.cc`**: Cap existing opset 23 kernel to versioned `(23, 24)`, add new non-versioned kernel at opset 25 - **`cuda_execution_provider.cc`**: Update forward declarations and `BuildKernelCreateInfo` entries to match; add new `// Opset 25` section - **`docs/OperatorKernels.md`**: Update CUDA Transpose entry from `23+` to `25+` with new `[23, 24]` versioned range No functional or type constraint changes — the kernel implementation is identical across these opsets. ### Motivation and Context CUDA EP's Transpose registration stopped at opset 23 while the ONNX spec defines it through opset 25. This is one of the P1 gaps tracked in #27729, following the same pattern as #27728. ### Limitation This PR does not add support of new data type for Transpose: - int2 (opset 25) - float8e8m0 (opset 24) - float4e2m1 (opset 23) - float8e4m3fn,float8e4m3fnuz, float8e5m2, float8e5m2fnuz, uint4, int4 (opset 21) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

tianleiwu requested a review from Copilot March 18, 2026 15:45

Copilot started reviewing on behalf of tianleiwu March 18, 2026 15:46 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

tianleiwu mentioned this pull request Mar 18, 2026

[Feature Request] Extend CUDA ONNX Ops to latest opset version #27729

Open

Rishi-Dave force-pushed the rishidave/fix/cuda-opset21-kernel-registrations branch from d6b72e4 to 8abced8 Compare March 18, 2026 20:20

Rishi-Dave force-pushed the rishidave/fix/cuda-opset21-kernel-registrations branch from 8abced8 to 0ce1020 Compare March 18, 2026 22:56

Rishi-Dave added 3 commits March 20, 2026 12:42

Update docs/OperatorKernels.md for new CUDA kernel version ranges

10138d5

Add version-bounded rows for ConstantOfShape, Flatten, Identity, If, Loop, Scan, and Size to reflect the new CUDA EP kernel registrations (21-22, 23-24 bounded, 25+ unbounded).

Rishi-Dave force-pushed the rishidave/fix/cuda-opset21-kernel-registrations branch from 0ce1020 to 10138d5 Compare March 20, 2026 20:23

tianleiwu enabled auto-merge (squash) March 21, 2026 17:05

tianleiwu approved these changes Mar 21, 2026

View reviewed changes

tianleiwu merged commit ce3158a into microsoft:main Mar 21, 2026
88 of 89 checks passed

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Conversation

Rishi-Dave commented Mar 18, 2026

Summary

Motivation

Changes

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Rishi-Dave commented Mar 18, 2026

Uh oh!

tianleiwu commented Mar 18, 2026

Uh oh!

Rishi-Dave commented Mar 18, 2026

Uh oh!

tianleiwu commented Mar 18, 2026

Uh oh!

Rishi-Dave commented Mar 18, 2026

Uh oh!

tianleiwu commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianleiwu commented Mar 18, 2026

Uh oh!

Rishi-Dave commented Mar 18, 2026

Uh oh!

azure-pipelines Bot commented Mar 19, 2026

Uh oh!

Rishi-Dave commented Mar 20, 2026

Uh oh!

tianleiwu commented Mar 20, 2026

Uh oh!

azure-pipelines Bot commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianleiwu commented Mar 18, 2026 •

edited

Loading