Skip to content

Add opset 21/23 CUDA kernel registrations for Flatten, Identity, If, Loop, Scan, ConstantOfShape, Size#27728

Merged
tianleiwu merged 3 commits intomicrosoft:mainfrom
Rishi-Dave:rishidave/fix/cuda-opset21-kernel-registrations
Mar 21, 2026
Merged

Add opset 21/23 CUDA kernel registrations for Flatten, Identity, If, Loop, Scan, ConstantOfShape, Size#27728
tianleiwu merged 3 commits intomicrosoft:mainfrom
Rishi-Dave:rishidave/fix/cuda-opset21-kernel-registrations

Conversation

@Rishi-Dave
Copy link
Copy Markdown
Contributor

Summary

Motivation

Fixes #27102.

When ONNX introduces a new operator version in opset 21, ORT's VerifyVersion function in kernel_registry.cc rejects non-versioned (open-ended) CUDA kernels. The check at kernel_registry.cc:L126-L133 requires either an exact version match or a bounded version range — a kernel registered as since_version=N, end_version=INT_MAX fails when since_ver (from the opset 21 schema) differs from N.

This causes the affected operators to fall back from CUDA to CPU, introducing unnecessary host↔device memory copies. On Windows with CUDA EP, this fallback path can produce corrupted shape computation values (e.g., 124647109376 instead of 6), leading to downstream Reshape failures.

PR #26075 fixed this for Shape, Reshape, Transpose, Squeeze, and Unsqueeze. This PR extends the same fix to the 7 remaining operators that were updated in ONNX opset 21 and had non-versioned CUDA kernels.

Changes

For each of the 7 operators:

  1. Cap existing non-versioned kernel to opset 20 (ONNX_OPERATOR_KERNELONNX_OPERATOR_VERSIONED_KERNEL)
  2. Add VERSIONED(21, 22) kernel with identical type constraints
  3. Add non-versioned opset 23 kernel for forward compatibility (opset 23 introduced another schema update for these operators)

Files modified:

  • onnxruntime/core/providers/cuda/cuda_execution_provider.cc — forward declarations + BuildKernelCreateInfo registration
  • onnxruntime/core/providers/cuda/tensor/flatten.cc
  • onnxruntime/core/providers/cuda/tensor/identity_op.cc
  • onnxruntime/core/providers/cuda/tensor/size.cc
  • onnxruntime/core/providers/cuda/generator/constant_of_shape.cc
  • onnxruntime/core/providers/cuda/controlflow/if.cc
  • onnxruntime/core/providers/cuda/controlflow/loop.cc
  • onnxruntime/core/providers/cuda/controlflow/scan.cc

Test Plan

  • Verify CUDA EP build compiles successfully (CI)
  • Existing opset 21 tests for Shape/Reshape/Squeeze/Unsqueeze pass (validates the pattern)
  • Verify operators are no longer falling back to CPU when running opset 21 models on CUDA
  • No regression in existing CUDA EP tests

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends CUDA Execution Provider kernel registrations to correctly declare bounded version ranges for several operators whose ONNX schema versions changed in opsets 21/23, preventing unintended CUDA→CPU fallback due to VerifyVersion rejecting open-ended kernels.

Changes:

  • Add ONNX_OPERATOR_VERSIONED_KERNEL_EX registrations for opset ranges (e.g., 13–20 and 21–22) for Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size.
  • Move prior open-ended CUDA registrations to target opset 23 for these operators.
  • Update CUDA EP forward declarations and BuildKernelCreateInfo registration list accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
onnxruntime/core/providers/cuda/cuda_execution_provider.cc Updates kernel class forward-decls and registry entries for the newly versioned kernels.
onnxruntime/core/providers/cuda/tensor/flatten.cc Adds version-bounded CUDA kernel registrations for Flatten for opsets 13–20 and 21–22; shifts open-ended to 23.
onnxruntime/core/providers/cuda/tensor/identity_op.cc Adds version-bounded CUDA kernel registrations for Identity for opsets 19–20 and 21–22; shifts open-ended to 23.
onnxruntime/core/providers/cuda/tensor/size.cc Adds version-bounded CUDA kernel registrations for Size for opsets 13–20 and 21–22; shifts open-ended to 23.
onnxruntime/core/providers/cuda/generator/constant_of_shape.cc Adds version-bounded CUDA kernel registrations for ConstantOfShape for opsets 9–20 and 21–22; shifts open-ended to 23.
onnxruntime/core/providers/cuda/controlflow/if.cc Adds version-bounded CUDA kernel registrations for If for opsets 19–20 and 21–22; shifts open-ended to 23.
onnxruntime/core/providers/cuda/controlflow/loop.cc Adds version-bounded CUDA kernel registrations for Loop for opsets 19–20 and 21–22; shifts open-ended to 23.
onnxruntime/core/providers/cuda/controlflow/scan.cc Adds version-bounded CUDA kernel registrations for Scan for opsets 19–20 and 21–22; shifts open-ended to 23.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread onnxruntime/core/providers/cuda/generator/constant_of_shape.cc
Comment thread onnxruntime/core/providers/cuda/controlflow/if.cc
Comment thread onnxruntime/core/providers/cuda/controlflow/loop.cc
Comment thread onnxruntime/core/providers/cuda/controlflow/scan.cc
Comment thread onnxruntime/core/providers/cuda/cuda_execution_provider.cc Outdated
Comment thread onnxruntime/core/providers/cuda/tensor/size.cc
Comment thread onnxruntime/core/providers/cuda/tensor/flatten.cc
Comment thread onnxruntime/core/providers/cuda/tensor/identity_op.cc
@Rishi-Dave
Copy link
Copy Markdown
Contributor Author

Good catch — updated in the latest push.

The opset-23 registrations were unbounded (end_version=INT_MAX), which meant VerifyVersion would only match since_version==23 and opset 24/25 models would still fall back to CPU.

Changes made:

  • Changed all 7 opset-23 registrations from unbounded to versioned 23–24 (ONNX_OPERATOR_VERSIONED_KERNEL_EX)
  • Added unbounded opset-25 registrations (ONNX_OPERATOR_KERNEL_EX) for all 7 operators
  • Updated forward declarations and BuildKernelCreateInfo entries in cuda_execution_provider.cc accordingly

This now matches the CPU provider's version ranges (21–22, 23–24 bounded, 25 unbounded) for Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size.

@tianleiwu
Copy link
Copy Markdown
Contributor

/azup run Windows GPU Doc Gen CI Pipeline, Windows ARM64 QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI

@Rishi-Dave Rishi-Dave force-pushed the rishidave/fix/cuda-opset21-kernel-registrations branch from d6b72e4 to 8abced8 Compare March 18, 2026 20:20
@Rishi-Dave
Copy link
Copy Markdown
Contributor Author

Rebased on latest upstream/main — no conflicts.

To summarize the current state of the PR:

  • Opset version ranges now fully match the CPU provider pattern for all 7 operators (Flatten, Identity, If, Loop, Scan, ConstantOfShape, Size):

    • Original range → end_version capped at opset 20
    • 21–22 bounded
    • 23–24 bounded
    • 25 unbounded (open-ended)
  • The "Build Extended Minimal" CI failure on the previous run was a transient vcpkg infrastructure issue (asset cache unreachable → boost-mp11 download failed), not related to this change. The re-triggered CI from the rebase should clear it.

  • All other CI checks (CUDA build, Linux build, Python format, etc.) passed cleanly.

Let me know if anything else needs adjustment — happy to iterate.

@tianleiwu
Copy link
Copy Markdown
Contributor

/azup run Windows GPU Doc Gen CI Pipeline

@Rishi-Dave Rishi-Dave force-pushed the rishidave/fix/cuda-opset21-kernel-registrations branch from 8abced8 to 0ce1020 Compare March 18, 2026 22:56
@Rishi-Dave
Copy link
Copy Markdown
Contributor Author

Rebased on latest upstream/main — no conflicts, no code changes needed.

All prior review feedback (unbounded opset-23 registrations) was addressed in the previous push. Current state for all 7 operators (Flatten, Identity, If, Loop, Scan, ConstantOfShape, Size):

Range Registration Type
Original → 20 ONNX_OPERATOR_VERSIONED_KERNEL_EX (bounded)
21 → 22 ONNX_OPERATOR_VERSIONED_KERNEL_EX (bounded)
23 → 24 ONNX_OPERATOR_VERSIONED_KERNEL_EX (bounded)
25 → ∞ ONNX_OPERATOR_KERNEL_EX (unbounded)

This matches the CPU provider's version ranges. Forward declarations and BuildKernelCreateInfo entries in cuda_execution_provider.cc are aligned.

Ready for review when CI clears.

@tianleiwu
Copy link
Copy Markdown
Contributor

tianleiwu commented Mar 18, 2026

/azp run Windows GPU Doc Gen CI Pipeline

@tianleiwu
Copy link
Copy Markdown
Contributor

@Rishi-Dave, Please update docs/OperatorKernels.md.

You can wait until "Windows GPU Doc Gen CI Pipeline" job finishes, then download the file from artifacts of the job.

@Rishi-Dave
Copy link
Copy Markdown
Contributor Author

Thanks for triggering the doc gen pipeline. I'll download the updated OperatorKernels.md from the pipeline artifacts once it completes and push the update.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

…Loop, Scan, ConstantOfShape, Size (microsoft#27102)

When ONNX introduces a new version of an operator in opset 21, the kernel
registry's VerifyVersion rejects non-versioned (open-ended) CUDA kernels
because kernel_start_version != since_ver while kernel_end_version == INT_MAX.
This causes those operators to fall back from CUDA to CPU, introducing
unnecessary host↔device copies that can lead to value corruption on Windows.

PR microsoft#26075 previously fixed this for Shape, Reshape, Transpose, Squeeze, and
Unsqueeze. This commit extends the same fix to the remaining affected
operators: Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size.

For each operator:
- Cap existing non-versioned kernel to opset 20 (VERSIONED)
- Add VERSIONED(21, 22) kernel with identical type constraints
- Add non-versioned opset 23 kernel for forward compatibility
Bound the opset-23 registrations to 23-24 and add unbounded opset-25
registrations for Flatten, Identity, If, Loop, Scan, ConstantOfShape,
and Size, matching the CPU provider's version ranges. This prevents
opset 24/25 models from falling back to CPU due to VerifyVersion
rejecting open-ended opset-23 kernels.
Add version-bounded rows for ConstantOfShape, Flatten, Identity, If,
Loop, Scan, and Size to reflect the new CUDA EP kernel registrations
(21-22, 23-24 bounded, 25+ unbounded).
@Rishi-Dave Rishi-Dave force-pushed the rishidave/fix/cuda-opset21-kernel-registrations branch from 0ce1020 to 10138d5 Compare March 20, 2026 20:23
@Rishi-Dave
Copy link
Copy Markdown
Contributor Author

Rebased on latest upstream/main and updated docs/OperatorKernels.md.

Changes in this push:

  1. Rebase conflict resolutioncuda_execution_provider.cc had conflicts with the Squeeze/Unsqueeze opset 24-25 registrations that landed upstream. Resolved by keeping upstream's Squeeze/Unsqueeze versioning (23→23, 24→24, 25→∞) and integrating our 7 operators alongside it.

  2. docs/OperatorKernels.md update — Updated the CUDA EP section for all 7 operators (ConstantOfShape, Flatten, Identity, If, Loop, Scan, Size) to reflect the new version ranges:

Range Registration Type
Original → 20 VERSIONED (bounded)
21 → 22 VERSIONED (bounded)
23 → 24 VERSIONED (bounded)
25 → ∞ Unbounded

This matches the CPU provider's version ranges and follows the same pattern as Shape, Reshape, Transpose, Squeeze, and Unsqueeze.

Note: OperatorKernels.md was updated manually since the doc gen script (gen_opkernel_doc.py) requires a compiled build. The version ranges and type constraints match the C++ kernel registrations exactly. The next doc gen pipeline run should produce an identical result — if there's any discrepancy, happy to pull the artifact and overwrite.

@tianleiwu
Copy link
Copy Markdown
Contributor

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

@tianleiwu tianleiwu enabled auto-merge (squash) March 21, 2026 17:05
@tianleiwu tianleiwu merged commit ce3158a into microsoft:main Mar 21, 2026
88 of 89 checks passed
tianleiwu added a commit that referenced this pull request Apr 6, 2026
### Description

Extends the CUDA Transpose kernel registration from opset 23 to opset
25.

- **`transpose.cc`**: Cap existing opset 23 kernel to versioned `(23,
24)`, add new non-versioned kernel at opset 25
- **`cuda_execution_provider.cc`**: Update forward declarations and
`BuildKernelCreateInfo` entries to match; add new `// Opset 25` section
- **`docs/OperatorKernels.md`**: Update CUDA Transpose entry from `23+`
to `25+` with new `[23, 24]` versioned range

No functional or type constraint changes — the kernel implementation is
identical across these opsets.

### Motivation and Context

CUDA EP's Transpose registration stopped at opset 23 while the ONNX spec
defines it through opset 25. This is one of the P1 gaps tracked in
#27729, following the same pattern as #27728.

### Limitation

This PR does not add support of new data type for Transpose:
- int2 (opset 25)
- float8e8m0 (opset 24)
- float4e2m1 (opset 23)
- float8e4m3fn,float8e4m3fnuz, float8e5m2, float8e5m2fnuz, uint4, int4
(opset 21)

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ORT inference fails after upgrading model's opset from 20 to 21

3 participants