Add opset 21/23 CUDA kernel registrations for Flatten, Identity, If, Loop, Scan, ConstantOfShape, Size#27728
Conversation
There was a problem hiding this comment.
Pull request overview
Extends CUDA Execution Provider kernel registrations to correctly declare bounded version ranges for several operators whose ONNX schema versions changed in opsets 21/23, preventing unintended CUDA→CPU fallback due to VerifyVersion rejecting open-ended kernels.
Changes:
- Add
ONNX_OPERATOR_VERSIONED_KERNEL_EXregistrations for opset ranges (e.g., 13–20 and 21–22) for Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size. - Move prior open-ended CUDA registrations to target opset 23 for these operators.
- Update CUDA EP forward declarations and
BuildKernelCreateInforegistration list accordingly.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/providers/cuda/cuda_execution_provider.cc | Updates kernel class forward-decls and registry entries for the newly versioned kernels. |
| onnxruntime/core/providers/cuda/tensor/flatten.cc | Adds version-bounded CUDA kernel registrations for Flatten for opsets 13–20 and 21–22; shifts open-ended to 23. |
| onnxruntime/core/providers/cuda/tensor/identity_op.cc | Adds version-bounded CUDA kernel registrations for Identity for opsets 19–20 and 21–22; shifts open-ended to 23. |
| onnxruntime/core/providers/cuda/tensor/size.cc | Adds version-bounded CUDA kernel registrations for Size for opsets 13–20 and 21–22; shifts open-ended to 23. |
| onnxruntime/core/providers/cuda/generator/constant_of_shape.cc | Adds version-bounded CUDA kernel registrations for ConstantOfShape for opsets 9–20 and 21–22; shifts open-ended to 23. |
| onnxruntime/core/providers/cuda/controlflow/if.cc | Adds version-bounded CUDA kernel registrations for If for opsets 19–20 and 21–22; shifts open-ended to 23. |
| onnxruntime/core/providers/cuda/controlflow/loop.cc | Adds version-bounded CUDA kernel registrations for Loop for opsets 19–20 and 21–22; shifts open-ended to 23. |
| onnxruntime/core/providers/cuda/controlflow/scan.cc | Adds version-bounded CUDA kernel registrations for Scan for opsets 19–20 and 21–22; shifts open-ended to 23. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
|
Good catch — updated in the latest push. The opset-23 registrations were unbounded ( Changes made:
This now matches the CPU provider's version ranges (21–22, 23–24 bounded, 25 unbounded) for Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size. |
|
/azup run Windows GPU Doc Gen CI Pipeline, Windows ARM64 QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI |
d6b72e4 to
8abced8
Compare
|
Rebased on latest To summarize the current state of the PR:
Let me know if anything else needs adjustment — happy to iterate. |
|
/azup run Windows GPU Doc Gen CI Pipeline |
8abced8 to
0ce1020
Compare
|
Rebased on latest All prior review feedback (unbounded opset-23 registrations) was addressed in the previous push. Current state for all 7 operators (Flatten, Identity, If, Loop, Scan, ConstantOfShape, Size):
This matches the CPU provider's version ranges. Forward declarations and Ready for review when CI clears. |
|
/azp run Windows GPU Doc Gen CI Pipeline |
|
@Rishi-Dave, Please update docs/OperatorKernels.md. You can wait until "Windows GPU Doc Gen CI Pipeline" job finishes, then download the file from artifacts of the job. |
|
Thanks for triggering the doc gen pipeline. I'll download the updated |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…Loop, Scan, ConstantOfShape, Size (microsoft#27102) When ONNX introduces a new version of an operator in opset 21, the kernel registry's VerifyVersion rejects non-versioned (open-ended) CUDA kernels because kernel_start_version != since_ver while kernel_end_version == INT_MAX. This causes those operators to fall back from CUDA to CPU, introducing unnecessary host↔device copies that can lead to value corruption on Windows. PR microsoft#26075 previously fixed this for Shape, Reshape, Transpose, Squeeze, and Unsqueeze. This commit extends the same fix to the remaining affected operators: Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size. For each operator: - Cap existing non-versioned kernel to opset 20 (VERSIONED) - Add VERSIONED(21, 22) kernel with identical type constraints - Add non-versioned opset 23 kernel for forward compatibility
Bound the opset-23 registrations to 23-24 and add unbounded opset-25 registrations for Flatten, Identity, If, Loop, Scan, ConstantOfShape, and Size, matching the CPU provider's version ranges. This prevents opset 24/25 models from falling back to CPU due to VerifyVersion rejecting open-ended opset-23 kernels.
Add version-bounded rows for ConstantOfShape, Flatten, Identity, If, Loop, Scan, and Size to reflect the new CUDA EP kernel registrations (21-22, 23-24 bounded, 25+ unbounded).
0ce1020 to
10138d5
Compare
|
Rebased on latest Changes in this push:
This matches the CPU provider's version ranges and follows the same pattern as Shape, Reshape, Transpose, Squeeze, and Unsqueeze. Note: |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
### Description Extends the CUDA Transpose kernel registration from opset 23 to opset 25. - **`transpose.cc`**: Cap existing opset 23 kernel to versioned `(23, 24)`, add new non-versioned kernel at opset 25 - **`cuda_execution_provider.cc`**: Update forward declarations and `BuildKernelCreateInfo` entries to match; add new `// Opset 25` section - **`docs/OperatorKernels.md`**: Update CUDA Transpose entry from `23+` to `25+` with new `[23, 24]` versioned range No functional or type constraint changes — the kernel implementation is identical across these opsets. ### Motivation and Context CUDA EP's Transpose registration stopped at opset 23 while the ONNX spec defines it through opset 25. This is one of the P1 gaps tracked in #27729, following the same pattern as #27728. ### Limitation This PR does not add support of new data type for Transpose: - int2 (opset 25) - float8e8m0 (opset 24) - float4e2m1 (opset 23) - float8e4m3fn,float8e4m3fnuz, float8e5m2, float8e5m2fnuz, uint4, int4 (opset 21) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Summary
Motivation
Fixes #27102.
When ONNX introduces a new operator version in opset 21, ORT's
VerifyVersionfunction inkernel_registry.ccrejects non-versioned (open-ended) CUDA kernels. The check at kernel_registry.cc:L126-L133 requires either an exact version match or a bounded version range — a kernel registered assince_version=N, end_version=INT_MAXfails whensince_ver(from the opset 21 schema) differs fromN.This causes the affected operators to fall back from CUDA to CPU, introducing unnecessary host↔device memory copies. On Windows with CUDA EP, this fallback path can produce corrupted shape computation values (e.g.,
124647109376instead of6), leading to downstream Reshape failures.PR #26075 fixed this for Shape, Reshape, Transpose, Squeeze, and Unsqueeze. This PR extends the same fix to the 7 remaining operators that were updated in ONNX opset 21 and had non-versioned CUDA kernels.
Changes
For each of the 7 operators:
ONNX_OPERATOR_KERNEL→ONNX_OPERATOR_VERSIONED_KERNEL)Files modified:
onnxruntime/core/providers/cuda/cuda_execution_provider.cc— forward declarations +BuildKernelCreateInforegistrationonnxruntime/core/providers/cuda/tensor/flatten.cconnxruntime/core/providers/cuda/tensor/identity_op.cconnxruntime/core/providers/cuda/tensor/size.cconnxruntime/core/providers/cuda/generator/constant_of_shape.cconnxruntime/core/providers/cuda/controlflow/if.cconnxruntime/core/providers/cuda/controlflow/loop.cconnxruntime/core/providers/cuda/controlflow/scan.ccTest Plan