CUDA Resize: add optimized 3D nearest resize kernel for 5D up/down sa… by johannes-rehm-snkeos · Pull Request #27578 · microsoft/onnxruntime

johannes-rehm-snkeos · 2026-03-06T10:03:05Z

Summary

This PR adds CUDA support for optimized nearest-neighbor 3D resize mapping/execution in the Resize operator path, and adds targeted regression coverage.

The implementation introduces a dedicated 3D fast path for nearest resize to handle the last three spatial dimensions (D/H/W) efficiently when outer dimensions are unchanged.

What Changed

CUDA Resize implementation

File: onnxruntime/core/providers/cuda/tensor/resize_impl.cu

Added 3D nearest mapping kernel:
- _ResizeNearestMappingKernel3D
Added 3D nearest compute kernel:
- _ResizeNearestKernel3D
Added optimized 3D dispatch path in ResizeNearestImpl:
- Enabled when:
  - rank >= 3
  - coordinate_transformation_mode != tf_crop_and_resize
  - all outer scales (except last 3 dims) are 1.0

This keeps existing behavior unchanged for other cases while using the optimized path for true 3D nearest resize workloads.

Regression tests

File: onnxruntime/test/providers/cpu/tensor/resize_op_test.cc

Added CUDA-targeted regression tests:

ResizeOpNearestUpSampleTest_5D_CudaRegression_Optimized3DMapping
ResizeOpNearestDownSampleTest_5D_CudaRegression_Optimized3DMapping

Why

The previous nearest implementation relied on the generic path for these 3D scenarios. This change introduces a dedicated CUDA 3D path to improve performance for 5D nearest resize workloads.

Fixes #14596

…mpling

johannes-rehm-snkeos · 2026-03-06T11:36:49Z

@microsoft-github-policy-service agree

tianleiwu · 2026-03-06T16:25:49Z

@johannes-rehm-snkeos, could you share benchmark/profiling results that show that the new kernel is better?

johannes-rehm-snkeos · 2026-03-09T09:42:17Z

@tianleiwu I used your script from here: #14596 (comment) and got the following results:

Profiling of Torch:

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   54.73%  4.98039s       503  9.9014ms  5.6153ms  39.858ms  void at::native::_GLOBAL__N__b9911c4e_20_UpSampleNearest3d_cu_2b4cf812::upsample_nearest3d_out_frame<float, __operator_&__(at::native::nearest_neighbor_compute_source_index(float, int, int))>(float const *, __int64, __int64, __int64, __int64, __int64, __int64, __int64, __int64, at::native::nearest_neighbor_compute_source_index*, float, float, float)

Profiling of onnxruntime-gpu==1.24.3:

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   90.56%  40.4292s       503  80.376ms  76.639ms  98.755ms  void onnxruntime::cuda::_ResizeNearestKernel<float>(int, onnxruntime::cuda::TArray<__int64, int=8>, onnxruntime::cuda::_ResizeNearestKernel<float, onnxruntime::cuda::DivMod<int>, int=8>, float const *, onnxruntime::cuda::_ResizeNearestKernel<float, onnxruntime::cuda::DivMod<int>, int=8>*, __int64, onnxruntime::cuda::_ResizeNearestKernel<float, onnxruntime::cuda::DivMod<int>, int=8>, __int64 const *, onnxruntime::cuda::NearestMappingInfo const *)

Profiling of johannes-rehm-snkeos:cuda-resize-nearest-3d-kernel:

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   43.42%  3.75516s       503  7.4655ms  1.9272ms  37.947ms  void onnxruntime::cuda::_ResizeNearestKernel3D<float, bool=0>(__int64, __int64, __int64, __int64, __int64, int, onnxruntime::cuda::DivMod<int>, onnxruntime::cuda::DivMod, onnxruntime::cuda::DivMod, float const *, onnxruntime::cuda::DivMod<int>*, __int64, onnxruntime::cuda::DivMod<int>, onnxruntime::cuda::NearestMappingInfo const *)

Copilot

Pull request overview

This PR adds a CUDA optimized fast-path for nearest-neighbor 3D resize (mapping + execution) to improve performance on rank≥3 tensors where only the last three dimensions are resized and all outer-dimension scales are 1.0, and introduces CUDA-targeted regression tests to validate the new path.

Changes:

Added CUDA nearest-neighbor 3D mapping and compute kernels and a dispatch fast-path in ResizeNearestImpl.
Enabled the new 3D optimized path when coordinate_transformation_mode != tf_crop_and_resize and all outer scales (except last 3 dims) are exactly 1.0.
Added CUDA regression tests covering 5D nearest upsample and downsample scenarios intended to hit the optimized 3D mapping path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`onnxruntime/core/providers/cuda/tensor/resize_impl.cu`	Introduces optimized nearest-neighbor 3D mapping/compute CUDA kernels and a conditional fast-path in `ResizeNearestImpl`.
`onnxruntime/test/providers/cpu/tensor/resize_op_test.cc`	Adds CUDA-targeted regression tests for 5D nearest resize upsample/downsample intended to exercise the optimized 3D path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

tianleiwu · 2026-03-18T04:56:08Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-18T04:56:28Z

Azure Pipelines successfully started running 4 pipeline(s).

CUDA Resize: add optimized 3D nearest resize kernel for 5D up/down sa…

a60c613

…mpling

tianleiwu requested a review from Copilot March 12, 2026 16:03

Copilot started reviewing on behalf of tianleiwu March 12, 2026 16:04 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

tianleiwu approved these changes Mar 18, 2026

View reviewed changes

tianleiwu enabled auto-merge (squash) March 18, 2026 05:07

tianleiwu merged commit 160af83 into microsoft:main Mar 18, 2026
103 of 105 checks passed

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Resize: add optimized 3D nearest resize kernel for 5D up/down sa…#27578

CUDA Resize: add optimized 3D nearest resize kernel for 5D up/down sa…#27578
tianleiwu merged 1 commit intomicrosoft:mainfrom
johannes-rehm-snkeos:cuda-resize-nearest-3d-kernel

johannes-rehm-snkeos commented Mar 6, 2026

Uh oh!

johannes-rehm-snkeos commented Mar 6, 2026

Uh oh!

tianleiwu commented Mar 6, 2026

Uh oh!

johannes-rehm-snkeos commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

tianleiwu commented Mar 18, 2026

Uh oh!

azure-pipelines Bot commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

johannes-rehm-snkeos commented Mar 6, 2026

Summary

What Changed

CUDA Resize implementation

Regression tests

Why

Uh oh!

johannes-rehm-snkeos commented Mar 6, 2026

Uh oh!

tianleiwu commented Mar 6, 2026

Uh oh!

johannes-rehm-snkeos commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

tianleiwu commented Mar 18, 2026

Uh oh!

azure-pipelines Bot commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants