Skip to content

CUDA Resize: add optimized 3D nearest resize kernel for 5D up/down sa…#27578

Merged
tianleiwu merged 1 commit intomicrosoft:mainfrom
johannes-rehm-snkeos:cuda-resize-nearest-3d-kernel
Mar 18, 2026
Merged

CUDA Resize: add optimized 3D nearest resize kernel for 5D up/down sa…#27578
tianleiwu merged 1 commit intomicrosoft:mainfrom
johannes-rehm-snkeos:cuda-resize-nearest-3d-kernel

Conversation

@johannes-rehm-snkeos
Copy link
Copy Markdown
Contributor

Summary

This PR adds CUDA support for optimized nearest-neighbor 3D resize mapping/execution in the Resize operator path, and adds targeted regression coverage.

The implementation introduces a dedicated 3D fast path for nearest resize to handle the last three spatial dimensions (D/H/W) efficiently when outer dimensions are unchanged.

What Changed

CUDA Resize implementation

File: onnxruntime/core/providers/cuda/tensor/resize_impl.cu

  • Added 3D nearest mapping kernel:
    • _ResizeNearestMappingKernel3D
  • Added 3D nearest compute kernel:
    • _ResizeNearestKernel3D
  • Added optimized 3D dispatch path in ResizeNearestImpl:
    • Enabled when:
      • rank >= 3
      • coordinate_transformation_mode != tf_crop_and_resize
      • all outer scales (except last 3 dims) are 1.0

This keeps existing behavior unchanged for other cases while using the optimized path for true 3D nearest resize workloads.

Regression tests

File: onnxruntime/test/providers/cpu/tensor/resize_op_test.cc

Added CUDA-targeted regression tests:

  • ResizeOpNearestUpSampleTest_5D_CudaRegression_Optimized3DMapping
  • ResizeOpNearestDownSampleTest_5D_CudaRegression_Optimized3DMapping

Why

The previous nearest implementation relied on the generic path for these 3D scenarios. This change introduces a dedicated CUDA 3D path to improve performance for 5D nearest resize workloads.

Fixes #14596

@johannes-rehm-snkeos
Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree

@tianleiwu
Copy link
Copy Markdown
Contributor

@johannes-rehm-snkeos, could you share benchmark/profiling results that show that the new kernel is better?

@johannes-rehm-snkeos
Copy link
Copy Markdown
Contributor Author

@tianleiwu I used your script from here: #14596 (comment) and got the following results:

Profiling of Torch:

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   54.73%  4.98039s       503  9.9014ms  5.6153ms  39.858ms  void at::native::_GLOBAL__N__b9911c4e_20_UpSampleNearest3d_cu_2b4cf812::upsample_nearest3d_out_frame<float, __operator_&__(at::native::nearest_neighbor_compute_source_index(float, int, int))>(float const *, __int64, __int64, __int64, __int64, __int64, __int64, __int64, __int64, at::native::nearest_neighbor_compute_source_index*, float, float, float)

Profiling of onnxruntime-gpu==1.24.3:

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   90.56%  40.4292s       503  80.376ms  76.639ms  98.755ms  void onnxruntime::cuda::_ResizeNearestKernel<float>(int, onnxruntime::cuda::TArray<__int64, int=8>, onnxruntime::cuda::_ResizeNearestKernel<float, onnxruntime::cuda::DivMod<int>, int=8>, float const *, onnxruntime::cuda::_ResizeNearestKernel<float, onnxruntime::cuda::DivMod<int>, int=8>*, __int64, onnxruntime::cuda::_ResizeNearestKernel<float, onnxruntime::cuda::DivMod<int>, int=8>, __int64 const *, onnxruntime::cuda::NearestMappingInfo const *)

Profiling of johannes-rehm-snkeos:cuda-resize-nearest-3d-kernel:

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   43.42%  3.75516s       503  7.4655ms  1.9272ms  37.947ms  void onnxruntime::cuda::_ResizeNearestKernel3D<float, bool=0>(__int64, __int64, __int64, __int64, __int64, int, onnxruntime::cuda::DivMod<int>, onnxruntime::cuda::DivMod, onnxruntime::cuda::DivMod, float const *, onnxruntime::cuda::DivMod<int>*, __int64, onnxruntime::cuda::DivMod<int>, onnxruntime::cuda::NearestMappingInfo const *)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a CUDA optimized fast-path for nearest-neighbor 3D resize (mapping + execution) to improve performance on rank≥3 tensors where only the last three dimensions are resized and all outer-dimension scales are 1.0, and introduces CUDA-targeted regression tests to validate the new path.

Changes:

  • Added CUDA nearest-neighbor 3D mapping and compute kernels and a dispatch fast-path in ResizeNearestImpl.
  • Enabled the new 3D optimized path when coordinate_transformation_mode != tf_crop_and_resize and all outer scales (except last 3 dims) are exactly 1.0.
  • Added CUDA regression tests covering 5D nearest upsample and downsample scenarios intended to hit the optimized 3D mapping path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
onnxruntime/core/providers/cuda/tensor/resize_impl.cu Introduces optimized nearest-neighbor 3D mapping/compute CUDA kernels and a conditional fast-path in ResizeNearestImpl.
onnxruntime/test/providers/cpu/tensor/resize_op_test.cc Adds CUDA-targeted regression tests for 5D nearest resize upsample/downsample intended to exercise the optimized 3D path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@tianleiwu
Copy link
Copy Markdown
Contributor

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

@tianleiwu tianleiwu enabled auto-merge (squash) March 18, 2026 05:07
@tianleiwu tianleiwu merged commit 160af83 into microsoft:main Mar 18, 2026
103 of 105 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance] nearest neighbor Resize operator is significantly slower than pytorch for 3D tensors

3 participants