Skip to content

Conversation

@daijh
Copy link
Contributor

@daijh daijh commented Jan 8, 2026

Description

This PR migrates the OIHW2OHWI Program from Im2ColMatMul to the Transpose operator. By centralizing this logic, we leverage the specialized shader to optimize generic 4D transpositions (specifically the {0, 2, 3, 1} permutation pattern) while reducing code duplication.

While this shader is capable of supporting 2D/3D transpositions, those optimizations are reserved for follow-up PRs.

Motivation and Context

See above.

@daijh daijh marked this pull request as draft January 8, 2026 06:58
@daijh daijh force-pushed the optimize-4D-weights-transpose branch from 1b8e605 to be0ea7b Compare January 9, 2026 04:59
@daijh daijh marked this pull request as ready for review January 9, 2026 07:36
@daijh
Copy link
Contributor Author

daijh commented Jan 9, 2026

@guschmue @fs-eire @qjia7 PTAL

@guschmue guschmue self-assigned this Jan 9, 2026
@guschmue guschmue requested a review from Copilot January 9, 2026 17:41
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jan 9, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes generic 4D transpose operations by migrating the specialized OIHW2OHWIProgram shader from the Im2ColMatMul operator to the Transpose operator. The migration enables reuse of this optimized shader for any 4D tensor transpose with the {0, 2, 3, 1} permutation pattern, while also fixing a calculation bug in the process.

  • Moves OIHW2OHWIProgram class and implementation from im2col_matmul to transpose
  • Relocates the WGSL shader template from nn/ to tensor/ directory
  • Fixes a bug where H_W_tiles was calculated using kernel_height * kernel_height instead of kernel_height * kernel_width

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
onnxruntime/core/providers/webgpu/tensor/transpose.h Adds OIHW2OHWIProgram class declaration with uniform variable definitions
onnxruntime/core/providers/webgpu/tensor/transpose.cc Implements OIHW2OHWIProgram shader generation and integrates the optimization into DoTranspose with proper threshold checks; fixes bug in H_W_tiles calculation
onnxruntime/core/providers/webgpu/tensor/oihw_to_ohwi.wgsl.template Adds the WGSL shader template for the OIHW to OHWI transpose operation with proper bounds checking and workgroup synchronization
onnxruntime/core/providers/webgpu/nn/im2col_matmul.h Removes OIHW2OHWIProgram class declaration as it's moved to transpose
onnxruntime/core/providers/webgpu/nn/im2col_matmul.cc Replaces local OIHW2OHWI implementation with call to TransposeKernel; adds conv.h include for the function

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@daijh
Copy link
Contributor Author

daijh commented Jan 12, 2026

Fix the CI error for span.

@guschmue
Copy link
Contributor

image

@daijh daijh force-pushed the optimize-4D-weights-transpose branch from b0d9657 to 61b825a Compare January 13, 2026 05:58
@daijh
Copy link
Contributor Author

daijh commented Jan 13, 2026

Thanks for analyzing the CI failure.

The webgpu_minimal_build_edge_build_x64_RelWithDebInfo environment uses different build options than my local setup.
Added AreSpansEqual to resolve the compiler errors regarding gsl::span comparisons.

@daijh
Copy link
Contributor Author

daijh commented Jan 15, 2026

React Native CI Pipeline / React Native CI Android (pull_request)

This failure appears to be environment-specific and is unrelated to the changes in this PR.

/usr/local/lib/android/sdk/emulator/qemu/linux-x86_64/qemu-system-x86_64: error while loading shared libraries: libpulse.so.0: cannot open shared object file: No such file or directory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants