[webgpu] Optimize generic 4D Transpose using OIHW2OHWI Program #26942

daijh · 2026-01-08T06:58:41Z

Description

This PR migrates the OIHW2OHWI Program from Im2ColMatMul to the Transpose operator. By centralizing this logic, we leverage the specialized shader to optimize generic 4D transpositions (specifically the {0, 2, 3, 1} permutation pattern) while reducing code duplication.

While this shader is capable of supporting 2D/3D transpositions, those optimizations are reserved for follow-up PRs.

Motivation and Context

See above.

daijh · 2026-01-09T07:37:48Z

@guschmue @fs-eire @qjia7 PTAL

Copilot

Pull request overview

This PR optimizes generic 4D transpose operations by migrating the specialized OIHW2OHWIProgram shader from the Im2ColMatMul operator to the Transpose operator. The migration enables reuse of this optimized shader for any 4D tensor transpose with the {0, 2, 3, 1} permutation pattern, while also fixing a calculation bug in the process.

Moves OIHW2OHWIProgram class and implementation from im2col_matmul to transpose
Relocates the WGSL shader template from nn/ to tensor/ directory
Fixes a bug where H_W_tiles was calculated using kernel_height * kernel_height instead of kernel_height * kernel_width

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
onnxruntime/core/providers/webgpu/tensor/transpose.h	Adds OIHW2OHWIProgram class declaration with uniform variable definitions
onnxruntime/core/providers/webgpu/tensor/transpose.cc	Implements OIHW2OHWIProgram shader generation and integrates the optimization into DoTranspose with proper threshold checks; fixes bug in H_W_tiles calculation
onnxruntime/core/providers/webgpu/tensor/oihw_to_ohwi.wgsl.template	Adds the WGSL shader template for the OIHW to OHWI transpose operation with proper bounds checking and workgroup synchronization
onnxruntime/core/providers/webgpu/nn/im2col_matmul.h	Removes OIHW2OHWIProgram class declaration as it's moved to transpose
onnxruntime/core/providers/webgpu/nn/im2col_matmul.cc	Replaces local OIHW2OHWI implementation with call to TransposeKernel; adds conv.h include for the function

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

daijh · 2026-01-12T01:34:25Z

Fix the CI error for span.

guschmue · 2026-01-12T16:25:37Z

daijh · 2026-01-13T06:14:56Z

Thanks for analyzing the CI failure.

The webgpu_minimal_build_edge_build_x64_RelWithDebInfo environment uses different build options than my local setup.
Added AreSpansEqual to resolve the compiler errors regarding gsl::span comparisons.

daijh · 2026-01-15T01:55:46Z

React Native CI Pipeline / React Native CI Android (pull_request)

This failure appears to be environment-specific and is unrelated to the changes in this PR.

/usr/local/lib/android/sdk/emulator/qemu/linux-x86_64/qemu-system-x86_64: error while loading shared libraries: libpulse.so.0: cannot open shared object file: No such file or directory

guschmue · 2026-01-16T17:35:25Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-16T17:35:44Z

Azure Pipelines successfully started running 4 pipeline(s).

daijh marked this pull request as draft January 8, 2026 06:58

daijh force-pushed the optimize-4D-weights-transpose branch from 1b8e605 to be0ea7b Compare January 9, 2026 04:59

daijh marked this pull request as ready for review January 9, 2026 07:36

guschmue self-assigned this Jan 9, 2026

guschmue requested a review from Copilot January 9, 2026 17:41

guschmue added the ep:WebGPU ort-web webgpu provider label Jan 9, 2026

Copilot started reviewing on behalf of guschmue January 9, 2026 17:42 View session

Copilot AI reviewed Jan 9, 2026

View reviewed changes

daijh added 3 commits January 13, 2026 10:49

Optimize 4D Transpose

ba15ccc

Fix span conversion

f7fc815

Add AreSpansEqual util

61b825a

daijh force-pushed the optimize-4D-weights-transpose branch from b0d9657 to 61b825a Compare January 13, 2026 05:58

Move local functions to anonymous namespace

881e33a

guschmue approved these changes Jan 21, 2026

View reviewed changes

guschmue merged commit 2aaf21b into microsoft:main Jan 21, 2026
92 of 93 checks passed

daijh deleted the optimize-4D-weights-transpose branch January 22, 2026 01:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[webgpu] Optimize generic 4D Transpose using OIHW2OHWI Program #26942

[webgpu] Optimize generic 4D Transpose using OIHW2OHWI Program #26942

Uh oh!

daijh commented Jan 8, 2026

Uh oh!

daijh commented Jan 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

daijh commented Jan 12, 2026

Uh oh!

guschmue commented Jan 12, 2026

Uh oh!

daijh commented Jan 13, 2026

Uh oh!

daijh commented Jan 15, 2026

Uh oh!

guschmue commented Jan 16, 2026

Uh oh!

azure-pipelines bot commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[webgpu] Optimize generic 4D Transpose using OIHW2OHWI Program #26942

[webgpu] Optimize generic 4D Transpose using OIHW2OHWI Program #26942

Uh oh!

Conversation

daijh commented Jan 8, 2026

Description

Motivation and Context

Uh oh!

daijh commented Jan 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

daijh commented Jan 12, 2026

Uh oh!

guschmue commented Jan 12, 2026

Uh oh!

daijh commented Jan 13, 2026

Uh oh!

daijh commented Jan 15, 2026

Uh oh!

guschmue commented Jan 16, 2026

Uh oh!

azure-pipelines bot commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants