webgpu: optimize Gemm and MatMul using subgroup feature by xhcao · Pull Request #26433 · microsoft/onnxruntime

xhcao · 2025-10-29T08:57:34Z

Description

Motivation and Context

xhcao · 2025-10-29T09:00:14Z

@guschmue @fs-eire @qjia7 @jchen10 @Jiawei-Shao
Hi, all, I want to discuss with you whether we could optimize Gemm, MatMul and Conv operators on some special vendor and special architectures as this PR.
Reasons:

It is difficult to design an algorithm that benifits all vendors and all architectures.
Even for the same vendor, but different architectures, it is also difficult.
Maintaining and reviewing the code is also difficult if adding some vendor and architecture information in common files.
There are not enough devices to test the correctness and performance.
...

jchen10 · 2025-10-29T11:26:05Z

@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference!

Copilot

Pull Request Overview

This PR adds Intel-specific optimizations for GEMM and MatMul operations in the WebGPU provider, targeting Intel Xe-2 GPU architectures (xe-2lpg and xe-2hpg) using subgroup operations.

Implements subgroup-based GEMM and MatMul kernels optimized for Intel GPUs
Adds vendor-specific code paths that are conditionally executed based on adapter info and matrix dimensions
Integrates Intel optimizations into existing GEMM and MatMul operators

Reviewed Changes

Copilot reviewed 2 out of 10 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
matmul_intel.h	Defines MatMulSubgroupProgram class and Intel MatMul entry points
matmul_intel.cc	Implements Intel-optimized MatMul with subgroup support and batched optimizations
gemm_utils_intel.h	Declares helper functions for reading/writing matrix data in Intel kernels
gemm_utils_intel.cc	Implements shader generation helpers for Intel-specific GEMM operations
gemm_subgroup.h	Defines constants and functions for subgroup-based matrix operations
gemm_subgroup.cc	Implements vec4 and scalar variants of subgroup matrix multiplication
gemm_intel.h	Defines GemmSubgroupProgram class and Intel GEMM entry points
gemm_intel.cc	Implements Intel-optimized GEMM with subgroup support
matmul.cc	Integrates Intel MatMul optimization check into existing MatMul operator
gemm.cc	Integrates Intel GEMM optimization check into existing GEMM operator

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

xhcao · 2025-10-31T02:11:30Z

@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference!

Firstly, I want being consented to create a vendor directory and make some optimizations on some special platforms, especially from Microsoft reviewers' consent.
If so, I will add the test cases in the other PR and apply performance data here.

jchen10 · 2025-10-31T06:41:55Z

@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference!

Firstly, I want being consented to create a vendor directory and make some optimizations on some special platforms, especially from Microsoft reviewers' consent. If so, I will add the test cases in the other PR and apply performance data here.

You don't need to depend on it. You can just best improve your PR with more test cases, better perf data, and easier to review.

fs-eire · 2025-11-04T01:03:03Z

could you please help to merge to latest main and push?

The pipeline failures should be unrelated, and may be fixed by rerun.

guschmue · 2025-11-04T19:57:47Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-11-04T19:58:08Z

Azure Pipelines successfully started running 4 pipeline(s).

guschmue · 2025-11-10T16:25:33Z

I understand the desire for a vendor directory.
But I'm concerned that we'll have trouble with testing and CI if we have too many vendor specific paths - it will get fragmented and very hard to test all code path variations.
We already head this problem with subgroup.matrixmultiply where I could not test a change for the intel specific code path so I had to let it fall back to different path.

But we can try and see how that works, just need to be careful with the testing

jchen10 · 2025-11-11T01:00:43Z

@guschmue Thank you for your understanding and support. We will fully test our changes and closely monitor the status on LNL/BMG. Once mature enough we could roll out to more devices.

fs-eire

The code is generally good to me. But I think there needs a few changes for code maintenance consideration:

we can keep the directory structure (onnxruntime/core/providers/webgpu/vendor/intel), but there is no need to add _intel suffix for files since it is already in the file's full path (folder name)
Use a sub-namespace for intel instead of including them in function/class name (there are exceptions, see 3), for example:

use namespace like this:

namespace onnxruntime {
namespace webgpu {
namespace intel {
    ...

MatMulReadFnSourceIntel(...) --> intel::MatMulReadFnSource(...)

For CanApplyXXXIntel or ApplyXXXIntel, it is OK to keep it as-is, because this specific functions are considered as "entry" to vendor specific code.

qjia7 · 2025-11-18T02:36:08Z

Haven't closely look at all code. But several high level comments.

Please use workgroup_idx instead workgroup_id.x/y/z to recalculate the right offset.
Please use the help function .setByOffset/getByOffset/setByIndices/getByindices to load/store data.
Does your shader support any size of subgroup? I don't see sg_size related checking in your shader. I suppose intel's subgroup size range 8-32.
Please use template to write the shader if possible.

xhcao · 2025-12-01T02:40:21Z

@fs-eire Had addressed your comments. @qjia7 Except for the fourth comment (Because I will continue to optimize the PR, and address the comment in future), all the comments have been addressed.
PTAL, thank you.

xhcao · 2025-12-01T07:37:43Z

Attach shder files

gemm_subgroup_shader.txt
gemm_subgroup_vec4_shader.txt

xhcao · 2025-12-05T05:25:48Z

@qjia7 @fs-eire PTAL, thanks.

…m_use_subgroup

qjia7

Will review the shader part tomorrow.

xhcao · 2025-12-12T05:09:09Z

@qjia7 I had addressed your comments. @fs-eire @guschmue Please take a look, thanks.

xhcao · 2025-12-16T09:09:30Z

@qjia7 I tested the PR using ~20 models on Rocket-Lake, several models got performance improvement, but was less than Lunar-Lake.
There was no regression on any models, so applied the PR to all intel devices support subgroup feature. Please take a look again.

qjia7 · 2025-12-17T06:25:03Z

The shader changes look good to me. I’ll hand off the remaining structural review to @fs-eire and @guschmue. Thanks!

xhcao · 2025-12-23T08:39:48Z

@fs-eire @guschmue PTAL

xhcao · 2026-01-09T08:24:20Z

The performance data of some models is shown as below on Lunar-lake.

Model	Performance
depth-anything-base-fp16-pow-fp32	10%
depth-anything-base-fp32	15%
florence-2-base-decoder-fp16	15%
florence-2-base-encoder-fp16	35%
florence-2-base-vision-encoder-fp16-0.5-fp32	15%
jina-clip-v1-version	10%
moondream2-vision-encoder-fp16	25%
sam-vit-b-encoder-fp16-demo	20%
sd-turbo-safety-checker-int32-reduceSum	25%
sd-turbo-unet-fp16-demo-layernorm	10%
whisper-base-encoder-lm-fp16-layernorm	20%

guschmue · 2026-01-15T23:11:54Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-15T23:12:13Z

Azure Pipelines successfully started running 4 pipeline(s).

) ### Description  ### Motivation and Context

)

webgpu: optimize Gemm and MatMul using subgroup feature

fe71863

guschmue added the ep:WebGPU ort-web webgpu provider label Oct 29, 2025

guschmue requested a review from Copilot October 29, 2025 15:54

Copilot AI reviewed Oct 29, 2025

View reviewed changes

Merge branch 'main' into optimize_matmul_gemm_use_subgroup

28429df

Merge branch 'main' into optimize_matmul_gemm_use_subgroup

f8166e5

xhcao mentioned this pull request Nov 14, 2025

webgpu: add MatMul and Gemm cases with large shapes #26572

Open

fs-eire suggested changes Nov 18, 2025

View reviewed changes

xhcao added 2 commits November 28, 2025 16:55

Merge branch 'main' into optimize_matmul_gemm_use_subgroup

9afbc36

Address comments

452cb18

xhcao added 2 commits December 8, 2025 10:35

Merge remote-tracking branch 'upstream/main' into optimize_matmul_gem…

eca1ce1

…m_use_subgroup

Fix code conflict

b5ec0ee

qjia7 reviewed Dec 9, 2025

View reviewed changes

Comment thread onnxruntime/core/providers/webgpu/math/gemm_utils.cc Outdated

Comment thread onnxruntime/core/providers/webgpu/vendor/intel/math/gemm.cc Outdated

xhcao added 3 commits December 11, 2025 16:26

Address Jiajia's comments

a185253

Merge branch 'main' into optimize_matmul_gemm_use_subgroup

267f8af

Tune parameters

2182b8e

qjia7 reviewed Dec 12, 2025

View reviewed changes

Comment thread onnxruntime/core/providers/webgpu/vendor/intel/math/gemm_subgroup.cc Outdated

xhcao added 3 commits December 12, 2025 15:14

support subgroup feature check

21d9ee7

Merge branch 'main' into optimize_matmul_gemm_use_subgroup

6ff2b4e

Expand devices usage

87ae8ed

xhcao added 2 commits December 18, 2025 11:17

Merge branch 'main' into optimize_matmul_gemm_use_subgroup

aceb7c9

Merge branch 'main' into optimize_matmul_gemm_use_subgroup

13cb4a4

xhcao added 2 commits January 5, 2026 15:55

Merge branch 'main' into optimize_matmul_gemm_use_subgroup

79ff973

Merge branch 'main' into optimize_matmul_gemm_use_subgroup

57fa663

guschmue approved these changes Jan 16, 2026

View reviewed changes

fs-eire approved these changes Jan 24, 2026

View reviewed changes

guschmue merged commit 2ece1c1 into microsoft:main Jan 27, 2026
102 of 107 checks passed

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Conversation

xhcao commented Oct 29, 2025

Description

Motivation and Context

Uh oh!

xhcao commented Oct 29, 2025

Uh oh!

jchen10 commented Oct 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

xhcao commented Oct 31, 2025

Uh oh!

jchen10 commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fs-eire commented Nov 4, 2025

Uh oh!

guschmue commented Nov 4, 2025

Uh oh!

azure-pipelines Bot commented Nov 4, 2025

Uh oh!

guschmue commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jchen10 commented Nov 11, 2025

Uh oh!

fs-eire left a comment

Choose a reason for hiding this comment

Uh oh!

qjia7 commented Nov 18, 2025

Uh oh!

xhcao commented Dec 1, 2025

Uh oh!

xhcao commented Dec 1, 2025

Uh oh!

xhcao commented Dec 5, 2025

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xhcao commented Dec 12, 2025

Uh oh!

Uh oh!

xhcao commented Dec 16, 2025

Uh oh!

qjia7 commented Dec 17, 2025

Uh oh!

xhcao commented Dec 23, 2025

Uh oh!

xhcao commented Jan 9, 2026

Uh oh!

guschmue commented Jan 15, 2026

Uh oh!

azure-pipelines Bot commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jchen10 commented Oct 31, 2025 •

edited

Loading

guschmue commented Nov 10, 2025 •

edited

Loading