webgpu: optimize Gemm and MatMul using subgroup feature#26433
webgpu: optimize Gemm and MatMul using subgroup feature#26433guschmue merged 17 commits intomicrosoft:mainfrom
Conversation
|
@guschmue @fs-eire @qjia7 @jchen10 @Jiawei-Shao
|
|
@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference! |
There was a problem hiding this comment.
Pull Request Overview
This PR adds Intel-specific optimizations for GEMM and MatMul operations in the WebGPU provider, targeting Intel Xe-2 GPU architectures (xe-2lpg and xe-2hpg) using subgroup operations.
- Implements subgroup-based GEMM and MatMul kernels optimized for Intel GPUs
- Adds vendor-specific code paths that are conditionally executed based on adapter info and matrix dimensions
- Integrates Intel optimizations into existing GEMM and MatMul operators
Reviewed Changes
Copilot reviewed 2 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| matmul_intel.h | Defines MatMulSubgroupProgram class and Intel MatMul entry points |
| matmul_intel.cc | Implements Intel-optimized MatMul with subgroup support and batched optimizations |
| gemm_utils_intel.h | Declares helper functions for reading/writing matrix data in Intel kernels |
| gemm_utils_intel.cc | Implements shader generation helpers for Intel-specific GEMM operations |
| gemm_subgroup.h | Defines constants and functions for subgroup-based matrix operations |
| gemm_subgroup.cc | Implements vec4 and scalar variants of subgroup matrix multiplication |
| gemm_intel.h | Defines GemmSubgroupProgram class and Intel GEMM entry points |
| gemm_intel.cc | Implements Intel-optimized GEMM with subgroup support |
| matmul.cc | Integrates Intel MatMul optimization check into existing MatMul operator |
| gemm.cc | Integrates Intel GEMM optimization check into existing GEMM operator |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Firstly, I want being consented to create a |
You don't need to depend on it. You can just best improve your PR with more test cases, better perf data, and easier to review. |
|
could you please help to merge to latest main and push? The pipeline failures should be unrelated, and may be fixed by rerun. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
I understand the desire for a vendor directory. But we can try and see how that works, just need to be careful with the testing |
|
@guschmue Thank you for your understanding and support. We will fully test our changes and closely monitor the status on LNL/BMG. Once mature enough we could roll out to more devices. |
fs-eire
left a comment
There was a problem hiding this comment.
The code is generally good to me. But I think there needs a few changes for code maintenance consideration:
- we can keep the directory structure (onnxruntime/core/providers/webgpu/vendor/intel), but there is no need to add
_intelsuffix for files since it is already in the file's full path (folder name) - Use a sub-namespace for
intelinstead of including them in function/class name (there are exceptions, see 3), for example:
- use namespace like this:
namespace onnxruntime { namespace webgpu { namespace intel { ...
MatMulReadFnSourceIntel(...)-->intel::MatMulReadFnSource(...)
- For
CanApplyXXXIntelorApplyXXXIntel, it is OK to keep it as-is, because this specific functions are considered as "entry" to vendor specific code.
|
Haven't closely look at all code. But several high level comments.
|
|
Attach shder files |
qjia7
left a comment
There was a problem hiding this comment.
Will review the shader part tomorrow.
|
@qjia7 I tested the PR using ~20 models on Rocket-Lake, several models got performance improvement, but was less than Lunar-Lake. |
|
The performance data of some models is shown as below on Lunar-lake. <style> </style>
|
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Description
Motivation and Context