webgpu: Refactor SubgroupMatrixMatMulNBits to vendor-agnostic config … by qjia7 · Pull Request #28109 · microsoft/onnxruntime

qjia7 · 2026-04-17T01:50:50Z

…+ add NVIDIA 16x16x16

Refactor subgroup matrix MatMulNBits support from vendor-specific (Apple/Intel) to a vendor-agnostic config-based approach. Any GPU reporting a matching subgroup matrix config from Dawn is now automatically supported.

Key changes:

Replace vendor-specific config table with SupportedSubgroupMatrixConfig struct containing {componentType, resultComponentType, M, N, K, subgroupMinSize, subgroupMaxSize, needsPrepack}. No architecture or backendType required.
Remove vendor_ member from SubgroupMatrixMatMulNBitsProgram. Shader selection is now driven by config dimensions (8x8x8, 8x16x16, 16x16x16).
Remove vendor gate in matmul_nbits.cc call site.
Rename shader templates: _apple -> _8x8x8, _intel -> _8x16x16.
Add new 16x16x16 shader template for NVIDIA Blackwell (RTX 5080).
- 4 subgroups x 32 lanes = 128 threads per workgroup
- 64x64 tile with 16x16 subgroup matrices
- Bounds-checked output via scratch buffer for partial M tiles
Fix prepack shader OOB reads: add scalar fallback with zero-fill for partial blocks where M is not a multiple of kSgMatM.
Prioritize larger configs (16x16x16 > 8x16x16 > 8x8x8) when multiple match.

Verified on NVIDIA RTX 5080 (Blackwell, Vulkan backend):

Correctness: model-qa.py with phi4-graph-prune produces identical output to D3D12 baseline
Prefill (phi4, l=1024):

phi4-graph-prune	D3D12 DP4A	Vulkan DP4A	Vulkan TC (16x16x16)	Vulkan TC (16x16x16_128)
Prefill (tps)	3,134	6,389	7,089	10,744

NVIDIA reports ChromiumExperimentalSubgroupMatrix with F16/F16 16x16x16 config

Description

Motivation and Context

…+ add NVIDIA 16x16x16 Refactor subgroup matrix MatMulNBits support from vendor-specific (Apple/Intel) to a vendor-agnostic config-based approach. Any GPU reporting a matching subgroup matrix config from Dawn is now automatically supported. Key changes: - Replace vendor-specific config table with SupportedSubgroupMatrixConfig struct containing {componentType, resultComponentType, M, N, K, subgroupMinSize, subgroupMaxSize, needsPrepack}. No architecture or backendType required. - Remove vendor_ member from SubgroupMatrixMatMulNBitsProgram. Shader selection is now driven by config dimensions (8x8x8, 8x16x16, 16x16x16). - Remove vendor gate in matmul_nbits.cc call site. - Rename shader templates: _apple -> _8x8x8, _intel -> _8x16x16. - Add new 16x16x16 shader template for NVIDIA Blackwell (RTX 5080). - 4 subgroups x 32 lanes = 128 threads per workgroup - 64x64 tile with 16x16 subgroup matrices - Bounds-checked output via scratch buffer for partial M tiles - Fix prepack shader OOB reads: add scalar fallback with zero-fill for partial blocks where M is not a multiple of kSgMatM. - Prioritize larger configs (16x16x16 > 8x16x16 > 8x8x8) when multiple match. Verified on NVIDIA RTX 5080 (Blackwell, Vulkan backend): - Correctness: model-qa.py with phi4-graph-prune produces identical output to D3D12 baseline - Prefill (phi4, l=1024): - D3D12 DP4A baseline: 3,006 tps - Vulkan DP4A baseline: 6,155 tps - Vulkan tensor core (this change): 6,759 tps (+10% vs Vulkan DP4A, +125% vs D3D12) - NVIDIA reports ChromiumExperimentalSubgroupMatrix with F16/F16 16x16x16 config

…barrier placement - Use fast subgroupMatrixStore directly to output for full M blocks (sg_m_base + kSgMatM <= M), avoiding scratch overhead for the common case. - Use scratch + scalar write only for partial M blocks at the boundary. - Move workgroupBarrier outside the if/else to avoid divergent barrier (WGSL disallows workgroupBarrier in non-uniform control flow). - Make scratch array unconditional (needed for both bias and non-bias paths). This fixes the Invalid ShaderModule crash that occurred when the barrier was inside a branch that different subgroups could take different sides of.

Use larger 128x128 tiles (vs 64x64) for NVIDIA Blackwell 16x16x16 config to improve prefill throughput. Key changes: - New WGSL template with 2x2 subgroup grid, each handling 64x64 subtile - Load A directly from prepacked global memory (no shared memory) - Dequant B to shared memory with padded stride (SHMEM_STRIDE=40) - Update dispatch to ceil(N/128) x ceil(M/128)

…port - Use exact subgroup size matching (==) instead of range (>=) - Add F32 8x8x8 config for Apple parity - Pass is_fp16 into IsSubgroupMatrixConfigSupported to correctly skip F16 configs when output is F32 - Simplify accuracy_level check to apply uniformly - Fix missing closing brace in CanApplySubgroupMatrixMatMulNBits

The prepack buffer was sized to ceil(M/sg_mat_m)*sg_mat_m rows, but the matmul shader dispatches workgroups covering ceil(M/tile_size_a)*tile_size_a rows. When M < tile_size_a (e.g. M=32 with 128x128 tiles), subgroups in the matmul shader would read past the end of the prepack buffer, causing a device-lost error. Fix: move tile size computation before prepack allocation and pad the prepack buffer to the workgroup tile size. Also remove unnecessary zero-fill in the prepack shader for OOB rows — the matmul shader already bounds-checks before storing output, so fully OOB prepack blocks can skip entirely.

Copilot

Pull request overview

Refactors WebGPU SubgroupMatrix MatMulNBits support from vendor-gated (Apple/Intel) logic to a vendor-agnostic, adapter-reported subgroup-matrix-config matching approach, and adds a new 16x16x16 (128x128 tile) shader variant targeting NVIDIA Blackwell-class GPUs.

Changes:

Introduces a vendor-agnostic SupportedSubgroupMatrixConfig table and matching logic driven by Dawn-reported subgroup matrix configs.
Renames/rewrites subgroup-matrix WGSL templates into dimension-keyed variants (8x8x8, 8x16x16) and adds a new 16x16x16_128 shader.
Updates prepack WGSL to avoid OOB reads via a scalar fallback for partial M blocks, and removes vendor gating at the MatMulNBits call site.

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits_prepack.wgsl.template	Adds partial-block handling in prepack to avoid OOB reads and updates documentation about padding behavior.
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits_8x8x8.wgsl.template	Adds the 8x8x8 subgroup-matrix matmul shader template (formerly vendor-specific).
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits_8x16x16.wgsl.template	Adds the 8x16x16 subgroup-matrix matmul shader template (formerly Intel-specific).
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits_16x16x16_128.wgsl.template	Adds the 16x16x16 subgroup-matrix matmul shader template using 128x128 tiles and edge-tile bounds-checked stores.
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.h	Removes vendor string plumbing from the program API.
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc	Replaces vendor-specific config matching with a vendor-agnostic config table; selects shaders by config dimensions; updates prepack padding logic.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc	Removes vendor gate so subgroup-matrix path is enabled for any adapter reporting a supported config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add M=100, N=256, K=128, block_size=32 test cases to Float16_4b_Accuracy0 and Float16_4b_Accuracy4. These dimensions meet SubgroupMatrix constraints (block_size=32, N%64==0, K%32==0) and M=100 is non-128-aligned, exercising the prepack buffer padding fix from commit 9581821.

- Skip F32 subgroup matrix configs when output is FP16 (symmetric with the existing F16-skip-when-F32 filter) - Fix misleading prepack comment: not all matmul shaders bounds-check edge tiles, so don't promise they do Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

- Skip F32 subgroup matrix configs when output is FP16 (symmetric with the existing F16-skip-when-F32 filter) - Fix misleading prepack comment: not all matmul shaders bounds-check edge tiles, so don't promise they do

… N=192 test Remove 63 workgroupBarrier() calls from the bias and non-bias edge-tile output paths in the 16x16x16_128 shader. These barriers are unnecessary because subgroupMatrixStore and subsequent scalar reads from coopmat_stage execute within the same subgroup (lockstep), and each subgroup writes to its own non-overlapping stage_base region. Also add N=192 (partial N tile, not divisible by 128) test cases and remove a duplicate #include in subgroup_matrix_matmul_nbits.h.

- Skip F32 subgroup matrix configs when output is FP16 (symmetric with the existing F16-skip-when-F32 filter) - Fix misleading prepack comment: not all matmul shaders bounds-check edge tiles, so don't promise they do

qjia7 · 2026-04-23T09:53:55Z

cc @jchen10

qjia7 added 4 commits April 17, 2026 09:48

qjia7 changed the title ~~[Don't review] webgpu: Refactor SubgroupMatrixMatMulNBits to vendor-agnostic config …~~ webgpu: Refactor SubgroupMatrixMatMulNBits to vendor-agnostic config … Apr 22, 2026

qjia7 marked this pull request as ready for review April 22, 2026 10:25

guschmue added the ep:WebGPU ort-web webgpu provider label Apr 22, 2026

qjia7 requested a review from Copilot April 23, 2026 03:18

Copilot started reviewing on behalf of qjia7 April 23, 2026 03:19 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc

Comment thread onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits_prepack.wgsl.template Outdated

qjia7 force-pushed the opt/webgpu-vulkan-perf branch 2 times, most recently from dab1ba9 to a3f7051 Compare April 23, 2026 06:34

qjia7 added 2 commits April 23, 2026 17:29

Address PR #28109 review comments

f061947

- Skip F32 subgroup matrix configs when output is FP16 (symmetric with the existing F16-skip-when-F32 filter) - Fix misleading prepack comment: not all matmul shaders bounds-check edge tiles, so don't promise they do

qjia7 requested review from guschmue, hariharans29 and sushraja-msft April 23, 2026 09:51

guschmue approved these changes Apr 25, 2026

View reviewed changes

guschmue enabled auto-merge (squash) April 25, 2026 00:25

guschmue merged commit 954dbce into main Apr 25, 2026
98 of 103 checks passed

guschmue deleted the opt/webgpu-vulkan-perf branch April 25, 2026 02:29

BrewTestBot mentioned this pull request May 8, 2026

onnxruntime 1.26.0 Homebrew/homebrew-core#281672

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webgpu: Refactor SubgroupMatrixMatMulNBits to vendor-agnostic config …#28109

webgpu: Refactor SubgroupMatrixMatMulNBits to vendor-agnostic config …#28109
guschmue merged 8 commits intomainfrom
opt/webgpu-vulkan-perf

qjia7 commented Apr 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

qjia7 commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qjia7 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

qjia7 commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qjia7 commented Apr 17, 2026 •

edited

Loading