webgpu: Optimize DP4A SmallM MatMulNBits tiling by qjia7 · Pull Request #27910 · microsoft/onnxruntime

qjia7 · 2026-03-31T07:27:50Z

This pull request adjusts the tiling strategy for small matrix sizes in the DP4A matmul kernel. The changes are aimed at improving performance and compatibility, especially for specific GPU vendors.

On Qualcomm, improving token generation from ~20 tps to ~25 tps.

Change default DP4A SmallM tile parameters from tile_size_k_vec=16, tile_size_n=32 to tile_size_k_vec=32, tile_size_n=4, increasing K-parallelism (32 threads along K vs 16) at the cost of fewer concurrent B rows (4 vs 32). This was previously Intel-only and is now the default for all vendors. On Qualcomm, improving token generation from ~20 tps to ~25 tps. Changes: - dp4a_matmul_nbits.cc: Default tile params to k_vec=32, tile_n=4; remove Intel-specific override (now redundant).

qjia7 marked this pull request as ready for review March 31, 2026 08:06

guschmue added the ep:WebGPU ort-web webgpu provider label Mar 31, 2026

guschmue approved these changes Mar 31, 2026

View reviewed changes

guschmue merged commit f22e3a9 into main Mar 31, 2026
98 of 99 checks passed

guschmue deleted the webgpu-dp4a-smallm-optimization branch March 31, 2026 21:29

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webgpu: Optimize DP4A SmallM MatMulNBits tiling#27910

webgpu: Optimize DP4A SmallM MatMulNBits tiling#27910
guschmue merged 1 commit intomainfrom
webgpu-dp4a-smallm-optimization

qjia7 commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qjia7 commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants