Skip to content

webgpu: Optimize DP4A SmallM MatMulNBits tiling#27910

Merged
guschmue merged 1 commit intomainfrom
webgpu-dp4a-smallm-optimization
Mar 31, 2026
Merged

webgpu: Optimize DP4A SmallM MatMulNBits tiling#27910
guschmue merged 1 commit intomainfrom
webgpu-dp4a-smallm-optimization

Conversation

@qjia7
Copy link
Copy Markdown
Contributor

@qjia7 qjia7 commented Mar 31, 2026

This pull request adjusts the tiling strategy for small matrix sizes in the DP4A matmul kernel. The changes are aimed at improving performance and compatibility, especially for specific GPU vendors.

On Qualcomm, improving token generation from ~20 tps to ~25 tps.

Change default DP4A SmallM tile parameters from tile_size_k_vec=16,
tile_size_n=32 to tile_size_k_vec=32, tile_size_n=4, increasing
K-parallelism (32 threads along K vs 16) at the cost of fewer
concurrent B rows (4 vs 32). This was previously Intel-only and
is now the default for all vendors.

On Qualcomm, improving token generation from ~20 tps to ~25 tps.

Changes:
- dp4a_matmul_nbits.cc: Default tile params to k_vec=32, tile_n=4;
  remove Intel-specific override (now redundant).
@qjia7 qjia7 marked this pull request as ready for review March 31, 2026 08:06
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Mar 31, 2026
@guschmue guschmue merged commit f22e3a9 into main Mar 31, 2026
98 of 99 checks passed
@guschmue guschmue deleted the webgpu-dp4a-smallm-optimization branch March 31, 2026 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants