ggml-webgpu: Improve the mat-vec and mat-mat of MUL_MAT_ID by yomaytk · Pull Request #22464 · ggml-org/llama.cpp

yomaytk · 2026-04-28T09:38:11Z

Overview

This PR improves the mat-vec computation of MUL_MAT_ID. The computation logic is the same as mul_mat_vec.wgsl. I confirmed that this change significantly improves performance when running the Deepseek-V2-Q6_K (16B-2.4B) in my enviroment (M2, Metal 4). The three phases (pp1, tg1, tg128) that are largely occupied by mul-mat-id-vec are improved as expected.

This PR supports only Q8_0 and Q6_K because the model uses those types for MUL_MAT_ID, but it would be valuable to support other types too.

benchmark

command: lama-bench -m DeepSeek-V2-Lite-Chat-Q6_K.gguf -fa 1 -p 0,1,512 -n 0,1,128 -r 3

E2E test (t/s)

Test	master	wg_reduce	sg_reduce	sg vs master
pp1	34.46	45.35	48.83	1.42x
pp512	647.03	653.54	659.72	1.02x
tg1	33.83	48.52	49.88	1.47x
tg128	34.13	48.33	49.40	1.45x

`MUL_MAT_ID` kernel GPU time sum of `tg128` (using `GGML_WEBGPU_GPU_PROFILE`)

Kernel	master	wg_reduce	sg_reduce
q8_0 path	941.1 ms (`mul_mat_id_q8_0_f32`)	268.6 ms (`id_vec_q8_0_f32_wg_reduce`)	215.2 ms (`id_vec_q8_0_f32_sg_reduce`)
q6_K path	3112.1 ms (`mul_mat_id_q6_K_f32`)	609.6 ms (`id_vec_q6_K_f32_wg_reduce`)	463.1 ms (`id_vec_q6_K_f32_sg_reduce`)
Total	4053.2 ms	878.2 ms	678.3 ms

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - I used AI to analyze the profiling data and investigate the code bases of MoE model inference behavior.

yomaytk · 2026-04-28T09:39:00Z

+    // gather the selected experts for the target token.
+    for (var col = thread_id;col < params.n_expert_used;col += WG_SIZE) {
+        let expert = ids[params.offset_ids + col];
+        gathered_count_ids[expert] = 1;


The selected experts are determined by top-k routing, so we can assume that the selected experts for a given token are distinct.

yomaytk · 2026-04-28T12:29:34Z

+#define BLOCK_SIZE_BYTES 34
+#define THREADS_PER_BLOCK 4
+#define ELEMS_PER_THREAD (BLOCK_SIZE/THREADS_PER_BLOCK)
+fn accumulate_vec_dot(thread_id: u32, row_base: u32, src0_batch_offset: u32, src1_idx_base: u32) -> array<f32, OUTPUTS_PER_WG> {


@reeselevine accumulate_vec_dot is the same as the logic in mul_mat_vec.wgsl, so it might be good to extract all the accumulation logic from mul_mat_vec.wgsl into a shared function and reuse it between mul_mat_vec and mul_mat_id_vec. WDYT?

yes that makes sense to me. Having a template file like mul_mat_tmpl.wgsl but specifically for the vector versions might be a good approach.

OK, let me try that.

yomaytk · 2026-04-29T10:52:24Z

I added mul_mat_vec_acc.tmpl and extended support to the other types that the existing mul_mat_vec supports — once the template is shared, it's just a matter of adding the types to the pipeline.
E2E benchmark on DeepSeek-V2-Lite Q4_K_M shows the following improvement:

E2E test (t/s)

Test	master	yomaytk/mul_mat_id_vec (sg_reduce)	speedup
pp1	38.18	57.11	1.50x
pp512	659.41	664.59	1.01x
tg1	38.93	60.34	1.55x
tg128	39.30	60.73	1.55x

yomaytk · 2026-04-29T11:15:44Z

This change is not related to this PR, but I removed this since test-backend-ops doesn't include this type for MUL_MAT, and Q8_1 is not listed in supports_op for MUL_MAT, making this case
unreachable. Please let me know if there's a reason to keep it.

yeah, my understanding is that Q8_1 is deprecated/not used, so we don't really need to support it.

reeselevine · 2026-04-29T15:01:06Z

thanks, looks good. A few conflicts that need to be resolved, also with #22504, you can probably add support for the i-quants to the mul_mat_id mat-mat path.

yomaytk · 2026-04-30T03:37:45Z

Ok, thanks! After #22504 is merged, I'll resolve the conflicts and add i-quant support to the MUL_MAT_ID mat-mat path in this PR. Does that sound good?
Or, if you'd prefer to keep this PR scope smaller, I can address the i-quant mat-mat part in a follow-up PR.

reeselevine · 2026-04-30T06:00:15Z

Go ahead and add the I-quant support in this PR!

yomaytk · 2026-04-30T14:23:24Z

I've finished the necessary updates.

reeselevine · 2026-04-30T15:49:11Z

    uint32_t sg_mat_n                 = 0;
    uint32_t sg_mat_k                 = 0;
    uint32_t max_subgroup_size        = 0;
+    uint32_t n_experts                = 0;


sorry one last change here: I want to try and avoid putting any op-specific information in this context, and instead derive it in the library as needed from the context. In this case, we can derive n_experts from the src0 tensor directly.

Ah, good point. I'll update.

* 'master' of github.com:tekintian/llama.cpp: (659 commits) ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (ggml-org#22464) Update llama-mmap to use ftello/fseeko (ggml-org#22497) common : check for null getpwuid in hf-cache (ggml-org#22550) vulkan: add get/set tensor 2d functions (ggml-org#22514) spec: fix argument typo (ggml-org#22552) ci : bump ty to 0.0.33 (ggml-org#22535) vendor : update cpp-httplib to 0.43.2 (ggml-org#22548) CUDA: fix tile FA kernel on Pascal (ggml-org#22541) scripts : add wc2wt.sh - create worktree from current HEAD (ggml-org#22513) add fast matmul iquants (ggml-org#22504) spec : fix draft model checkpoints (ggml-org#22521) spec : fix vocab compat checks in spec example (ggml-org#22426) common : do not pass prompt tokens to reasoning budget sampler (ggml-org#22488) hexagon: make vmem and buffer-size configurable (ggml-org#22487) CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478) spec : disacard last drafted token with low prob (ggml-org#22506) sync : ggml ggml : bump version to 0.10.1 (ggml/1469) webui: fix slow mic stop and WAV encode (ggml-org#22480) ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (ggml-org#22293) ... # Conflicts: # .gitignore

ggml-org#22464) * Add mat-vec fast path of MUL_MAT_ID. * Add shared accumulation vec logic and the other types supports. * Add i-quant mat-mat for MUL_MAT_ID and fix some parts * Remove n_experts from shader_lib_context.

yomaytk requested a review from a team as a code owner April 28, 2026 09:38

yomaytk commented Apr 28, 2026

View reviewed changes

CISC approved these changes Apr 28, 2026

View reviewed changes

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning WebGPU labels Apr 28, 2026

yomaytk force-pushed the mul_mat_id_vec branch from 3e8ec27 to 7c5cebe Compare April 28, 2026 12:06

yomaytk commented Apr 28, 2026

View reviewed changes

yomaytk force-pushed the mul_mat_id_vec branch from f2a5503 to 5466aa8 Compare April 29, 2026 10:47

yomaytk commented Apr 29, 2026

View reviewed changes

yomaytk added 2 commits April 30, 2026 21:17

Add mat-vec fast path of MUL_MAT_ID.

590fa48

Add shared accumulation vec logic and the other types supports.

afe0c62

yomaytk force-pushed the mul_mat_id_vec branch from 5466aa8 to 9399542 Compare April 30, 2026 13:33

Add i-quant mat-mat for MUL_MAT_ID and fix some parts

c0700cd

yomaytk force-pushed the mul_mat_id_vec branch from 66e7400 to c0700cd Compare April 30, 2026 14:19

yomaytk changed the title ~~ggml-webgpu: Improve the mat-vec performance of MUL_MAT_ID~~ ggml-webgpu: Improve the mat-vec and mat-mat of MUL_MAT_ID Apr 30, 2026

reeselevine reviewed Apr 30, 2026

View reviewed changes

Remove n_experts from shader_lib_context.

a44c0fb

reeselevine approved these changes Apr 30, 2026

View reviewed changes

reeselevine merged commit a95a11e into ggml-org:master Apr 30, 2026
42 of 46 checks passed

yomaytk deleted the mul_mat_id_vec branch May 1, 2026 01:40

Conversation

yomaytk commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

benchmark

E2E test (t/s)

MUL_MAT_ID kernel GPU time sum of tg128 (using GGML_WEBGPU_GPU_PROFILE)

Requirements

Uh oh!

yomaytk Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

yomaytk Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reeselevine Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yomaytk Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

yomaytk commented Apr 29, 2026

E2E test (t/s)

Uh oh!

yomaytk Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reeselevine Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

reeselevine commented Apr 29, 2026

Uh oh!

yomaytk commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reeselevine commented Apr 30, 2026

Uh oh!

yomaytk commented Apr 30, 2026

Uh oh!

reeselevine Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

yomaytk Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yomaytk commented Apr 28, 2026 •

edited

Loading

`MUL_MAT_ID` kernel GPU time sum of `tg128` (using `GGML_WEBGPU_GPU_PROFILE`)

yomaytk Apr 28, 2026 •

edited

Loading

reeselevine Apr 28, 2026 •

edited

Loading

yomaytk Apr 29, 2026 •

edited

Loading

yomaytk commented Apr 30, 2026 •

edited

Loading