Skip to content

ggml-webgpu: Improve the mat-vec and mat-mat of MUL_MAT_ID#22464

Merged
reeselevine merged 4 commits into
ggml-org:masterfrom
yomaytk:mul_mat_id_vec
Apr 30, 2026
Merged

ggml-webgpu: Improve the mat-vec and mat-mat of MUL_MAT_ID#22464
reeselevine merged 4 commits into
ggml-org:masterfrom
yomaytk:mul_mat_id_vec

Conversation

@yomaytk
Copy link
Copy Markdown
Contributor

@yomaytk yomaytk commented Apr 28, 2026

Overview

This PR improves the mat-vec computation of MUL_MAT_ID. The computation logic is the same as mul_mat_vec.wgsl. I confirmed that this change significantly improves performance when running the Deepseek-V2-Q6_K (16B-2.4B) in my enviroment (M2, Metal 4). The three phases (pp1, tg1, tg128) that are largely occupied by mul-mat-id-vec are improved as expected.

This PR supports only Q8_0 and Q6_K because the model uses those types for MUL_MAT_ID, but it would be valuable to support other types too.

benchmark

command: lama-bench -m DeepSeek-V2-Lite-Chat-Q6_K.gguf -fa 1 -p 0,1,512 -n 0,1,128 -r 3

E2E test (t/s)

Test master wg_reduce sg_reduce sg vs master
pp1 34.46 45.35 48.83 1.42x
pp512 647.03 653.54 659.72 1.02x
tg1 33.83 48.52 49.88 1.47x
tg128 34.13 48.33 49.40 1.45x

MUL_MAT_ID kernel GPU time sum of tg128 (using GGML_WEBGPU_GPU_PROFILE)

Kernel master wg_reduce sg_reduce
q8_0 path 941.1 ms (mul_mat_id_q8_0_f32) 268.6 ms (id_vec_q8_0_f32_wg_reduce) 215.2 ms (id_vec_q8_0_f32_sg_reduce)
q6_K path 3112.1 ms (mul_mat_id_q6_K_f32) 609.6 ms (id_vec_q6_K_f32_wg_reduce) 463.1 ms (id_vec_q6_K_f32_sg_reduce)
Total 4053.2 ms 878.2 ms 678.3 ms

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - I used AI to analyze the profiling data and investigate the code bases of MoE model inference behavior.

@yomaytk yomaytk requested a review from a team as a code owner April 28, 2026 09:38
// gather the selected experts for the target token.
for (var col = thread_id;col < params.n_expert_used;col += WG_SIZE) {
let expert = ids[params.offset_ids + col];
gathered_count_ids[expert] = 1;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The selected experts are determined by top-k routing, so we can assume that the selected experts for a given token are distinct.

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning WebGPU labels Apr 28, 2026
#define BLOCK_SIZE_BYTES 34
#define THREADS_PER_BLOCK 4
#define ELEMS_PER_THREAD (BLOCK_SIZE/THREADS_PER_BLOCK)
fn accumulate_vec_dot(thread_id: u32, row_base: u32, src0_batch_offset: u32, src1_idx_base: u32) -> array<f32, OUTPUTS_PER_WG> {
Copy link
Copy Markdown
Contributor Author

@yomaytk yomaytk Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reeselevine accumulate_vec_dot is the same as the logic in mul_mat_vec.wgsl, so it might be good to extract all the accumulation logic from mul_mat_vec.wgsl into a shared function and reuse it between mul_mat_vec and mul_mat_id_vec. WDYT?

Copy link
Copy Markdown
Contributor

@reeselevine reeselevine Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that makes sense to me. Having a template file like mul_mat_tmpl.wgsl but specifically for the vector versions might be a good approach.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me try that.

@yomaytk
Copy link
Copy Markdown
Contributor Author

yomaytk commented Apr 29, 2026

I added mul_mat_vec_acc.tmpl and extended support to the other types that the existing mul_mat_vec supports — once the template is shared, it's just a matter of adding the types to the pipeline.
E2E benchmark on DeepSeek-V2-Lite Q4_K_M shows the following improvement:

E2E test (t/s)

Test master yomaytk/mul_mat_id_vec (sg_reduce) speedup
pp1 38.18 57.11 1.50x
pp512 659.41 664.59 1.01x
tg1 38.93 60.34 1.55x
tg128 39.30 60.73 1.55x

Copy link
Copy Markdown
Contributor Author

@yomaytk yomaytk Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not related to this PR, but I removed this since test-backend-ops doesn't include this type for MUL_MAT, and Q8_1 is not listed in supports_op for MUL_MAT, making this case
unreachable. Please let me know if there's a reason to keep it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, my understanding is that Q8_1 is deprecated/not used, so we don't really need to support it.

@reeselevine
Copy link
Copy Markdown
Contributor

thanks, looks good. A few conflicts that need to be resolved, also with #22504, you can probably add support for the i-quants to the mul_mat_id mat-mat path.

@yomaytk
Copy link
Copy Markdown
Contributor Author

yomaytk commented Apr 30, 2026

Ok, thanks! After #22504 is merged, I'll resolve the conflicts and add i-quant support to the MUL_MAT_ID mat-mat path in this PR. Does that sound good?
Or, if you'd prefer to keep this PR scope smaller, I can address the i-quant mat-mat part in a follow-up PR.

@reeselevine
Copy link
Copy Markdown
Contributor

Go ahead and add the I-quant support in this PR!

@yomaytk
Copy link
Copy Markdown
Contributor Author

yomaytk commented Apr 30, 2026

I've finished the necessary updates.

@yomaytk yomaytk changed the title ggml-webgpu: Improve the mat-vec performance of MUL_MAT_ID ggml-webgpu: Improve the mat-vec and mat-mat of MUL_MAT_ID Apr 30, 2026
uint32_t sg_mat_n = 0;
uint32_t sg_mat_k = 0;
uint32_t max_subgroup_size = 0;
uint32_t n_experts = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry one last change here: I want to try and avoid putting any op-specific information in this context, and instead derive it in the library as needed from the context. In this case, we can derive n_experts from the src0 tensor directly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. I'll update.

@reeselevine reeselevine merged commit a95a11e into ggml-org:master Apr 30, 2026
42 of 46 checks passed
tekintian added a commit to tekintian/llama.cpp that referenced this pull request May 1, 2026
* 'master' of github.com:tekintian/llama.cpp: (659 commits)
  ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (ggml-org#22464)
  Update llama-mmap to use ftello/fseeko (ggml-org#22497)
  common : check for null getpwuid in hf-cache (ggml-org#22550)
  vulkan: add get/set tensor 2d functions (ggml-org#22514)
  spec: fix argument typo (ggml-org#22552)
  ci : bump ty to 0.0.33 (ggml-org#22535)
  vendor : update cpp-httplib to 0.43.2 (ggml-org#22548)
  CUDA: fix tile FA kernel on Pascal (ggml-org#22541)
  scripts : add wc2wt.sh - create worktree from current HEAD (ggml-org#22513)
  add fast matmul iquants (ggml-org#22504)
  spec : fix draft model checkpoints (ggml-org#22521)
  spec : fix vocab compat checks in spec example (ggml-org#22426)
  common : do not pass prompt tokens to reasoning budget sampler (ggml-org#22488)
  hexagon: make vmem and buffer-size configurable (ggml-org#22487)
  CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478)
  spec : disacard last drafted token with low prob (ggml-org#22506)
  sync : ggml
  ggml : bump version to 0.10.1 (ggml/1469)
  webui: fix slow mic stop and WAV encode (ggml-org#22480)
  ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (ggml-org#22293)
  ...

# Conflicts:
#	.gitignore
@yomaytk yomaytk deleted the mul_mat_id_vec branch May 1, 2026 01:40
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
ggml-org#22464)

* Add mat-vec fast path of MUL_MAT_ID.

* Add shared accumulation vec logic and the other types supports.

* Add i-quant mat-mat for MUL_MAT_ID and fix some parts

* Remove n_experts from shader_lib_context.
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
ggml-org#22464)

* Add mat-vec fast path of MUL_MAT_ID.

* Add shared accumulation vec logic and the other types supports.

* Add i-quant mat-mat for MUL_MAT_ID and fix some parts

* Remove n_experts from shader_lib_context.
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
ggml-org#22464)

* Add mat-vec fast path of MUL_MAT_ID.

* Add shared accumulation vec logic and the other types supports.

* Add i-quant mat-mat for MUL_MAT_ID and fix some parts

* Remove n_experts from shader_lib_context.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning WebGPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants