ggml-webgpu: Improve the mat-vec and mat-mat of MUL_MAT_ID#22464
Conversation
| // gather the selected experts for the target token. | ||
| for (var col = thread_id;col < params.n_expert_used;col += WG_SIZE) { | ||
| let expert = ids[params.offset_ids + col]; | ||
| gathered_count_ids[expert] = 1; |
There was a problem hiding this comment.
The selected experts are determined by top-k routing, so we can assume that the selected experts for a given token are distinct.
| #define BLOCK_SIZE_BYTES 34 | ||
| #define THREADS_PER_BLOCK 4 | ||
| #define ELEMS_PER_THREAD (BLOCK_SIZE/THREADS_PER_BLOCK) | ||
| fn accumulate_vec_dot(thread_id: u32, row_base: u32, src0_batch_offset: u32, src1_idx_base: u32) -> array<f32, OUTPUTS_PER_WG> { |
There was a problem hiding this comment.
@reeselevine accumulate_vec_dot is the same as the logic in mul_mat_vec.wgsl, so it might be good to extract all the accumulation logic from mul_mat_vec.wgsl into a shared function and reuse it between mul_mat_vec and mul_mat_id_vec. WDYT?
There was a problem hiding this comment.
yes that makes sense to me. Having a template file like mul_mat_tmpl.wgsl but specifically for the vector versions might be a good approach.
There was a problem hiding this comment.
OK, let me try that.
|
I added E2E test (t/s)
|
There was a problem hiding this comment.
This change is not related to this PR, but I removed this since test-backend-ops doesn't include this type for MUL_MAT, and Q8_1 is not listed in supports_op for MUL_MAT, making this case
unreachable. Please let me know if there's a reason to keep it.
There was a problem hiding this comment.
yeah, my understanding is that Q8_1 is deprecated/not used, so we don't really need to support it.
|
thanks, looks good. A few conflicts that need to be resolved, also with #22504, you can probably add support for the i-quants to the mul_mat_id mat-mat path. |
|
Ok, thanks! After #22504 is merged, I'll resolve the conflicts and add i-quant support to the MUL_MAT_ID mat-mat path in this PR. Does that sound good? |
|
Go ahead and add the I-quant support in this PR! |
|
I've finished the necessary updates. |
| uint32_t sg_mat_n = 0; | ||
| uint32_t sg_mat_k = 0; | ||
| uint32_t max_subgroup_size = 0; | ||
| uint32_t n_experts = 0; |
There was a problem hiding this comment.
sorry one last change here: I want to try and avoid putting any op-specific information in this context, and instead derive it in the library as needed from the context. In this case, we can derive n_experts from the src0 tensor directly.
There was a problem hiding this comment.
Ah, good point. I'll update.
* 'master' of github.com:tekintian/llama.cpp: (659 commits) ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (ggml-org#22464) Update llama-mmap to use ftello/fseeko (ggml-org#22497) common : check for null getpwuid in hf-cache (ggml-org#22550) vulkan: add get/set tensor 2d functions (ggml-org#22514) spec: fix argument typo (ggml-org#22552) ci : bump ty to 0.0.33 (ggml-org#22535) vendor : update cpp-httplib to 0.43.2 (ggml-org#22548) CUDA: fix tile FA kernel on Pascal (ggml-org#22541) scripts : add wc2wt.sh - create worktree from current HEAD (ggml-org#22513) add fast matmul iquants (ggml-org#22504) spec : fix draft model checkpoints (ggml-org#22521) spec : fix vocab compat checks in spec example (ggml-org#22426) common : do not pass prompt tokens to reasoning budget sampler (ggml-org#22488) hexagon: make vmem and buffer-size configurable (ggml-org#22487) CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478) spec : disacard last drafted token with low prob (ggml-org#22506) sync : ggml ggml : bump version to 0.10.1 (ggml/1469) webui: fix slow mic stop and WAV encode (ggml-org#22480) ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (ggml-org#22293) ... # Conflicts: # .gitignore
ggml-org#22464) * Add mat-vec fast path of MUL_MAT_ID. * Add shared accumulation vec logic and the other types supports. * Add i-quant mat-mat for MUL_MAT_ID and fix some parts * Remove n_experts from shader_lib_context.
ggml-org#22464) * Add mat-vec fast path of MUL_MAT_ID. * Add shared accumulation vec logic and the other types supports. * Add i-quant mat-mat for MUL_MAT_ID and fix some parts * Remove n_experts from shader_lib_context.
ggml-org#22464) * Add mat-vec fast path of MUL_MAT_ID. * Add shared accumulation vec logic and the other types supports. * Add i-quant mat-mat for MUL_MAT_ID and fix some parts * Remove n_experts from shader_lib_context.
Overview
This PR improves the mat-vec computation of
MUL_MAT_ID. The computation logic is the same asmul_mat_vec.wgsl. I confirmed that this change significantly improves performance when running the Deepseek-V2-Q6_K (16B-2.4B) in my enviroment (M2, Metal 4). The three phases (pp1,tg1,tg128) that are largely occupied by mul-mat-id-vec are improved as expected.This PR supports only Q8_0 and Q6_K because the model uses those types for MUL_MAT_ID, but it would be valuable to support other types too.
benchmark
command:
lama-bench -m DeepSeek-V2-Lite-Chat-Q6_K.gguf -fa 1 -p 0,1,512 -n 0,1,128 -r 3E2E test (t/s)
MUL_MAT_IDkernel GPU time sum oftg128(usingGGML_WEBGPU_GPU_PROFILE)mul_mat_id_q8_0_f32)id_vec_q8_0_f32_wg_reduce)id_vec_q8_0_f32_sg_reduce)mul_mat_id_q6_K_f32)id_vec_q6_K_f32_wg_reduce)id_vec_q6_K_f32_sg_reduce)Requirements