Patch perf regression for mmq kernels in ROCm#18442
Patch perf regression for mmq kernels in ROCm#18442jiachengjason wants to merge 2 commits intoggml-org:masterfrom
Conversation
recover performance regression for ggml-org#17917
|
did a quick compile on gfx1100 looks like small 24b @ q6k matches pre #17576 perf now without forcing cublas. qwen 30b3a unsloth q5kxl lost 40% of its pp vs mmq, might need a n_experts branch like the cdna path? |
|
Naively copying the CDNA path as presented in #18202 diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
index 1a297971..0a0d440a 100644
--- a/ggml/src/ggml-cuda/mmq.cu
+++ b/ggml/src/ggml-cuda/mmq.cu
@@ -333,7 +333,10 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t
}
if (amd_wmma_available(cc)) {
- if (ne11 <= 128 || type == GGML_TYPE_Q4_0 || type == GGML_TYPE_Q4_1 || type == GGML_TYPE_Q5_0 || type == GGML_TYPE_Q5_1) {
+ if (n_experts > 64 || ne11 <= 128) {
+ return true;
+ }
+ if (type == GGML_TYPE_Q4_0 || type == GGML_TYPE_Q4_1 || type == GGML_TYPE_Q5_0 || type == GGML_TYPE_Q5_1) {
return true;
}
if (ne11 <= 256 && (type == GGML_TYPE_Q4_K || type == GGML_TYPE_Q5_K)) {makes this positive or equal on all the models I have happen to have on-disk, but I'm not sure if RDNA should have a different exp count or not. looking through #14949 seems like this block was done by pretty much just A|B testing every quant type which I don't have time for today, but likely this needs to be re-done for RDNA |
|
For my new year's resolution I've picked this up and created more granular switching in #18537 based on results from semi-automated benchmarks. Might want to check it on Strix Halo @jiachengjason |
|
The testing methodology that I am using for matrix multiplications is something like this: export mn=llama_3-8b
for q in q4_0 q4_1 q5_0 q5_1 q8_0 q2_k_s q3_k_s q4_k_s q5_k_s q6_k iq1_s iq2_xxs iq2_xs iq2_s iq3_xxs iq3_xs iq3_s iq3_m iq4_nl iq4_xs; do echo $q; ./bench --model models/opt/${mn}-${q}.gguf -r 1 -fa 1 -n 0 -p 2048 -ub "1-2048*2" --progress -o sql|sqlite3 llama-bench.sqlite; sleep 10; doneAfter doing the above for 2 commits, scripts/compare-llama-bench.py -s gpu_info,model_type,n_ubatch -i llama-bench.sqlitecan be used to create a table that compares the performance. That is the only format in which I will accept evidence for changes in performance. But previously similar PRs inadvertently resulted in performance regressions so I will not approve this or the other PR anyways unless I can confirm the numbers myself. But I cannot do that until Monday because the Strix Halo system that AMD sent me comes with only a single USB-A port. |
|
sorry for the interruption, ./bench utility is tools/llama-bench , tools/server/bench/bench.py or rather some other? Thank you so much. |
Probably worth mentioning in CONTRIBUTING.md? I updated my PR #18537
the llama-bench make target. elf binary. |
Sorry, that is a symlink to
To be clear, when I said "only format" I meant |
|
Superseded by #18537 . |
Recover performance regression for #17917 for RDNA3 and 4 by choosing more performant configs for mmq kernels
Using similar approach as CDNA's config for now to patch the regression, will optimize further in upcoming PRs
Strix Halo performance after patch