HIP: tune mmq/rocblas switching for RDNA4#18816
HIP: tune mmq/rocblas switching for RDNA4#18816jiachengjason wants to merge 7 commits intoggml-org:masterfrom
Conversation
|
I don't think the kernel selection logic should be changed like this. For batch sizes < 1024 you are reporting at most a marginal speedup that is I think not worth the increase in memory use from dequantizing the weights to FP16. |
|
@jiachengjason Could you please repeat your testing with @JohannesGaessler I built the latest llama.cpp with With this PR included:
Without this PR included:
|
|
Okay, but clearly this is dependent on factors that this PR does not account for. As I've said before, the default kernel selection logic should be applicable to the default way to run the software. If it depends e.g. on environment variables being set that needs an explicit check in the code. |
|
did some further tuning such that most of the models would get a significant amount of perf gain for micro batch sizes > 256 and at micro batch size 8 (+9% to +230% perf gain) |
|
Hi @JohannesGaessler just want to follow up on this PR, as I did some further tuning such that most of the models would get a bigger performance gain for micro batch sizes > 256 and at micro batch size 8 as mentioned above. Thank you. |
|
When I do a quick test on my RX 9060 XT:
This is with ROCm 7.1.1 at the default settings and environment variables where this PR is clearly detrimental. If you are changing anything in your environment that will need an explicit check in the code. |
Hi @JohannesGaessler I used this build command the following default run command on ROCm 7.1.1 I don't see the huge regression that you have for micro batch for 256 and 512, I am wondering what was your build and run command that you used? This tuning increases the perf gain when used with flash attention, and should maintain default performance without.
|
|
On Linux 6.12 I used this build command: cmake -DCMAKE_BUILD_TYPE=Release -DGGML_HIP=ON .. && time cmake --build . -j 32 -- --quiet && echo -e "\a"export mn=llama_3-8b && export q=q6_k
./build/bin/llama-bench --model models/opt/${mn}-${q}.gguf -fa 1 -r 1 -n 0 -ub "1-512*2" -o sql|sqlite3 llama-bench.sqliteLooking at the raw numbers, the MMQ performance you're reporting is very bad relative to the specs of the card so I think that there is something else wrong. |
Hi @JohannesGaessler, running your exact same build and run commands gives me the following results. This is my environment (AMDSMI Tool: 26.2.1+fc0010cf6a | AMDSMI Library version: 26.2.1 | ROCm version: 7.2.0 | amdgpu version: 6.16.6 | hsmp version: N/A)
|







Following similar approach to #18537 for tuning mmq/rocblas switching for RDNA4 to improve performance for microbatch size >256 and at micro batch size 8 for most models (+9% to +230% perf gain)
Testing set up:
Performance result for llama-bench (revised)