Vulkan Repack PoC#21024
Conversation
|
Hi, Ruben |
|
Off topic, but please look here: #20776 |
|
I was curious of the performance improvements: ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 64 | shared main:
pr:
(I updated the values for the correct main branch comparision (same build tag as the pullrequest) |
df488da to
13a55c8
Compare
|
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat main:
pr:
It seems that the Performance is now always slower after repacking for this quants models on this hardwar. |
|
The current state on some of my systems. RTX 3090
AMD RX 9070 XT
AMD Radeon Pro VII
Intel A770
|
|
@inforithmics I think the commit you tested was broken, so your result might not be valid. Not sure how it even worked for you, I just got a segfault. |
|
Strange I did run the Benchmarks again (with updated pr) and they are similar, but I did run them on windows. I did run the same benches on Windows again on a Radeon VII: And there were some improvements. ggml_vulkan: 1 = AMD Radeon Pro VII (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: none datamain:
pr:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat datamain:
pr:
I formatted and updated the data for 780m so for this chip it reduces pp sometimes. I Saw that the other results where with mmap 0 so i reran the the tests with mmap off datamain:
pr:
Then i testet again mmap off and flash attention on datamain:
pr:
|
|
Datapoint from a bandwidth-bound iGPU: AMD Radeon 780M (RDNA3/gfx1103, RADV PHOENIX, Mesa 26.0.6), Ryzen 7 PRO 250 laptop with single-channel DDR5-5600 (~44.8 GB/s pin rate), Linux 7.0.11. Device caps: Model: gpt-oss-20b MXFP4 (ggml-org GGUF — experts MXFP4, attention + output Q8_0, so this exercises both the mxfp4 and q8_0 repack paths).
Quality check: perplexity identical to 4 decimal places on both builds (9.8898 ± 0.4984, same corpus and chunk count). Greedy outputs diverge after a few hundred tokens (near-tie token flips from FP reordering), which matches the expectation that repack changes summation order but not quality. Context for why +2.4% is meaningful here: decode on this setup is hard against the memory wall — per-token weight traffic is ~2.56 GB, and test-backend-ops shows the mul_mat_vec / mul_mat_id kernels already sustaining ~38–39 GB/s of the ~44.8 GB/s pin rate. So the gain reads as real bandwidth efficiency recovered, and it was consistent across rounds with tight stddev (±0.02–0.05 t/s). |
This is a basic PoC to see how much of a difference quant alignment makes across GPU vendors. It's not complete and performance is better in many cases, but not universally so. I'll post benchmarks later. I assume the alignment moved the deltas into other memory pages, which in some cases is worse than the previous unaligned state. I'll gather some more data and see if it can be improved.
Claude was used for assistance, but code was written by me.