vulkan: Pad Q3_K/Q6_K tensors out to 32-bit alignment#22951
vulkan: Pad Q3_K/Q6_K tensors out to 32-bit alignment#22951TheBlueMatt wants to merge 9 commits into
Conversation
ce56552 to
6ba4d99
Compare
|
Oh, forgot to mention, this brings us back to beating (by a huge margin) SYCL on Q3_K and Q6_K (and some mixed models as well) again. |
|
I do think there's a lot of potential to repacking unaligned types. Have you looked at #21024 to see if there's any overlap? |
|
Oops, yea, there is some overlap there, though I took a bit of a different approach on the shaders - instead of touching lots of shaders with a new offset constant I just padded the block structs which seems much simpler. Sadly #21024 is now also out of date because of the 2d set/get which is a bit more annoying to map. |
6ba4d99 to
daa64c5
Compare
|
I did a quick perf run on RTX 5090, and I do see a big improvement in the directed tests but not in a real model. I tested llama-7b.Q3_K_M.gguf, but I didn't try to debug why it's not faster. For the directed tests: I haven't reviewed the code. |
|
I'd also be curious to see perf numbers after |
daa64c5 to
59da539
Compare
|
I'm generally not a fan of wasting memory for alignment. The entire point of quantization is to cram as much data as possible into limited space, even small overhead should be avoided. I'm also not a fan of these huge performance claims, while you may see them, they are very specific to a broken/unoptimized driver and you make them look like we suddenly more than double performance when barely anyone would see this. I'll run some tests, but I think the approach I took in #21024 is better in concept, though unfinished. How and if we repack is still an open discussion with some of the CUDA backend developers. |
Yea, that's fair. We could also make it optional. It does seem that some people might want the tradeoff here, even if certainly not all. I'm also curious about the original comment - in which hardware was the 32-bit alignment the most problematic? On BMG using MMVQ is a win even at 2b alignment.
Apologies if I came across poorly here, certainly wasn't my intent. I certainly agree the problem is mostly on the mesa side (that's why I went to the effort to break down the performance changes by commit so we can see how much is the alignment vs not), but I also wanted to accurately report the performance changes as well - some of us do prefer to use open source drivers and as much as mesa misses some optimizations its generally pretty decent. I used mesa git to get a few improvements they've made upstream and a, I think, relatively common model design. Is there some way you'd prefer I communicate performance changes in the future?
I'm more than happy to tweak this to repack through actual bit unpacking and take that approach if you prefer, or can wait until you finish the work on #21024 if you prefer to tackle this. I imagine it would be kinda hard to get proper 4b-alignment with that approach on Q3/Q6, though, no? |
In the next commit we'll start padding Q3_K and Q6_K to make them 4-byte-aligned, so here we first enable MMVQ for these types to break out benchmarking. On Intel BMG on mesa, this leads to an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.
59da539 to
e7c65b8
Compare
I saw regressions on all hardware I tested, meaning Nvidia, Intel Alchemist and AMD GCN + some form of RDNA.
I know and I appreciate that you are pushing the driver forwards. I would just prefer to avoid baseless hype about performance claims when they will barely affect anyone. I have the knowledge to recognize that this change only affects a very specific configuration, others may just read the title or title+descriptions and miss this entirely. Disclaimers are important to avoid this with huge claims.
I don't think we're ready to merge any of these approaches, but testing different variants across diverse hardware to gather more knowledge is good. I haven't looked into repacking approaches for k-quants at all yet. |
mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, because we have the high bits to ourselves after loading Q3/Q6 ints, we can do subtraction with bit twiddling and 32 bit intops rather than 4x8 bit intops. On Intel BMG on mesa, this leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.
Interesting. Can you retest with just the block-load+32-bit-subtraction commit (the branch at https://github.com/TheBlueMatt/llama.cpp/commits/2026-05-q3-q6-load-block/)? While we think about how to handle padding it probably makes sense to land that separately but I'd need to update the hardware filter for MMVQ first. I went ahead and rebased this PR on that branch (without other changes) so that the separation is more clear.
Fair enough. I did update the title and I just went in to be more specific about the wins only applying to "pure" Q3_K/Q6_K models and how they're not so common these days. Hopefully that is more clear. |
Every call to `vk_tensor_offset` immediately adds the tensor `view_offs` so we might as well inline it in a single function.
It hasn't been used for some time, so just kill it off.
In the coming commits we'll pad Q3_K and Q6_K to make them 4-byte-aligned and allow for more effecient loads. This is the first step, moving to a type-size function that allows us to include the padding.
In the coming commits we'll pad Q3_K and Q6_K to make them 4-byte-aligned and allow for more effecient loads. This ensures we know the size of our buffers on the GPU, which will then differ from the size of our buffers on the host.
Q3_K and Q6_K are not 4-byte-aligned, which can be a performance issue on some devices. In the next commit we'll start padding them out to 4-byte-aligned blocks, and here support copying such tensors to/from device.
On some device 4-b aligned loads leads to a performance increase. Q3_K and Q6_K blocks are not 4-byte-aligned which resulted in us avoiding using MMVQ for them entirely. This isn't ideal and instead here we switch to padding them out to 4-byte alignemnt. FOr Q6_K this is a ~0.9% size overhead and for Q3_K a ~1.8% size overhead. The additional bloat of 0.9% for Q6_K users seems unlikely to bother those using larger quants. Q3_K is somewhat more debatable, but getting the option to use MMVQ there seems worth it in most cases. On Intel BMG on mesa, this leads to a ~28% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~3% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.
Now that our Q3_K and Q6_K data blocks are 4-byte-aligned, we can simplify our loading logic a bit. This has no performance change on BMG using mesa.
e7c65b8 to
f717e22
Compare
|
I'm unfortunately not around much but I did play with this before, not creating a working implementation as that's too much work but rather only modifying the existing shaders assuming that the layout was changed in the background. I think I only got around a 10% improvement at most on my AMD cards, like @0cc4m said the big improvements are specifically for your card due to the bad Intel driver and compiler. We've had a lot of problem in the past with Intel graphics cards, and its only with int dot that Intel now has decent performance (i think the regular fp32 and fp16 code still has strange slowness on certain quants). Also for repacking I feel that it should either be enabled for everyone with an improvement for everyone, or disabled for all rather than having two different paths and two sets of shaders for different chips. We've already got way too many shaders out there. Personally I'm fine with using up a little bit more memory for alignment provided that it actually provides a big performance improvement for everyone. By the way repacking isn't worth it everywhere, for simpler quant types you'll get a larger advantage as most of the time is spent on the loads. For something like the i quants most of the time is spent unpacking the quant itself and going through those lookup tables and all that. |
|
Just a quick note because it wasn't all that clear in my OP (too many commits, sorry), the actual padding change was only a +2.9% on Q6_K matmul and +28% on Q3_K. Looking at the actual compiled shader I'm not really sure they're to blame there, though indeed the fact that the block-load commit is an improvement definitely is mesa's fault! I'm working through the slowness, and at this point most of the large/K-quant tg is getting 90%+ bandwidth (pp is a different story, but mesa needs cm2 really). |
|
Marking draft since we're still deciding on the padding. #23056 contains the non-padding speedups for Intel BMG. |
Bit of a conversation-starter. We currently disable the MMVQ path entirely for Q3_K and Q6_K because they're not 32-bit aligned and the comment indicates its performance-problematic on some machines. At least on Battlemage, for Q3_K/Q6_K models, its still a win, and something like a 50% performance win at that (for pure-Q3_K/Q6_K models, even if those are rare these days).
Then, if we switch to block-loading for Q3_K and Q6_K we get another material performance win, totaling +94% on a unsloth/Qwen3.5-9B-GGUF:Q3_K and +166% on unsloth/Qwen3.5-GGUF:Q6_K! Obviously this won't translate to more common models that aren't pure-Q3_K/Q6_K but I see an extra token per second on a "Q4_K_XL" model which has only a bit of Q6_K in it.
However, of course, 32-bit alignment is still better. Here, I propose padding out to 32 bits. This costs ~1.8% in overhead on Q3_K tensors and 0.9% on Q6_K tensors. On my BMG the Q6_K change is almost a wash (+2.9%) so maybe not worth it, but I'm curious about other devices. However, the Q3_K change is a large win - +28% on top of the previous +94% gain for a pure Q3_K model.
In total, the win here on BMG on mesa is +150% on unsloth/Qwen3.5-9B-GGUF:Q3_K and +174% on unsloth/Qwen3.5-GGUF:Q6_K.
Currently this just always enables the padding on Q3_K and Q6_K, but we could also make it device/env-dependent. I didn't do that because I didn't want to add yet more shaders to the binary but certainly could. I also did the packing on tensor load on the host-side, I think this is probably fine but if someone is doing async tensor load to do just-in-time expert offloading it might not be so free.
This probably breaks some of the test code in ggml-vulkan.cpp for Q3_K/Q6_K, but it wasn't clear to me that it was worth cleaning that up - is there anything left in that test code that isn't cover-able with test-backend-ops and should we just make sure to get the coverage we need that way instead of carrying our own test logic?
Stole the subtraction trick in the block-load commit from #22066.
ggml_type_sizecallsites and check my initial pass.