Skip to content

vulkan: Pad Q3_K/Q6_K tensors out to 32-bit alignment#22951

Draft
TheBlueMatt wants to merge 9 commits into
ggml-org:masterfrom
TheBlueMatt:2026-05-pad-q3-q6
Draft

vulkan: Pad Q3_K/Q6_K tensors out to 32-bit alignment#22951
TheBlueMatt wants to merge 9 commits into
ggml-org:masterfrom
TheBlueMatt:2026-05-pad-q3-q6

Conversation

@TheBlueMatt
Copy link
Copy Markdown
Contributor

@TheBlueMatt TheBlueMatt commented May 11, 2026

Bit of a conversation-starter. We currently disable the MMVQ path entirely for Q3_K and Q6_K because they're not 32-bit aligned and the comment indicates its performance-problematic on some machines. At least on Battlemage, for Q3_K/Q6_K models, its still a win, and something like a 50% performance win at that (for pure-Q3_K/Q6_K models, even if those are rare these days).

Then, if we switch to block-loading for Q3_K and Q6_K we get another material performance win, totaling +94% on a unsloth/Qwen3.5-9B-GGUF:Q3_K and +166% on unsloth/Qwen3.5-GGUF:Q6_K! Obviously this won't translate to more common models that aren't pure-Q3_K/Q6_K but I see an extra token per second on a "Q4_K_XL" model which has only a bit of Q6_K in it.

However, of course, 32-bit alignment is still better. Here, I propose padding out to 32 bits. This costs ~1.8% in overhead on Q3_K tensors and 0.9% on Q6_K tensors. On my BMG the Q6_K change is almost a wash (+2.9%) so maybe not worth it, but I'm curious about other devices. However, the Q3_K change is a large win - +28% on top of the previous +94% gain for a pure Q3_K model.

In total, the win here on BMG on mesa is +150% on unsloth/Qwen3.5-9B-GGUF:Q3_K and +174% on unsloth/Qwen3.5-GGUF:Q6_K.

Currently this just always enables the padding on Q3_K and Q6_K, but we could also make it device/env-dependent. I didn't do that because I didn't want to add yet more shaders to the binary but certainly could. I also did the packing on tensor load on the host-side, I think this is probably fine but if someone is doing async tensor load to do just-in-time expert offloading it might not be so free.

This probably breaks some of the test code in ggml-vulkan.cpp for Q3_K/Q6_K, but it wasn't clear to me that it was worth cleaning that up - is there anything left in that test code that isn't cover-able with test-backend-ops and should we just make sure to get the coverage we need that way instead of carrying our own test logic?

Stole the subtraction trick in the block-load commit from #22066.

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, claude suggested just switching to MMVQ which provides something and then used it to review ggml_type_size callsites and check my initial pass.

@TheBlueMatt TheBlueMatt requested a review from a team as a code owner May 11, 2026 17:57
@TheBlueMatt TheBlueMatt force-pushed the 2026-05-pad-q3-q6 branch 4 times, most recently from ce56552 to 6ba4d99 Compare May 11, 2026 18:08
@TheBlueMatt
Copy link
Copy Markdown
Contributor Author

Oh, forgot to mention, this brings us back to beating (by a huge margin) SYCL on Q3_K and Q6_K (and some mixed models as well) again.

@jeffbolznv
Copy link
Copy Markdown
Contributor

I do think there's a lot of potential to repacking unaligned types. Have you looked at #21024 to see if there's any overlap?

@TheBlueMatt
Copy link
Copy Markdown
Contributor Author

Oops, yea, there is some overlap there, though I took a bit of a different approach on the shaders - instead of touching lots of shaders with a new offset constant I just padded the block structs which seems much simpler. Sadly #21024 is now also out of date because of the 2d set/get which is a bit more annoying to map.

@TheBlueMatt
Copy link
Copy Markdown
Contributor Author

cc @0cc4m not sure if you prefer to rebase #21024 or take this instead. Either way the OP has answers for your "im curious about performance from padding" question :)

@jeffbolznv
Copy link
Copy Markdown
Contributor

I did a quick perf run on RTX 5090, and I do see a big improvement in the directed tests but not in a real model. I tested llama-7b.Q3_K_M.gguf, but I didn't try to debug why it's not faster.

For the directed tests:

before:
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                45156 runs -    22.45 us/run - 117.44 MFLOP/run -   5.23 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                31524 runs -    32.03 us/run - 234.88 MFLOP/run -   7.33 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                22152 runs -    45.62 us/run - 352.32 MFLOP/run -   7.72 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                15549 runs -    64.35 us/run - 469.76 MFLOP/run -   7.30 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                12996 runs -    77.70 us/run - 587.20 MFLOP/run -   7.56 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 7062 runs -   142.77 us/run - 939.52 MFLOP/run -   6.58 TFLOPS

after:
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                74124 runs -    13.51 us/run - 117.44 MFLOP/run -   8.69 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                54528 runs -    18.36 us/run - 234.88 MFLOP/run -  12.79 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                42600 runs -    23.52 us/run - 352.32 MFLOP/run -  14.98 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                34932 runs -    28.71 us/run - 469.76 MFLOP/run -  16.36 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                25308 runs -    39.64 us/run - 587.20 MFLOP/run -  14.81 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                16799 runs -    59.70 us/run - 939.52 MFLOP/run -  15.74 TFLOPS

I haven't reviewed the code.

@TheBlueMatt
Copy link
Copy Markdown
Contributor Author

I'd also be curious to see perf numbers after vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints before the actual padding commit.

@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 11, 2026
@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented May 12, 2026

I'm generally not a fan of wasting memory for alignment. The entire point of quantization is to cram as much data as possible into limited space, even small overhead should be avoided.

I'm also not a fan of these huge performance claims, while you may see them, they are very specific to a broken/unoptimized driver and you make them look like we suddenly more than double performance when barely anyone would see this.

I'll run some tests, but I think the approach I took in #21024 is better in concept, though unfinished. How and if we repack is still an open discussion with some of the CUDA backend developers.

@TheBlueMatt TheBlueMatt changed the title vulkan: Pad Q3_K/Q6_K tensors out to 32-bit alignment (+150% tg128) vulkan: Pad Q3_K/Q6_K tensors out to 32-bit alignment May 12, 2026
@TheBlueMatt
Copy link
Copy Markdown
Contributor Author

TheBlueMatt commented May 12, 2026

I'm generally not a fan of wasting memory for alignment. The entire point of quantization is to cram as much data as possible into limited space, even small overhead should be avoided.

Yea, that's fair. We could also make it optional. It does seem that some people might want the tradeoff here, even if certainly not all. I'm also curious about the original comment - in which hardware was the 32-bit alignment the most problematic? On BMG using MMVQ is a win even at 2b alignment.

I'm also not a fan of these huge performance claims, while you may see them, they are very specific to a broken/unoptimized driver and you make them look like we suddenly more than double performance when barely anyone would see this.

Apologies if I came across poorly here, certainly wasn't my intent. I certainly agree the problem is mostly on the mesa side (that's why I went to the effort to break down the performance changes by commit so we can see how much is the alignment vs not), but I also wanted to accurately report the performance changes as well - some of us do prefer to use open source drivers and as much as mesa misses some optimizations its generally pretty decent. I used mesa git to get a few improvements they've made upstream and a, I think, relatively common model design. Is there some way you'd prefer I communicate performance changes in the future?

I'll run some tests, but I think the approach I took in #21024 is better in concept, though unfinished. How and if we repack is still an open discussion with some of the CUDA backend developers.

I'm more than happy to tweak this to repack through actual bit unpacking and take that approach if you prefer, or can wait until you finish the work on #21024 if you prefer to tackle this. I imagine it would be kinda hard to get proper 4b-alignment with that approach on Q3/Q6, though, no?

In the next commit we'll start padding Q3_K and Q6_K to make them
4-byte-aligned, so here we first enable MMVQ for these types to
break out benchmarking.

On Intel BMG on mesa, this leads to an immediate ~57% perf increase
in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase
in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.
@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented May 12, 2026

I'm also curious about the original comment - in which hardware was the 32-bit alignment the most problematic? On BMG using MMVQ is a win even at 2b alignment.

I saw regressions on all hardware I tested, meaning Nvidia, Intel Alchemist and AMD GCN + some form of RDNA.

I'm also not a fan of these huge performance claims, while you may see them, they are very specific to a broken/unoptimized driver and you make them look like we suddenly more than double performance when barely anyone would see this.

Apologies if I came across poorly here, certainly wasn't my intent. I certainly agree the problem is mostly on the mesa side (that's why I went to the effort to break down the performance changes by commit so we can see how much is the alignment vs not), but I also wanted to accurately report the performance changes as well - some of us do prefer to use open source drivers and as much as mesa misses some optimizations its generally pretty decent. I used mesa git to get a few improvements they've made upstream and a, I think, relatively common model design. Is there some way you'd prefer I communicate performance changes in the future?

I know and I appreciate that you are pushing the driver forwards. I would just prefer to avoid baseless hype about performance claims when they will barely affect anyone. I have the knowledge to recognize that this change only affects a very specific configuration, others may just read the title or title+descriptions and miss this entirely. Disclaimers are important to avoid this with huge claims.

I'll run some tests, but I think the approach I took in #21024 is better in concept, though unfinished. How and if we repack is still an open discussion with some of the CUDA backend developers.

I'm more than happy to tweak this to repack through actual bit unpacking and take that approach if you prefer, or can wait until you finish the work on #21024 if you prefer to tackle this. I imagine it would be kinda hard to get proper 4b-alignment with that approach on Q3/Q6, though, no?

I don't think we're ready to merge any of these approaches, but testing different variants across diverse hardware to gather more knowledge is good. I haven't looked into repacking approaches for k-quants at all yet.

mesa isn't all that great at coalescing back-to-back loads from
alternating arrays, so we force it instead. Further, because we
have the high bits to ourselves after loading Q3/Q6 ints, we can
do subtraction with bit twiddling and 32 bit intops rather than
4x8 bit intops.

On Intel BMG on mesa, this leads to a ~24% perf increase in tg128
for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128
for unsloth/Qwen3.5-9B-GGUF:Q6_K.
@TheBlueMatt
Copy link
Copy Markdown
Contributor Author

TheBlueMatt commented May 12, 2026

I saw regressions on all hardware I tested, meaning Nvidia, Intel Alchemist and AMD GCN + some form of RDNA.

Interesting. Can you retest with just the block-load+32-bit-subtraction commit (the branch at https://github.com/TheBlueMatt/llama.cpp/commits/2026-05-q3-q6-load-block/)? While we think about how to handle padding it probably makes sense to land that separately but I'd need to update the hardware filter for MMVQ first.

I went ahead and rebased this PR on that branch (without other changes) so that the separation is more clear.

I know and I appreciate that you are pushing the driver forwards. I would just prefer to avoid baseless hype about performance claims when they will barely affect anyone. I have the knowledge to recognize that this change only affects a very specific configuration, others may just read the title or title+descriptions and miss this entirely. Disclaimers are important to avoid this with huge claims.

Fair enough. I did update the title and I just went in to be more specific about the wins only applying to "pure" Q3_K/Q6_K models and how they're not so common these days. Hopefully that is more clear.

Every call to `vk_tensor_offset` immediately adds the tensor
`view_offs` so we might as well inline it in a single function.
It hasn't been used for some time, so just kill it off.
In the coming commits we'll pad Q3_K and Q6_K to make them
4-byte-aligned and allow for more effecient loads. This is the
first step, moving to a type-size function that allows us to
include the padding.
In the coming commits we'll pad Q3_K and Q6_K to make them
4-byte-aligned and allow for more effecient loads. This ensures we
know the size of our buffers on the GPU, which will then differ
from the size of our buffers on the host.
Q3_K and Q6_K are not 4-byte-aligned, which can be a performance
issue on some devices. In the next commit we'll start padding them
out to 4-byte-aligned blocks, and here support copying such tensors
to/from device.
On some device 4-b aligned loads leads to a performance increase.
Q3_K and Q6_K blocks are not 4-byte-aligned which resulted in us
avoiding using MMVQ for them entirely. This isn't ideal and
instead here we switch to padding them out to 4-byte alignemnt.

FOr Q6_K this is a ~0.9% size overhead and for Q3_K a ~1.8% size
overhead. The additional bloat of 0.9% for Q6_K users seems
unlikely to bother those using larger quants. Q3_K is somewhat more
debatable, but getting the option to use MMVQ there seems worth it
in most cases.

On Intel BMG on mesa, this leads to a ~28% perf increase in tg128
for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~3% perf increase in tg128
for unsloth/Qwen3.5-9B-GGUF:Q6_K.
Now that our Q3_K and Q6_K data blocks are 4-byte-aligned, we can
simplify our loading logic a bit.

This has no performance change on BMG using mesa.
@netrunnereve
Copy link
Copy Markdown
Collaborator

I'm unfortunately not around much but I did play with this before, not creating a working implementation as that's too much work but rather only modifying the existing shaders assuming that the layout was changed in the background. I think I only got around a 10% improvement at most on my AMD cards, like @0cc4m said the big improvements are specifically for your card due to the bad Intel driver and compiler. We've had a lot of problem in the past with Intel graphics cards, and its only with int dot that Intel now has decent performance (i think the regular fp32 and fp16 code still has strange slowness on certain quants).

Also for repacking I feel that it should either be enabled for everyone with an improvement for everyone, or disabled for all rather than having two different paths and two sets of shaders for different chips. We've already got way too many shaders out there. Personally I'm fine with using up a little bit more memory for alignment provided that it actually provides a big performance improvement for everyone.

By the way repacking isn't worth it everywhere, for simpler quant types you'll get a larger advantage as most of the time is spent on the loads. For something like the i quants most of the time is spent unpacking the quant itself and going through those lookup tables and all that.

@TheBlueMatt
Copy link
Copy Markdown
Contributor Author

TheBlueMatt commented May 12, 2026

Just a quick note because it wasn't all that clear in my OP (too many commits, sorry), the actual padding change was only a +2.9% on Q6_K matmul and +28% on Q3_K. Looking at the actual compiled shader I'm not really sure they're to blame there, though indeed the fact that the block-load commit is an improvement definitely is mesa's fault! I'm working through the slowness, and at this point most of the large/K-quant tg is getting 90%+ bandwidth (pp is a different story, but mesa needs cm2 really).

@TheBlueMatt
Copy link
Copy Markdown
Contributor Author

Marking draft since we're still deciding on the padding. #23056 contains the non-padding speedups for Intel BMG.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants