vulkan: Pad Q3_K/Q6_K tensors out to 32-bit alignment by TheBlueMatt · Pull Request #22951 · ggml-org/llama.cpp

TheBlueMatt · 2026-05-11T17:57:09Z

Bit of a conversation-starter. We currently disable the MMVQ path entirely for Q3_K and Q6_K because they're not 32-bit aligned and the comment indicates its performance-problematic on some machines. At least on Battlemage, for Q3_K/Q6_K models, its still a win, and something like a 50% performance win at that (for pure-Q3_K/Q6_K models, even if those are rare these days).

Then, if we switch to block-loading for Q3_K and Q6_K we get another material performance win, totaling +94% on a unsloth/Qwen3.5-9B-GGUF:Q3_K and +166% on unsloth/Qwen3.5-GGUF:Q6_K! Obviously this won't translate to more common models that aren't pure-Q3_K/Q6_K but I see an extra token per second on a "Q4_K_XL" model which has only a bit of Q6_K in it.

However, of course, 32-bit alignment is still better. Here, I propose padding out to 32 bits. This costs ~1.8% in overhead on Q3_K tensors and 0.9% on Q6_K tensors. On my BMG the Q6_K change is almost a wash (+2.9%) so maybe not worth it, but I'm curious about other devices. However, the Q3_K change is a large win - +28% on top of the previous +94% gain for a pure Q3_K model.

In total, the win here on BMG on mesa is +150% on unsloth/Qwen3.5-9B-GGUF:Q3_K and +174% on unsloth/Qwen3.5-GGUF:Q6_K.

Currently this just always enables the padding on Q3_K and Q6_K, but we could also make it device/env-dependent. I didn't do that because I didn't want to add yet more shaders to the binary but certainly could. I also did the packing on tensor load on the host-side, I think this is probably fine but if someone is doing async tensor load to do just-in-time expert offloading it might not be so free.

This probably breaks some of the test code in ggml-vulkan.cpp for Q3_K/Q6_K, but it wasn't clear to me that it was worth cleaning that up - is there anything left in that test code that isn't cover-able with test-backend-ops and should we just make sure to get the coverage we need that way instead of carrying our own test logic?

Stole the subtraction trick in the block-load commit from #22066.

I have read and agree with the contributing guidelines
AI usage disclosure: YES, claude suggested just switching to MMVQ which provides something and then used it to review ggml_type_size callsites and check my initial pass.

TheBlueMatt · 2026-05-11T18:10:23Z

Oh, forgot to mention, this brings us back to beating (by a huge margin) SYCL on Q3_K and Q6_K (and some mixed models as well) again.

jeffbolznv · 2026-05-11T18:11:09Z

I do think there's a lot of potential to repacking unaligned types. Have you looked at #21024 to see if there's any overlap?

TheBlueMatt · 2026-05-11T18:21:04Z

Oops, yea, there is some overlap there, though I took a bit of a different approach on the shaders - instead of touching lots of shaders with a new offset constant I just padded the block structs which seems much simpler. Sadly #21024 is now also out of date because of the 2d set/get which is a bit more annoying to map.

TheBlueMatt · 2026-05-11T18:23:23Z

cc @0cc4m not sure if you prefer to rebase #21024 or take this instead. Either way the OP has answers for your "im curious about performance from padding" question :)

jeffbolznv · 2026-05-11T18:50:07Z

I did a quick perf run on RTX 5090, and I do see a big improvement in the directed tests but not in a real model. I tested llama-7b.Q3_K_M.gguf, but I didn't try to debug why it's not faster.

For the directed tests:

before:
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                45156 runs -    22.45 us/run - 117.44 MFLOP/run -   5.23 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                31524 runs -    32.03 us/run - 234.88 MFLOP/run -   7.33 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                22152 runs -    45.62 us/run - 352.32 MFLOP/run -   7.72 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                15549 runs -    64.35 us/run - 469.76 MFLOP/run -   7.30 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                12996 runs -    77.70 us/run - 587.20 MFLOP/run -   7.56 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 7062 runs -   142.77 us/run - 939.52 MFLOP/run -   6.58 TFLOPS

after:
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                74124 runs -    13.51 us/run - 117.44 MFLOP/run -   8.69 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                54528 runs -    18.36 us/run - 234.88 MFLOP/run -  12.79 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                42600 runs -    23.52 us/run - 352.32 MFLOP/run -  14.98 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                34932 runs -    28.71 us/run - 469.76 MFLOP/run -  16.36 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                25308 runs -    39.64 us/run - 587.20 MFLOP/run -  14.81 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                16799 runs -    59.70 us/run - 939.52 MFLOP/run -  15.74 TFLOPS

I haven't reviewed the code.

TheBlueMatt · 2026-05-11T18:52:19Z

I'd also be curious to see perf numbers after vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints before the actual padding commit.

0cc4m · 2026-05-12T09:53:54Z

I'm generally not a fan of wasting memory for alignment. The entire point of quantization is to cram as much data as possible into limited space, even small overhead should be avoided.

I'm also not a fan of these huge performance claims, while you may see them, they are very specific to a broken/unoptimized driver and you make them look like we suddenly more than double performance when barely anyone would see this.

I'll run some tests, but I think the approach I took in #21024 is better in concept, though unfinished. How and if we repack is still an open discussion with some of the CUDA backend developers.

TheBlueMatt · 2026-05-12T10:40:26Z

I'm generally not a fan of wasting memory for alignment. The entire point of quantization is to cram as much data as possible into limited space, even small overhead should be avoided.

Yea, that's fair. We could also make it optional. It does seem that some people might want the tradeoff here, even if certainly not all. I'm also curious about the original comment - in which hardware was the 32-bit alignment the most problematic? On BMG using MMVQ is a win even at 2b alignment.

I'm also not a fan of these huge performance claims, while you may see them, they are very specific to a broken/unoptimized driver and you make them look like we suddenly more than double performance when barely anyone would see this.

Apologies if I came across poorly here, certainly wasn't my intent. I certainly agree the problem is mostly on the mesa side (that's why I went to the effort to break down the performance changes by commit so we can see how much is the alignment vs not), but I also wanted to accurately report the performance changes as well - some of us do prefer to use open source drivers and as much as mesa misses some optimizations its generally pretty decent. I used mesa git to get a few improvements they've made upstream and a, I think, relatively common model design. Is there some way you'd prefer I communicate performance changes in the future?

I'll run some tests, but I think the approach I took in #21024 is better in concept, though unfinished. How and if we repack is still an open discussion with some of the CUDA backend developers.

I'm more than happy to tweak this to repack through actual bit unpacking and take that approach if you prefer, or can wait until you finish the work on #21024 if you prefer to tackle this. I imagine it would be kinda hard to get proper 4b-alignment with that approach on Q3/Q6, though, no?

In the next commit we'll start padding Q3_K and Q6_K to make them 4-byte-aligned, so here we first enable MMVQ for these types to break out benchmarking. On Intel BMG on mesa, this leads to an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

0cc4m · 2026-05-12T11:06:37Z

I'm also curious about the original comment - in which hardware was the 32-bit alignment the most problematic? On BMG using MMVQ is a win even at 2b alignment.

I saw regressions on all hardware I tested, meaning Nvidia, Intel Alchemist and AMD GCN + some form of RDNA.

I'm also not a fan of these huge performance claims, while you may see them, they are very specific to a broken/unoptimized driver and you make them look like we suddenly more than double performance when barely anyone would see this.

Apologies if I came across poorly here, certainly wasn't my intent. I certainly agree the problem is mostly on the mesa side (that's why I went to the effort to break down the performance changes by commit so we can see how much is the alignment vs not), but I also wanted to accurately report the performance changes as well - some of us do prefer to use open source drivers and as much as mesa misses some optimizations its generally pretty decent. I used mesa git to get a few improvements they've made upstream and a, I think, relatively common model design. Is there some way you'd prefer I communicate performance changes in the future?

I know and I appreciate that you are pushing the driver forwards. I would just prefer to avoid baseless hype about performance claims when they will barely affect anyone. I have the knowledge to recognize that this change only affects a very specific configuration, others may just read the title or title+descriptions and miss this entirely. Disclaimers are important to avoid this with huge claims.

I'll run some tests, but I think the approach I took in #21024 is better in concept, though unfinished. How and if we repack is still an open discussion with some of the CUDA backend developers.

I'm more than happy to tweak this to repack through actual bit unpacking and take that approach if you prefer, or can wait until you finish the work on #21024 if you prefer to tackle this. I imagine it would be kinda hard to get proper 4b-alignment with that approach on Q3/Q6, though, no?

I don't think we're ready to merge any of these approaches, but testing different variants across diverse hardware to gather more knowledge is good. I haven't looked into repacking approaches for k-quants at all yet.

mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, because we have the high bits to ourselves after loading Q3/Q6 ints, we can do subtraction with bit twiddling and 32 bit intops rather than 4x8 bit intops. On Intel BMG on mesa, this leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

TheBlueMatt · 2026-05-12T11:17:57Z

I saw regressions on all hardware I tested, meaning Nvidia, Intel Alchemist and AMD GCN + some form of RDNA.

Interesting. Can you retest with just the block-load+32-bit-subtraction commit (the branch at https://github.com/TheBlueMatt/llama.cpp/commits/2026-05-q3-q6-load-block/)? While we think about how to handle padding it probably makes sense to land that separately but I'd need to update the hardware filter for MMVQ first.

I went ahead and rebased this PR on that branch (without other changes) so that the separation is more clear.

I know and I appreciate that you are pushing the driver forwards. I would just prefer to avoid baseless hype about performance claims when they will barely affect anyone. I have the knowledge to recognize that this change only affects a very specific configuration, others may just read the title or title+descriptions and miss this entirely. Disclaimers are important to avoid this with huge claims.

Fair enough. I did update the title and I just went in to be more specific about the wins only applying to "pure" Q3_K/Q6_K models and how they're not so common these days. Hopefully that is more clear.

Every call to `vk_tensor_offset` immediately adds the tensor `view_offs` so we might as well inline it in a single function.

It hasn't been used for some time, so just kill it off.

In the coming commits we'll pad Q3_K and Q6_K to make them 4-byte-aligned and allow for more effecient loads. This is the first step, moving to a type-size function that allows us to include the padding.

In the coming commits we'll pad Q3_K and Q6_K to make them 4-byte-aligned and allow for more effecient loads. This ensures we know the size of our buffers on the GPU, which will then differ from the size of our buffers on the host.

Q3_K and Q6_K are not 4-byte-aligned, which can be a performance issue on some devices. In the next commit we'll start padding them out to 4-byte-aligned blocks, and here support copying such tensors to/from device.

On some device 4-b aligned loads leads to a performance increase. Q3_K and Q6_K blocks are not 4-byte-aligned which resulted in us avoiding using MMVQ for them entirely. This isn't ideal and instead here we switch to padding them out to 4-byte alignemnt. FOr Q6_K this is a ~0.9% size overhead and for Q3_K a ~1.8% size overhead. The additional bloat of 0.9% for Q6_K users seems unlikely to bother those using larger quants. Q3_K is somewhat more debatable, but getting the option to use MMVQ there seems worth it in most cases. On Intel BMG on mesa, this leads to a ~28% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~3% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

Now that our Q3_K and Q6_K data blocks are 4-byte-aligned, we can simplify our loading logic a bit. This has no performance change on BMG using mesa.

netrunnereve · 2026-05-12T20:45:23Z

I'm unfortunately not around much but I did play with this before, not creating a working implementation as that's too much work but rather only modifying the existing shaders assuming that the layout was changed in the background. I think I only got around a 10% improvement at most on my AMD cards, like @0cc4m said the big improvements are specifically for your card due to the bad Intel driver and compiler. We've had a lot of problem in the past with Intel graphics cards, and its only with int dot that Intel now has decent performance (i think the regular fp32 and fp16 code still has strange slowness on certain quants).

Also for repacking I feel that it should either be enabled for everyone with an improvement for everyone, or disabled for all rather than having two different paths and two sets of shaders for different chips. We've already got way too many shaders out there. Personally I'm fine with using up a little bit more memory for alignment provided that it actually provides a big performance improvement for everyone.

By the way repacking isn't worth it everywhere, for simpler quant types you'll get a larger advantage as most of the time is spent on the loads. For something like the i quants most of the time is spent unpacking the quant itself and going through those lookup tables and all that.

TheBlueMatt · 2026-05-12T22:04:18Z

Just a quick note because it wasn't all that clear in my OP (too many commits, sorry), the actual padding change was only a +2.9% on Q6_K matmul and +28% on Q3_K. Looking at the actual compiled shader I'm not really sure they're to blame there, though indeed the fact that the block-load commit is an improvement definitely is mesa's fault! I'm working through the slowness, and at this point most of the large/K-quant tg is getting 90%+ bandwidth (pp is a different story, but mesa needs cm2 really).

TheBlueMatt · 2026-05-14T14:27:18Z

Marking draft since we're still deciding on the padding. #23056 contains the non-padding speedups for Intel BMG.

TheBlueMatt requested a review from a team as a code owner May 11, 2026 17:57

TheBlueMatt force-pushed the 2026-05-pad-q3-q6 branch 4 times, most recently from ce56552 to 6ba4d99 Compare May 11, 2026 18:08

TheBlueMatt force-pushed the 2026-05-pad-q3-q6 branch from 6ba4d99 to daa64c5 Compare May 11, 2026 18:49

TheBlueMatt force-pushed the 2026-05-pad-q3-q6 branch from daa64c5 to 59da539 Compare May 11, 2026 19:12

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 11, 2026

TheBlueMatt changed the title ~~vulkan: Pad Q3_K/Q6_K tensors out to 32-bit alignment (+150% tg128)~~ vulkan: Pad Q3_K/Q6_K tensors out to 32-bit alignment May 12, 2026

TheBlueMatt force-pushed the 2026-05-pad-q3-q6 branch from 59da539 to e7c65b8 Compare May 12, 2026 10:49

TheBlueMatt added 7 commits May 12, 2026 11:20

vulkan: Swap vk_tensor_offset() + view_offs for a single function

72ed4a2

Every call to `vk_tensor_offset` immediately adds the tensor `view_offs` so we might as well inline it in a single function.

vulkan: Drop ggml_vk_buffer_write_nc_async

36aa093

It hasn't been used for some time, so just kill it off.

vulkan: Switch to a vulkan-specific type size function

fd49560

In the coming commits we'll pad Q3_K and Q6_K to make them 4-byte-aligned and allow for more effecient loads. This is the first step, moving to a type-size function that allows us to include the padding.

vulkan: support copying tensors with extra padding between blocks

a9da6df

Q3_K and Q6_K are not 4-byte-aligned, which can be a performance issue on some devices. In the next commit we'll start padding them out to 4-byte-aligned blocks, and here support copying such tensors to/from device.

vulkan: Rewrite MMVQ Q3_K/Q6_K block loads to load through 4xu32

f717e22

Now that our Q3_K and Q6_K data blocks are 4-byte-aligned, we can simplify our loading logic a bit. This has no performance change on BMG using mesa.

TheBlueMatt force-pushed the 2026-05-pad-q3-q6 branch from e7c65b8 to f717e22 Compare May 12, 2026 11:21

TheBlueMatt mentioned this pull request May 12, 2026

vulkan : transpose A-matrix data layout for K-quant mul_mat performance #22970

Open

TheBlueMatt mentioned this pull request May 14, 2026

vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints #23056

Merged

TheBlueMatt marked this pull request as draft May 14, 2026 14:26

Conversation

TheBlueMatt commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented May 11, 2026

Uh oh!

jeffbolznv commented May 11, 2026

Uh oh!

TheBlueMatt commented May 11, 2026

Uh oh!

TheBlueMatt commented May 11, 2026

Uh oh!

jeffbolznv commented May 11, 2026

Uh oh!

TheBlueMatt commented May 11, 2026

Uh oh!

0cc4m commented May 12, 2026

Uh oh!

TheBlueMatt commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented May 12, 2026

Uh oh!

TheBlueMatt commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netrunnereve commented May 12, 2026

Uh oh!

TheBlueMatt commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TheBlueMatt commented May 11, 2026 •

edited

Loading

TheBlueMatt commented May 12, 2026 •

edited

Loading

TheBlueMatt commented May 12, 2026 •

edited

Loading

TheBlueMatt commented May 12, 2026 •

edited

Loading