Enhance Metal operations for TQ weights and concurrency handling by iamwavecut · Pull Request #52 · TheTom/llama-cpp-turboquant

iamwavecut · 2026-04-04T15:55:42Z

Overview

While trying to quantize and run the pruned 96-expert Gemma 4 checkpoint from Hugging Face (blascotobasco/Gemma-4-96E-A4B-Heretic) and validating it end-to-end with TurboQuant, I found several independent issues with the current implementation.

The main symptom was a severe post-quantization quality regression: the Gemma 4 TQ4_1S checkpoint could load, but on Apple Metal it produced degenerate, repetitive, or broken outputs, while the same quantized weights behaved correctly on CPU. That pointed to backend/runtime bugs in the Metal execution path.

Because this investigation started from a real Gemma 4 deployment target, I also ended up fixing the model-side integration pieces needed to make that checkpoint usable in practice, including template-related handling around Gemma 4 inference workflows.

Additional information

What was fixed in this PR:

fixed multiple Metal runtime issues affecting TurboQuant weight quantization (TQ3_1S / TQ4_1S)
added the missing Gemma 4 MoE Metal kernel instantiation required for routed expert execution
fixed incorrect small-batch TQ Metal kernel instantiations
added proper synchronization around rotated TQ activation paths to avoid races on shared activations
routed TQ weight matmul away from the unstable fused mul_mv path and through the correct rotated mul_mm / mul_mm_id paths
ensured Gemma 4 MoE expert tensors use the correct rotated TQ execution path on GPU
improved Gemma 4 model-side usability during inference, including the template/integration pieces needed to run this checkpoint correctly end-to-end

Result:

Gemma 4 TQ4_1S weight quantization now produces valid outputs on Apple Metal
the quantized model no longer falls into degenerate or repetitive generations on GPU
the model runs correctly together with TurboQuant KV cache settings during inference (for example -ctk q8_0 -ctv turbo4)
Gemma 4 is now much closer to working like the other supported TurboQuant models in real-world inference, not just in isolated conversion tests

So this PR is primarily a correctness fix for Gemma 4 TurboQuant support on Metal, covering both quantized weight execution and inference with TurboQuant KV cache enabled, while also addressing the model-side integration details required by this 96-expert checkpoint.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - steered debugging

…Gemma 4 - Introduced a new function to check if TQ source tensors mutate during operations. - Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior. - Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters. - Enhanced comments for clarity on concurrency issues related to TQ weights.

… and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Full results for experiment TheTom#52: strength sweep, K-only vs K+V scaling, RMS vs max-based mode, auto-detect on hd256, context scaling verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

46% PPL gap closure on hd128 (6.6340→6.5349), auto-disabled on hd256. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TheTom · 2026-04-06T22:47:47Z

Thank you @iamwavecut — this is a real fix. I spent the better part of a day debugging this exact issue (TQ4_1S producing garbage on Metal while CPU worked fine) and independently narrowed it to the fused mul_mv kernel producing wrong results, but didn't find the root cause.

Your dispatch change routing all TQ weights through the rotated mul_mm path is the right call. The fused kernel_mul_mv_tq4_1s_f32 has a deeper issue that I haven't root-caused yet — serial single-thread execution on GPU produces correct results but parallel lane execution produces ERR ≈ 0.938 regardless of which WHT implementation is used. That's a separate investigation.

Verified on M5 Max 128GB:

test-backend-ops test -o MUL_MAT -p "tq4_1s" — all n=1 through n=64 pass
test-backend-ops test -o GET_ROWS -p "tq4_1s" — passes
llama-server with Qwen2.5-1.5B TQ4_1S — coherent output, 49 t/s decode

One note: Qwen3.5-35B-A3B MoE has 256 experts and needs kernel_mul_mm_id_map0_ne20_256 instantiated (your PR adds ne20_96 for Gemma 4). I'll add that in a follow-up.

Merging — thanks for the thorough investigation and the Gemma 4 concurrency fix.

github-actions Bot added ggml Apple Metal labels Apr 4, 2026

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S…

883bf4a

… and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

github-actions Bot added the python label Apr 4, 2026

spiritbuun added a commit to spiritbuun/buun-llama-cpp that referenced this pull request Apr 6, 2026

merge: InnerQ per-channel equalization (TheTom#52)

d6af798

46% PPL gap closure on hd128 (6.6340→6.5349), auto-disabled on hd256. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TheTom merged commit 7433ad9 into TheTom:feature/turboquant-kv-cache Apr 6, 2026
1 check passed

TheTom mentioned this pull request Apr 6, 2026

Metal: TQ4_1S crashes on 256-expert MoE models (shmem assert in mul_mm_id map0) #58

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance Metal operations for TQ weights and concurrency handling #52

Enhance Metal operations for TQ weights and concurrency handling #52
TheTom merged 2 commits into
TheTom:feature/turboquant-kv-cachefrom
iamwavecut:feature/turboquant-kv-cache

iamwavecut commented Apr 4, 2026 •

edited

Loading

Uh oh!

TheTom commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

iamwavecut commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

TheTom commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iamwavecut commented Apr 4, 2026 •

edited

Loading