Skip to content

Enhance Metal operations for TQ weights and concurrency handling #52

Merged
TheTom merged 2 commits into
TheTom:feature/turboquant-kv-cachefrom
iamwavecut:feature/turboquant-kv-cache
Apr 6, 2026
Merged

Enhance Metal operations for TQ weights and concurrency handling #52
TheTom merged 2 commits into
TheTom:feature/turboquant-kv-cachefrom
iamwavecut:feature/turboquant-kv-cache

Conversation

@iamwavecut
Copy link
Copy Markdown

@iamwavecut iamwavecut commented Apr 4, 2026

Overview

While trying to quantize and run the pruned 96-expert Gemma 4 checkpoint from Hugging Face (blascotobasco/Gemma-4-96E-A4B-Heretic) and validating it end-to-end with TurboQuant, I found several independent issues with the current implementation.

The main symptom was a severe post-quantization quality regression: the Gemma 4 TQ4_1S checkpoint could load, but on Apple Metal it produced degenerate, repetitive, or broken outputs, while the same quantized weights behaved correctly on CPU. That pointed to backend/runtime bugs in the Metal execution path.

Because this investigation started from a real Gemma 4 deployment target, I also ended up fixing the model-side integration pieces needed to make that checkpoint usable in practice, including template-related handling around Gemma 4 inference workflows.

Additional information

What was fixed in this PR:

  • fixed multiple Metal runtime issues affecting TurboQuant weight quantization (TQ3_1S / TQ4_1S)
  • added the missing Gemma 4 MoE Metal kernel instantiation required for routed expert execution
  • fixed incorrect small-batch TQ Metal kernel instantiations
  • added proper synchronization around rotated TQ activation paths to avoid races on shared activations
  • routed TQ weight matmul away from the unstable fused mul_mv path and through the correct rotated mul_mm / mul_mm_id paths
  • ensured Gemma 4 MoE expert tensors use the correct rotated TQ execution path on GPU
  • improved Gemma 4 model-side usability during inference, including the template/integration pieces needed to run this checkpoint correctly end-to-end

Result:

  • Gemma 4 TQ4_1S weight quantization now produces valid outputs on Apple Metal
  • the quantized model no longer falls into degenerate or repetitive generations on GPU
  • the model runs correctly together with TurboQuant KV cache settings during inference (for example -ctk q8_0 -ctv turbo4)
  • Gemma 4 is now much closer to working like the other supported TurboQuant models in real-world inference, not just in isolated conversion tests

So this PR is primarily a correctness fix for Gemma 4 TurboQuant support on Metal, covering both quantized weight execution and inference with TurboQuant KV cache enabled, while also addressing the model-side integration details required by this 96-expert checkpoint.

Requirements

…Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
… and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
@github-actions github-actions Bot added the python label Apr 4, 2026
spiritbuun added a commit to spiritbuun/buun-llama-cpp that referenced this pull request Apr 6, 2026
Full results for experiment TheTom#52: strength sweep, K-only vs K+V scaling,
RMS vs max-based mode, auto-detect on hd256, context scaling verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spiritbuun added a commit to spiritbuun/buun-llama-cpp that referenced this pull request Apr 6, 2026
46% PPL gap closure on hd128 (6.6340→6.5349), auto-disabled on hd256.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Apr 6, 2026

Thank you @iamwavecut — this is a real fix. I spent the better part of a day debugging this exact issue (TQ4_1S producing garbage on Metal while CPU worked fine) and independently narrowed it to the fused mul_mv kernel producing wrong results, but didn't find the root cause.

Your dispatch change routing all TQ weights through the rotated mul_mm path is the right call. The fused kernel_mul_mv_tq4_1s_f32 has a deeper issue that I haven't root-caused yet — serial single-thread execution on GPU produces correct results but parallel lane execution produces ERR ≈ 0.938 regardless of which WHT implementation is used. That's a separate investigation.

Verified on M5 Max 128GB:

  • test-backend-ops test -o MUL_MAT -p "tq4_1s" — all n=1 through n=64 pass
  • test-backend-ops test -o GET_ROWS -p "tq4_1s" — passes
  • llama-server with Qwen2.5-1.5B TQ4_1S — coherent output, 49 t/s decode

One note: Qwen3.5-35B-A3B MoE has 256 experts and needs kernel_mul_mm_id_map0_ne20_256 instantiated (your PR adds ne20_96 for Gemma 4). I'll add that in a follow-up.

Merging — thanks for the thorough investigation and the Gemma 4 concurrency fix.

@TheTom TheTom merged commit 7433ad9 into TheTom:feature/turboquant-kv-cache Apr 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants