Enhance Metal operations for TQ weights and concurrency handling #52
Conversation
…Gemma 4 - Introduced a new function to check if TQ source tensors mutate during operations. - Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior. - Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters. - Enhanced comments for clarity on concurrency issues related to TQ weights.
… and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Full results for experiment TheTom#52: strength sweep, K-only vs K+V scaling, RMS vs max-based mode, auto-detect on hd256, context scaling verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
46% PPL gap closure on hd128 (6.6340→6.5349), auto-disabled on hd256. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thank you @iamwavecut — this is a real fix. I spent the better part of a day debugging this exact issue (TQ4_1S producing garbage on Metal while CPU worked fine) and independently narrowed it to the fused mul_mv kernel producing wrong results, but didn't find the root cause. Your dispatch change routing all TQ weights through the rotated mul_mm path is the right call. The fused kernel_mul_mv_tq4_1s_f32 has a deeper issue that I haven't root-caused yet — serial single-thread execution on GPU produces correct results but parallel lane execution produces ERR ≈ 0.938 regardless of which WHT implementation is used. That's a separate investigation. Verified on M5 Max 128GB:
One note: Qwen3.5-35B-A3B MoE has 256 experts and needs Merging — thanks for the thorough investigation and the Gemma 4 concurrency fix. |
Overview
While trying to quantize and run the pruned 96-expert Gemma 4 checkpoint from Hugging Face (blascotobasco/Gemma-4-96E-A4B-Heretic) and validating it end-to-end with TurboQuant, I found several independent issues with the current implementation.
The main symptom was a severe post-quantization quality regression: the Gemma 4 TQ4_1S checkpoint could load, but on Apple Metal it produced degenerate, repetitive, or broken outputs, while the same quantized weights behaved correctly on CPU. That pointed to backend/runtime bugs in the Metal execution path.
Because this investigation started from a real Gemma 4 deployment target, I also ended up fixing the model-side integration pieces needed to make that checkpoint usable in practice, including template-related handling around Gemma 4 inference workflows.
Additional information
What was fixed in this PR:
TQ3_1S/TQ4_1S)mul_mvpath and through the correct rotatedmul_mm/mul_mm_idpathsResult:
-ctk q8_0 -ctv turbo4)So this PR is primarily a correctness fix for Gemma 4 TurboQuant support on Metal, covering both quantized weight execution and inference with TurboQuant KV cache enabled, while also addressing the model-side integration details required by this 96-expert checkpoint.
Requirements