feat: HIP/ROCm support for turbo3/turbo2 (7900 XTX) by apollosenvy · Pull Request #31 · TheTom/llama-cpp-turboquant

apollosenvy · 2026-03-30T00:20:27Z

Summary

HIP/ROCm porting for the turbo3/turbo2 warp-cooperative kernels. Split from PR #5 per review feedback.

Single commit, minimal surface area:

HIP vendor header (hip.h): Added cudaMemcpyToSymbol/FromSymbol mappings. Fixed __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync to support 3-arg calls (CUDA defaults width to warpSize). Added __ballot_sync -> __ballot with uint32_t cast.
HIP CMakeLists: Added turbo3/turbo2 FA template instances. Excluded D>=576 fattn-tile kernels (exceed HIP's 64KB local memory limit).

Test Results (AMD 7900 XTX, ROCm 7.1)

Model	KV Type	PPL	pp128 t/s	tg32 t/s
Qwen3.5-27B Q4_K_M	turbo3	7.58	654	25.2
Mistral-Small-24B Q4_K_S	turbo3	5.28	600	24.2

turbo3 at ~98% of F16 speed. Mistral-Small (head_dim=128) confirmed working.

What's NOT in this PR

Temporal decay (separate PR, needs Metal kernel)
Non-128 head_dim fallback changes (separate discussion)

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

…rnels Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3 flash attention templates to HIP/ROCm (7900 XTX, gfx1100). HIP vendor header fixes: - Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol - Add cudaMemcpyHostToDevice/DeviceToHost mappings - Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync to support both 3-arg and 4-arg calls (CUDA allows defaulting width to warpSize, HIP macros required 4 args) - Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit on wave64 platforms, turbo code expects 32-bit) HIP CMakeLists: - Add turbo3 and turbo2 flash attention template instances (same files as CUDA CMakeLists, were missing from HIP build) Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16) Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug (fixed by TheTom in 53f1298). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TheTom · 2026-03-30T01:08:33Z

Tested on M5 Max 128GB and M2 Pro 32GB (Metal). HIP-only changes, zero shared code — confirmed no Metal regressions.

M5 Max (Qwen3.5-35B-A3B Q8_0):

Config	PPL	Baseline	pp512 t/s	tg128 t/s
turbo3	6.1756	6.1756 (match)	2726	76.69
turbo4	6.1250	6.1250 (match)	—	—
q8_0/turbo4	—	—	2760	81.17

M2 Pro (Qwen2.5-7B Q4_K_M, asymmetric — correct config for Q4_K_M models):

Config	PPL	vs q8_0	pp512 t/s	tg128 t/s
q8_0/q8_0	6.7938	baseline	334.91	32.51
q8_0/turbo4	6.8281	+0.5%	332.72	28.39

Clean on both platforms. Nice minimal PR — the variadic shuffle macros and the D>=576 exclusion are both sensible. Thanks for the clean split from PR #5, this is exactly the right approach.

Merging.

…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- turbo4 K+V results on Qwen3.5-27B (-0.32% vs q8_0) and Qwen3-14B (+6.3%) - Sparse V dequant benchmarks: MoE native dequant +10.9% at 8K - Gemma-3 turbo3 results post-iSWA fix (+3.3%) - KVLinC no-K-rotation negative result - Speculative decoding negative result - CUDA 13.2 compatibility verified - Experiments #31, #39, #42, #45, #49, #50, #51 status updates Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On Gemma 4 26B-A4B (Ampere), the inverse-FWHT decode K dequant for K=turbo2 produces values that are correct in isolation but trigger degenerate single-token output (e.g. <thought> repeat, 0000 repeat) when paired with V types in {turbo3, turbo4, q8_0, f16}. The same K=turbo2 inv-FWHT path works fine for V in {turbo2, turbo2_tcq, turbo3_tcq}. The native VEC turbo path also works for the failing combos. Root cause is still undiagnosed after deep instrumentation: K and V f16 buffers contain correct values, strides are correct, the FA kernel template selection matches the working configurations, and L2 norms / value distributions look healthy across all 30+ FA calls. Yet the model output collapses on Gemma 4 globals. The PREFILL path uses the rotated-domain dequant kernel (k_turbo2_dequant_f16, no inverse FWHT) plus a Q rotation to keep K and Q in the same Hadamard basis. That path works for every K/V combination. This commit conditionally mirrors the prefill K dequant + Q rotation in the decode path for the specific K=turbo2 ↔ V={turbo3,turbo4,q8_0,f16} cases, and symmetrically for K=turbo3 ↔ V=turbo2. Same-type and TCQ-side configurations are unchanged. Fixes (Gemma 4 26B-A4B, dorei RTX 3090): K=turbo2 V=turbo3 degenerate `<thought>` repeat → coherent factorial code K=turbo3 V=turbo2 degenerate `0000` repeat → coherent factorial code K=turbo2 V=turbo4 degenerate `<start_of_turn>` → coherent factorial code K=turbo2 V=q8_0 degenerate `<div>` echo → coherent factorial code Still broken (different code paths, separate root causes): K=q8_0 V=turbo2 `<|channel>` repeat — K=q8_0 has no rotated-domain decode option (Q8_0 is naturally in original domain) K=turbo2 V=f16 crash in llama_decode (separate dispatch ASSERT) K=f16 V=turbo2 crash in llama_decode (separate dispatch ASSERT) Verification: - Qwen3.5-27B PPL (8 chunks @ 2K, wikitext-2, RTX 3090): bit-identical to master baseline across f16/q8_0/turbo3/turbo4/turbo3_tcq/turbo2/turbo2_tcq (5.8048 / 5.8385 / 5.8501 / 5.8579 / 5.8017 / 6.0786 / 6.0051). - Gemma 4 26B-A4B same-type configs (f16, q8_0, turbo2..turbo4, turbo*_tcq): all coherent. - Gemma 4 26B-A4B mixed K/V matrix: 19/21 working configs unchanged, 4 previously-failing K=t2/t3 mixed configs newly fixed, K=q8_0+V=t2 still broken (separate bug). The conditional dispatch is gated on K type AND V type so it only diverts the empirically-failing combinations. Same-type configs and configs that work with the inv-FWHT path are untouched. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The previous commit optimized turbo4 decode by skipping the expensive inv-FWHT butterfly on K dequant and using rotated-domain dequant with Q pre-rotation instead. This extends the same optimization to all turbo quantization types: - turbo2_0: switched from k_turbo2_dequant_f16_inv_fwht to k_turbo2_dequant_f16 - turbo3_0: switched from k_turbo3_dequant_f16_inv_fwht to k_turbo3_dequant_f16 - turbo3_tcq: switched from k_turbo3_tcq_dequant_f16_inv_fwht to rotated-domain - turbo2_tcq: switched from k_turbo2_tcq_dequant_f16_inv_fwht to rotated-domain All turbo K types now always use rotated-domain dequant in decode, with Q pre-rotated via FWHT to compensate (cheap: ~1 FWHT group per head vs N FWHT groups per KV row). The Bug #31 conditional workaround is removed since rotated-domain is now the default for all types. Also adds shared-memory codebook (_cb) variants for turbo2_0 and turbo3_0 VEC path (K dot product and V dequant) to match turbo4 and TCQ patterns, and adds turbo4 tile-path loading in the MMA/f16 kernel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>