Skip to content

feat: HIP/ROCm support for turbo3/turbo2 (7900 XTX)#31

Merged
TheTom merged 1 commit into
TheTom:feature/turboquant-kv-cachefrom
apollosenvy:pr/rocm-hip-port
Mar 30, 2026
Merged

feat: HIP/ROCm support for turbo3/turbo2 (7900 XTX)#31
TheTom merged 1 commit into
TheTom:feature/turboquant-kv-cachefrom
apollosenvy:pr/rocm-hip-port

Conversation

@apollosenvy
Copy link
Copy Markdown

Summary

HIP/ROCm porting for the turbo3/turbo2 warp-cooperative kernels. Split from PR #5 per review feedback.

Single commit, minimal surface area:

  • HIP vendor header (hip.h): Added cudaMemcpyToSymbol/FromSymbol mappings. Fixed __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync to support 3-arg calls (CUDA defaults width to warpSize). Added __ballot_sync -> __ballot with uint32_t cast.
  • HIP CMakeLists: Added turbo3/turbo2 FA template instances. Excluded D>=576 fattn-tile kernels (exceed HIP's 64KB local memory limit).

Test Results (AMD 7900 XTX, ROCm 7.1)

Model KV Type PPL pp128 t/s tg32 t/s
Qwen3.5-27B Q4_K_M turbo3 7.58 654 25.2
Mistral-Small-24B Q4_K_S turbo3 5.28 600 24.2

turbo3 at ~98% of F16 speed. Mistral-Small (head_dim=128) confirmed working.

What's NOT in this PR

  • Temporal decay (separate PR, needs Metal kernel)
  • Non-128 head_dim fallback changes (separate discussion)

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

…rnels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Mar 30, 2026

Tested on M5 Max 128GB and M2 Pro 32GB (Metal). HIP-only changes, zero shared code — confirmed no Metal regressions.

M5 Max (Qwen3.5-35B-A3B Q8_0):

Config PPL Baseline pp512 t/s tg128 t/s
turbo3 6.1756 6.1756 (match) 2726 76.69
turbo4 6.1250 6.1250 (match)
q8_0/turbo4 2760 81.17

M2 Pro (Qwen2.5-7B Q4_K_M, asymmetric — correct config for Q4_K_M models):

Config PPL vs q8_0 pp512 t/s tg128 t/s
q8_0/q8_0 6.7938 baseline 334.91 32.51
q8_0/turbo4 6.8281 +0.5% 332.72 28.39

Clean on both platforms. Nice minimal PR — the variadic shuffle macros and the D>=576 exclusion are both sensible. Thanks for the clean split from PR #5, this is exactly the right approach.

Merging.

@TheTom TheTom merged commit 64dd362 into TheTom:feature/turboquant-kv-cache Mar 30, 2026
1 check passed
shtaylor pushed a commit to shtaylor/llama-cpp-turboquant that referenced this pull request Mar 30, 2026
…heTom#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
shtaylor pushed a commit to shtaylor/llama-cpp-turboquant that referenced this pull request Mar 30, 2026
…eTom#31 TheTom#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mihai-chiorean pushed a commit to mihai-chiorean/turbo3-cuda that referenced this pull request Mar 31, 2026
…heTom#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mihai-chiorean pushed a commit to mihai-chiorean/turbo3-cuda that referenced this pull request Mar 31, 2026
…eTom#31 TheTom#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom added a commit that referenced this pull request Apr 2, 2026
…31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom added a commit that referenced this pull request Apr 2, 2026
…#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom added a commit that referenced this pull request Apr 2, 2026
…31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom added a commit that referenced this pull request Apr 2, 2026
…#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
spiritbuun referenced this pull request in spiritbuun/buun-llama-cpp Apr 6, 2026
- turbo4 K+V results on Qwen3.5-27B (-0.32% vs q8_0) and Qwen3-14B (+6.3%)
- Sparse V dequant benchmarks: MoE native dequant +10.9% at 8K
- Gemma-3 turbo3 results post-iSWA fix (+3.3%)
- KVLinC no-K-rotation negative result
- Speculative decoding negative result
- CUDA 13.2 compatibility verified
- Experiments #31, #39, #42, #45, #49, #50, #51 status updates

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
iamwavecut pushed a commit to iamwavecut/llama-cpp-turboquant that referenced this pull request Apr 8, 2026
…heTom#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
iamwavecut pushed a commit to iamwavecut/llama-cpp-turboquant that referenced this pull request Apr 8, 2026
…eTom#31 TheTom#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 9, 2026
…heTom#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 9, 2026
…eTom#31 TheTom#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
spiritbuun referenced this pull request in spiritbuun/buun-llama-cpp Apr 9, 2026
On Gemma 4 26B-A4B (Ampere), the inverse-FWHT decode K dequant for K=turbo2
produces values that are correct in isolation but trigger degenerate
single-token output (e.g. <thought> repeat, 0000 repeat) when paired with
V types in {turbo3, turbo4, q8_0, f16}. The same K=turbo2 inv-FWHT path
works fine for V in {turbo2, turbo2_tcq, turbo3_tcq}. The native VEC turbo
path also works for the failing combos. Root cause is still undiagnosed
after deep instrumentation: K and V f16 buffers contain correct values,
strides are correct, the FA kernel template selection matches the working
configurations, and L2 norms / value distributions look healthy across
all 30+ FA calls. Yet the model output collapses on Gemma 4 globals.

The PREFILL path uses the rotated-domain dequant kernel (k_turbo2_dequant_f16,
no inverse FWHT) plus a Q rotation to keep K and Q in the same Hadamard
basis. That path works for every K/V combination. This commit conditionally
mirrors the prefill K dequant + Q rotation in the decode path for the
specific K=turbo2 ↔ V={turbo3,turbo4,q8_0,f16} cases, and symmetrically for
K=turbo3 ↔ V=turbo2. Same-type and TCQ-side configurations are unchanged.

Fixes (Gemma 4 26B-A4B, dorei RTX 3090):
  K=turbo2 V=turbo3   degenerate `<thought>` repeat → coherent factorial code
  K=turbo3 V=turbo2   degenerate `0000` repeat       → coherent factorial code
  K=turbo2 V=turbo4   degenerate `<start_of_turn>`   → coherent factorial code
  K=turbo2 V=q8_0     degenerate `<div>` echo        → coherent factorial code

Still broken (different code paths, separate root causes):
  K=q8_0 V=turbo2     `<|channel>` repeat — K=q8_0 has no rotated-domain
                      decode option (Q8_0 is naturally in original domain)
  K=turbo2 V=f16      crash in llama_decode (separate dispatch ASSERT)
  K=f16 V=turbo2      crash in llama_decode (separate dispatch ASSERT)

Verification:
- Qwen3.5-27B PPL (8 chunks @ 2K, wikitext-2, RTX 3090): bit-identical to
  master baseline across f16/q8_0/turbo3/turbo4/turbo3_tcq/turbo2/turbo2_tcq
  (5.8048 / 5.8385 / 5.8501 / 5.8579 / 5.8017 / 6.0786 / 6.0051).
- Gemma 4 26B-A4B same-type configs (f16, q8_0, turbo2..turbo4, turbo*_tcq):
  all coherent.
- Gemma 4 26B-A4B mixed K/V matrix: 19/21 working configs unchanged,
  4 previously-failing K=t2/t3 mixed configs newly fixed, K=q8_0+V=t2 still
  broken (separate bug).

The conditional dispatch is gated on K type AND V type so it only diverts
the empirically-failing combinations. Same-type configs and configs that
work with the inv-FWHT path are untouched.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 10, 2026
…heTom#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 10, 2026
…eTom#31 TheTom#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 13, 2026
…heTom#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 13, 2026
…eTom#31 TheTom#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 14, 2026
…heTom#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 14, 2026
…eTom#31 TheTom#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
greatyingzi referenced this pull request in greatyingzi/buun-llama-cpp Apr 15, 2026
The previous commit optimized turbo4 decode by skipping the expensive
inv-FWHT butterfly on K dequant and using rotated-domain dequant with
Q pre-rotation instead. This extends the same optimization to all turbo
quantization types:

- turbo2_0: switched from k_turbo2_dequant_f16_inv_fwht to k_turbo2_dequant_f16
- turbo3_0: switched from k_turbo3_dequant_f16_inv_fwht to k_turbo3_dequant_f16
- turbo3_tcq: switched from k_turbo3_tcq_dequant_f16_inv_fwht to rotated-domain
- turbo2_tcq: switched from k_turbo2_tcq_dequant_f16_inv_fwht to rotated-domain

All turbo K types now always use rotated-domain dequant in decode, with
Q pre-rotated via FWHT to compensate (cheap: ~1 FWHT group per head vs
N FWHT groups per KV row). The Bug #31 conditional workaround is removed
since rotated-domain is now the default for all types.

Also adds shared-memory codebook (_cb) variants for turbo2_0 and turbo3_0
VEC path (K dot product and V dequant) to match turbo4 and TCQ patterns,
and adds turbo4 tile-path loading in the MMA/f16 kernel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 15, 2026
…heTom#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 15, 2026
…eTom#31 TheTom#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom added a commit that referenced this pull request Apr 15, 2026
…31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom added a commit that referenced this pull request Apr 15, 2026
…#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 22, 2026
…heTom#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 22, 2026
…eTom#31 TheTom#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 23, 2026
…heTom#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 23, 2026
…eTom#31 TheTom#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 27, 2026
…heTom#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 27, 2026
…eTom#31 TheTom#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
…heTom#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
…eTom#31 TheTom#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jimbothigpen added a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
)

4-screen flow wrapping the existing pipeline:
  1. HF model selection + download
  2. Calibration corpus (preset or custom path)
  3. Weight quant format whitelist (checkboxes)
  4. Priority XYZ + budget GB slider

Each screen has an ASCII mock in its docstring + prints the equivalent
CLI invocation so users see the underlying pipeline. State is JSON-
serializable for --save/--resume mid-flow.

Stub status: argparse + state + screen functions wired; run_pipeline()
prints the plan but doesn't yet exec ../run-pipeline.sh. Cache key
hashing is in place; cache hit/miss detection is TODO.

Estimated 3-5 days to make production-ready. See wizard/README.md for
the next-steps list. Task TheTom#31 tracks.
jimbothigpen added a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
Hybrid auto-discovery + curated metadata overlay:

1. format_discovery.py — parses `<binary> --help` to enumerate supported
   types (handles fork-specific additions automatically). Caches result
   per binary hash in ~/.cache/prismaquant/binary-types/<sha256>.json so
   re-runs are instant. Standalone CLI for ad-hoc inspection.

2. format_metadata.json — curated overlay with bpw, family (k/i/ik/tq/fp),
   source (mainline/ikllama/frankenturbo), recommend flag, notes. Covers
   53+ formats spanning all four families. Internal-use types (F32,
   COPY) are denylisted.

3. wizard.py Screen 3 — uses discover_formats() to populate checkbox
   list. Default shows only recommend=true (typically ~15). --all-formats
   flag surfaces every format the binary supports. --binary flag picks
   which binary to query (LLAMA_QUANTIZE_BIN env var also honored).

Tested against frankenturbo2 build/bin/llama-quantize: discovers all 53
supported types, recommend filter narrows to 15. Hard-coded fallback if
--help parsing fails. Cache hit/miss verified.

Bumps task TheTom#31 status — Screen 3 fully wired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jimbothigpen added a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
calibration.py — fill metadata with measured data instead of relying on
the heuristic + curated overlay alone. Three modes ordered by cost:

  quick   size-only sweep: quantize ref_model with each format, derive
          bpw from output size. ~25 min for 53 formats. Resume-safe;
          per-format results persisted between iterations so a kill
          mid-sweep preserves progress. Peak disk ~= 1× output.

  deep    + llama-perplexity (PPL Δ vs f16) + llama-bench (PP/TG tps).
          ~6-12 hours overnight. Produces empirical "your binary, your
          hardware" cost+perf curve. Same resume + disk semantics.

  ingest  slurps prismaquant Stage D cost.csv and folds per-format MSE
          aggregates into the wizard's calibration cache. Free; can
          run automatically as part of the pipeline.

Output: ~/.cache/prismaquant-wizard/binary-types/<sha>-calibrated.json
keyed by binary SHA-256. Acts as Layer 1.5 in the metadata stack:
overrides heuristic auto-classification, falls under static maintainer
files (base + fork) when both are present.

count_gguf_params helper verified on Llama-3.2-3B Q6_K_PLLow:
  3.21B params, 2.48 GB → derived bpw = 6.64 (matches Q6_K nominal 6.5625).

Status: scaffold. Subprocess invocations + parsers + persistence
implemented; not yet executed end-to-end against a live binary because
ai00 GPU is busy with the comparison sweep. Ready to test post-sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sbaier1 pushed a commit to sbaier1/llama-cpp-turboquant that referenced this pull request May 8, 2026
…heTom#31

Block 128: PPL=165.6 (same as block 32)
Disabled Q rotation: PPL=165.6 (same)
Root cause: dynamic_cast fails for MoE hybrid memory context.
Q rotation and V inverse rotation never execute.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sbaier1 pushed a commit to sbaier1/llama-cpp-turboquant that referenced this pull request May 8, 2026
…eTom#31 TheTom#30

ROOT CAUSE: pre-rotate-queries never executed because:
1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128
2. mctx dynamic_cast failed for MoE hybrid memory

FIX: put inverse WHT rotation back in dequantize_full_block.
This is slower (10.7 tok/s vs 77.7) but produces CORRECT results.

PERPLEXITY RESULTS:
- f16:     6.121
- q8_0:    6.111
- q4_0:    6.142
- turbo3:  6.194 (+1.2% vs q8_0) ✅

The speed optimization (pre-rotate-queries) needs to be reimplemented
to work with GQA head layout and hybrid memory types.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants