QVAC-18966 [TTS GGML] Fix CPU regression#21
Conversation
`apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written assuming `dense_matmul_time_ggml` returns `ne=[HD, L]`. In fact the matmul (CPU `cblas_sgemm` fast path + `conv1d_f32(K=1)` fallback) produces `ne=[L, HD]` with channel-major-flat memory (`data[t + c*L]`) — the bit-exact transpose of the helper's input contract. Every CPU synth with `--n-gpu-layers 0` against a GGUF carrying `vector_rope_theta` aborts at the helper's defensive assertion on the first denoise step: supertonic_internal.h:742: GGML_ASSERT(HD == (int64_t) n_heads * head_dim) failed apply_rope_to_packed_qk → supertonic_vector_trace_proj_ggml → supertonic_vector_step_ggml → supertonic_vector_loop_ggml The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. Fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]`. Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits. 2. `apply_rope_to_packed_qk` (supertonic_internal.h): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-flat (the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V has no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the V matmul output in `build_group_graph_cache` and the front-block path in `supertonic_vector_trace_proj_ggml` so the GPU-bridge `ggml_backend_tensor_copy(v_src, v_tc_in)` lands bit-exact bytes. Style sq/sk/sv left untouched — this branch has no GPU bridge for style attention, so the host-vector path via `tensor_to_time_channel` is already correct. 4. Legacy host-bridge downloads of post-RoPE Q/K and post-transpose V switched from `tensor_to_time_channel` to `tensor_raw_f32`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would apply the transpose-of-the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU (--n-gpu-layers 0) | abort on first step | writes 1.35s 44.1 kHz WAV | | CPU long-text synth | abort | writes 6.25s WAV | | Multi-voice (F1 / M1) | abort | both work | | Determinism (same seed × 2) | n/a | bit-identical | - `test-supertonic-rope-packed-qk`: 14 / 14 checks, `max_abs_err = 0.000e+00`. - CPU `ctest -L unit`: 12 / 12 tests, 0 regressions. Audio sanity on the exact QVAC-18966 reproduction command: 99.9% non-zero samples, rms=1406, abs_max=15984 — speech-like dynamics, not silence / clipping / garbage. Co-authored-by: Cursor <cursoragent@cursor.com>
ogad-tether
left a comment
There was a problem hiding this comment.
LGTM — approving.
Reviewed the full diff against the QVAC-18966 reproduction I posted in Slack. The root cause analysis is correct (matmul output is ne=[L, HD] channel-major-flat, not [HD, L] as the helper assumed), the fix is clean (head-of-pipeline ggml_cont(ggml_transpose) + parallel V transposes at the two GPU-bridge sites + host-bridge fallback download switches in lock-step), and the rewritten test pins both the input shape and the output-shape contract so a future regression of the same kind would be caught.
Particularly liked:
- The new
test_output_shape_contract()is the bit that would have caught this originally — pinning the public contract independent of the internal pipeline. - Pre-RoPE Q/K downloads stay on
tensor_to_time_channel(matmul output layout unchanged); only post-RoPE Q/K and post-transpose V switch totensor_raw_f32. The asymmetric documentation in the host-bridge call sites is clear about why. build_res_style_qkv_cachedeliberately untouched — the rationale (no GPU bridge for style on this branch, unlike the Vulkan branch) checks out.
One nit (non-blocking): the PR notes that OpenCL wasn't exercised on this branch's working tree, relying on Vulkan-branch equivalence. That's a reasonable claim since the helper uses universal ops only, but worth a manual OpenCL smoke before next release if someone has the hardware.
Unblocks #20 (sentence-aligned Supertonic streaming) for CPU — that PR's "CPU broken on this branch" known-limitation note can drop once this lands.
1e6a48d
into
tetherto:supertonic_optimizations
…sumption + voice cache threading + round-5 gap Pure docs / comments change. No production-logic surface modified. CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit` 25 / 25; CPU + Vulkan end-to-end synth produce valid speech WAVs (99.7% non-zero samples, healthy rms). Addresses three reviewer asks on PR tetherto#18: 1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md). Adds an explicit "Note on the round 5 gap" section between round 4 and round 7 documenting that the round-4 plan reserved the name "Round 5 = pinned-host-buffer per-step uploads" as a placeholder, that the actual implementation was deferred behind round-7's bench observability prerequisite, and that it ultimately landed as round 12 tetherto#5. No code was dropped; round numbers stay contiguous so PR descriptions and CI logs match the round labels in this log without rebase churn. 2. UMA-bias assumption (supertonic_gguf.cpp — resolve_vulkan_device_index). Adds a long comment in the requested == -1 auto-pick branch documenting the assumption that is_uma_per_device[i] is sourced from ggml_backend_dev_get_props().type and the failure mode when a discrete adapter's driver mis-reports its type as _IGPU (some Thunderbolt eGPU configs; some ARM SoC dGPU paths). Three sub-cases enumerated: (a) discrete-only with mis-classification falls through to round-3 all-device argmax and still picks discrete by free-VRAM (coincidentally correct), (b) mixed UMA-iGPU + mis-classified-discrete picks iGPU silently (regression vs. round 3 — operator escape hatch: --vulkan-device N is UMA-agnostic and --vulkan-perf-logger exposes the choice). Future-work pointer to a "free-VRAM ceiling" heuristic (UMA reports system-RAM-scale; a discrete reporting > 256 GB is implausible and can be re-classified) tracked in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. 3. voice_host_cache threading model (supertonic_internal.h). Tightens the reference-stability docstring from "must NOT call clear() while holding the reference" to a full thread-safety section explicitly calling out single-threaded -per-Engine as the supported model (matches what the iOS load/unload race fix 36a2c56 enforces for s3gen). Explains why no internal lock today (cache exists to eliminate per -call GPU downloads; internal locking would give back the saving) and what a future thread-pool refactor must do (external mutex around get_or_load + downstream .data() capture, OR switch to a std::shared_mutex-guarded internal lock). Also clarifies the unordered_map guarantee: element references survive insert even when the table rehashes; only iterators are invalidated. Reviewer's fourth ask — "the round-11 fix is redone in PR tetherto#21" — was resolved by the rebase landing in this same branch state. After rebasing onto upstream/supertonic_optimizations (which now contains PR tetherto#21's QVAC-18966 narrower 2-site fix), this branch's round-11 commit is a delta of only the 2 Vulkan-only V-transpose sites needed for round 8's front-block GPU bridge + round 9's style GPU bridge. No double-application; the QVAC-18966 fix is applied exactly once via PR tetherto#21 in the new base. Co-authored-by: Cursor <cursoragent@cursor.com>
…sumption + voice cache threading + round-5 gap Pure docs / comments change. No production-logic surface modified. CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit` 25 / 25; CPU + Vulkan end-to-end synth produce valid speech WAVs (99.7% non-zero samples, healthy rms). Addresses three reviewer asks on PR tetherto#18: 1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md). Adds an explicit "Note on the round 5 gap" section between round 4 and round 7 documenting that the round-4 plan reserved the name "Round 5 = pinned-host-buffer per-step uploads" as a placeholder, that the actual implementation was deferred behind round-7's bench observability prerequisite, and that it ultimately landed as round 12 tetherto#5. No code was dropped; round numbers stay contiguous so PR descriptions and CI logs match the round labels in this log without rebase churn. 2. UMA-bias assumption (supertonic_gguf.cpp — resolve_vulkan_device_index). Adds a long comment in the requested == -1 auto-pick branch documenting the assumption that is_uma_per_device[i] is sourced from ggml_backend_dev_get_props().type and the failure mode when a discrete adapter's driver mis-reports its type as _IGPU (some Thunderbolt eGPU configs; some ARM SoC dGPU paths). Three sub-cases enumerated: (a) discrete-only with mis-classification falls through to round-3 all-device argmax and still picks discrete by free-VRAM (coincidentally correct), (b) mixed UMA-iGPU + mis-classified-discrete picks iGPU silently (regression vs. round 3 — operator escape hatch: --vulkan-device N is UMA-agnostic and --vulkan-perf-logger exposes the choice). Future-work pointer to a "free-VRAM ceiling" heuristic (UMA reports system-RAM-scale; a discrete reporting > 256 GB is implausible and can be re-classified) tracked in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. 3. voice_host_cache threading model (supertonic_internal.h). Tightens the reference-stability docstring from "must NOT call clear() while holding the reference" to a full thread-safety section explicitly calling out single-threaded -per-Engine as the supported model (matches what the iOS load/unload race fix 36a2c56 enforces for s3gen). Explains why no internal lock today (cache exists to eliminate per -call GPU downloads; internal locking would give back the saving) and what a future thread-pool refactor must do (external mutex around get_or_load + downstream .data() capture, OR switch to a std::shared_mutex-guarded internal lock). Also clarifies the unordered_map guarantee: element references survive insert even when the table rehashes; only iterators are invalidated. Reviewer's fourth ask — "the round-11 fix is redone in PR tetherto#21" — was resolved by the rebase landing in this same branch state. After rebasing onto upstream/supertonic_optimizations (which now contains PR tetherto#21's QVAC-18966 narrower 2-site fix), this branch's round-11 commit is a delta of only the 2 Vulkan-only V-transpose sites needed for round 8's front-block GPU bridge + round 9's style GPU bridge. No double-application; the QVAC-18966 fix is applied exactly once via PR tetherto#21 in the new base. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Single-commit critical fix for the CPU regression on
supertonic_optimizationsreported in [QVAC-18966]. The packed-QK RoPE helper introduced by audit follow-up #5 (PR #16, commit5869231c) aborts on the first denoise step of every Supertonic synth run with--n-gpu-layers 0against any GGUF carryingvector_rope_theta:The regression is not on
master— it only affectssupertonic_optimizationsand branches derived from it. This PR is the one-commit fix; the rest of the branch is untouched.Root cause
apply_rope_to_packed_qk(PR #16 audit follow-up #5) was written under the assumption thatdense_matmul_time_ggmlreturns ane=[HD, L]tensor with channel-fastest-in-memory layout. In fact the matmul — both the CPUcblas_sgemmfast path (ggml_custom_4d(F32, x->ne[0] /* = L */, w->ne[0] /* = OC */, …)→[L, OC]) and the universalconv1d_f32(K=1)fallback (ggml_reshape_3d(result, im2col->ne[1] /* = L */, kernel->ne[2] /* = OC */, …)→ also[L, OC]) — producesne=[L, HD]with channel-major-flat memory (data[t + c*L]), the bit-exact transpose of the helper's input contract.The CPU unit test that landed alongside the helper (
test_supertonic_rope_packed_qk.cpp) hand-built Q under the wrong[HD, L]shape, matching the helper's internal layout assumption verbatim, so the failure mode was invisible to CI. The first end-to-end synth attempt against a GGUF withvector_rope_thetahits the defensive assertion.Fix (strict TDD)
test_supertonic_rope_packed_qk.cpprewritten under the production matmul shapene=[L, HD](channel-major-flat memory). Reference built in scalarapply_rope's native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pinsy->ne[0] = HD, y->ne[1] = Lso the downstreamq_tc_inblit cannot regress on layout. Added an explicit output-shape contract test (test_output_shape_contract). Committed RED first, observed to abort at the same assertion the production crash hits.apply_rope_to_packed_qk(supertonic_internal.h): added a head-of-pipelineggml_cont(ggml_transpose(q))to flip fromne=[L, HD]channel-major-flat tone=[HD, L]time-major-flat (the layoutq_tc_inexpects). Rest of the pipeline (view-as-[D, H, L]+ cont +apply_rope_in_graph+ reshape) is unchanged. Outputne=[HD, L]time-major-flat bytes match scalarapply_rope's native layout ANDq_tc_in's blit target bit-for-bit. Universal-op only (ggml_transpose,ggml_cont,ggml_view_3d,ggml_reshape_2d); no new backend dispatch.V matmul layout fix at two GPU-bridge call sites (V has no RoPE to mask the layout flip):
build_group_graph_cache— per-group attention Vsupertonic_vector_trace_proj_ggml— front-block attention VOpen-coded the same
ggml_cont(ggml_transpose(...))at the matmul output so the GPU-bridgeggml_backend_tensor_copy(v_src, v_tc_in)lands bit-exact bytes inv_tc_in(ne=[HD, kv_len]time-major-flat). Style sq/sk/sv intentionally not touched — this branch (unlike the Vulkan branch) has no GPU bridge for style attention (sq_gpu/sk_gpu/sv_gpudon't exist here), so the existing host-vector path viatensor_to_time_channelis already correct and adding a transpose would break it.Legacy host-bridge fallbacks switched from
tensor_to_time_channel(<post-rope-or-v>)totensor_raw_f32(...)at:run_group_graph_cache:out.v,out.q_rope,out.k_ropesupertonic_vector_trace_proj_ggml:v_out,q_rotated,k_rotatedThe new graph-side layout puts the bytes already in the time-major-flat shape scalar
apply_rope/flash_attention_qkvhost references read, so the raw download is the correct call.tensor_to_time_channelwould now apply the transpose-of-the-transpose and feed wrong-orientation Q/K/V into the attention silently. Pre-RoPE Q/K downloads (stillne=[L, HD]channel-major-flat) stay ontensor_to_time_channel— correct as before.Test contract
Whole
ctest -L unit: 12 / 12 pass, 0 regressions, 0 flakes.Verification on the exact QVAC-18966 reproduction command
./build/supertonic-cli --model models/supertonic2.gguf \ --text "Hello world." --out /tmp/out.wav \ --voice F1 --n-gpu-layers 0 --seed 42wrote /tmp/out.wav (1.35s @ 44100 Hz, 59479 samples)md5summatch)Risks & mitigations
b54b7d43round 11) already shipped and validated end-to-end on Vulkan RTX 5090 + AMD RADV iGPU + Mesa lavapipe, with the deliberate exception ofbuild_res_style_qkv_cache(no GPU bridge for style on this branch). The helper itself uses universal-only ops, so no backend dispatch table change.q / k / q_rope / k_rope / vhost vectors. Pre-RoPE Q/K are still downloaded viatensor_to_time_channel(matmul output is unchanged). Post-RoPE Q/K and post-transpose V are downloaded viatensor_raw_f32— the new graph-side bytes are already in the time-major-flat shape every existing scalar-parity test reads (out[t*H*D + h*D + d]).vector_rope_thetaabsent).cache.apply_rope = falseshort-circuits the helper call entirely; the V transpose still runs but produces the same time-major-flat bytes the host bridge expects, just plumbed throughtensor_raw_f32instead oftensor_to_time_channel. No code path regresses.What this PR does NOT touch
masteris unaffected (regression isn't there).build_res_style_qkv_cache/run_res_style_qkv_cacheleft alone — see note above.#include <atomic>inchatterbox_tts.cppfrom upstream commit1819b681;#include <cstring>insupertonic_vector_estimator.cppfrom the Metal-port commit5403d10c) are not part of this commit. They live in the working tree only so the local build can compile; they should land via separate cherry-picks from master if needed.Commits in this PR
1 commit, 3 files changed, +366 / −129.
dfdbc924apply_rope_to_packed_qk+ V matmul transposes at the two GPU-bridge sites + host-bridge fallback download switches.Tested platforms
--n-gpu-layers 0(CPU). 12 / 12 unit tests pass; CLI end-to-end onmodels/supertonic2.ggufsucceeds.Other backends (OpenCL / Vulkan / Metal / CUDA) were not exercised on this branch's working tree (no local GPU available for this fix's verification); however the fix is byte-for-byte the same as what the Vulkan branch shipped after end-to-end validation on those backends, and the new ops in the helper are all universal — no backend can refuse them.