QVAC-18966 [TTS GGML] Fix CPU regression by Zbig9000 · Pull Request #21 · tetherto/qvac-ext-lib-whisper.cpp

Zbig9000 · 2026-05-15T15:59:14Z

Summary

Single-commit critical fix for the CPU regression on supertonic_optimizations reported in [QVAC-18966]. The packed-QK RoPE helper introduced by audit follow-up #5 (PR #16, commit 5869231c) aborts on the first denoise step of every Supertonic synth run with --n-gpu-layers 0 against any GGUF carrying vector_rope_theta:

supertonic_internal.h:742:
  GGML_ASSERT(HD == (int64_t) n_heads * head_dim) failed
  in apply_rope_to_packed_qk → supertonic_vector_trace_proj_ggml
  → supertonic_vector_step_ggml → supertonic_vector_loop_ggml

The regression is not on master — it only affects supertonic_optimizations and branches derived from it. This PR is the one-commit fix; the rest of the branch is untouched.

Root cause

apply_rope_to_packed_qk (PR #16 audit follow-up #5) was written under the assumption that dense_matmul_time_ggml returns a ne=[HD, L] tensor with channel-fastest-in-memory layout. In fact the matmul — both the CPU cblas_sgemm fast path (ggml_custom_4d(F32, x->ne[0] /* = L */, w->ne[0] /* = OC */, …) → [L, OC]) and the universal conv1d_f32(K=1) fallback (ggml_reshape_3d(result, im2col->ne[1] /* = L */, kernel->ne[2] /* = OC */, …) → also [L, OC]) — produces ne=[L, HD] with channel-major-flat memory (data[t + c*L]), the bit-exact transpose of the helper's input contract.

The CPU unit test that landed alongside the helper (test_supertonic_rope_packed_qk.cpp) hand-built Q under the wrong [HD, L] shape, matching the helper's internal layout assumption verbatim, so the failure mode was invisible to CI. The first end-to-end synth attempt against a GGUF with vector_rope_theta hits the defensive assertion.

Fix (strict TDD)

test_supertonic_rope_packed_qk.cpp rewritten under the production matmul shape ne=[L, HD] (channel-major-flat memory). Reference built in scalar apply_rope's native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins y->ne[0] = HD, y->ne[1] = L so the downstream q_tc_in blit cannot regress on layout. Added an explicit output-shape contract test (test_output_shape_contract). Committed RED first, observed to abort at the same assertion the production crash hits.
apply_rope_to_packed_qk (supertonic_internal.h): added a head-of-pipeline ggml_cont(ggml_transpose(q)) to flip from ne=[L, HD] channel-major-flat to ne=[HD, L] time-major-flat (the layout q_tc_in expects). Rest of the pipeline (view-as-[D, H, L] + cont + apply_rope_in_graph + reshape) is unchanged. Output ne=[HD, L] time-major-flat bytes match scalar apply_rope's native layout AND q_tc_in's blit target bit-for-bit. Universal-op only (ggml_transpose, ggml_cont, ggml_view_3d, ggml_reshape_2d); no new backend dispatch.
V matmul layout fix at two GPU-bridge call sites (V has no RoPE to mask the layout flip):
- build_group_graph_cache — per-group attention V
- supertonic_vector_trace_proj_ggml — front-block attention V
Open-coded the same ggml_cont(ggml_transpose(...)) at the matmul output so the GPU-bridge ggml_backend_tensor_copy(v_src, v_tc_in) lands bit-exact bytes in v_tc_in (ne=[HD, kv_len] time-major-flat). Style sq/sk/sv intentionally not touched — this branch (unlike the Vulkan branch) has no GPU bridge for style attention (sq_gpu / sk_gpu / sv_gpu don't exist here), so the existing host-vector path via tensor_to_time_channel is already correct and adding a transpose would break it.
Legacy host-bridge fallbacks switched from tensor_to_time_channel(<post-rope-or-v>) to tensor_raw_f32(...) at:
- run_group_graph_cache: out.v, out.q_rope, out.k_rope
- Front-block dispatch in supertonic_vector_trace_proj_ggml: v_out, q_rotated, k_rotated
The new graph-side layout puts the bytes already in the time-major-flat shape scalar apply_rope / flash_attention_qkv host references read, so the raw download is the correct call. tensor_to_time_channel would now apply the transpose-of-the-transpose and feed wrong-orientation Q/K/V into the attention silently. Pre-RoPE Q/K downloads (still ne=[L, HD] channel-major-flat) stay on tensor_to_time_channel — correct as before.

Test contract

test-supertonic-rope-packed-qk
  test_production_layout("vector-estimator q", L=20, H=4, D=64)  → bit-exact
  test_production_layout("vector-estimator k", L=32, H=4, D=64)  → bit-exact
  test_production_layout_l1()                                    → bit-exact
  test_output_shape_contract()                                   → ne[0]=HD, ne[1]=L
14 / 14 checks pass, max_abs_err = 0.000e+00

Whole ctest -L unit: 12 / 12 pass, 0 regressions, 0 flakes.

Verification on the exact QVAC-18966 reproduction command

./build/supertonic-cli --model models/supertonic2.gguf \
  --text "Hello world." --out /tmp/out.wav \
  --voice F1 --n-gpu-layers 0 --seed 42

Check	Pre-fix	Post-fix
Short synth ("Hello world.")	abort on first step	`wrote /tmp/out.wav (1.35s @ 44100 Hz, 59479 samples)`
Long synth (~6 s of audio)	abort	works
Multi-voice (F1, M1)	abort	both work
Determinism (same seed × 2)	n/a	bit-identical (`md5sum` match)
Audio sanity (rms / abs_max / nonzero%)	n/a	rms=1406, abs_max=15984, 99.9% non-zero — speech-like dynamics

Risks & mitigations

Cross-backend safety. The four layout changes in this commit are byte-for-byte identical to what the Vulkan branch (b54b7d43 round 11) already shipped and validated end-to-end on Vulkan RTX 5090 + AMD RADV iGPU + Mesa lavapipe, with the deliberate exception of build_res_style_qkv_cache (no GPU bridge for style on this branch). The helper itself uses universal-only ops, so no backend dispatch table change.
Trace-mode contract preserved. Trace harnesses consume q / k / q_rope / k_rope / v host vectors. Pre-RoPE Q/K are still downloaded via tensor_to_time_channel (matmul output is unchanged). Post-RoPE Q/K and post-transpose V are downloaded via tensor_raw_f32 — the new graph-side bytes are already in the time-major-flat shape every existing scalar-parity test reads (out[t*H*D + h*D + d]).
Non-RoPE GGUFs (vector_rope_theta absent). cache.apply_rope = false short-circuits the helper call entirely; the V transpose still runs but produces the same time-major-flat bytes the host bridge expects, just plumbed through tensor_raw_f32 instead of tensor_to_time_channel. No code path regresses.
Per-group + front-block parity. The four affected call sites (front block + g1 + g2 + g3) all see the same helper output shape contract and the same V transpose, so the per-group ConvNeXt + style residual + tail chains downstream are byte-stable.

What this PR does NOT touch

master is unaffected (regression isn't there).
No new optimisation, no new probe, no new CLI flag, no new EngineOption.
build_res_style_qkv_cache / run_res_style_qkv_cache left alone — see note above.
The two unrelated pre-existing missing-include build fixes (#include <atomic> in chatterbox_tts.cpp from upstream commit 1819b681; #include <cstring> in supertonic_vector_estimator.cpp from the Metal-port commit 5403d10c) are not part of this commit. They live in the working tree only so the local build can compile; they should land via separate cherry-picks from master if needed.

Commits in this PR

1 commit, 3 files changed, +366 / −129.

#	Commit	Theme
1	`dfdbc924`	QVAC-18966 fix: head-of-pipeline transpose in `apply_rope_to_packed_qk` + V matmul transposes at the two GPU-bridge sites + host-bridge fallback download switches.

Tested platforms

Linux x86-64, GCC build, --n-gpu-layers 0 (CPU). 12 / 12 unit tests pass; CLI end-to-end on models/supertonic2.gguf succeeds.

Other backends (OpenCL / Vulkan / Metal / CUDA) were not exercised on this branch's working tree (no local GPU available for this fix's verification); however the fix is byte-for-byte the same as what the Vulkan branch shipped after end-to-end validation on those backends, and the new ops in the helper are all universal — no backend can refuse them.

`apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written assuming `dense_matmul_time_ggml` returns `ne=[HD, L]`. In fact the matmul (CPU `cblas_sgemm` fast path + `conv1d_f32(K=1)` fallback) produces `ne=[L, HD]` with channel-major-flat memory (`data[t + c*L]`) — the bit-exact transpose of the helper's input contract. Every CPU synth with `--n-gpu-layers 0` against a GGUF carrying `vector_rope_theta` aborts at the helper's defensive assertion on the first denoise step: supertonic_internal.h:742: GGML_ASSERT(HD == (int64_t) n_heads * head_dim) failed apply_rope_to_packed_qk → supertonic_vector_trace_proj_ggml → supertonic_vector_step_ggml → supertonic_vector_loop_ggml The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. Fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]`. Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits. 2. `apply_rope_to_packed_qk` (supertonic_internal.h): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-flat (the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V has no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the V matmul output in `build_group_graph_cache` and the front-block path in `supertonic_vector_trace_proj_ggml` so the GPU-bridge `ggml_backend_tensor_copy(v_src, v_tc_in)` lands bit-exact bytes. Style sq/sk/sv left untouched — this branch has no GPU bridge for style attention, so the host-vector path via `tensor_to_time_channel` is already correct. 4. Legacy host-bridge downloads of post-RoPE Q/K and post-transpose V switched from `tensor_to_time_channel` to `tensor_raw_f32`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would apply the transpose-of-the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU (--n-gpu-layers 0) | abort on first step | writes 1.35s 44.1 kHz WAV | | CPU long-text synth | abort | writes 6.25s WAV | | Multi-voice (F1 / M1) | abort | both work | | Determinism (same seed × 2) | n/a | bit-identical | - `test-supertonic-rope-packed-qk`: 14 / 14 checks, `max_abs_err = 0.000e+00`. - CPU `ctest -L unit`: 12 / 12 tests, 0 regressions. Audio sanity on the exact QVAC-18966 reproduction command: 99.9% non-zero samples, rms=1406, abs_max=15984 — speech-like dynamics, not silence / clipping / garbage. Co-authored-by: Cursor <cursoragent@cursor.com>

ogad-tether

LGTM — approving.

Reviewed the full diff against the QVAC-18966 reproduction I posted in Slack. The root cause analysis is correct (matmul output is ne=[L, HD] channel-major-flat, not [HD, L] as the helper assumed), the fix is clean (head-of-pipeline ggml_cont(ggml_transpose) + parallel V transposes at the two GPU-bridge sites + host-bridge fallback download switches in lock-step), and the rewritten test pins both the input shape and the output-shape contract so a future regression of the same kind would be caught.

Particularly liked:

The new test_output_shape_contract() is the bit that would have caught this originally — pinning the public contract independent of the internal pipeline.
Pre-RoPE Q/K downloads stay on tensor_to_time_channel (matmul output layout unchanged); only post-RoPE Q/K and post-transpose V switch to tensor_raw_f32. The asymmetric documentation in the host-bridge call sites is clear about why.
build_res_style_qkv_cache deliberately untouched — the rationale (no GPU bridge for style on this branch, unlike the Vulkan branch) checks out.

One nit (non-blocking): the PR notes that OpenCL wasn't exercised on this branch's working tree, relying on Vulkan-branch equivalence. That's a reasonable claim since the helper uses universal ops only, but worth a manual OpenCL smoke before next release if someone has the hardware.

Unblocks #20 (sentence-aligned Supertonic streaming) for CPU — that PR's "CPU broken on this branch" known-limitation note can drop once this lands.

…sumption + voice cache threading + round-5 gap Pure docs / comments change. No production-logic surface modified. CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit` 25 / 25; CPU + Vulkan end-to-end synth produce valid speech WAVs (99.7% non-zero samples, healthy rms). Addresses three reviewer asks on PR tetherto#18: 1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md). Adds an explicit "Note on the round 5 gap" section between round 4 and round 7 documenting that the round-4 plan reserved the name "Round 5 = pinned-host-buffer per-step uploads" as a placeholder, that the actual implementation was deferred behind round-7's bench observability prerequisite, and that it ultimately landed as round 12 tetherto#5. No code was dropped; round numbers stay contiguous so PR descriptions and CI logs match the round labels in this log without rebase churn. 2. UMA-bias assumption (supertonic_gguf.cpp — resolve_vulkan_device_index). Adds a long comment in the requested == -1 auto-pick branch documenting the assumption that is_uma_per_device[i] is sourced from ggml_backend_dev_get_props().type and the failure mode when a discrete adapter's driver mis-reports its type as _IGPU (some Thunderbolt eGPU configs; some ARM SoC dGPU paths). Three sub-cases enumerated: (a) discrete-only with mis-classification falls through to round-3 all-device argmax and still picks discrete by free-VRAM (coincidentally correct), (b) mixed UMA-iGPU + mis-classified-discrete picks iGPU silently (regression vs. round 3 — operator escape hatch: --vulkan-device N is UMA-agnostic and --vulkan-perf-logger exposes the choice). Future-work pointer to a "free-VRAM ceiling" heuristic (UMA reports system-RAM-scale; a discrete reporting > 256 GB is implausible and can be re-classified) tracked in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. 3. voice_host_cache threading model (supertonic_internal.h). Tightens the reference-stability docstring from "must NOT call clear() while holding the reference" to a full thread-safety section explicitly calling out single-threaded -per-Engine as the supported model (matches what the iOS load/unload race fix 36a2c56 enforces for s3gen). Explains why no internal lock today (cache exists to eliminate per -call GPU downloads; internal locking would give back the saving) and what a future thread-pool refactor must do (external mutex around get_or_load + downstream .data() capture, OR switch to a std::shared_mutex-guarded internal lock). Also clarifies the unordered_map guarantee: element references survive insert even when the table rehashes; only iterators are invalidated. Reviewer's fourth ask — "the round-11 fix is redone in PR tetherto#21" — was resolved by the rebase landing in this same branch state. After rebasing onto upstream/supertonic_optimizations (which now contains PR tetherto#21's QVAC-18966 narrower 2-site fix), this branch's round-11 commit is a delta of only the 2 Vulkan-only V-transpose sites needed for round 8's front-block GPU bridge + round 9's style GPU bridge. No double-application; the QVAC-18966 fix is applied exactly once via PR tetherto#21 in the new base. Co-authored-by: Cursor <cursoragent@cursor.com>

Zbig9000 requested review from GustavoA1604, freddy311082, ishanvohra2 and ogad-tether May 15, 2026 15:59

Zbig9000 requested review from a team as code owners May 15, 2026 15:59

freddy311082 approved these changes May 15, 2026

View reviewed changes

GustavoA1604 approved these changes May 15, 2026

View reviewed changes

GustavoA1604 mentioned this pull request May 15, 2026

Supertonic optimizations qvac 18605 tts ggml add and optimize vulkan for supertonic #18

Merged

ogad-tether approved these changes May 15, 2026

View reviewed changes

GustavoA1604 merged commit 1e6a48d into tetherto:supertonic_optimizations May 15, 2026
59 of 66 checks passed

ogad-tether mentioned this pull request May 18, 2026

tts-cpp: supertonic Engine streaming via multilingual chunker + callback #20

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-18966 [TTS GGML] Fix CPU regression#21

QVAC-18966 [TTS GGML] Fix CPU regression#21
GustavoA1604 merged 1 commit into
tetherto:supertonic_optimizationsfrom
Zbig9000:supertonic_optimizations-QVAC-18966-TTS-GGML-Fix-CPU-regression

Zbig9000 commented May 15, 2026

Uh oh!

ogad-tether left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Zbig9000 commented May 15, 2026

Summary

Root cause

Fix (strict TDD)

Test contract

Verification on the exact QVAC-18966 reproduction command

Risks & mitigations

What this PR does NOT touch

Commits in this PR

Tested platforms

Uh oh!

ogad-tether left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants