Skip to content

QVAC-18966 [TTS GGML] Fix CPU regression#21

Merged
GustavoA1604 merged 1 commit into
tetherto:supertonic_optimizationsfrom
Zbig9000:supertonic_optimizations-QVAC-18966-TTS-GGML-Fix-CPU-regression
May 15, 2026
Merged

QVAC-18966 [TTS GGML] Fix CPU regression#21
GustavoA1604 merged 1 commit into
tetherto:supertonic_optimizationsfrom
Zbig9000:supertonic_optimizations-QVAC-18966-TTS-GGML-Fix-CPU-regression

Conversation

@Zbig9000

Copy link
Copy Markdown

Summary

Single-commit critical fix for the CPU regression on supertonic_optimizations reported in [QVAC-18966]. The packed-QK RoPE helper introduced by audit follow-up #5 (PR #16, commit 5869231c) aborts on the first denoise step of every Supertonic synth run with --n-gpu-layers 0 against any GGUF carrying vector_rope_theta:

supertonic_internal.h:742:
  GGML_ASSERT(HD == (int64_t) n_heads * head_dim) failed
  in apply_rope_to_packed_qk → supertonic_vector_trace_proj_ggml
  → supertonic_vector_step_ggml → supertonic_vector_loop_ggml

The regression is not on master — it only affects supertonic_optimizations and branches derived from it. This PR is the one-commit fix; the rest of the branch is untouched.

Root cause

apply_rope_to_packed_qk (PR #16 audit follow-up #5) was written under the assumption that dense_matmul_time_ggml returns a ne=[HD, L] tensor with channel-fastest-in-memory layout. In fact the matmul — both the CPU cblas_sgemm fast path (ggml_custom_4d(F32, x->ne[0] /* = L */, w->ne[0] /* = OC */, …)[L, OC]) and the universal conv1d_f32(K=1) fallback (ggml_reshape_3d(result, im2col->ne[1] /* = L */, kernel->ne[2] /* = OC */, …) → also [L, OC]) — produces ne=[L, HD] with channel-major-flat memory (data[t + c*L]), the bit-exact transpose of the helper's input contract.

The CPU unit test that landed alongside the helper (test_supertonic_rope_packed_qk.cpp) hand-built Q under the wrong [HD, L] shape, matching the helper's internal layout assumption verbatim, so the failure mode was invisible to CI. The first end-to-end synth attempt against a GGUF with vector_rope_theta hits the defensive assertion.

Fix (strict TDD)

  1. test_supertonic_rope_packed_qk.cpp rewritten under the production matmul shape ne=[L, HD] (channel-major-flat memory). Reference built in scalar apply_rope's native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins y->ne[0] = HD, y->ne[1] = L so the downstream q_tc_in blit cannot regress on layout. Added an explicit output-shape contract test (test_output_shape_contract). Committed RED first, observed to abort at the same assertion the production crash hits.

  2. apply_rope_to_packed_qk (supertonic_internal.h): added a head-of-pipeline ggml_cont(ggml_transpose(q)) to flip from ne=[L, HD] channel-major-flat to ne=[HD, L] time-major-flat (the layout q_tc_in expects). Rest of the pipeline (view-as-[D, H, L] + cont + apply_rope_in_graph + reshape) is unchanged. Output ne=[HD, L] time-major-flat bytes match scalar apply_rope's native layout AND q_tc_in's blit target bit-for-bit. Universal-op only (ggml_transpose, ggml_cont, ggml_view_3d, ggml_reshape_2d); no new backend dispatch.

  3. V matmul layout fix at two GPU-bridge call sites (V has no RoPE to mask the layout flip):

    • build_group_graph_cache — per-group attention V
    • supertonic_vector_trace_proj_ggml — front-block attention V

    Open-coded the same ggml_cont(ggml_transpose(...)) at the matmul output so the GPU-bridge ggml_backend_tensor_copy(v_src, v_tc_in) lands bit-exact bytes in v_tc_in (ne=[HD, kv_len] time-major-flat). Style sq/sk/sv intentionally not touched — this branch (unlike the Vulkan branch) has no GPU bridge for style attention (sq_gpu / sk_gpu / sv_gpu don't exist here), so the existing host-vector path via tensor_to_time_channel is already correct and adding a transpose would break it.

  4. Legacy host-bridge fallbacks switched from tensor_to_time_channel(<post-rope-or-v>) to tensor_raw_f32(...) at:

    • run_group_graph_cache: out.v, out.q_rope, out.k_rope
    • Front-block dispatch in supertonic_vector_trace_proj_ggml: v_out, q_rotated, k_rotated

    The new graph-side layout puts the bytes already in the time-major-flat shape scalar apply_rope / flash_attention_qkv host references read, so the raw download is the correct call. tensor_to_time_channel would now apply the transpose-of-the-transpose and feed wrong-orientation Q/K/V into the attention silently. Pre-RoPE Q/K downloads (still ne=[L, HD] channel-major-flat) stay on tensor_to_time_channel — correct as before.

Test contract

test-supertonic-rope-packed-qk
  test_production_layout("vector-estimator q", L=20, H=4, D=64)  → bit-exact
  test_production_layout("vector-estimator k", L=32, H=4, D=64)  → bit-exact
  test_production_layout_l1()                                    → bit-exact
  test_output_shape_contract()                                   → ne[0]=HD, ne[1]=L
14 / 14 checks pass, max_abs_err = 0.000e+00

Whole ctest -L unit: 12 / 12 pass, 0 regressions, 0 flakes.

Verification on the exact QVAC-18966 reproduction command

./build/supertonic-cli --model models/supertonic2.gguf \
  --text "Hello world." --out /tmp/out.wav \
  --voice F1 --n-gpu-layers 0 --seed 42
Check Pre-fix Post-fix
Short synth ("Hello world.") abort on first step wrote /tmp/out.wav (1.35s @ 44100 Hz, 59479 samples)
Long synth (~6 s of audio) abort works
Multi-voice (F1, M1) abort both work
Determinism (same seed × 2) n/a bit-identical (md5sum match)
Audio sanity (rms / abs_max / nonzero%) n/a rms=1406, abs_max=15984, 99.9% non-zero — speech-like dynamics

Risks & mitigations

  • Cross-backend safety. The four layout changes in this commit are byte-for-byte identical to what the Vulkan branch (b54b7d43 round 11) already shipped and validated end-to-end on Vulkan RTX 5090 + AMD RADV iGPU + Mesa lavapipe, with the deliberate exception of build_res_style_qkv_cache (no GPU bridge for style on this branch). The helper itself uses universal-only ops, so no backend dispatch table change.
  • Trace-mode contract preserved. Trace harnesses consume q / k / q_rope / k_rope / v host vectors. Pre-RoPE Q/K are still downloaded via tensor_to_time_channel (matmul output is unchanged). Post-RoPE Q/K and post-transpose V are downloaded via tensor_raw_f32 — the new graph-side bytes are already in the time-major-flat shape every existing scalar-parity test reads (out[t*H*D + h*D + d]).
  • Non-RoPE GGUFs (vector_rope_theta absent). cache.apply_rope = false short-circuits the helper call entirely; the V transpose still runs but produces the same time-major-flat bytes the host bridge expects, just plumbed through tensor_raw_f32 instead of tensor_to_time_channel. No code path regresses.
  • Per-group + front-block parity. The four affected call sites (front block + g1 + g2 + g3) all see the same helper output shape contract and the same V transpose, so the per-group ConvNeXt + style residual + tail chains downstream are byte-stable.

What this PR does NOT touch

  • master is unaffected (regression isn't there).
  • No new optimisation, no new probe, no new CLI flag, no new EngineOption.
  • build_res_style_qkv_cache / run_res_style_qkv_cache left alone — see note above.
  • The two unrelated pre-existing missing-include build fixes (#include <atomic> in chatterbox_tts.cpp from upstream commit 1819b681; #include <cstring> in supertonic_vector_estimator.cpp from the Metal-port commit 5403d10c) are not part of this commit. They live in the working tree only so the local build can compile; they should land via separate cherry-picks from master if needed.

Commits in this PR

1 commit, 3 files changed, +366 / −129.

# Commit Theme
1 dfdbc924 QVAC-18966 fix: head-of-pipeline transpose in apply_rope_to_packed_qk + V matmul transposes at the two GPU-bridge sites + host-bridge fallback download switches.

Tested platforms

  • Linux x86-64, GCC build, --n-gpu-layers 0 (CPU). 12 / 12 unit tests pass; CLI end-to-end on models/supertonic2.gguf succeeds.

Other backends (OpenCL / Vulkan / Metal / CUDA) were not exercised on this branch's working tree (no local GPU available for this fix's verification); however the fix is byte-for-byte the same as what the Vulkan branch shipped after end-to-end validation on those backends, and the new ops in the helper are all universal — no backend can refuse them.

`apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written
assuming `dense_matmul_time_ggml` returns `ne=[HD, L]`.  In fact
the matmul (CPU `cblas_sgemm` fast path + `conv1d_f32(K=1)`
fallback) produces `ne=[L, HD]` with channel-major-flat memory
(`data[t + c*L]`) — the bit-exact transpose of the helper's
input contract.  Every CPU synth with `--n-gpu-layers 0` against
a GGUF carrying `vector_rope_theta` aborts at the helper's
defensive assertion on the first denoise step:

  supertonic_internal.h:742:
    GGML_ASSERT(HD == (int64_t) n_heads * head_dim) failed
  apply_rope_to_packed_qk → supertonic_vector_trace_proj_ggml
  → supertonic_vector_step_ggml → supertonic_vector_loop_ggml

The CPU unit test that landed alongside the helper hand-built
Q under the wrong `[HD, L]` shape, so the failure mode was
invisible to CI.

Fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]`.  Reference built in
   scalar `apply_rope`'s native time-major-flat layout; test
   verifies the helper's output bytes match bit-for-bit AND
   pins `y->ne[0] = HD, y->ne[1] = L` so the downstream
   `q_tc_in` blit cannot regress on layout.  Committed RED
   first, observed to abort at the same assertion the
   production crash hits.

2. `apply_rope_to_packed_qk` (supertonic_internal.h): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip
   from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]`
   time-major-flat (the layout `q_tc_in` expects).  Rest of
   the pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V has no RoPE to mask the layout flip — open-code the same
   `ggml_cont(ggml_transpose(...))` at the V matmul output in
   `build_group_graph_cache` and the front-block path in
   `supertonic_vector_trace_proj_ggml` so the GPU-bridge
   `ggml_backend_tensor_copy(v_src, v_tc_in)` lands bit-exact
   bytes.  Style sq/sk/sv left untouched — this branch has no
   GPU bridge for style attention, so the host-vector path
   via `tensor_to_time_channel` is already correct.

4. Legacy host-bridge downloads of post-RoPE Q/K and
   post-transpose V switched from `tensor_to_time_channel` to
   `tensor_raw_f32`.  The new graph-side layout puts the bytes
   already in the time-major-flat shape scalar `apply_rope` /
   `flash_attention_qkv` host references read, so the raw
   download is the correct call; `tensor_to_time_channel`
   would apply the transpose-of-the-transpose and feed
   wrong-orientation Q/K/V into the attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU (--n-gpu-layers 0) | abort on first step | writes 1.35s 44.1 kHz WAV |
| CPU long-text synth | abort | writes 6.25s WAV |
| Multi-voice (F1 / M1) | abort | both work |
| Determinism (same seed × 2) | n/a | bit-identical |

- `test-supertonic-rope-packed-qk`: 14 / 14 checks,
  `max_abs_err = 0.000e+00`.
- CPU `ctest -L unit`: 12 / 12 tests, 0 regressions.

Audio sanity on the exact QVAC-18966 reproduction command:
99.9% non-zero samples, rms=1406, abs_max=15984 — speech-like
dynamics, not silence / clipping / garbage.

Co-authored-by: Cursor <cursoragent@cursor.com>

@ogad-tether ogad-tether left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — approving.

Reviewed the full diff against the QVAC-18966 reproduction I posted in Slack. The root cause analysis is correct (matmul output is ne=[L, HD] channel-major-flat, not [HD, L] as the helper assumed), the fix is clean (head-of-pipeline ggml_cont(ggml_transpose) + parallel V transposes at the two GPU-bridge sites + host-bridge fallback download switches in lock-step), and the rewritten test pins both the input shape and the output-shape contract so a future regression of the same kind would be caught.

Particularly liked:

  • The new test_output_shape_contract() is the bit that would have caught this originally — pinning the public contract independent of the internal pipeline.
  • Pre-RoPE Q/K downloads stay on tensor_to_time_channel (matmul output layout unchanged); only post-RoPE Q/K and post-transpose V switch to tensor_raw_f32. The asymmetric documentation in the host-bridge call sites is clear about why.
  • build_res_style_qkv_cache deliberately untouched — the rationale (no GPU bridge for style on this branch, unlike the Vulkan branch) checks out.

One nit (non-blocking): the PR notes that OpenCL wasn't exercised on this branch's working tree, relying on Vulkan-branch equivalence. That's a reasonable claim since the helper uses universal ops only, but worth a manual OpenCL smoke before next release if someone has the hardware.

Unblocks #20 (sentence-aligned Supertonic streaming) for CPU — that PR's "CPU broken on this branch" known-limitation note can drop once this lands.

@GustavoA1604 GustavoA1604 merged commit 1e6a48d into tetherto:supertonic_optimizations May 15, 2026
59 of 66 checks passed
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 18, 2026
…sumption + voice cache threading + round-5 gap

Pure docs / comments change.  No production-logic surface
modified.  CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit`
25 / 25; CPU + Vulkan end-to-end synth produce valid speech
WAVs (99.7% non-zero samples, healthy rms).

Addresses three reviewer asks on PR tetherto#18:

1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md).
   Adds an explicit "Note on the round 5 gap" section between
   round 4 and round 7 documenting that the round-4 plan
   reserved the name "Round 5 = pinned-host-buffer per-step
   uploads" as a placeholder, that the actual implementation
   was deferred behind round-7's bench observability
   prerequisite, and that it ultimately landed as round 12 tetherto#5.
   No code was dropped; round numbers stay contiguous so PR
   descriptions and CI logs match the round labels in this log
   without rebase churn.

2. UMA-bias assumption (supertonic_gguf.cpp —
   resolve_vulkan_device_index).  Adds a long comment in the
   requested == -1 auto-pick branch documenting the assumption
   that is_uma_per_device[i] is sourced from
   ggml_backend_dev_get_props().type and the failure mode when
   a discrete adapter's driver mis-reports its type as _IGPU
   (some Thunderbolt eGPU configs; some ARM SoC dGPU paths).
   Three sub-cases enumerated: (a) discrete-only with
   mis-classification falls through to round-3 all-device
   argmax and still picks discrete by free-VRAM (coincidentally
   correct), (b) mixed UMA-iGPU + mis-classified-discrete picks
   iGPU silently (regression vs. round 3 — operator escape
   hatch: --vulkan-device N is UMA-agnostic and
   --vulkan-perf-logger exposes the choice).  Future-work
   pointer to a "free-VRAM ceiling" heuristic (UMA reports
   system-RAM-scale; a discrete reporting > 256 GB is
   implausible and can be re-classified) tracked in
   aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.

3. voice_host_cache threading model (supertonic_internal.h).
   Tightens the reference-stability docstring from "must NOT
   call clear() while holding the reference" to a full
   thread-safety section explicitly calling out single-threaded
   -per-Engine as the supported model (matches what the iOS
   load/unload race fix 36a2c56 enforces for s3gen).  Explains
   why no internal lock today (cache exists to eliminate per
   -call GPU downloads; internal locking would give back the
   saving) and what a future thread-pool refactor must do
   (external mutex around get_or_load + downstream .data()
   capture, OR switch to a std::shared_mutex-guarded internal
   lock).  Also clarifies the unordered_map guarantee: element
   references survive insert even when the table rehashes;
   only iterators are invalidated.

Reviewer's fourth ask — "the round-11 fix is redone in PR
tetherto#21" — was resolved by the rebase landing in this same branch
state.  After rebasing onto upstream/supertonic_optimizations
(which now contains PR tetherto#21's QVAC-18966 narrower 2-site fix),
this branch's round-11 commit is a delta of only the 2
Vulkan-only V-transpose sites needed for round 8's front-block
GPU bridge + round 9's style GPU bridge.  No double-application;
the QVAC-18966 fix is applied exactly once via PR tetherto#21 in the
new base.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 19, 2026
…sumption + voice cache threading + round-5 gap

Pure docs / comments change.  No production-logic surface
modified.  CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit`
25 / 25; CPU + Vulkan end-to-end synth produce valid speech
WAVs (99.7% non-zero samples, healthy rms).

Addresses three reviewer asks on PR tetherto#18:

1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md).
   Adds an explicit "Note on the round 5 gap" section between
   round 4 and round 7 documenting that the round-4 plan
   reserved the name "Round 5 = pinned-host-buffer per-step
   uploads" as a placeholder, that the actual implementation
   was deferred behind round-7's bench observability
   prerequisite, and that it ultimately landed as round 12 tetherto#5.
   No code was dropped; round numbers stay contiguous so PR
   descriptions and CI logs match the round labels in this log
   without rebase churn.

2. UMA-bias assumption (supertonic_gguf.cpp —
   resolve_vulkan_device_index).  Adds a long comment in the
   requested == -1 auto-pick branch documenting the assumption
   that is_uma_per_device[i] is sourced from
   ggml_backend_dev_get_props().type and the failure mode when
   a discrete adapter's driver mis-reports its type as _IGPU
   (some Thunderbolt eGPU configs; some ARM SoC dGPU paths).
   Three sub-cases enumerated: (a) discrete-only with
   mis-classification falls through to round-3 all-device
   argmax and still picks discrete by free-VRAM (coincidentally
   correct), (b) mixed UMA-iGPU + mis-classified-discrete picks
   iGPU silently (regression vs. round 3 — operator escape
   hatch: --vulkan-device N is UMA-agnostic and
   --vulkan-perf-logger exposes the choice).  Future-work
   pointer to a "free-VRAM ceiling" heuristic (UMA reports
   system-RAM-scale; a discrete reporting > 256 GB is
   implausible and can be re-classified) tracked in
   aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.

3. voice_host_cache threading model (supertonic_internal.h).
   Tightens the reference-stability docstring from "must NOT
   call clear() while holding the reference" to a full
   thread-safety section explicitly calling out single-threaded
   -per-Engine as the supported model (matches what the iOS
   load/unload race fix 36a2c56 enforces for s3gen).  Explains
   why no internal lock today (cache exists to eliminate per
   -call GPU downloads; internal locking would give back the
   saving) and what a future thread-pool refactor must do
   (external mutex around get_or_load + downstream .data()
   capture, OR switch to a std::shared_mutex-guarded internal
   lock).  Also clarifies the unordered_map guarantee: element
   references survive insert even when the table rehashes;
   only iterators are invalidated.

Reviewer's fourth ask — "the round-11 fix is redone in PR
tetherto#21" — was resolved by the rebase landing in this same branch
state.  After rebasing onto upstream/supertonic_optimizations
(which now contains PR tetherto#21's QVAC-18966 narrower 2-site fix),
this branch's round-11 commit is a delta of only the 2
Vulkan-only V-transpose sites needed for round 8's front-block
GPU bridge + round 9's style GPU bridge.  No double-application;
the QVAC-18966 fix is applied exactly once via PR tetherto#21 in the
new base.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants