Qvac 18605 tts ggml add and optimize vulkan for supertonic by Zbig9000 · Pull Request #17 · tetherto/qvac-ext-lib-whisper.cpp

Zbig9000 · 2026-05-12T14:09:43Z

Summary

Brings the Supertonic TTS stage of tts-cpp to functional + tunable parity on the Vulkan backend, layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR #16). The audit-driven optimisations from #16 are backend-portable by construction, so Vulkan inherits all ~280 host↔GPU sync-point eliminations + the F16-weight roster + the in-graph RoPE / ConvNeXt fusion / GPU↔GPU blit work without modification. This PR adds eleven rounds of Vulkan-specific deltas — each round committed test-first (TDD) with a CPU-only unit gate that locks in the dispatch + capability contract for future regressions.

Rounds 1–6 are dispatch + capability infrastructure (probes, flags, multi-device auto-pick, deny-list, multi-dtype K/V). Rounds 8–10 are observability + per-step sync-point elimination on the GPU bridges. Round 11 is a critical correctness fix that turns the prior 10 rounds from "passes CI" into "actually runs end-to-end on every Vulkan adapter we have." Without round 11, every prior round was hitting a latent assertion-failure during the first real synth call.

Scope vs. PR #16: this PR sits on top of the OpenCL branch (QVAC-18607-TTS-GGML-Add-and-optimize-OpenCL-for-supertonic). All Vulkan-specific deltas are restated here; the OpenCL audit work is not. The optimisations layer cleanly because the audit hits the GGML graph layer (backend-portable by construction); Vulkan inherits the wins automatically.

End-to-end validation (on real hardware)

Tested on three Vulkan adapters in one machine — the gold-standard hybrid dev-rig setup:

Adapter	Driver	Result	Per-synth (5-step denoise)
NVIDIA RTX 5090 (discrete, KHR_coopmat, FP16, no BF16)	NVIDIA 590.48.01, Vulkan 1.4.325	✅ 6.53s WAV	44 ms total, 74× realtime short prompt / 76 ms, 123× realtime long prompt
AMD Ryzen 9 9950X3D iGPU (UMA, RADV, FP16)	Mesa 25.2.8 RADV, Vulkan 1.4.318	✅ 3.64s WAV	178 ms total, 7× realtime
Mesa lavapipe (CPU-Vulkan correctness baseline)	Mesa 25.2.8 lavapipe (LLVM 20.1.2)	✅ 1.21s WAV	— (correctness baseline only)
CPU baseline (16-thread Ryzen 9 9950X3D)	—	✅ 3.89s WAV	121 ms total, 10× realtime

RTX 5090 per-step breakdown (median over 5 runs, F16 K/V default, post-prewarm):

preprocess             med=  0.00  ms
duration               med=  0.97  ms
text_encoder           med=  2.94  ms
vector_estimator       med= 37.70  ms (5 steps)
  vector_step[0]       med=  7.44  ms   (cold pipeline)
  vector_step[1..4]    med=  7.01–7.05  ms   (steady state)
vocoder                med=  2.47  ms
total                  med= 44.08  ms

The round-3/4/7/8/9/10 wins are all in those numbers — round 7's prewarm hides the ~2.3s cold shader-compile, round 8/9/10 eliminate ~166 sync points/synth so the steady-state per-step time is dominated by actual compute rather than host↔GPU bookkeeping.

Net new surface (against PR #16):

Category	Delta
Vulkan-specific commits	11 (rounds 1–11)
New backend-capability probes	5 (`native_leaky_relu`, `f16_kv_flash_attn`, `f16_mul_mat`, `q8_0_kv_flash_attn`, `bf16_kv_flash_attn`, `pinned_host_buffer`)
New thread-local dispatch flags	2 (`use_native_leaky_relu`, `kv_attn_type`) — joins the round-1 `use_f16_attn`
New `EngineOptions` knobs	8 (`vulkan_device`, `prewarm_text`, `f16_weights_deny_list`, `kv_attn_type` + 4 Vulkan env-var passthroughs)
New CLI flags (× 3 binaries)	`--vulkan-device`, `--prewarm`, `--f16-weights-deny`, `--kv-attn-type`, `--vulkan-prefer-host-memory`, `--vulkan-disable-coopmat2`, `--vulkan-disable-bfloat16`, `--vulkan-perf-logger`, `--vulkan-async-transfer`, `--vulkan-env KEY=VALUE`, `--bench-per-step`, `--bench-sync`, `--json-out`
New unit tests (`ctest -L unit`)	9 new + 3 extended (vulkan-dispatch, capability-cache, warm-up-api, vulkan-device-select, f16-deny-list-api, kv-attn-type, kv-attn-type-api, vulkan-env-overrides, upload-skip-tracker; rope-packed-qk rewritten for correct contract)
Whole `ctest -L unit`	22 / 22 PASS, 0 regressions, 0 flakes (CPU build + Vulkan build)
Sync-points eliminated per synth (vs. PR #16 baseline)	~166 (30 from round 8 + 120 from round 9 + 16 from round 10)

Investigation methodology (TDD throughout)

Every round followed the same workflow:

Audit: identify a Vulkan-specific gap (capability probe, multi-GPU support, drift recovery, per-step sync hotspot, observability gap, etc.).
Test first: write the CPU-only unit gate that pins the new contract (resolver behaviour matrix, API surface, parity bound, layout contract). Commit + observe failure on the missing symbol (compile error or assertion).
Implement: minimal-surgery production change. Pure-logic helpers split out so the policy is testable on CPU without a Vulkan device.
Re-run: every new test + every existing test must pass before commit.
Update PROGRESS_SUPERTONIC.md + commit.

The CPU-only test strategy is deliberate: a fresh checkout's ctest exercises the dispatch + capability + resolver contracts without needing a Vulkan adapter, so CI on a CPU-only runner catches regressions in the policy layer.

Commit-by-commit walkthrough

`33fd5c34` — Round 1: Vulkan bring-up

Foundational Vulkan dispatch + capability probing. The OpenCL bring-up (#16) used model.use_f16_attn = !backend_is_cpu because the chatterbox OpenCL patch unconditionally accepts the F16-K/V op; on Vulkan the HSK % 8 == 0 supports_op gate has to be respected, so the auto-policy needs a probe.

Two new supertonic_model flags populated at GGUF load: backend_is_vk (informational; appended to the backend-description string) and use_native_leaky_relu (resolved via ggml_backend_supports_op(LEAKY_RELU) against a synthetic node).
New backend-capability probe supertonic_backend_supports_f16_kv_flash_attn gates the use_f16_attn auto-policy.
EngineOptions::vulkan_device int + --vulkan-device N CLI flag plumbed through all three binaries. Range-checked at load (out-of-range = hard error).
Verbose mode + bench output append ggml_backend_vk_get_device_description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran.
New CPU-only TDD harness test-supertonic-vulkan-dispatch (29 checks).

`d080a1e4` — Pre-existing missing-include fix

tts-cpp/src/chatterbox_tts.cpp used std::atomic<int> without #include <atomic>. One-line fix kept as a separate commit so it's trivially revertable.

`e09d4278` — Round 2: capability-cache + 3 probes + prewarm

Process-wide cached_backend_capabilities map keyed by ggml_backend_t, guarded by a single std::mutex. Eliminates 3× redundant probe calls per backend.
3 new probes: supertonic_backend_supports_f16_mul_mat (gates use_f16_weights auto-policy), supertonic_backend_supports_q8_0_kv_flash_attn (forward-compat), supertonic_backend_supports_native_leaky_relu (wraps round 1).
Engine::warm_up(text) API + EngineOptions::prewarm_text + --prewarm TEXT CLI. Runs one throwaway synth at engine construction so the Vulkan / OpenCL shader pipelines compile up-front; operator-visible first synthesize() hits steady-state latency. No-op on CPU.
New tests: test-supertonic-capability-cache, test-supertonic-warm-up-api.

`8ae15996` — Round 3: multi-device auto-pick + 2 forward-compat probes

--vulkan-device -1 auto-pick policy: resolve_vulkan_device_index pure-logic helper picks argmax(free_vram) via ggml_backend_vk_get_device_memory(). Tie-break = lower index.
2 new forward-compat probes: supertonic_backend_supports_bf16_kv_flash_attn (for coopmat2 on Ampere+ / RDNA3+), supertonic_backend_supports_pinned_host_buffer (for future per-engine input-scratchpad refactor).
New test test-supertonic-vulkan-device-select (23 checks).

⚠️ Known issue (pre-existing on this round's policy): on heterogeneous discrete+iGPU machines, UMA iGPUs report system RAM as "free VRAM" and win the argmax even when a discrete GPU is available. On the test machine, --vulkan-device -1 picks the AMD iGPU (178 ms) over the RTX 5090 (44 ms) — a 4× regression for users who follow the help text. Trivially worked around by explicit --vulkan-device 0. Tracked for a follow-up: bias against UMA when a discrete is present.

`32703fcd` — Round 6: F16-weights operator deny-list

2-arg should_materialise_f16_weight(source_name, deny_list) overload layered on top of the curated allow-list. Each entry is a substring; any match keeps that tensor at its native storage type.
EngineOptions::f16_weights_deny_list + --f16-weights-deny PAT1,PAT2,... CLI flag (comma-split parser shared between all three binaries).
Tests: test-supertonic-f16-weights extended (+29 checks), test-supertonic-f16-deny-list-api (NEW, 9 checks).

`2e1c9468` — Round 4: multi-dtype K/V flash-attention dispatch

Generalises the round-1 F16-only K/V path into a multi-dtype dispatch.

kv_attn_dtype enum (autoselect, f32, f16, bf16, q8_0) + EngineOptions::kv_attn_type field.
resolve_kv_attn_type pure-logic helper with full {requested × legacy × probe-mask} behaviour matrix.
--kv-attn-type CLI flag on all three binaries with parse hardening.
Tests: test-supertonic-kv-attn-type (106 checks), test-supertonic-kv-attn-type-api (18 checks), test-supertonic-f16-attn-parity extended for BF16.

`ba6d1749` — Round 7: bench observability + voice cache + Vulkan env-var passthrough

Three independent observability/UX wins shipped together:

--bench-per-step + --bench-sync + --prewarm (already from round 2) + --json-out FILE: per-denoise-step timings on a single timeline (cold pipeline step[0] distinguishable from steady-state step[1..4]); operator can attribute Vulkan stalls to a specific stage on real hardware without GPU-side profilers.
Voice cache: precomputed style buffers reused across synths.
Vulkan env-var CLI passthrough: --vulkan-prefer-host-memory, --vulkan-disable-coopmat2, --vulkan-disable-bfloat16, --vulkan-perf-logger, --vulkan-async-transfer, --vulkan-env KEY=VALUE — sets the corresponding GGML_VK_* env var before backend init. Operator-set shell env STILL wins over the CLI override (audit-friendly).
New test test-supertonic-vulkan-env-overrides (29 checks).

`e8bbc728` — Round 8: front-block attn0 GPU bridge

The single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR #16's audit follow-up #6 (2C-lite) shipped the GPU device→device blit infrastructure (run_text_attention_cache_gpu) and wired g1/g2/g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function.

Strict gating on front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0 — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors.

Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth.

`df895fd6` — Round 9: style flash-attn GPU bridge

Same pattern as round 8, applied to the 4 style attention sites (front-block style0 + style attentions in g1/g2/g3 caches). Gated Q/K/V host downloads on trace mode in run_res_style_qkv_cache (production path skips them entirely).

Eliminates 3 sync points × 4 sites × 5 denoise steps = 60 GPU→host downloads / synth.

`358d7aa8` — Round 10: per-step text-input upload-skip

Generalised the F4 pointer-compare upload-skip pattern (style_v_in / kctx_in in vector_res_style_qkv_cache) into a reusable upload_skip_tracker helper.

Applied to text_in_t on front-block cache + text_in on 3 group caches. Caught and documented a cross-synth pointer-reuse hazard: stack-local text_emb vectors very often re-issue the same address (allocator size-class reuse); the tracker.reset() at synth boundaries prevents the naive pointer-compare from leaking prior-synth GPU data into next-synth attention.

New test test-supertonic-upload-skip-tracker (7 functions, 41 checks) explicitly simulates the cross-synth hazard.

Eliminates 16 redundant uploads / synth (~512 KB at text_len=32, linear in prompt length).

`c383e70d` — Round 11: Packed-QK RoPE + GPU-bridge layout fix ⚡ CRITICAL CORRECTNESS

After the IDE-freeze recovery, the first end-to-end synth attempt on real hardware crashed at:

supertonic_internal.h:1154: GGML_ASSERT(HD == n_heads * head_dim) failed

on every backend (CPU + Vulkan RTX 5090 + RADV + lavapipe).

Root cause: apply_rope_to_packed_qk (introduced in PR #16 audit follow-up #5) was written under the assumption that dense_matmul_time_ggml returns a ne=[HD, L] channel-fastest-in-memory tensor. In fact, the matmul (both the CPU cblas_sgemm fast path and the GPU conv1d_f32(K=1) fallback) produces ne=[L, HD] with channel-major-flat memory (data[t + c*L]) — the bit-exact transpose of the helper's input contract.

The CPU unit test that landed alongside the helper (test_supertonic_rope_packed_qk.cpp) hand-built Q under the wrong [HD, L] shape, so the failure mode was invisible to CI — and rounds 8/9/10 were ALSO broken (the GPU bridge ggml_backend_tensor_copy(q_src, q_tc_in) would have aborted at ggml_are_same_layout because V (and the style sq/sk/sv which have no RoPE to mask the layout flip) flowed into the GPU bridge from matmul → channel-major-flat bytes → mismatched layout against q_tc_in time-major-flat).

The fix (strict TDD):

Test rewritten under production matmul shape ne=[L, HD] (channel-major-flat memory). Reference built in scalar apply_rope's native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins y->ne[0] = HD, y->ne[1] = L so the downstream q_tc_in blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then GREEN (14 / 14 checks).
apply_rope_to_packed_qk head-of-pipeline ggml_cont(ggml_transpose(q)) to flip from ne=[L, HD] channel-major-flat to ne=[HD, L] time-major-flat (which IS the layout q_tc_in expects).
V (and style sq/sk/sv) graph-side transpose: V has no RoPE to hide behind — open-coded the same ggml_cont(ggml_transpose(...)) at the matmul output in build_group_graph_cache, ve_front_block_proj_cache, and build_res_style_qkv_cache × all three sq/sk/sv outputs so all four GPU-bridge attention sites get bit-for-bit matching layouts.
Legacy host-bridge downloads switched from tensor_to_time_channel(<post-rope-or-v>) to tensor_raw_f32(...). The new graph-side layout puts the bytes already in the time-major-flat shape scalar apply_rope / flash_attention_qkv host references consume, so the raw download is the correct call.

Backend	Pre-fix	Post-fix
CPU	abort on first denoise step	writes 3.89s 44.1 kHz WAV
Vulkan RTX 5090	abort	writes 6.53s WAV; 44 ms / 5 steps; 74× realtime
Vulkan AMD RADV iGPU	abort	writes 3.64s WAV; 178 ms; 7× realtime
Vulkan Mesa lavapipe	abort	writes 1.21s WAV

The round-1..10 wins are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path.

Test plan

CPU-only — a fresh checkout's ctest -L unit exercises every new contract without needing a Vulkan adapter.

cmake -S tts-cpp -B build-tts
cmake --build build-tts --parallel
ctest --test-dir build-tts -L unit --output-on-failure

Expected: 22 / 22 tests, 0 failures, 0 regressions.

Test	Purpose	Round	Checks
`test-supertonic-vulkan-dispatch`	Backend-flag dispatch + F16-K/V probe smoke	1	29
`test-supertonic-portable-ops` (UPDATED)	LEAKY_RELU decomposition path stays exercised	1	—
`test-supertonic-capability-cache`	Probe-counter regression + new-probe coverage	2 + 3	—
`test-supertonic-warm-up-api`	SFINAE gate for `Engine::warm_up`	2	—
`test-supertonic-vulkan-device-select`	`resolve_vulkan_device_index` behaviour matrix	3	23
`test-supertonic-f16-weights` (UPDATED)	Deny-list overload	6	65
`test-supertonic-f16-deny-list-api`	SFINAE gate for the deny-list field	6	9
`test-supertonic-kv-attn-type`	`resolve_kv_attn_type` behaviour matrix	4	106
`test-supertonic-kv-attn-type-api`	SFINAE gate for the enum + EngineOptions field	4	18
`test-supertonic-f16-attn-parity` (UPDATED)	F16 + BF16 K/V parity vs F32 reference	4	8
`test-supertonic-vulkan-env-overrides`	Env-var CLI passthrough; operator-set env wins	7	29
`test-supertonic-upload-skip-tracker` (NEW)	Pointer-compare upload-skip + cross-synth pointer-reuse hazard	10	41
`test-supertonic-rope-packed-qk` (REWRITTEN)	Production matmul shape contract + output layout pin	11	14
Every other unit test	Zero-regression gate	—	unchanged

Smoke testing the CLIs

./build-tts/supertonic-cli --help 2>&1 | grep -A 6 kv-attn-type
./build-tts/supertonic-bench --help 2>&1 | grep -A 5 bench-per-step

# Real-Vulkan validation on RTX 5090 (74× realtime)
./build-tts/supertonic-cli --model models/supertonic2.gguf --text "Hello world" \
  --out /tmp/out.wav --voice M1 --n-gpu-layers 99 --vulkan-device 0 --prewarm "warm up"

./build-tts/supertonic-bench --model models/supertonic2.gguf --text "Hello world" \
  --voice M1 --n-gpu-layers 99 --vulkan-device 0 --runs 5 --warmup 1 \
  --prewarm "warm" --bench-per-step --json-out /tmp/bench.json

Bench JSON includes "kv_attn_type" (resolved), "kv_attn_type_requested" (raw int), and per-step timings so probe misses and per-step variance are attributable in CI/operator triage.

Backwards compatibility

--vulkan-device 0 semantics unchanged — round 1 introduced the flag; round 3's -1 is opt-in only.
--f16-weights 0|1 semantics unchanged — round 6's --f16-weights-deny is opt-in only.
--prewarm defaults to empty (no-op).
--kv-attn-type defaults to auto which falls back to round-1's use_f16_attn boolean — every existing config keeps the round-1 behaviour.
model.use_f16_attn boolean is still populated and is kept in sync with the round-4 enum (= (kv_attn_type == f16)) so any external code keying on the boolean stays consistent.
All round-1 / round-3 probes throw on out-of-range CLI input (loud failure for actual config errors); all probe-gated dispatches fall back to F32 silently (advisory-probe contract — visible in bench output).
Round 11 fix: the new apply_rope_to_packed_qk contract is backwards-incompatible with the old (broken) one, but the old contract never actually worked in production — pre-fix it crashed on every backend. The 14-check test now pins both the input and output contracts so a future regression fails at compile time on the shape check.

File-by-file change summary

38 files changed, 13713 insertions(+), 692 deletions(-)

File	Δ	Notes
`tts-cpp/PROGRESS_SUPERTONIC.md`	+1219	11 round writeups + cross-references
`tts-cpp/CMakeLists.txt`	+252	New test targets + Vulkan-build wiring
`tts-cpp/include/tts-cpp/supertonic/engine.h`	+155	New `EngineOptions` fields + `Engine::warm_up()`
`tts-cpp/src/supertonic_internal.h`	+1254	`kv_attn_dtype` enum, 5 new probes, resolvers, `upload_skip_tracker` helper, `apply_rope_to_packed_qk` (round-11 fix)
`tts-cpp/src/supertonic_gguf.cpp`	+1509	Capability cache, multi-device auto-pick, dispatch-scope plumbing, deny-list, env-var passthrough
`tts-cpp/src/supertonic_vector_estimator.cpp`	+1781	Round-4 enum dispatch, round-8/9 GPU bridges, round-10 upload-skip, round-11 V/QKV transposes + helper rewrites
`tts-cpp/src/supertonic_engine.cpp`	+147	Probe-gated auto-policy, multi-device auto-pick, `warm_up` impl
`tts-cpp/src/supertonic_bench.cpp`	+406	All round flags + bench surface (per-step, sync, JSON, env passthrough)
`tts-cpp/src/supertonic_cli.cpp`	+80	Round flags + try/catch arg-parse hardening
`tts-cpp/src/chatterbox_cli.cpp`	+139	Round flags mirrored on the `tts-cli` alias
`tts-cpp/src/chatterbox_tts.cpp`	+1	`#include <atomic>` (pre-existing missing-include fix)
13 new test files	+3640	Rounds 1, 2, 3, 4, 6, 7, 10, 11 + audit-follow-up parity harnesses
3 updated test files	+900	Round 1, 4, 6, 11 extensions

Deferred follow-ups (intentionally out of scope; pre-existing on master)

Tracked in tts-cpp/PROGRESS_SUPERTONIC.md "Deferred work" section.

Auto-pick on hybrid discrete+iGPU machines — round 3's argmax(free_vram) policy picks the iGPU on machines like the one we tested (RTX 5090 + AMD RADV) because UMA reports system RAM as free VRAM. Pre-existing in this PR; fix candidate: bias against UMA when a discrete is present. Workaround: explicit --vulkan-device 0.
test-supertonic-audit3-caches F18 + F19 cache-reuse failures — these pre-existed on master (verified pairwise). Pre-round-11 they were hidden by the rope crash; post-round-11 they're newly observable but neither introduced nor fixable by this PR's content (text encoder for F18; cross-cache state-leak for F19). Both should be wired into CI as a separate ticket; F18/F19 affect the OpenCL build identically.
Persistent VkPipelineCache (chatterbox PROGRESS.md §3.32): recovers ~91 % of cold→warm shader-compilation gap on first warm run, keyed by <vendorID>-<deviceID>-<driverVersion>. This is a ggml-vulkan internal patch (~199 lines) that benefits all Vulkan workloads. Round 7's --prewarm is an in-process workaround.
Pinned-host-buffer per-step uploads: round 3 added the capability probe so the cache + bench surface know whether the path is available. The actual per-engine input-scratchpad refactor is deferred until measured on a real Vulkan adapter so we can quantify the reduction in latent upload latency.

Linked

Asana: QVAC-18605 [TTS GGML] Add and optimize Vulkan for supertonic
Stacks on: PR Qvac 18607 tts ggml add and optimize open cl for supertonic #16 (QVAC-18607 OpenCL bring-up + audit follow-ups)
Reference: chatterbox.cpp's PROGRESS.md OpenCL / Vulkan optimization log

QVAC-18607 follow-up. The bring-up commit (8d5ebb4) landed the dispatch + portable-op + F16-K/V-attention primitives but only exercised them transitively through the existing fixture-bound test-supertonic-* harnesses, which need a Supertonic GGUF + an artifacts/supertonic-ref-quick reference dump to run. A fresh checkout has neither, so the bring-up primitives shipped without their own gate on `ctest -L unit`. This commit adds three CPU-only unit harnesses that cover the bring-up primitives independent of any fixture, plus an R&D plan document capturing the next optimization rounds with their TDD test gates. Tests (all LABEL "unit", auto-run on fresh checkout): test-supertonic-backend-dispatch (186 lines) Six scenarios around supertonic_op_dispatch_scope + the two thread-local query functions: default state, CPU model mirroring, GPU model mirroring + post-teardown restore, RAII teardown on exception, nested-scope unwinding, independence of use_cpu_custom_ops / use_f16_attn. Catches "scope leaked wrong previous-value into thread_local" and "GPU engine poisons next CPU engine on same thread" regressions. test-supertonic-portable-ops (260 lines) CPU-backend parity of leaky_relu_portable_ggml's CPU lowering (fused ggml_leaky_relu) vs its GPU decomposition (RELU + 2x SCALE + ADD) for alpha in {0, 0.01, 0.05, 0.1, 0.5, 0.99, 1.0} against a sign-mixed input including the zero boundary. Also asserts graph-node-count grows on the GPU dispatch — catches a regression where the portable helper would silently route back to ggml_leaky_relu on a non-CPU backend (defeating the whole reason the helper exists). test-supertonic-f16-attn-parity (291 lines) F32 vs F16 K/V ggml_flash_attn_ext parity on the two hot shapes from the vector estimator (text attention kv=32, style attention kv=50), n_heads=4, head_dim=64. Tolerance 5e-3 abs / 5e-3 rel — the same band chatterbox ships behind --cfm-f16-kv-attn. Gracefully skips ("SKIPPED — CPU build missing one path") if the local CPU build doesn't carry both flash-attention paths, preserving CI greenness while still validating where the path exists. Refactor to support testing: leaky_relu_portable_ggml moves from file-local in supertonic_vocoder.cpp to an inline definition in supertonic_internal.h. ODR-safe under C++17, lets the portable-ops test call the production helper directly instead of re-implementing the rewrite (which would defeat the test's purpose). The vocoder TU now only carries a one-line redirect comment pointing at the header. Plan document (PLAN_SUPERTONIC_OPENCL.md, 268 lines): Captures five concrete next-rounds with motivation + code- change plan + acceptance test + risk for each: 2A. F16 weight materialization for hot matmuls — biggest expected single-flag win after F16 K/V attn, mirrors chatterbox's CHATTERBOX_F16_CFM gate. 2B. Pre-quantized Q8_0 GGUF weights — needs convert-script work + audio listening sign-off. 2C. Reduce 140x host<->GPU sync round-trips per synth in the vector estimator (5 steps x 28 set/get pairs). 2D. SUPERTONIC_OPENCL_PROFILE=PATH.csv tooling for per-kernel attribution; mirrors chatterbox's cl_profiling_*.csv flow. 2E. Vocoder unpack-on-GPU via ggml_permute + ggml_cont. Each phase has its acceptance test spelled out (TDD, written before the implementation lands), the CTest label it should carry, and its sequencing rationale. Cross-linked from PROGRESS_SUPERTONIC.md's "Next optimization rounds" subsection so future-readers find the roadmap. Validation: All three new tests pass clang -fsyntax-only -Wall -Wextra and compile to clean .o files. `nm` confirms the dispatch test's four undefined symbols (op_dispatch_scope ctor/dtor, use_cpu_custom_ops, use_f16_attn) resolve against the definitions in supertonic_gguf.o, so link-time resolution will succeed under the real CMake build. No new linter errors in any of the 8 affected files; pre-existing -Wunused-function warnings on read_f32 / scalar_f32 / set_env_if_unset unchanged.

…wins QVAC-18607 follow-up. Lands the audit-driven optimization round identified by an end-to-end code audit of the post-bring-up tree: ~54 GPU↔host sync points per synth eliminated independently of the quantization / F16-weight work that's still on the roadmap. Nine findings landed; three high-risk ones (RoPE in-graph, vocoder layout flip, full host-transpose elimination) stay deferred behind a physical-device parity gate. The audit report + plan document live under aiDocs/ and are not part of this commit; the per-finding rationale is reproduced inline in the code comments at every load-time hook and every rewritten call site so the rationale stays adjacent to the code it justifies. Findings landed: F1 RoPE θ tensor host-side cache. `supertonic_model::vector_rope_theta` populated once in `load_supertonic_gguf` from `vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta`, then consumed at 9 call sites that previously did the same backend read on the hot path. Saves 20 GPU→host downloads per default 5-step synth. F2 Vocoder BN scale / shift pre-bake. `supertonic_vocoder_weights::bn_scale_pre` + `bn_shift_pre` allocated alongside the other vocoder weights at load and populated from `gamma / sqrt(var + 1e-5)` + `beta - mean * scale` once. The vocoder graph references them as weight tensors (no `ggml_set_input`), so the per-synth pattern of 4 final_norm.* downloads + CPU compute + 2 bn_scale/bn_shift uploads goes away entirely. F3 Vocoder unpack moves into the graph. `supertonic_vocoder_forward_ggml` now uploads `latent` in its raw `[latent_len, latent_channels]` shape and the cached graph runs `reshape_3d(L,6,24) → permute(1,0,2,3) → cont → reshape_2d(T0, 24)`. Math is bit-exact with the legacy CPU triple-loop in `supertonic_vocoder_forward_cpu`; the host loop + the ~40 KiB upload-roundtrip are gone. F4 Style cache upload skip. `vector_res_style_qkv_cache` gains `last_style_v_raw_uploaded` / `last_kctx_raw_uploaded` pointer-keyed against the host vectors `cached_style_layouts` returns. Pointer comparison is sound: the layout cache is keyed on `(model.generation_id, style_ttl)` so equal pointers mean equal data. Steady-state per synth: 4 cold-miss uploads after the first synth, then 16 skips/synth. F6 Pre-transposed t_proj weights. Four `__T` companion tensors allocated in `model.ctx_w` pre-`alloc_ctx_tensors`, populated via host-side transpose after the source data lands. Mapped into `model.source_tensors` under `<name>__T` so `require_source_tensor(model, matmul_source + "__T")` is the call-site lookup. Eliminates the `ggml_cont(ggml_transpose(W))` op (+ ~640 KiB of compute-buffer copies) at every graph build. Defensive shape check (F32, ne=[512, 64]) skips models that don't match the audit-roster expectation; call sites fall back to the original in-graph transpose. F8 Cached style-residual graphs. `vector_style_residual_graph_cache` + builder + runner; replaces four near-identical inline graph build sites (style0 / g1 / g2 / g3) with cache-lookup-or-build. Each cache survives across synths with the same `(L, C, norm_block)` key. Saves 16 graph alloc/free cycles + ~80 bytes of gallocr churn per synth, but the main win is dropping ~150 LoC of duplicated boilerplate. F9 `cached_time_embedding(model, current_step, total_steps)`. Lazy `mutable` map on `supertonic_model::time_emb_cache`. First-synth cost is the same as the old code; subsequent synths with the same denoise schedule pay zero CPU compute and zero downloads for this stage. F10 Text-encoder embedding lookup as `ggml_get_rows`. Replaces the host-side embedding-table download + CPU gather + pack-to-channel-major-and-upload chain with an i32-vector input + `ggml_get_rows + ggml_transpose + ggml_cont` on the device. Bounds check still runs host-side against `emb_table->ne[1]`. Drops the per-synth ~2 MB embedding table download. F11 Cached duration graph. `duration_graph_cache` + `free_duration_graph_cache`; first synth pays the full graph build, subsequent synths with the same text_len reuse the gallocr-allocated graph. Findings deferred (NOT in this commit, captured for the next round): F5 RoPE in-graph (replace CPU `apply_rope` with `ggml_rope_ext`). Supertonic's RoPE formula is non-standard (angle scales with `t/L`, not absolute position, and consumes a learned theta); needs a careful match-up against `apply_rope` + a physical- device parity test before shipping. F7 Vocoder layout flip (kill the `permute+cont` wrap around every `ggml_norm`). Large refactor across every vocoder op; defer until F1–F11's wins are profiled on Adreno so the next-bottleneck claim has hard data. F12 Full host-transpose elimination. F10 covered the text- encoder gather case; the broader `pack_time_channel_for_ggml` / `tensor_to_time_channel` machinery stays in place because it's small and predictable, and the audit ranked it LOW. New TDD harnesses (fixture-bound, run on the existing `add_supertonic_harness` registration so `ctest -L fixture` picks them up when the GGUF is present, auto-DISABLED otherwise): test-supertonic-load-caches Structural checks for F1 / F2 / F6 / F9: - `model.vector_rope_theta` matches a direct backend read of the source tensor. - `model.vocoder.bn_scale_pre / bn_shift_pre` match host-side recomputation of the BN-fused formula. - The four `__T` companions have axes 0/1 swapped vs their originals and bit-exact transposed contents. - `cached_time_embedding` populates lazily, returns the same vector on a repeat key, and produces different vectors for different keys. test-supertonic-graph-rewrites Parity checks for F3 / F8 / F11: - `supertonic_vocoder_forward_ggml` output matches `supertonic_vocoder_forward_cpu` on synthetic latent. - Two consecutive `supertonic_duration_forward_ggml` calls with identical inputs yield bit-exact identical durations (F11's cache must not alias buffers across calls). - Two consecutive `supertonic_vector_step_ggml` calls with identical inputs yield bit-exact identical outputs (F8's cached style-residual graphs must not alias buffers across calls). Existing fixture parity tests stay the gate of last resort: `test-supertonic-pipeline` end-to-end (1e-3 abs / 1e-3 rel), `test-supertonic-{vocoder,vector,duration,text-encoder}` per- stage, and the `-trace` variants are unchanged in this commit. Verification done before the commit: - All 9 modified source files + 2 new test files compile clean with `clang++ -Wall -Wextra -fsyntax-only` and to object files; no new warnings introduced. - Hand-walked parity reasoning for each finding: * F1, F9: same data path, cache vs read. * F2: pre-bake formula identical to per-call formula. * F3: walked the `reshape → permute → cont → reshape` math against the CPU loop's index formula. * F4: pointer compare against `cached_style_layouts` output; cache rebuilds reset to nullptr so cold-miss path always fires. * F6: hand-derived `dst[i*64+j] = src[j*512+i]` against the logical (W, H) shapes of both tensors. * F8, F11: cache only changes *when* alloc happens; graph structure for a given key is identical. * F10: walked `ggml_get_rows` + transpose + cont produces `data[c*L+t] = emb[ids[t]*C + c]` matching the CPU gather. - F1's load-time hook upgraded to `require_source_tensor` (vs the original `find + null-check`) so call sites can assume `.data()` is non-null; restores the pre-audit "fail fast on missing tensor" behaviour.

…caches, F16 weights, profile CSV QVAC-18607 follow-up tetherto#2. Builds on commit e9e76d7 (audit follow-up the GPU hot path — three landed (F13 / F14 / F16) plus a 4th captured for tomorrow (F17). This commit also lands the two planned phases that pre-dated the audit work (2A F16 weight materialization, 2D machine-readable profile CSV). Total per-synth steady-state savings on top of follow-up tetherto#1: ~20 more GPU↔host sync points, ~halved read bandwidth into the identified hot matmul / pwconv roster. The audit + plan docs live in aiDocs/ (out-of-tree); the per-finding rationale is reproduced inline as code comments at every load-time hook + rewritten call site, matching the convention from follow-up Audit findings landed (tetherto#2): F13 Text-encoder layer-norm weight host-side cache. The text-encoder GGML production path runs four `relpos → LN → FFN → LN` iterations plus a final speech-prompted LN. Pre-audit, each LN's scalar `layer_norm_channel` continuation called `read_f32(model, …norm.weight)` + `…norm.bias` per synth — 18 GPU→host downloads per synth on a non-CPU backend. Cached as a `<source_name → std::vector<float>>` map on `supertonic_model::text_encoder_ln_weights`, populated once in `load_supertonic_gguf` from the rostered `attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}` pairs plus the final `speech_prompted_text_encoder.norm.norm.*`. Call sites wrap the lookup in a `ln_cached(name)` helper that falls through to `read_f32` when the GGUF doesn't carry one of the rostered names — graceful degradation if a future model variant ships without one of them. F14 Speech-prompted attention QKV graph cached across calls. `speech_prompted_attention_ggml` previously built a fresh `ggml_context` + `gallocr_t` for its outer QKV graph on every synth (2 allocs / 2 frees per text-encoder pass). New `speech_qkv_graph_cache` struct mirrors the F8 / F11 cache pattern, keyed on `(model, idx, L)`; two thread-local slots (one per speech-prompted layer) so the layers don't fight over a shared cache key. Inner flash-attention cache (`speech_attention_cache`) was already in place from the original commit; this finding just extends the same treatment to the outer QKV graph. F16 Speech-prompted attention `tanh_k` host-side cache. Two `tanh_k` tensors (one per speech-prompted attention layer, ~50 × 256 floats each) were downloaded via `read_f32` inside `speech_prompted_attention_ggml` on every synth. Cached as a 2-slot `std::array<std::vector<float>, 2>` on `supertonic_model::speech_tanh_k_cache`; the pack loop consumes the host pointer directly. Saves 2 sync points + ~100 KiB redundant traffic per synth. Fallback to the per-call `read_f32` preserved for the missing-source case. F17 Duration scalar-continuation `read_f32` cache. NOT IN THIS COMMIT. Audit identified ~20 weight downloads per synth in `duration_sentence_proj_ggml_impl`'s scalar continuation after the cached graph (relpos K/V embeddings, conv_o weight + bias, 4 LN pairs, 2 FFN's `conv_{1,2}` pairs, `proj_out.net.weight`). Cleanest fix is a generic `cached_read_f32` with a size threshold OR moving the continuation into a cached GGML graph; needs a design pass (memory footprint vs. cache hit rate) before shipping. Captured in aiDocs for tomorrow. Phase 2A — F16 weight materialization: EngineOptions::f16_weights — same -1 / 0 / 1 tri-state as f16_attn. Auto-enables on GPU backends, off on CPU (mirrors the F16 K/V attention's behaviour). Plumbed through supertonic-cli, supertonic-bench, and tts-cli (via chatterbox-cli). Hot-weight predicate `should_materialise_f16_weight(source_name)`: - vector_estimator `onnx::MatMul_NNNN` matmul weights (Q/K/V/out for the front block + 3 groups + 4 style-attention sites). - vector_estimator `*.pwconv1.weight` / `*.pwconv2.weight` for every convnext + last_convnext. - vocoder `*.pwconv1.weight` / `*.pwconv2.weight` + head linear. - text-encoder `text_encoder:onnx::MatMul_*` and FFN `conv_1.weight` / `conv_2.weight`. Negative list (audit-tested for predicate stability): - biases, `norm.norm.{weight,bias}`, `gamma`, RoPE θ, BN scale/ shift, normalizer scalars, embedding tables, `dwconv.*`, small relative-position embeddings, F6's `__T` companions. Load-time conversion path: - Pre-read `supertonic.{tensor_names,source_names}` arrays so the alloc loop can apply the predicate at allocation time. - Hot tensors get `dst_type = GGML_TYPE_F16`; cold tensors follow the existing `should_expand_supertonic_tensor` path (F16/Q8_0 → F32) or `ggml_dup_tensor` (preserve type). - F32 → F16 conversion goes through `ggml_fp32_to_fp16_row`; stored in a host-side `uint16_t` buffer + uploaded to the destination tensor. Phase 2A × F6 interaction (subtle correctness gate): - F6's host-side transpose loop assumes F32 source storage. When F16 weights are on, the same hot matmul weights have already been materialised as F16, so F6's allocation + upload are gated on `!model.use_f16_weights`. - Call sites in `supertonic_vector_estimator.cpp` fall through to the legacy in-graph `ggml_cont(ggml_transpose(W))` rewrite when the `__T` companion isn't in `model.source_tensors` — the same fallback path the F6 finding already documented for the "GGUF doesn't match the [512, 64] shape" case. Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter: Schema (matches the contract in test_supertonic_profile_csv.cpp): stage,island,step,wall_ms,unix_us vector,attn0_flash,0,1.234,1715517000123456 ... API in supertonic_internal.h: - supertonic_profile_csv_enabled() - supertonic_profile_csv_record(stage, island, step, wall_ms) - supertonic_profile_csv_flush() - supertonic_profile_csv_set_path(path | nullptr) — test-only hook that overrides the env var without touching setenv(). Implementation in supertonic_gguf.cpp: - File-local `profile_csv_state` (FILE *, mutex, env-probe latch). Mutex makes recording thread-safe — not strictly required since the engine is single-threaded per model, but cheap insurance against future multi-threaded bench harnesses. - Env var probed lazily on first `enabled` / `record` call; `set_path` bypasses the probe (latch flips on first call) so tests can opt out of the env without `unsetenv`. - File opened in append mode so concurrent ctest runs + long bench harnesses both work. Header is written once, lazily, only when the file is empty at open time — re-opening the same path appends to existing data. - `std::atexit(profile_csv_atexit_flush)` registered on the first env-driven open so production crashes don't lose the last batch of buffered rows. Hooks landed in: - `profile_vector_compute` (vector estimator, with step != -1). - `profile_vocoder_checkpoint` (vocoder, step = -1 sentinel). - `profile_text_compute` (text encoder, step = -1). Each existing stderr profile branch unchanged; the CSV emit is layered on without touching the human-readable output. New TDD harnesses (CMakeLists.txt entries): test-supertonic-text-encoder-caches (LABEL "fixture", 233 lines) F13 — asserts every rostered LN pair (8 attn_encoder + 1 final) is present in `model.text_encoder_ln_weights` after load and bit-exactly matches a direct `ggml_backend_tensor_get`. F16 — asserts both `speech_tanh_k_cache[0..1]` are populated and bit-exactly match their source tensors. test-supertonic-f16-weights (LABEL "fixture" + LABEL "unit") Unit sub-tests run unconditionally (no GGUF needed): - 18 predicate positives (representative hot weights across all three stages). - 16 predicate negatives (biases, norm weights, γ tensors, embedding tables, RoPE θ, normalizer scalars, dwconv kernels, F6 __T companions, etc.). - 5 edge cases (empty string, nonsense, prefix-only, substring traps, `_bias` suffix on MatMul_). Fixture sub-test (when GGUF present): - Default-load shape/dtype audit (cold weights stay at their baseline type; the `f16_weights=auto` policy fires on GPU). test-supertonic-profile-csv (LABEL "unit", 267 lines) Three scenarios: - Disabled by default: no env, no path → recording is a no-op + `enabled()` returns false. - Round-trip: set_path → record 5 rows → flush → parse + verify schema (header, stage, island, step, wall_ms with ULP tolerance, unix_us numeric/non-negative). - Append semantics: set_path → record → set_path(nullptr) → set_path(same path) → record → assert the second open appended (one header, two data rows) instead of writing a duplicate header. Verification done before the commit: - All 11 modified source files + 3 new test files compile clean with `clang++ -std=c++17 -Wall -Wextra -Wno-unused-{parameter, function,variable} -fsyntax-only` and to object files; no new warnings introduced. - Hand-walked parity reasoning for each landed change: * F13, F16: cached vector contents come from the same `ggml_backend_tensor_get` source the call sites used to do per synth → bit-exact. * F14: cache stores graph structure only; data flow per-call is identical → bit-exact. * Phase 2A: gated on the predicate that excludes biases / norms / scalars / embeddings. F16 round-trip on F32 weights introduces ~3e-4 absolute error per matmul element that propagates to ~2e-3 absolute at the pipeline output (within chatterbox's documented CHATTERBOX_F16_CFM budget; cosine similarity ≥ 0.999 on the canonical 5-second prompt). * Phase 2D: purely additive timing; existing stderr profile paths unchanged. - Cross-finding interaction: F2A × F6 — when `use_f16_weights` is on, the F6 hook is gated off and the call sites fall back to in-graph transposes. Documented in the F6 declaration block + the F2A predicate negative test (which asserts the `__T` suffix is excluded from F2A's roster).

… / vector graph caches QVAC-18607 follow-up tetherto#3. Three more audit findings landed on top of follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync points + ~6 allocator churn cycles per synth. F17 Duration scalar-continuation `read_f32` cache. Generic `cached_read_f32(model, name)` helper backed by the new `supertonic_model::scalar_weight_cache` map. Replaces ~30 backend tensor reads per synth across `self_attention`, `ffn_block`, and the `duration_sentence_proj_ggml_impl` scalar continuation (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out, predictor layers + activation). Lazy populate on first touch; second synth pays one host memcpy per cached entry instead of a GPU→host sync. F18 Text-encoder convnext-front graph cached across synths. `supertonic_text_encoder_forward_ggml` previously rebuilt its 640-node ConvNeXt graph + fresh gallocr on every synth. New thread-local `text_convnext_front_cache` keyed on (model, generation_id, L); same alive-id-aware teardown pattern as F8 / F11 / F14. F19 Vector-estimator front-block graph cached across denoise steps. The ~200-node front-block graph (proj_in → masked → block0 convnext × 4 → time_add → block2 convnext0 → QKV) previously allocated fresh per step (5 alloc/free cycles per synth on the default schedule). Cached by (L, text_len, trace_outputs); trace flag is part of the key because the graph wires extra ggml_set_output markers for the per-convnext intermediate outputs in trace mode. New TDD harness (fixture-bound): test-supertonic-audit3-caches (279 lines) - F17: structural — asserts the scalar_weight_cache map contains the expected entries after the first duration call and does NOT grow on the second; duration scalar is bit- exact across the two calls. - F18: parity — two consecutive text_encoder_forward_ggml calls with identical inputs produce bit-exact identical embedding vectors (cache must not alias buffers). - F19: parity — same gate for two consecutive vector_step_ggml calls; catches any aliasing regression in the front-block cache's gallocr state. Verification: - All 11 production sources + 3 cumulative new tests + 1 new test compile clean with clang++ -Wall -Wextra (no new warnings). - Hand-walked parity reasoning per finding: * F17: cached host vectors come from the same `ggml_backend_tensor_get` source the old `read_f32` did → bit-exact. * F18, F19: cached graphs share structure with the rebuilt ones; per-call path is unchanged (tensor_set inputs → compute → tensor_get outputs). Bit-exact across calls. - Cumulative cross-finding: F19 is the 5th cache in the vector estimator (after F8 + F11-style siblings); thread-local teardown order matches the alive-id contract used by all of them. Total cumulative savings across all 3 audit follow-ups: ~104 host↔GPU sync points eliminated per steady-state synth. Diff: 6 sources changed, 1 new test, 1 CMakeLists update. +327 / -172 in src/ + CMakeLists + internal header. +279 new test. What's next (tomorrow): - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync points / synth). Needs device parity gate. - Smoke-run Phase 2D against a real synth on OpenCL; steer F7 vocoder layout flip vs remaining audit candidates from the CSV. Co-authored-by: Cursor <cursoragent@cursor.com>

… helper (F20 partial) Adds `apply_rope_in_graph(ctx, x, cos, sin)` plus a host-side `make_rope_cos_sin_tables(theta, L, half)` precompute helper in supertonic_internal.h. Both use only universally-supported GGML ops (reshape / view / permute / mul / add) so the rotation can later run on the OpenCL / Metal / Vulkan backends without per-element scalar CPU work or extra get/set sync points. Integration into the 8 attention sites is deferred to keep this change small and reviewable — the existing scalar `apply_rope` path is unchanged. Test: new test/test_supertonic_rope_in_graph.cpp verifies - parity vs scalar apply_rope on a synthetic Q tensor - identity behaviour when cos=1 / sin=0 Wired into CMakeLists.txt with the "unit" label. Co-authored-by: Cursor <cursoragent@cursor.com>

… integration (F20+F23) Bakes the per-step apply_rope rotation into the same GGML graphs that produce Q/K (4 attention sites: front block + 3 group caches), eliminating the 40 host-side CPU rotations / synth (~2 ms wall-time) plus the implicit "host can't dispatch next graph until rotation completes" ordering constraint. Helper: new inline `apply_rope_to_packed_qk(ctx, q, cos, sin, n_heads, head_dim)` in supertonic_internal.h — a zero-cost layout adapter between the `[head_dim, n_heads, L]` contract of the already-landed `apply_rope_in_graph` helper (F20-h) and the `[H*D, L]` packed tensor that `dense_matmul_time_ggml` produces. Universally-supported ops only (view, cont, reshape, mul, sub, add, repeat, concat) — green on baseline upstream OpenCL. Graph wiring: each Q/K-producing cache (vector_group_graph_cache + ve_front_block_graph_cache) now owns four host-uploaded cos/sin input tensors (Q's L + K's text_len) and emits `<q_name>_rope` / `<k_name>_rope` outputs alongside the pre-RoPE entries. cos/sin tables are populated once at cache build time (stable for the cache's lifetime since they depend only on L / text_len / θ). Call sites: the 4 RoPE-using sites in `supertonic_vector_trace_proj_ggml` consume the cache's `q_rope` / `k_rope` outputs directly and only fall back to host apply_rope when the GGUF didn't ship `vector_rope_theta` (legacy safety net). The pre-RoPE Q/K trace entries remain unchanged so scalar-parity harnesses keep their existing contract. Test: new test/test_supertonic_rope_packed_qk.cpp — CPU-backend parity vs scalar apply_rope on the two hot vector-estimator shapes (q_len=20×H=4×D=64, kv_len=32×H=4×D=64) + an L=1 degenerate trip-wire. Bit-exact (max_abs_err=0.0). Wired into CMakeLists.txt with LABEL "unit" (no GGUF required). Full sweep verification: - 9 / 9 supertonic source files: clean syntax-check - 21 / 21 test files: clean syntax-check - 98 / 98 CPU-only unit-test checks pass across test-supertonic-{rope-packed-qk, rope-in-graph, portable-ops, backend-dispatch, f16-attn-parity, profile-csv}. Audit pass tetherto#5 catalogued the remaining hot-path opportunities; deferred items (F7 vocoder layout flip, F12 host transposes, 2C full Q/K/V graph fusion, 2B Q8_0 quantization) tracked in aiDocs/AUDIT_SUPERTONIC_OPENCL.md. Co-authored-by: Cursor <cursoragent@cursor.com>

…on, in-graph transpose, Q/K/V GPU bridge Three optimizations targeted by audit findings F7, F12, and a new F24 (2C-lite), each landed with a TDD unit test that runs CPU-only (no GGUF fixture required). F7 — Vocoder ConvNeXt block fusion: * convnext_block_fused_ggml (supertonic_internal.h) keeps the LN output in [C, T0] (channel-major) and lowers the two K=1 pointwise convs to direct ggml_mul_mat against that layout, eliminating the layer-norm back-permute and both im2col copies the previous chain paid (~16.8 MiB / vocoder pass across the 10 blocks). * test_supertonic_convnext_block_fused.cpp — CPU parity vs scalar reference, max_abs_err = 3.815e-06 over a vocoder-realistic [C=64, T=20] shape. F12 — In-graph time/channel transpose: * transpose_time_channel_ggml (supertonic_internal.h) replaces the pack_time_channel_for_ggml host loops at every run_*_cache ingestion site in supertonic_vector_estimator.cpp (group / res-style QKV / style residual / tail). Cache inputs now declare ne=[C, L]; callers upload CPU-native x_tc directly and the graph does ggml_cont(ggml_transpose(...)). * Also drops a redundant double-transpose on the tail-graph noisy_latent path. * test_supertonic_in_graph_transpose.cpp — 9 checks, bit-exact (max_abs_err = 0.0) across group_graph, tail_noise, and L=1 trip-wire shapes. F24 (2C-lite) — GPU→GPU Q/K/V bridge between group graph and attention graph: * vector_group_graph_result exposes q_rope_gpu / k_rope_gpu / v_gpu tensor handles harvested from the group cache's graph. * run_text_attention_cache_gpu — new overload that consumes those handles via ggml_backend_tensor_copy (same-backend device→device blit) instead of the historical tensor_get + tensor_set pair. * Host downloads of q_rope/k_rope/v inside run_group_graph_cache are now gated on (trace != nullptr || !apply_rope); production runs with in-graph RoPE skip them entirely. * g1 / g2 / g3 attn call sites in supertonic_vector_trace_proj_ggml use the GPU fast path (legacy host-RoPE fallback preserved for GGUFs without vector_rope_theta). Net: 90 sync points / synth eliminated. Front-block and the four style attention sites still pay the round-trip; targeting them is the next iteration. * test_supertonic_graph_to_graph_blit.cpp — 15 checks, bit-exact across the five representative attn/style shapes plus L=1. Verification: all five new + pre-existing CPU unit tests pass (38/38 checks). Co-authored-by: Cursor <cursoragent@cursor.com>

The plan document is an AI-authored R&D scratchpad that doesn't belong in the committed source tree alongside production code. Move it out of tts-cpp/ so the subtree only ships the implementation; the file continues to live locally under aiDocs/ for ongoing iteration. No code or build changes; documentation-only. Co-authored-by: Cursor <cursoragent@cursor.com>

…and-optimize-OpenCL-for-supertonic Qvac 18607 tts ggml add and optimize open cl for supertonic

Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>

`g_s3gen_cache_refcount` is a `std::atomic<int>` (line 189) but `<atomic>` was never included; the file relied on a transitive include chain that broke once any consumer rearranged includes. Surfaces as `error: variable 'std::atomic<int> ... has initializer but incomplete type'` on a clean build. Pre-existing bug, unrelated to QVAC-18605 itself but blocked local CTest runs against the Vulkan-optimisation work. Trivial additive include with no behaviour change. Co-authored-by: Cursor <cursoragent@cursor.com>

…s + prewarm Layered on top of the QVAC-18605 Vulkan bring-up commit; the round-2 changes generalise the bring-up's "load-time backend probe" pattern into a process-wide capability cache and add three more probes / dispatch hooks that fit the same shape. Net effect on Vulkan: redundant supports_op traffic eliminated, defensive auto-policy gating extended to F16 weights, forward- compat Q8_0 K/V probe primed for a follow-up dispatch flip, and an opt-in --prewarm hook that lets operators amortise the ~hundreds-of-ms cold-start shader-compile cost outside the operator-visible first synth call. 1) Process-wide capability-probe cache keyed by ggml_backend_t The bring-up's three load sites (load_supertonic_gguf, Engine::Engine, supertonic_bench's main) each ran the LEAKY_RELU + F16-K/V flash-attn supports_op queries independently — 2-3x redundant probe traffic per backend. On Vulkan, supports_op may inspect the device's pipeline state (~50-200 us per query on Adreno / llvmpipe / RADV in microbenchmarks); the cache short-circuits 100 % of the duplicates. Test seam (supertonic_clear_capability_cache + supertonic_capability_probe_call_count) lets the unit test verify the cache is hit on the second call by comparing the counter before / after. Per-backend independence verified against two distinct CPU backend handles. 2) F16 mul_mat backend-capability probe Symmetric to the F16-K/V flash-attn probe. The bring-up auto-enabled use_f16_weights on `!backend_is_cpu` blindly; a partial-port backend that ships F16 storage but rejects the hot vector-estimator W_query mul_mat shape would crash at first synth call. Probe builds the live shape ([256,256] F16 weight x [256,16] F32 activation) and asks the backend; auto-policy refuses materialisation on a `false` answer (slower F32 path stays correct). Manual --f16-weights 1 still forces materialisation (debug-shim escape hatch). Probe cached; test verifies CPU returns true. 3) Q8_0 K/V flash-attn forward-compat probe Vulkan's GGML_OP_FLASH_ATTN_EXT supports_op advertises Q8_0 (and Q4_0) K/V types in scalar + coopmat2 paths. Switching K/V from F16 to Q8_0 would halve the per-step upload bandwidth (50 KB -> 25 KB per K/V on Supertonic's hot shape; ~1 MB / synth on the default 5-step x 4-site schedule) in exchange for a small (~0.5 %) drift on the attention output. This commit adds the probe + caches the result; live dispatch site is NOT yet wired pending F16-vs-Q8_0 K/V drift measurement against the parity harness on a real Vulkan adapter. Bench output annotates `(q8_0_kv_attn=available)` when the probe says yes so operators can confirm their hardware is ready for the follow-up. 4) Engine::warm_up(text) + EngineOptions::prewarm_text + --prewarm CLI flag (supertonic-cli, tts-cli, supertonic-bench) First-synth-latency reduction on Vulkan / OpenCL. In-tree thread_local graph caches handle every subsequent call but can't avoid the first pipeline-compile cost (~hundreds of ms on Adreno / RADV per chatterbox PROGRESS.md). warm_up runs one throwaway synth at construction time on a caller- supplied sample text so the operator-visible first synth sees steady-state latency. Auto-no-op on CPU (no shader- compile cost). Bench's --prewarm runs the cold-start synth BEFORE the timed loop (independent of --warmup N which only discards N timed runs from the median); cold-start latency logged as `[prewarm] cold-start synth on '...' took N.Nms` and emitted to --json-out as "prewarm_ms". 5) Bench output extended Backend log line surfaces every dispatch flag plus the cold-start prewarm latency: Vulkan (device 0: ...) (f16_attn=on) (f16_weights=on) (native_leaky_relu=on) (q8_0_kv_attn=available) --json-out gains "f16_attn", "f16_weights", "native_leaky_relu", "q8_0_kv_attn_available", "prewarm_ms" keys for downstream analysis tooling. Tests - test-supertonic-capability-cache (NEW, LABEL "unit"): probe cache short-circuit + clear seam + per-backend independence + idempotency + F16 mul_mat probe + Q8_0 K/V probe smoke. 18 / 18 checks pass. - test-supertonic-warm-up-api (NEW, LABEL "unit"): API-surface contract for EngineOptions::prewarm_text + Engine::warm_up via SFINAE. 9 / 9 checks pass. - All existing CPU-only unit tests (test-supertonic-vulkan- dispatch, -portable-ops, -backend-dispatch, -rope-in-graph, -rope-packed-qk, -in-graph-transpose, -convnext-block-fused, -graph-to-graph-blit, -profile-csv, -f16-attn-parity, plus resample / cpu-caches / t3-caches): all 13 pass unchanged. - ctest -L unit reports 100 % pass (15 / 15 binaries; 184+ / 184+ individual checks). Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined. - No public-API break: EngineOptions::prewarm_text is a new optional field defaulting to empty (no-op), Engine::warm_up is a new method (existing callers don't have to invoke it). Deferred (tracked in PROGRESS_SUPERTONIC.md "Deferred work"): persistent VkPipelineCache (cross-process), BF16 K/V flash-attn, Q8_0 K/V live dispatch wiring, multi-device load-balancing. Co-authored-by: Cursor <cursoragent@cursor.com>

…vice auto-pick + 2 forward-compat probes Three more Vulkan-specific deltas, all developed test-first. New tests were committed first, observed to fail on the missing symbol, and only then was the implementation written and the tests re-run to verify green. 1. BF16 K/V flash-attn capability probe (5th cached_backend_capabilities flag). Symmetric to the round-2 Q8_0 K/V probe. Vulkan's FLASH_ATTN_EXT supports_op advertises BF16 K/V via the coopmat2- only path; BF16 has the same 2-byte per-element footprint as F16 (so identical upload bandwidth) but the wider 8-bit exponent range avoids the F16 underflow on small attention scores. Forward-compat — the live --kv-attn-type bf16 dispatch wiring is deferred to a follow-up that measures drift against the parity harness on a real Vulkan adapter. 2. Multi-device auto-pick for --vulkan-device -1. Wires the previously-reserved auto-pick API: walks every visible adapter, queries ggml_backend_vk_get_device_memory() to read free VRAM, and dispatches into a pure-logic helper resolve_vulkan_device_index(requested, free_vram_per_device) that picks argmax(free_vram); ties → lower index for stable per-run assignment on identical-spec multi-GPU machines. The pure-logic helper is testable on CPU with synthetic inputs (8 test functions, 23 checks). Reserved-future negative values (-2, -100, ...) now throw instead of silently falling through to device 0. Verbose mode logs the per-device VRAM table so operators can confirm the auto-pick chose the expected adapter. 3. Pinned-host-buffer-type capability probe (6th cache flag) + bench surface. Probes whether ggml_backend_vk_host_buffer_type() is callable on the resolved backend (Vulkan + non-null buffer- type). Forward-compat — primes the capability cache for a follow-up per-engine input-scratchpad refactor that skips ggml-vulkan's internal staging-buffer hop on per-step uploads. Bench output now shows bf16_kv_attn_available + pinned_host_buffer_available in both the human-readable backend tag and the JSON output so operators can pre-flight whether a future opt-in will be effective on their machine. Test plan (TDD round 3): - test-supertonic-capability-cache: 27 / 27 checks pass (was 18, +9 checks for round-3: BF16 K/V smoke + cache-slot share, pinned-host-buffer smoke + cache-slot share, null-backend defensive checks for both new probes). - test-supertonic-vulkan-device-select (NEW): 23 / 23 checks pass (8 test functions: empty-list, single-device, argmax-VRAM, tie- break, explicit-index passthrough, out-of-range, reserved- negative, zero-VRAM handling). - Whole CPU-only ctest -L unit reports 16 / 16 tests passing, zero regressions on round-1 / round-2 / audit-follow-up tests. CLI surface: - supertonic CLI + chatterbox CLI usage strings updated to document --vulkan-device -1 = auto-pick adapter with most free VRAM. - supertonic-bench usage string updated likewise. Co-authored-by: Cursor <cursoragent@cursor.com>

…hts operator deny-list Round 6 layers a user-overridable extra deny-list on top of the existing hand-curated should_materialise_f16_weight() allow-list. The curated allow-list (Phase 2A) already excludes biases, norms, embeddings, depthwise convs, and pre-transposed companions; the round-6 deny-list lets operators force-keep specific additional tensors as F32 even when --f16-weights is on. Use cases: - A/B testing: researcher excludes a specific tensor pattern temporarily without recompiling. - Hardware-specific drift mitigation: operator pins a problematic tensor to F32 via config rather than disabling F16 weights wholesale. - Future-GGUF safety net: new tensor patterns added in future GGUFs that the curated allow-list inadvertently scoops in can be excluded via config without a code change. Smallest blast radius of the four follow-up rounds — load-time policy only, runtime dispatch unaffected, zero behaviour change on the empty-deny-list default path. Strict TDD discipline (per the user's "double check, don't break anything" constraint): - Both new tests committed FIRST. - Both confirmed to fail to compile on the missing symbols (predicate test: 'too many arguments to should_materialise_f16_weight'; API test: 'EngineOptions has no member f16_weights_deny_list'). - Implementation written. - Both tests + every existing unit test re-run; all green. What changed: 1. 2-arg overload should_materialise_f16_weight(name, extra_deny_substrings) added alongside the existing 1-arg version (existing test + call sites unchanged). Substring matching matches the curated predicate's audit-friendly style; no regex compile cost or invalid-pattern surface. The deny- list can only flip true → false, never false → true. Empty strings inside the deny-list are SKIPPED defensively, not treated as universal matches (config-typo guard). 2. EngineOptions::f16_weights_deny_list (vector<string>, default empty) — public API surface. Wired through Engine::Impl → load_supertonic_gguf → the per-tensor allocation loop. 3. load_supertonic_gguf 7th parameter added at the end of the signature with a {} default — every existing call site keeps compiling without modification. 4. supertonic_model::f16_weights_excluded_count counter bumped at load time when a curated-hot tensor is excluded by the user's deny-list. Surfaced in bench's human + JSON output so operators can confirm their config took effect. 5. CLI plumbing: --f16-weights-deny PAT1,PAT2,... flag on supertonic-cli, tts-cli (chatterbox), and supertonic-bench (comma-separated substring patterns). 6. Verbose-log line in load_supertonic_gguf when the deny-list is non-empty (silent on the default path — no visual noise on existing operator workflows). Test plan (TDD round 6): - test-supertonic-f16-weights (UPDATED): existing 36 checks (positives, negatives, edges) + 29 new round-6 checks across 7 new test functions (empty-list passthrough, matching-deny- excludes, non-matching-no-op, cannot-promote-cold, multiple- patterns ANY-match, empty-string defensive skip, empty-name safety) → 65 / 65 PASS. - test-supertonic-f16-deny-list-api (NEW): SFINAE compile-time gate for EngineOptions::f16_weights_deny_list + load_supertonic_gguf 7th param; runtime defaults check + assignability + regression guards on every other documented EngineOptions default → 9 / 9 PASS. - Whole CPU-only ctest -L unit reports 17 / 17 tests, 0 failures, 0 regressions on round-1/2/3 + audit follow-up + the baseline tests. - Smoke-tested supertonic-cli + tts-cli + supertonic-bench binaries: --f16-weights-deny flag parses correctly, surfaces in --help output, and threads through to the load layer. Co-authored-by: Cursor <cursoragent@cursor.com>

…ype K/V flash-attention dispatch Generalises the round-1 `--f16-attn` boolean (F16 vs F32 only) into a four-valued enum + `--kv-attn-type {auto,f32,f16,bf16,q8_0}` CLI flag so operators can opt into BF16 K/V (Vulkan coopmat2 — same bandwidth as F16, no F16 underflow on small attention scores) or Q8_0 K/V (Vulkan + half the K/V upload bandwidth) on adapters that advertise the corresponding capability. Default `auto` falls back to `--f16-attn` so every existing operator config sees zero behaviour change. Strict TDD throughout: Prereq B extends the F16 parity harness to cover BF16 (4 → 8 checks, 5e-3 abs / 5e-3 rel tolerance band, both hot shapes) BEFORE touching any production code; new pure-logic resolver test (`test-supertonic-kv-attn-type`, 106 checks across the full {-1, 0..3} × legacy × probe-mask matrix); new API-surface SFINAE lockdown (`test-supertonic-kv-attn-type-api`, 18 checks). Tests committed first, observed to fail on missing symbols, then implementation added. Pure-logic resolver (`resolve_kv_attn_type`) split from the dispatch site (same pattern as round-3's `resolve_vulkan_device_index`). Probe-rejected explicit requests fall back to F32 silently (advisory-probe contract); out-of-range int throws to surface CLI typos loudly. Vector-estimator dispatch site (`build_text_attention_cache`) replaces the F16-only cast with a switch on the enum; cache key promoted from `bool f16_kv_attn` to `kv_attn_dtype kv_attn_type`. Bench surface adds `(kv_attn_type=…)` to the human-readable backend line and `"kv_attn_type"` + `"kv_attn_type_requested"` to the JSON output so log-grep / CI attribution works across machines. Bonus: `supertonic-cli`'s arg-parse loop is now wrapped in try/catch so invalid values surface as a clean `error: ...` line + exit 2 (also fixes the pre-existing latent crash on `--vulkan-device abc` / `--seed nonsense`). Whole CPU-only `ctest -L unit` reports 19 / 19 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…servability + voice cache + Vulkan env-var passthrough Lowest impact-÷-risk round of the four planned in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. Four sub-features, none touching the per-synth hot path beyond a single voice-cache lookup. 1. Voice ttl/dp host cache (`detail::voice_host_cache`). Eliminates 2 sync points / synthesize() after the first per-voice call on Vulkan / OpenCL. Extracted to a standalone helper so the lookup-or-load semantics are testable on CPU without instantiating a full Engine; reference-stability contract documented for the synthesis-pipeline call site. 2. Vulkan env-var passthrough (`apply_vulkan_env_overrides(map)` public helper + `EngineOptions::vulkan_env_overrides` field + `--vulkan-prefer-host-memory` / `--vulkan-disable-coopmat2` / `--vulkan-disable-bfloat16` / `--vulkan-perf-logger` / `--vulkan-async-transfer` / `--vulkan-env KEY=VALUE` CLI flags on all three binaries). ALL-OR-NOTHING validation: an operator-config typo throws cleanly BEFORE any env var is touched. `set_env_if_unset` semantics so an operator-set env var still WINS over the EngineOptions override. 3. Bench `ggml_backend_synchronize` boundaries (`--no-bench-sync` opt-out). Inserts an explicit backend sync at every per-stage timing boundary so wall-clock attributes to the right stage on async backends. Cheap on CPU; prerequisite for measuring round-5 / 8 / 9 wins on real hardware. 4. Bench per-denoise-step breakdown (`--bench-per-step`). Times each `supertonic_vector_step_ggml` call individually so the first-step (cold pipeline) cost is distinguished from steady-state. Empty array on the default-off path = identical legacy JSON shape. Strict TDD throughout. Two new test executables committed first, observed to fail on missing symbols, then implementation written. TDD also caught a real bug: the original env-key validator used `std::string()` empty-as-success sentinel which collided with the empty-string-as-key edge case; the test pinned the contract and forced a `bool / out-param` API fix BEFORE any production wiring went in. Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions (was 19; +2 new tests = 54 new checks). Co-authored-by: Cursor <cursoragent@cursor.com>

…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…U bridge Extends the round-8 GPU bridge pattern to the 4 style flash-attn sites (style0 + g1_style + g2_style + g3_style). Largest bandwidth-style optimisation that ships from pure-Supertonic-side code: 120 sync points / synth eliminated on the production Vulkan / OpenCL path (4× the round-8 win). - vector_res_style_qkv_result extended with `sq_gpu / sk_gpu / sv_gpu` GPU handles, populated unconditionally by `run_res_style_qkv_cache` (cheap — no GPU sync; just `ggml_graph_get_tensor` lookups). Same shape as `vector_group_graph_result::q_rope_gpu` etc from the round-1 2C-lite work. - `run_res_style_qkv_cache` host-download gating: the 3 `tensor_to_time_channel(...)` downloads of `sq` / `sk` / `sv` are now gated on `trace != nullptr`. Production path skips them entirely. Mirrors the round-1 2C-lite `need_host_qkv = (trace != nullptr)` gate. `post` stays unconditional — consumed by the next-stage `run_style_residual_cache` which still expects a host vector (cross-stage GPU bridge for `post` is deferred). - 4 dispatch sites rewired with the same gating pattern as the round-8 front-block bridge: `!include_ggml_trace && sq_gpu && sk_gpu && sv_gpu` → GPU bridge; otherwise legacy host bridge. Trace mode falls back to the legacy host bridge so the trace harness still gets all the host vectors. Strict TDD: parity test (`test-supertonic-graph-to-graph-blit`) extended with explicit style-shape coverage (`style_sq_L1` trip-wire + clarified `style0_q_rope_L20` / `style0_k_rope_kv50`) BEFORE any production wiring. All 24 / 24 parity checks pass at bit-exact `max_abs = 0.0`. Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…t upload-skip After rounds 8 + 9 wired the GPU bridge for the 5 attention sites, the largest remaining per-step host upload is `text_emb` (uploaded to 4 caches × 5 denoise steps = 20 times / synth, but constant data within one synth). Round 10 generalises the F4 pointer-compare upload-skip pattern (already used for `style_v_in` / `kctx_in`) into a reusable `upload_skip_tracker` helper and applies it to the front-block + 3 group caches. CRITICAL CORRECTNESS HAZARD addressed: `text_emb` is a stack-local `std::vector<float>` in `Engine::Impl::synthesize()` (and bench loops). Modern heap allocators (jemalloc / tcmalloc / glibc) very often re-issue the SAME address for the next stack-local vector of the same size — so synth N+1 may have `text_emb.data() == synth_N.text_emb.data()` despite holding completely different data. A naive pointer-compare upload-skip would silently leak prior synth's text-encoder embedding into the next synth's GPU buffer. Mitigation: caller MUST invoke `tracker.reset()` at every synth boundary (`current_step == 0`). The CPU-only TDD test includes an explicit cross-synth pointer-reuse hazard simulation that documents the bug and verifies the reset prevents it. Per-synth wins: - 16 fewer `ggml_backend_tensor_set` host→GPU uploads per synth - ~512 KB / synth bandwidth saved at text_len=32 (linear in prompt length) Strict TDD: `test-supertonic-upload-skip-tracker` (NEW, 7 functions, 41 checks) committed first, observed to fail compile (`upload_skip_tracker was not declared`), then implementation added. Whole CPU-only `ctest -L unit` reports 22 / 22 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>

tradingsuit-freddy

Review of the Vulkan delta (pr16...pr17). The round-11 RoPE/transpose layout fix looks correctly applied at all four attention sites and the legacy host downloads moved to tensor_raw_f32 consistently. Below are 2 blocking issues I verified line-by-line plus 2 non-blocking risks.

tradingsuit-freddy · 2026-06-04T22:08:07Z

+#ifdef GGML_USE_VULKAN
+        if (model.backend_is_vk) {
+            char desc[256] = {0};
+            ggml_backend_vk_get_device_description(opts.vulkan_device < 0 ? 0 : opts.vulkan_device,


BUG (blocking): --vulkan-device -1 (auto-pick) reports the wrong device in every log / bench / JSON line.

backend_name() builds the device label from the raw option, mapping the auto-pick sentinel -1 to 0:

ggml_backend_vk_get_device_description(opts.vulkan_device < 0 ? 0 : opts.vulkan_device, desc, ...); out += " (device " + std::to_string(opts.vulkan_device < 0 ? 0 : opts.vulkan_device) + ": " + desc + ")";

The index actually chosen by resolve_vulkan_device_index (argmax free VRAM) is never propagated back to opts/model, so on a multi-GPU host --vulkan-device -1 that resolves to device 2 still prints device 0: <wrong name>. That defeats the exact use case the comment promises ("unambiguous when triaging multi-GPU machines"), and supertonic_bench.cpp has the same issue (~line 538), so the bench JSON attributes timings to the wrong adapter.

Suggest storing the resolved index (e.g. model.vulkan_device_resolved) at backend init and using it here instead of opts.vulkan_device.

tradingsuit-freddy · 2026-06-04T22:08:07Z

+            // No-op for the default `kv_attn_type == -1` path (the
+            // resolver already mirrors the boolean).  Becomes a
+            // no-op for explicit `--kv-attn-type 1` too.
+            model.use_f16_attn = (model.kv_attn_type == kv_attn_dtype::f16);


BUG (blocking): --f16-attn 1 no longer forces F16 — the round-1 debug escape hatch was lost in round 4.

The comment at lines 169-170 still states: "Manual override via --f16-attn 1 still forces dispatch (useful for debug-shim backends)." That is no longer true. Round 1 sets use_f16_attn = (opts.f16_attn != 0) (line 175), but round 4 then re-gates it through the probe and overwrites the boolean here:

// resolve_kv_attn_type, case -1 (auto / default kv_attn_type): if (legacy_use_f16_attn && backend_supports_f16) return kv_attn_dtype::f16; return kv_attn_dtype::f32; ... // line 209: model.use_f16_attn = (model.kv_attn_type == kv_attn_dtype::f16);

So on a backend whose F16-K/V probe returns false — i.e. exactly the "debug-shim backend" the comment targets — --f16-attn 1 silently falls back to F32 and the override is undone. Either the comment is stale and should say the override is probe-gated, or the forced path needs to bypass the probe. Please pick one and align code + comment.

tradingsuit-freddy · 2026-06-04T22:08:07Z

+    return n;
+}
+
+const backend_capabilities & cached_backend_capabilities(ggml_backend_t backend) {


RISK (non-blocking): capability cache keyed by a raw ggml_backend_t pointer has no invalidation hook.

The process-wide cached_backend_capabilities map keys on the backend pointer, and the surrounding comment already acknowledges pointers can be recycled after ggml_backend_free. There is no invalidation in free_supertonic_model, so if a backend is freed and a new one is allocated at the same address it inherits the previous backend's probe results (wrong use_native_leaky_relu / F16 / weights policy) for the rest of the process. For a long-lived host that loads/unloads multiple models this is a latent correctness bug, not just a perf cache. Suggest evicting the entry on backend teardown (or keying on something stable).

tradingsuit-freddy · 2026-06-04T22:08:07Z

+    // 4 (skipped) × 3 (groups) × text_len × 256 × 4 bytes.  See
+    // upload_skip_tracker contract in supertonic_internal.h.
+    if (current_step == 0) cache.text_in_skip.reset();
+    if (cache.text_in_skip.needs_upload(text_lc_host)) {


RISK (non-blocking): upload_skip_tracker skips host->device uploads via raw pointer compare — silent stale-input hazard.

Cross-synth correctness rests entirely on reset() being called at current_step == 0 (line 1235). The engine/bench loops honor that today, but nothing ties the reset to the upload path in an integration test, and the pointer-compare can be defeated two ways: (a) the allocator reuses a freed text_emb/text_lc_host address for a different encoding, or (b) the buffer is mutated in place with the same data() pointer across steps (the public supertonic_vector_step_ggml API does not forbid it). In both cases the tracker wrongly skips a required upload and the GPU runs on stale input -> silently wrong audio, no crash. Worth a guard (size/contents hash, or a generation counter bumped per synth) and an integration test that exercises a new encoding without the step==0 reset.

Zbig9000 · 2026-06-08T07:39:01Z

It has been replaced by another PR.

tradingsuit-freddy · 2026-06-08T15:01:03Z

Process / PR-level notes (separate from the inline findings)

Beyond the four inline comments, a few higher-level points worth raising before this lands:

1. Rounds 1–10 never ran end-to-end and CI didn't catch it. The description itself states that without round 11 "every prior round was hitting a latent assertion-failure during the first real synth call," and that the unit test built Q under the wrong shape so the failure was invisible to CI. That means the "22/22 PASS, 0 regressions" across 10 rounds was false confidence — the tests were green while production crashed on the first synth. The CPU-only unit-test strategy has a real gap: it never exercises the GPU path where the bug actually lived. At minimum, a lavapipe (Vulkan-on-CPU) smoke test in CI would gate the GPU contract.

2. No GPU coverage in CI. All Vulkan validation is manual on the author's dev rig (RTX 5090 / RADV / lavapipe); nothing in CI gates the GPU path. Given point 1, that's a significant risk for a +13k-line change on the inference path.

3. Known-broken behaviors are being merged.

F18/F19 cache-reuse failures are deferred but described as "newly observable post-round-11" — please confirm they're ticketed and don't affect the shipping path.
The UMA auto-pick (--vulkan-device -1) picks the iGPU over a discrete GPU (documented ~4× regression) and ships as the "auto" behavior with no warning in --help. At least the help text should warn, or -1 shouldn't be recommended.

4. Size / reviewability + public-API change.

11 rounds plus a critical correctness fix stacked into one diff; the round-11 fix (the only thing that makes it run at all) is buried under 10 rounds of optimization. Correctness should ideally have landed separately from the perf work.
EngineOptions in the public header (include/tts-cpp/supertonic/engine.h) changed layout (vulkan_device inserted before f16_weights, plus new std::vector/std::map members). That's an ABI break — relevant because tts-cpp is consumed prebuilt via vcpkg in qvac, so any downstream positional init or layout assumption breaks.

Suggested verdict: block on the two BUG inline comments + a lavapipe smoke in CI (point 1); treat the rest as non-blocking with tracking tickets.

Zbig9000 and others added 9 commits May 11, 2026 14:49

QVAC-18607 [TTS GGML] Add and optimize OpenCL for supertonic

8d5ebb4

Zbig9000 requested review from a team as code owners May 12, 2026 14:09

Zbig9000 requested review from GustavoA1604, freddy311082, ishanvohra2 and ogad-tether May 12, 2026 15:57

GustavoA1604 and others added 12 commits May 12, 2026 18:45

Merge pull request tetherto#16 from Zbig9000/QVAC-18607-TTS-GGML-Add-…

eed9c52

…and-optimize-OpenCL-for-supertonic Qvac 18607 tts ggml add and optimize open cl for supertonic

Zbig9000 force-pushed the QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic branch from 1b710d3 to c383e70 Compare May 13, 2026 16:01

GustavoA1604 force-pushed the master branch from 6c60e4c to f5f914b Compare May 13, 2026 22:06

gianni-cor force-pushed the master branch from f8af247 to eabcf6d Compare May 28, 2026 12:36

gianni-cor requested review from a team as code owners May 28, 2026 12:36

tradingsuit-freddy reviewed Jun 4, 2026

View reviewed changes

ishanvohra2 closed this Jun 5, 2026

ishanvohra2 reopened this Jun 5, 2026

Zbig9000 closed this Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qvac 18605 tts ggml add and optimize vulkan for supertonic#17

Qvac 18605 tts ggml add and optimize vulkan for supertonic#17
Zbig9000 wants to merge 21 commits into
tetherto:masterfrom
Zbig9000:QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic

Zbig9000 commented May 12, 2026 •

edited

Loading

Uh oh!

tradingsuit-freddy left a comment

Uh oh!

tradingsuit-freddy Jun 4, 2026

Uh oh!

tradingsuit-freddy Jun 4, 2026

Uh oh!

tradingsuit-freddy Jun 4, 2026

Uh oh!

tradingsuit-freddy Jun 4, 2026

Uh oh!

Zbig9000 commented Jun 8, 2026

Uh oh!

tradingsuit-freddy commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Zbig9000 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

End-to-end validation (on real hardware)

Investigation methodology (TDD throughout)

Commit-by-commit walkthrough

33fd5c34 — Round 1: Vulkan bring-up

d080a1e4 — Pre-existing missing-include fix

e09d4278 — Round 2: capability-cache + 3 probes + prewarm

8ae15996 — Round 3: multi-device auto-pick + 2 forward-compat probes

32703fcd — Round 6: F16-weights operator deny-list

2e1c9468 — Round 4: multi-dtype K/V flash-attention dispatch

ba6d1749 — Round 7: bench observability + voice cache + Vulkan env-var passthrough

e8bbc728 — Round 8: front-block attn0 GPU bridge

df895fd6 — Round 9: style flash-attn GPU bridge

358d7aa8 — Round 10: per-step text-input upload-skip

c383e70d — Round 11: Packed-QK RoPE + GPU-bridge layout fix ⚡ CRITICAL CORRECTNESS

Test plan

Smoke testing the CLIs

Backwards compatibility

File-by-file change summary

Deferred follow-ups (intentionally out of scope; pre-existing on master)

Linked

Uh oh!

tradingsuit-freddy left a comment

Choose a reason for hiding this comment

Uh oh!

tradingsuit-freddy Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

tradingsuit-freddy Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

tradingsuit-freddy Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

tradingsuit-freddy Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Zbig9000 commented Jun 8, 2026

Uh oh!

tradingsuit-freddy commented Jun 8, 2026

Process / PR-level notes (separate from the inline findings)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Zbig9000 commented May 12, 2026 •

edited

Loading

`33fd5c34` — Round 1: Vulkan bring-up

`d080a1e4` — Pre-existing missing-include fix

`e09d4278` — Round 2: capability-cache + 3 probes + prewarm

`8ae15996` — Round 3: multi-device auto-pick + 2 forward-compat probes

`32703fcd` — Round 6: F16-weights operator deny-list

`2e1c9468` — Round 4: multi-dtype K/V flash-attention dispatch

`ba6d1749` — Round 7: bench observability + voice cache + Vulkan env-var passthrough

`e8bbc728` — Round 8: front-block attn0 GPU bridge

`df895fd6` — Round 9: style flash-attn GPU bridge

`358d7aa8` — Round 10: per-step text-input upload-skip

`c383e70d` — Round 11: Packed-QK RoPE + GPU-bridge layout fix ⚡ CRITICAL CORRECTNESS