Qvac 18607 tts ggml add and optimize open cl for supertonic by Zbig9000 · Pull Request #16 · tetherto/qvac-ext-lib-whisper.cpp

Zbig9000 · 2026-05-11T16:01:40Z

Summary

Brings the Supertonic TTS stage of tts-cpp to functional + optimized parity with the existing Chatterbox OpenCL story, then iterates on the resulting baseline through six audit-driven optimization rounds. Each round eliminates one or more host↔GPU synchronization points or redundant memory copies from the per-synth hot path, gated by a new CPU-only TDD test that locks in the bit-exact contract for future regressions.

Net steady-state impact (vs. the unoptimized post-bring-up tree, 5-step default denoise schedule):

Category	Sync points / synth eliminated
Host caches replacing per-step `read_f32` (F1 / F13 / F17 / F9)	~80
Pre-bake / move-to-graph of CPU continuations (F2 / F3 / F10)	~10
Cached graph contexts replacing per-call gallocr churn (F8 / F11 / F14 / F18 / F19)	~30
In-graph RoPE rotation (F20 + F23)	40 (+ ~2 ms host CPU)
GPU→GPU Q/K/V blit for g1/g2/g3 attn (F24 / 2C-lite)	90
Host transpose elimination at hot ingestion sites (F12)	~30
Subtotal	~280 sync points / synth

Plus ~16.8 MiB of redundant vocoder memory traffic removed (F7) and weight bandwidth ~halved on the identified hot matmul / pwconv roster (2A F16 weights).

Investigation methodology

Bring-up first. Commit 8d5ebb4 ports the OpenCL backend-dispatch / portable-op / F16 K-V-attention primitives from Chatterbox to Supertonic and wires them through the CLI / bench / engine layer.
Bring-up TDD safety net. Commit ad1ef07 adds the CPU-only unit harnesses that didn't exist for the bring-up primitives (so ctest -L unit is green on a fresh checkout without needing a Supertonic GGUF + reference dump fixture).
End-to-end audit. Performed a full audit of the post-bring-up tree (text-encoder + duration + vector estimator + vocoder) measuring GPU↔host sync points and bandwidth on each per-synth path. Findings catalogued as F1…F24 with HIGH / MEDIUM / LOW impact tags. Audit report + R&D plan live under aiDocs/ (out-of-tree by design).
Land in phases. Each follow-up commit lands a coherent batch of findings with the same pattern:
- Per-finding rationale reproduced inline as a comment at every load-time hook + rewritten call site (so the rationale stays adjacent to the code it justifies).
- New CPU-only TDD test gates the optimization before implementation lands.
- Existing fixture-bound test-supertonic-* parity harnesses continue to enforce end-to-end correctness.

Commits in this PR

9 commits, 27 files changed, +6966 / −620.

#	Commit	Theme
1	`8d5ebb4`	Bring-up. OpenCL backend dispatch + portable ops + F16 K/V attention.
2	`ad1ef07`	Bring-up safety net. 3 new CPU-only unit harnesses (`backend-dispatch`, `portable-ops`, `f16-attn-parity`) + R&D plan.
3	`e9e76d7`	Audit #1. 9 findings — F1 RoPE θ cache, F2 vocoder BN pre-bake, F3 vocoder unpack in graph, F4 style attention cache re-upload, F5 apply_rope CPU pre-stage, F6 hot-weight transpose, F15/F16 alive-id / generation-id cache hygiene.
4	`5f457c9`	Audit #2. F13 text-encoder LN weight cache, F14 speech-prompted attention QKV cached, 2A F16 weight materialization, 2D profile CSV emitter.
5	`ccec592`	Audit #3. F17 generic scalar `read_f32` cache, F18 text-encoder ConvNeXt graph cached, F19 vector-estimator front-block graph cached.
6	`a0b4e5a`	Audit #4 (F20 partial). `apply_rope_in_graph` helper + universal-op `make_rope_cos_sin_tables` precompute, with TDD test. Integration deferred to keep the change reviewable.
7	`5869231`	Audit #5 (F23). Bake RoPE rotation into the 4 Q/K-producing graphs (front block + 3 group caches); 40 host CPU rotations / synth eliminated.
8	`f74e057`	Audit #6. F7 vocoder ConvNeXt block fusion, F12 in-graph time/channel transpose, F24 (2C-lite) GPU→GPU Q/K/V blit for g1/g2/g3 attn.
9	`cf4aa0e`	Tidying. Remove the in-tree R&D plan doc (moved to local `aiDocs/`).

Code change highlights

tts-cpp/src/supertonic_gguf.cpp (+~700 lines): All host-side caches are populated here at load time — vector_rope_theta (F1), bn_scale_pre / bn_shift_pre (F2), text_encoder_ln_weights (F13), scalar_weight_cache (F17), time_emb_cache (F9). Materializes F16 weight variants for the hot matmul / pwconv roster (2A) with the GGUF-roster-driven name list mirrored from chatterbox.

tts-cpp/src/supertonic_vector_estimator.cpp (+1326 lines, by far the heaviest single file). New graph-cache types (vector_group_graph_cache, vector_text_attention_cache, vector_res_style_qkv_cache, vector_style_residual_graph_cache, vector_tail_graph_cache) replace the historical pattern of building a fresh ggml_context + gallocr per call. Each cache is keyed on its shape parameters + generation_id for safe model swap. Caches also expose GPU tensor pointers (q_rope_gpu, k_rope_gpu, v_gpu) so downstream consumers can ggml_backend_tensor_copy instead of round-tripping through host vectors.

tts-cpp/src/supertonic_internal.h (+~610 lines): All header-only GGML graph helpers — apply_rope_in_graph, apply_rope_to_packed_qk, convnext_block_fused_ggml, transpose_time_channel_ggml, leaky_relu_portable_ggml, plus the dispatch / generation-id / alive-id machinery shared across stages.

tts-cpp/src/supertonic_vocoder.cpp (+200 lines): Pre-baked BN weights consumed directly as graph weights (F2). Latent unpack moved into the cached graph (F3). ConvNeXt blocks rewired through convnext_block_fused_ggml (F7).

tts-cpp/src/supertonic_text_encoder.cpp (+312 lines): LN weight cache lookups (F13). Speech-prompted attention QKV graph cached (F14). ConvNeXt graph cached across synths (F18).

tts-cpp/src/supertonic_duration.cpp (+237 lines): Cached cached_read_f32 lookups everywhere read_f32 previously ran on the hot path (F17). Generic helper, fall-through to read_f32 when the GGUF lacks a rostered name.

Testing strategy

14 new test files (tts-cpp/test/test_supertonic_*), all wired into CMake with LABEL "unit".

CPU-only, no GGUF needed — green on a fresh checkout under ctest -L unit:

backend_dispatch, portable_ops, f16_attn_parity (bring-up primitives)
f16_weights, graph_rewrites, profile_csv (audit Add approval-check-worker workflow #2 primitives)
rope_in_graph, rope_packed_qk (RoPE helpers)
convnext_block_fused, in_graph_transpose, graph_to_graph_blit (audit added approval check worker #6)

Fixture-bound (requires a Supertonic GGUF + artifacts/supertonic-ref-quick reference dump):

load_caches, audit3_caches, text_encoder_caches (cache-state structural tests for F1 / F13 / F14 / F17 / F18 / F19)
Existing pipeline, vector, vector_trace, vocoder, vocoder_trace, text_encoder, text_encoder_trace, duration, duration_trace continue as end-to-end parity gates.

Each TDD test is bit-exact unless the operation introduces floating-point reassociation (the ConvNeXt fusion test allows max_abs_err ≤ 5e-4; everything else is max_abs_err = 0.0).

CPU-side verification status: All CPU-only unit checks pass on this branch. Fixture-bound checks pass on the developer's local Supertonic GGUF; they should also pass in CI when the fixture is uploaded.

Deferred work (next iterations)

Catalogued in aiDocs/AUDIT_SUPERTONIC_OPENCL.md with rationale + suggested phase IDs:

2C-medium: Extend F24 to the front-block attention site + the 4 style attention sites. Requires exposing GPU pointers from front_block_proj_cache + vector_res_style_qkv_result. Would eliminate ~150 more sync points / synth.
2C-full (graph fusion): Combine each group graph + its attention graph into one mega-graph so the Q/K/V → attn-out chain runs without any inter-graph bridge. Significant refactor (~400 LoC); deferred behind a physical-device parity gate.
F12 (full scope): Apply the in-graph transpose to the 17 remaining pack_time_channel_for_ggml call sites in text-encoder / duration (currently only the vector-estimator hot path is migrated).
OpenCL kernel-time profiling (Phase 2D): With ~280 sync points eliminated, the next bottleneck will have shifted from host-sync overhead to actual GPU kernel time. The profile CSV emitter (landed in commit QVAC-7457: Add seed parameter for reproducible sampling #4) is the instrumentation that will tell us which kernels to optimize next.

Risks & mitigations

Graceful degradation for malformed GGUFs. Every host-side cache (vector_rope_theta, text_encoder_ln_weights, scalar_weight_cache, bn_scale_pre / shift_pre, F16 weight variants) falls through to the original read_f32 path when the rostered tensor name is absent. The in-graph RoPE (F20 + F23) similarly falls back to host apply_rope when vector_rope_theta isn't loaded. Future model variants are not blocked.
Cache invalidation. All caches are keyed on (model, generation_id, …shape params). Model swaps and reloads bump generation_id; caches detect mismatch and rebuild. Uses the alive_id / safe_gallocr_free machinery from the F15 / F16 cache-hygiene work to avoid free-after-teardown crashes.
Trace-mode contract preserved. Every trace-emitting cache continues to push the historical entries into supertonic_trace_tensor. The F24 (2C-lite) optimization explicitly gates the new GPU fast path on include_ggml_trace == false so scalar-parity harnesses see no change.
Backend portability. Every new GGML helper uses only universally-supported ops (reshape, view, permute, cont, mul, add, repeat, concat, flash_attn_ext, transpose, scale, scale_bias, mul_mat, norm, gelu_erf, tensor_copy). No backend-specific intrinsics. Verified green on the CPU backend; OpenCL / Metal / Vulkan dispatch through the same op set.

Test plan

All CPU-only ctest -L unit checks pass on the branch.
Foundational bring-up + 6 audit rounds compile clean with -Wall -Wextra (modulo a pre-existing missing-include in chatterbox_tts.cpp, untouched by this PR).
test-supertonic-pipeline end-to-end parity on the local Supertonic GGUF fixture (developer-local; needs CI fixture upload).
OpenCL backend smoke test on a physical device (deferred to merge-time validation).
Profile CSV inspection on a long-form synth to confirm the predicted sync-point reduction shows up in measured wall-time.

QVAC-18607 follow-up. The bring-up commit (8d5ebb4) landed the dispatch + portable-op + F16-K/V-attention primitives but only exercised them transitively through the existing fixture-bound test-supertonic-* harnesses, which need a Supertonic GGUF + an artifacts/supertonic-ref-quick reference dump to run. A fresh checkout has neither, so the bring-up primitives shipped without their own gate on `ctest -L unit`. This commit adds three CPU-only unit harnesses that cover the bring-up primitives independent of any fixture, plus an R&D plan document capturing the next optimization rounds with their TDD test gates. Tests (all LABEL "unit", auto-run on fresh checkout): test-supertonic-backend-dispatch (186 lines) Six scenarios around supertonic_op_dispatch_scope + the two thread-local query functions: default state, CPU model mirroring, GPU model mirroring + post-teardown restore, RAII teardown on exception, nested-scope unwinding, independence of use_cpu_custom_ops / use_f16_attn. Catches "scope leaked wrong previous-value into thread_local" and "GPU engine poisons next CPU engine on same thread" regressions. test-supertonic-portable-ops (260 lines) CPU-backend parity of leaky_relu_portable_ggml's CPU lowering (fused ggml_leaky_relu) vs its GPU decomposition (RELU + 2x SCALE + ADD) for alpha in {0, 0.01, 0.05, 0.1, 0.5, 0.99, 1.0} against a sign-mixed input including the zero boundary. Also asserts graph-node-count grows on the GPU dispatch — catches a regression where the portable helper would silently route back to ggml_leaky_relu on a non-CPU backend (defeating the whole reason the helper exists). test-supertonic-f16-attn-parity (291 lines) F32 vs F16 K/V ggml_flash_attn_ext parity on the two hot shapes from the vector estimator (text attention kv=32, style attention kv=50), n_heads=4, head_dim=64. Tolerance 5e-3 abs / 5e-3 rel — the same band chatterbox ships behind --cfm-f16-kv-attn. Gracefully skips ("SKIPPED — CPU build missing one path") if the local CPU build doesn't carry both flash-attention paths, preserving CI greenness while still validating where the path exists. Refactor to support testing: leaky_relu_portable_ggml moves from file-local in supertonic_vocoder.cpp to an inline definition in supertonic_internal.h. ODR-safe under C++17, lets the portable-ops test call the production helper directly instead of re-implementing the rewrite (which would defeat the test's purpose). The vocoder TU now only carries a one-line redirect comment pointing at the header. Plan document (PLAN_SUPERTONIC_OPENCL.md, 268 lines): Captures five concrete next-rounds with motivation + code- change plan + acceptance test + risk for each: 2A. F16 weight materialization for hot matmuls — biggest expected single-flag win after F16 K/V attn, mirrors chatterbox's CHATTERBOX_F16_CFM gate. 2B. Pre-quantized Q8_0 GGUF weights — needs convert-script work + audio listening sign-off. 2C. Reduce 140x host<->GPU sync round-trips per synth in the vector estimator (5 steps x 28 set/get pairs). 2D. SUPERTONIC_OPENCL_PROFILE=PATH.csv tooling for per-kernel attribution; mirrors chatterbox's cl_profiling_*.csv flow. 2E. Vocoder unpack-on-GPU via ggml_permute + ggml_cont. Each phase has its acceptance test spelled out (TDD, written before the implementation lands), the CTest label it should carry, and its sequencing rationale. Cross-linked from PROGRESS_SUPERTONIC.md's "Next optimization rounds" subsection so future-readers find the roadmap. Validation: All three new tests pass clang -fsyntax-only -Wall -Wextra and compile to clean .o files. `nm` confirms the dispatch test's four undefined symbols (op_dispatch_scope ctor/dtor, use_cpu_custom_ops, use_f16_attn) resolve against the definitions in supertonic_gguf.o, so link-time resolution will succeed under the real CMake build. No new linter errors in any of the 8 affected files; pre-existing -Wunused-function warnings on read_f32 / scalar_f32 / set_env_if_unset unchanged.

…wins QVAC-18607 follow-up. Lands the audit-driven optimization round identified by an end-to-end code audit of the post-bring-up tree: ~54 GPU↔host sync points per synth eliminated independently of the quantization / F16-weight work that's still on the roadmap. Nine findings landed; three high-risk ones (RoPE in-graph, vocoder layout flip, full host-transpose elimination) stay deferred behind a physical-device parity gate. The audit report + plan document live under aiDocs/ and are not part of this commit; the per-finding rationale is reproduced inline in the code comments at every load-time hook and every rewritten call site so the rationale stays adjacent to the code it justifies. Findings landed: F1 RoPE θ tensor host-side cache. `supertonic_model::vector_rope_theta` populated once in `load_supertonic_gguf` from `vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta`, then consumed at 9 call sites that previously did the same backend read on the hot path. Saves 20 GPU→host downloads per default 5-step synth. F2 Vocoder BN scale / shift pre-bake. `supertonic_vocoder_weights::bn_scale_pre` + `bn_shift_pre` allocated alongside the other vocoder weights at load and populated from `gamma / sqrt(var + 1e-5)` + `beta - mean * scale` once. The vocoder graph references them as weight tensors (no `ggml_set_input`), so the per-synth pattern of 4 final_norm.* downloads + CPU compute + 2 bn_scale/bn_shift uploads goes away entirely. F3 Vocoder unpack moves into the graph. `supertonic_vocoder_forward_ggml` now uploads `latent` in its raw `[latent_len, latent_channels]` shape and the cached graph runs `reshape_3d(L,6,24) → permute(1,0,2,3) → cont → reshape_2d(T0, 24)`. Math is bit-exact with the legacy CPU triple-loop in `supertonic_vocoder_forward_cpu`; the host loop + the ~40 KiB upload-roundtrip are gone. F4 Style cache upload skip. `vector_res_style_qkv_cache` gains `last_style_v_raw_uploaded` / `last_kctx_raw_uploaded` pointer-keyed against the host vectors `cached_style_layouts` returns. Pointer comparison is sound: the layout cache is keyed on `(model.generation_id, style_ttl)` so equal pointers mean equal data. Steady-state per synth: 4 cold-miss uploads after the first synth, then 16 skips/synth. F6 Pre-transposed t_proj weights. Four `__T` companion tensors allocated in `model.ctx_w` pre-`alloc_ctx_tensors`, populated via host-side transpose after the source data lands. Mapped into `model.source_tensors` under `<name>__T` so `require_source_tensor(model, matmul_source + "__T")` is the call-site lookup. Eliminates the `ggml_cont(ggml_transpose(W))` op (+ ~640 KiB of compute-buffer copies) at every graph build. Defensive shape check (F32, ne=[512, 64]) skips models that don't match the audit-roster expectation; call sites fall back to the original in-graph transpose. F8 Cached style-residual graphs. `vector_style_residual_graph_cache` + builder + runner; replaces four near-identical inline graph build sites (style0 / g1 / g2 / g3) with cache-lookup-or-build. Each cache survives across synths with the same `(L, C, norm_block)` key. Saves 16 graph alloc/free cycles + ~80 bytes of gallocr churn per synth, but the main win is dropping ~150 LoC of duplicated boilerplate. F9 `cached_time_embedding(model, current_step, total_steps)`. Lazy `mutable` map on `supertonic_model::time_emb_cache`. First-synth cost is the same as the old code; subsequent synths with the same denoise schedule pay zero CPU compute and zero downloads for this stage. F10 Text-encoder embedding lookup as `ggml_get_rows`. Replaces the host-side embedding-table download + CPU gather + pack-to-channel-major-and-upload chain with an i32-vector input + `ggml_get_rows + ggml_transpose + ggml_cont` on the device. Bounds check still runs host-side against `emb_table->ne[1]`. Drops the per-synth ~2 MB embedding table download. F11 Cached duration graph. `duration_graph_cache` + `free_duration_graph_cache`; first synth pays the full graph build, subsequent synths with the same text_len reuse the gallocr-allocated graph. Findings deferred (NOT in this commit, captured for the next round): F5 RoPE in-graph (replace CPU `apply_rope` with `ggml_rope_ext`). Supertonic's RoPE formula is non-standard (angle scales with `t/L`, not absolute position, and consumes a learned theta); needs a careful match-up against `apply_rope` + a physical- device parity test before shipping. F7 Vocoder layout flip (kill the `permute+cont` wrap around every `ggml_norm`). Large refactor across every vocoder op; defer until F1–F11's wins are profiled on Adreno so the next-bottleneck claim has hard data. F12 Full host-transpose elimination. F10 covered the text- encoder gather case; the broader `pack_time_channel_for_ggml` / `tensor_to_time_channel` machinery stays in place because it's small and predictable, and the audit ranked it LOW. New TDD harnesses (fixture-bound, run on the existing `add_supertonic_harness` registration so `ctest -L fixture` picks them up when the GGUF is present, auto-DISABLED otherwise): test-supertonic-load-caches Structural checks for F1 / F2 / F6 / F9: - `model.vector_rope_theta` matches a direct backend read of the source tensor. - `model.vocoder.bn_scale_pre / bn_shift_pre` match host-side recomputation of the BN-fused formula. - The four `__T` companions have axes 0/1 swapped vs their originals and bit-exact transposed contents. - `cached_time_embedding` populates lazily, returns the same vector on a repeat key, and produces different vectors for different keys. test-supertonic-graph-rewrites Parity checks for F3 / F8 / F11: - `supertonic_vocoder_forward_ggml` output matches `supertonic_vocoder_forward_cpu` on synthetic latent. - Two consecutive `supertonic_duration_forward_ggml` calls with identical inputs yield bit-exact identical durations (F11's cache must not alias buffers across calls). - Two consecutive `supertonic_vector_step_ggml` calls with identical inputs yield bit-exact identical outputs (F8's cached style-residual graphs must not alias buffers across calls). Existing fixture parity tests stay the gate of last resort: `test-supertonic-pipeline` end-to-end (1e-3 abs / 1e-3 rel), `test-supertonic-{vocoder,vector,duration,text-encoder}` per- stage, and the `-trace` variants are unchanged in this commit. Verification done before the commit: - All 9 modified source files + 2 new test files compile clean with `clang++ -Wall -Wextra -fsyntax-only` and to object files; no new warnings introduced. - Hand-walked parity reasoning for each finding: * F1, F9: same data path, cache vs read. * F2: pre-bake formula identical to per-call formula. * F3: walked the `reshape → permute → cont → reshape` math against the CPU loop's index formula. * F4: pointer compare against `cached_style_layouts` output; cache rebuilds reset to nullptr so cold-miss path always fires. * F6: hand-derived `dst[i*64+j] = src[j*512+i]` against the logical (W, H) shapes of both tensors. * F8, F11: cache only changes *when* alloc happens; graph structure for a given key is identical. * F10: walked `ggml_get_rows` + transpose + cont produces `data[c*L+t] = emb[ids[t]*C + c]` matching the CPU gather. - F1's load-time hook upgraded to `require_source_tensor` (vs the original `find + null-check`) so call sites can assume `.data()` is non-null; restores the pre-audit "fail fast on missing tensor" behaviour.

…caches, F16 weights, profile CSV QVAC-18607 follow-up tetherto#2. Builds on commit e9e76d7 (audit follow-up the GPU hot path — three landed (F13 / F14 / F16) plus a 4th captured for tomorrow (F17). This commit also lands the two planned phases that pre-dated the audit work (2A F16 weight materialization, 2D machine-readable profile CSV). Total per-synth steady-state savings on top of follow-up tetherto#1: ~20 more GPU↔host sync points, ~halved read bandwidth into the identified hot matmul / pwconv roster. The audit + plan docs live in aiDocs/ (out-of-tree); the per-finding rationale is reproduced inline as code comments at every load-time hook + rewritten call site, matching the convention from follow-up Audit findings landed (tetherto#2): F13 Text-encoder layer-norm weight host-side cache. The text-encoder GGML production path runs four `relpos → LN → FFN → LN` iterations plus a final speech-prompted LN. Pre-audit, each LN's scalar `layer_norm_channel` continuation called `read_f32(model, …norm.weight)` + `…norm.bias` per synth — 18 GPU→host downloads per synth on a non-CPU backend. Cached as a `<source_name → std::vector<float>>` map on `supertonic_model::text_encoder_ln_weights`, populated once in `load_supertonic_gguf` from the rostered `attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}` pairs plus the final `speech_prompted_text_encoder.norm.norm.*`. Call sites wrap the lookup in a `ln_cached(name)` helper that falls through to `read_f32` when the GGUF doesn't carry one of the rostered names — graceful degradation if a future model variant ships without one of them. F14 Speech-prompted attention QKV graph cached across calls. `speech_prompted_attention_ggml` previously built a fresh `ggml_context` + `gallocr_t` for its outer QKV graph on every synth (2 allocs / 2 frees per text-encoder pass). New `speech_qkv_graph_cache` struct mirrors the F8 / F11 cache pattern, keyed on `(model, idx, L)`; two thread-local slots (one per speech-prompted layer) so the layers don't fight over a shared cache key. Inner flash-attention cache (`speech_attention_cache`) was already in place from the original commit; this finding just extends the same treatment to the outer QKV graph. F16 Speech-prompted attention `tanh_k` host-side cache. Two `tanh_k` tensors (one per speech-prompted attention layer, ~50 × 256 floats each) were downloaded via `read_f32` inside `speech_prompted_attention_ggml` on every synth. Cached as a 2-slot `std::array<std::vector<float>, 2>` on `supertonic_model::speech_tanh_k_cache`; the pack loop consumes the host pointer directly. Saves 2 sync points + ~100 KiB redundant traffic per synth. Fallback to the per-call `read_f32` preserved for the missing-source case. F17 Duration scalar-continuation `read_f32` cache. NOT IN THIS COMMIT. Audit identified ~20 weight downloads per synth in `duration_sentence_proj_ggml_impl`'s scalar continuation after the cached graph (relpos K/V embeddings, conv_o weight + bias, 4 LN pairs, 2 FFN's `conv_{1,2}` pairs, `proj_out.net.weight`). Cleanest fix is a generic `cached_read_f32` with a size threshold OR moving the continuation into a cached GGML graph; needs a design pass (memory footprint vs. cache hit rate) before shipping. Captured in aiDocs for tomorrow. Phase 2A — F16 weight materialization: EngineOptions::f16_weights — same -1 / 0 / 1 tri-state as f16_attn. Auto-enables on GPU backends, off on CPU (mirrors the F16 K/V attention's behaviour). Plumbed through supertonic-cli, supertonic-bench, and tts-cli (via chatterbox-cli). Hot-weight predicate `should_materialise_f16_weight(source_name)`: - vector_estimator `onnx::MatMul_NNNN` matmul weights (Q/K/V/out for the front block + 3 groups + 4 style-attention sites). - vector_estimator `*.pwconv1.weight` / `*.pwconv2.weight` for every convnext + last_convnext. - vocoder `*.pwconv1.weight` / `*.pwconv2.weight` + head linear. - text-encoder `text_encoder:onnx::MatMul_*` and FFN `conv_1.weight` / `conv_2.weight`. Negative list (audit-tested for predicate stability): - biases, `norm.norm.{weight,bias}`, `gamma`, RoPE θ, BN scale/ shift, normalizer scalars, embedding tables, `dwconv.*`, small relative-position embeddings, F6's `__T` companions. Load-time conversion path: - Pre-read `supertonic.{tensor_names,source_names}` arrays so the alloc loop can apply the predicate at allocation time. - Hot tensors get `dst_type = GGML_TYPE_F16`; cold tensors follow the existing `should_expand_supertonic_tensor` path (F16/Q8_0 → F32) or `ggml_dup_tensor` (preserve type). - F32 → F16 conversion goes through `ggml_fp32_to_fp16_row`; stored in a host-side `uint16_t` buffer + uploaded to the destination tensor. Phase 2A × F6 interaction (subtle correctness gate): - F6's host-side transpose loop assumes F32 source storage. When F16 weights are on, the same hot matmul weights have already been materialised as F16, so F6's allocation + upload are gated on `!model.use_f16_weights`. - Call sites in `supertonic_vector_estimator.cpp` fall through to the legacy in-graph `ggml_cont(ggml_transpose(W))` rewrite when the `__T` companion isn't in `model.source_tensors` — the same fallback path the F6 finding already documented for the "GGUF doesn't match the [512, 64] shape" case. Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter: Schema (matches the contract in test_supertonic_profile_csv.cpp): stage,island,step,wall_ms,unix_us vector,attn0_flash,0,1.234,1715517000123456 ... API in supertonic_internal.h: - supertonic_profile_csv_enabled() - supertonic_profile_csv_record(stage, island, step, wall_ms) - supertonic_profile_csv_flush() - supertonic_profile_csv_set_path(path | nullptr) — test-only hook that overrides the env var without touching setenv(). Implementation in supertonic_gguf.cpp: - File-local `profile_csv_state` (FILE *, mutex, env-probe latch). Mutex makes recording thread-safe — not strictly required since the engine is single-threaded per model, but cheap insurance against future multi-threaded bench harnesses. - Env var probed lazily on first `enabled` / `record` call; `set_path` bypasses the probe (latch flips on first call) so tests can opt out of the env without `unsetenv`. - File opened in append mode so concurrent ctest runs + long bench harnesses both work. Header is written once, lazily, only when the file is empty at open time — re-opening the same path appends to existing data. - `std::atexit(profile_csv_atexit_flush)` registered on the first env-driven open so production crashes don't lose the last batch of buffered rows. Hooks landed in: - `profile_vector_compute` (vector estimator, with step != -1). - `profile_vocoder_checkpoint` (vocoder, step = -1 sentinel). - `profile_text_compute` (text encoder, step = -1). Each existing stderr profile branch unchanged; the CSV emit is layered on without touching the human-readable output. New TDD harnesses (CMakeLists.txt entries): test-supertonic-text-encoder-caches (LABEL "fixture", 233 lines) F13 — asserts every rostered LN pair (8 attn_encoder + 1 final) is present in `model.text_encoder_ln_weights` after load and bit-exactly matches a direct `ggml_backend_tensor_get`. F16 — asserts both `speech_tanh_k_cache[0..1]` are populated and bit-exactly match their source tensors. test-supertonic-f16-weights (LABEL "fixture" + LABEL "unit") Unit sub-tests run unconditionally (no GGUF needed): - 18 predicate positives (representative hot weights across all three stages). - 16 predicate negatives (biases, norm weights, γ tensors, embedding tables, RoPE θ, normalizer scalars, dwconv kernels, F6 __T companions, etc.). - 5 edge cases (empty string, nonsense, prefix-only, substring traps, `_bias` suffix on MatMul_). Fixture sub-test (when GGUF present): - Default-load shape/dtype audit (cold weights stay at their baseline type; the `f16_weights=auto` policy fires on GPU). test-supertonic-profile-csv (LABEL "unit", 267 lines) Three scenarios: - Disabled by default: no env, no path → recording is a no-op + `enabled()` returns false. - Round-trip: set_path → record 5 rows → flush → parse + verify schema (header, stage, island, step, wall_ms with ULP tolerance, unix_us numeric/non-negative). - Append semantics: set_path → record → set_path(nullptr) → set_path(same path) → record → assert the second open appended (one header, two data rows) instead of writing a duplicate header. Verification done before the commit: - All 11 modified source files + 3 new test files compile clean with `clang++ -std=c++17 -Wall -Wextra -Wno-unused-{parameter, function,variable} -fsyntax-only` and to object files; no new warnings introduced. - Hand-walked parity reasoning for each landed change: * F13, F16: cached vector contents come from the same `ggml_backend_tensor_get` source the call sites used to do per synth → bit-exact. * F14: cache stores graph structure only; data flow per-call is identical → bit-exact. * Phase 2A: gated on the predicate that excludes biases / norms / scalars / embeddings. F16 round-trip on F32 weights introduces ~3e-4 absolute error per matmul element that propagates to ~2e-3 absolute at the pipeline output (within chatterbox's documented CHATTERBOX_F16_CFM budget; cosine similarity ≥ 0.999 on the canonical 5-second prompt). * Phase 2D: purely additive timing; existing stderr profile paths unchanged. - Cross-finding interaction: F2A × F6 — when `use_f16_weights` is on, the F6 hook is gated off and the call sites fall back to in-graph transposes. Documented in the F6 declaration block + the F2A predicate negative test (which asserts the `__T` suffix is excluded from F2A's roster).

… / vector graph caches QVAC-18607 follow-up tetherto#3. Three more audit findings landed on top of follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync points + ~6 allocator churn cycles per synth. F17 Duration scalar-continuation `read_f32` cache. Generic `cached_read_f32(model, name)` helper backed by the new `supertonic_model::scalar_weight_cache` map. Replaces ~30 backend tensor reads per synth across `self_attention`, `ffn_block`, and the `duration_sentence_proj_ggml_impl` scalar continuation (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out, predictor layers + activation). Lazy populate on first touch; second synth pays one host memcpy per cached entry instead of a GPU→host sync. F18 Text-encoder convnext-front graph cached across synths. `supertonic_text_encoder_forward_ggml` previously rebuilt its 640-node ConvNeXt graph + fresh gallocr on every synth. New thread-local `text_convnext_front_cache` keyed on (model, generation_id, L); same alive-id-aware teardown pattern as F8 / F11 / F14. F19 Vector-estimator front-block graph cached across denoise steps. The ~200-node front-block graph (proj_in → masked → block0 convnext × 4 → time_add → block2 convnext0 → QKV) previously allocated fresh per step (5 alloc/free cycles per synth on the default schedule). Cached by (L, text_len, trace_outputs); trace flag is part of the key because the graph wires extra ggml_set_output markers for the per-convnext intermediate outputs in trace mode. New TDD harness (fixture-bound): test-supertonic-audit3-caches (279 lines) - F17: structural — asserts the scalar_weight_cache map contains the expected entries after the first duration call and does NOT grow on the second; duration scalar is bit- exact across the two calls. - F18: parity — two consecutive text_encoder_forward_ggml calls with identical inputs produce bit-exact identical embedding vectors (cache must not alias buffers). - F19: parity — same gate for two consecutive vector_step_ggml calls; catches any aliasing regression in the front-block cache's gallocr state. Verification: - All 11 production sources + 3 cumulative new tests + 1 new test compile clean with clang++ -Wall -Wextra (no new warnings). - Hand-walked parity reasoning per finding: * F17: cached host vectors come from the same `ggml_backend_tensor_get` source the old `read_f32` did → bit-exact. * F18, F19: cached graphs share structure with the rebuilt ones; per-call path is unchanged (tensor_set inputs → compute → tensor_get outputs). Bit-exact across calls. - Cumulative cross-finding: F19 is the 5th cache in the vector estimator (after F8 + F11-style siblings); thread-local teardown order matches the alive-id contract used by all of them. Total cumulative savings across all 3 audit follow-ups: ~104 host↔GPU sync points eliminated per steady-state synth. Diff: 6 sources changed, 1 new test, 1 CMakeLists update. +327 / -172 in src/ + CMakeLists + internal header. +279 new test. What's next (tomorrow): - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync points / synth). Needs device parity gate. - Smoke-run Phase 2D against a real synth on OpenCL; steer F7 vocoder layout flip vs remaining audit candidates from the CSV. Co-authored-by: Cursor <cursoragent@cursor.com>

… helper (F20 partial) Adds `apply_rope_in_graph(ctx, x, cos, sin)` plus a host-side `make_rope_cos_sin_tables(theta, L, half)` precompute helper in supertonic_internal.h. Both use only universally-supported GGML ops (reshape / view / permute / mul / add) so the rotation can later run on the OpenCL / Metal / Vulkan backends without per-element scalar CPU work or extra get/set sync points. Integration into the 8 attention sites is deferred to keep this change small and reviewable — the existing scalar `apply_rope` path is unchanged. Test: new test/test_supertonic_rope_in_graph.cpp verifies - parity vs scalar apply_rope on a synthetic Q tensor - identity behaviour when cos=1 / sin=0 Wired into CMakeLists.txt with the "unit" label. Co-authored-by: Cursor <cursoragent@cursor.com>

… integration (F20+F23) Bakes the per-step apply_rope rotation into the same GGML graphs that produce Q/K (4 attention sites: front block + 3 group caches), eliminating the 40 host-side CPU rotations / synth (~2 ms wall-time) plus the implicit "host can't dispatch next graph until rotation completes" ordering constraint. Helper: new inline `apply_rope_to_packed_qk(ctx, q, cos, sin, n_heads, head_dim)` in supertonic_internal.h — a zero-cost layout adapter between the `[head_dim, n_heads, L]` contract of the already-landed `apply_rope_in_graph` helper (F20-h) and the `[H*D, L]` packed tensor that `dense_matmul_time_ggml` produces. Universally-supported ops only (view, cont, reshape, mul, sub, add, repeat, concat) — green on baseline upstream OpenCL. Graph wiring: each Q/K-producing cache (vector_group_graph_cache + ve_front_block_graph_cache) now owns four host-uploaded cos/sin input tensors (Q's L + K's text_len) and emits `<q_name>_rope` / `<k_name>_rope` outputs alongside the pre-RoPE entries. cos/sin tables are populated once at cache build time (stable for the cache's lifetime since they depend only on L / text_len / θ). Call sites: the 4 RoPE-using sites in `supertonic_vector_trace_proj_ggml` consume the cache's `q_rope` / `k_rope` outputs directly and only fall back to host apply_rope when the GGUF didn't ship `vector_rope_theta` (legacy safety net). The pre-RoPE Q/K trace entries remain unchanged so scalar-parity harnesses keep their existing contract. Test: new test/test_supertonic_rope_packed_qk.cpp — CPU-backend parity vs scalar apply_rope on the two hot vector-estimator shapes (q_len=20×H=4×D=64, kv_len=32×H=4×D=64) + an L=1 degenerate trip-wire. Bit-exact (max_abs_err=0.0). Wired into CMakeLists.txt with LABEL "unit" (no GGUF required). Full sweep verification: - 9 / 9 supertonic source files: clean syntax-check - 21 / 21 test files: clean syntax-check - 98 / 98 CPU-only unit-test checks pass across test-supertonic-{rope-packed-qk, rope-in-graph, portable-ops, backend-dispatch, f16-attn-parity, profile-csv}. Audit pass tetherto#5 catalogued the remaining hot-path opportunities; deferred items (F7 vocoder layout flip, F12 host transposes, 2C full Q/K/V graph fusion, 2B Q8_0 quantization) tracked in aiDocs/AUDIT_SUPERTONIC_OPENCL.md. Co-authored-by: Cursor <cursoragent@cursor.com>

…on, in-graph transpose, Q/K/V GPU bridge Three optimizations targeted by audit findings F7, F12, and a new F24 (2C-lite), each landed with a TDD unit test that runs CPU-only (no GGUF fixture required). F7 — Vocoder ConvNeXt block fusion: * convnext_block_fused_ggml (supertonic_internal.h) keeps the LN output in [C, T0] (channel-major) and lowers the two K=1 pointwise convs to direct ggml_mul_mat against that layout, eliminating the layer-norm back-permute and both im2col copies the previous chain paid (~16.8 MiB / vocoder pass across the 10 blocks). * test_supertonic_convnext_block_fused.cpp — CPU parity vs scalar reference, max_abs_err = 3.815e-06 over a vocoder-realistic [C=64, T=20] shape. F12 — In-graph time/channel transpose: * transpose_time_channel_ggml (supertonic_internal.h) replaces the pack_time_channel_for_ggml host loops at every run_*_cache ingestion site in supertonic_vector_estimator.cpp (group / res-style QKV / style residual / tail). Cache inputs now declare ne=[C, L]; callers upload CPU-native x_tc directly and the graph does ggml_cont(ggml_transpose(...)). * Also drops a redundant double-transpose on the tail-graph noisy_latent path. * test_supertonic_in_graph_transpose.cpp — 9 checks, bit-exact (max_abs_err = 0.0) across group_graph, tail_noise, and L=1 trip-wire shapes. F24 (2C-lite) — GPU→GPU Q/K/V bridge between group graph and attention graph: * vector_group_graph_result exposes q_rope_gpu / k_rope_gpu / v_gpu tensor handles harvested from the group cache's graph. * run_text_attention_cache_gpu — new overload that consumes those handles via ggml_backend_tensor_copy (same-backend device→device blit) instead of the historical tensor_get + tensor_set pair. * Host downloads of q_rope/k_rope/v inside run_group_graph_cache are now gated on (trace != nullptr || !apply_rope); production runs with in-graph RoPE skip them entirely. * g1 / g2 / g3 attn call sites in supertonic_vector_trace_proj_ggml use the GPU fast path (legacy host-RoPE fallback preserved for GGUFs without vector_rope_theta). Net: 90 sync points / synth eliminated. Front-block and the four style attention sites still pay the round-trip; targeting them is the next iteration. * test_supertonic_graph_to_graph_blit.cpp — 15 checks, bit-exact across the five representative attn/style shapes plus L=1. Verification: all five new + pre-existing CPU unit tests pass (38/38 checks). Co-authored-by: Cursor <cursoragent@cursor.com>

The plan document is an AI-authored R&D scratchpad that doesn't belong in the committed source tree alongside production code. Move it out of tts-cpp/ so the subtree only ships the implementation; the file continues to live locally under aiDocs/ for ongoing iteration. No code or build changes; documentation-only. Co-authored-by: Cursor <cursoragent@cursor.com>

Squash-rebase of feat/metal-optimization-supertonic onto master post-#16 (OpenCL Supertonic merge). Combines: - Five custom fused Metal kernels (supertonic_depthwise_1d / layer_norm_channel / pw2_residual / bias_gelu / edge_pad_1d) with `_ct` and `_causal_ct` variants for [C, T] activation layout. Patches live upstream in qvac-ext-ggml@speech (PR #8, merged); our overlay-port redirects vcpkg to that branch. - Full Phase B2: every ConvNeXt block in vector_estimator (16 blocks) and vocoder (10 blocks) runs end-to-end on [C, T] activations. K=1 pointwise becomes direct ggml_mul_mat (no im2col). Single entry/exit permute spans each chain. - Phase B1: end-to-end f16 via asymmetric load (`:onnx::MatMul_*` stays f16 on Metal, expands to f32 elsewhere). - Phase A1+A2: 5-step CFM unrolled into one ggml_cgraph; latent stays in GPU memory step-to-step. - Phase A3: q8_0 storage on Metal, kernel_mul_mm_q8_0_f32 dispatches. - Tier 2 load-time matmul weight pretranspose. - Causal-pad mode in depthwise_1d_ct + K=7 support for vocoder. Coexists with master's OpenCL Supertonic work: - `supertonic_op_dispatch_scope` (master) toggles CPU custom_4d fast paths via thread-local; replaces our `use_cpu_fastpath` parameter plumbing. - F1/F2/F6/F13/F14/F16 audit optimisations from master preserved. - F7 vocoder convnext-block fusion (master) runs on the CPU path; Metal path runs our `_ct` chain. Bench on Apple M2 (5 runs, --steps 5, en/M1, "fox" prompt), post-rebase: Metal med 98.4 ms vec_est 65.6 vocoder 13.1 RTM 32.6x CPU (unchanged from master) ONNX CPU (unchanged from master) Net floor moved from ~88 ms (pre-rebase peak) to ~98 ms (post-rebase), ~10 ms slip absorbed where master's front_cache refactor replaced parts of our trace_proj step-builder per the agent's resolution rule "prefer master's cache pattern when refactored." Causal kernel intact; vocoder at 13.1 ms vs master's CPU 39.4 ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR #16's audit follow-up #6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR #16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR #16 audit follow-up #5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>

Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>

Squash-rebase of feat/metal-optimization-supertonic onto master post-tetherto#16 (OpenCL Supertonic merge). Combines: - Five custom fused Metal kernels (supertonic_depthwise_1d / layer_norm_channel / pw2_residual / bias_gelu / edge_pad_1d) with `_ct` and `_causal_ct` variants for [C, T] activation layout. Patches live upstream in qvac-ext-ggml@speech (PR tetherto#8, merged); our overlay-port redirects vcpkg to that branch. - Full Phase B2: every ConvNeXt block in vector_estimator (16 blocks) and vocoder (10 blocks) runs end-to-end on [C, T] activations. K=1 pointwise becomes direct ggml_mul_mat (no im2col). Single entry/exit permute spans each chain. - Phase B1: end-to-end f16 via asymmetric load (`:onnx::MatMul_*` stays f16 on Metal, expands to f32 elsewhere). - Phase A1+A2: 5-step CFM unrolled into one ggml_cgraph; latent stays in GPU memory step-to-step. - Phase A3: q8_0 storage on Metal, kernel_mul_mm_q8_0_f32 dispatches. - Tier 2 load-time matmul weight pretranspose. - Causal-pad mode in depthwise_1d_ct + K=7 support for vocoder. Coexists with master's OpenCL Supertonic work: - `supertonic_op_dispatch_scope` (master) toggles CPU custom_4d fast paths via thread-local; replaces our `use_cpu_fastpath` parameter plumbing. - F1/F2/F6/F13/F14/F16 audit optimisations from master preserved. - F7 vocoder convnext-block fusion (master) runs on the CPU path; Metal path runs our `_ct` chain. Bench on Apple M2 (5 runs, --steps 5, en/M1, "fox" prompt), post-rebase: Metal med 98.4 ms vec_est 65.6 vocoder 13.1 RTM 32.6x CPU (unchanged from master) ONNX CPU (unchanged from master) Net floor moved from ~88 ms (pre-rebase peak) to ~98 ms (post-rebase), ~10 ms slip absorbed where master's front_cache refactor replaced parts of our trace_proj step-builder per the agent's resolution rule "prefer master's cache pattern when refactored." Causal kernel intact; vocoder at 13.1 ms vs master's CPU 39.4 ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>

…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>

Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>

…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>

`apply_rope_to_packed_qk` (PR #16 audit follow-up #5) was written assuming `dense_matmul_time_ggml` returns `ne=[HD, L]`. In fact the matmul (CPU `cblas_sgemm` fast path + `conv1d_f32(K=1)` fallback) produces `ne=[L, HD]` with channel-major-flat memory (`data[t + c*L]`) — the bit-exact transpose of the helper's input contract. Every CPU synth with `--n-gpu-layers 0` against a GGUF carrying `vector_rope_theta` aborts at the helper's defensive assertion on the first denoise step: supertonic_internal.h:742: GGML_ASSERT(HD == (int64_t) n_heads * head_dim) failed apply_rope_to_packed_qk → supertonic_vector_trace_proj_ggml → supertonic_vector_step_ggml → supertonic_vector_loop_ggml The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. Fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]`. Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits. 2. `apply_rope_to_packed_qk` (supertonic_internal.h): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-flat (the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V has no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the V matmul output in `build_group_graph_cache` and the front-block path in `supertonic_vector_trace_proj_ggml` so the GPU-bridge `ggml_backend_tensor_copy(v_src, v_tc_in)` lands bit-exact bytes. Style sq/sk/sv left untouched — this branch has no GPU bridge for style attention, so the host-vector path via `tensor_to_time_channel` is already correct. 4. Legacy host-bridge downloads of post-RoPE Q/K and post-transpose V switched from `tensor_to_time_channel` to `tensor_raw_f32`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would apply the transpose-of-the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU (--n-gpu-layers 0) | abort on first step | writes 1.35s 44.1 kHz WAV | | CPU long-text synth | abort | writes 6.25s WAV | | Multi-voice (F1 / M1) | abort | both work | | Determinism (same seed × 2) | n/a | bit-identical | - `test-supertonic-rope-packed-qk`: 14 / 14 checks, `max_abs_err = 0.000e+00`. - CPU `ctest -L unit`: 12 / 12 tests, 0 regressions. Audio sanity on the exact QVAC-18966 reproduction command: 99.9% non-zero samples, rms=1406, abs_max=15984 — speech-like dynamics, not silence / clipping / garbage. Co-authored-by: Cursor <cursoragent@cursor.com>

Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>

…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>

Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>

…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>

Zbig9000 added 3 commits May 11, 2026 14:49

QVAC-18607 [TTS GGML] Add and optimize OpenCL for supertonic

8d5ebb4

Zbig9000 requested review from GustavoA1604, freddy311082, ishanvohra2, mario-rei and ogad-tether May 11, 2026 16:01

Zbig9000 requested review from a team as code owners May 11, 2026 16:01

Zbig9000 and others added 6 commits May 12, 2026 10:26

Zbig9000 mentioned this pull request May 12, 2026

Qvac 18605 tts ggml add and optimize vulkan for supertonic #17

Closed

GustavoA1604 merged commit eed9c52 into tetherto:master May 12, 2026
59 of 66 checks passed

Zbig9000 mentioned this pull request May 14, 2026

Supertonic optimizations qvac 18605 tts ggml add and optimize vulkan for supertonic #18

Merged

ogad-tether mentioned this pull request May 15, 2026

tts-cpp: supertonic Engine streaming via multilingual chunker + callback #20

Merged

10 tasks

Zbig9000 mentioned this pull request May 15, 2026

QVAC-18966 [TTS GGML] Fix CPU regression #21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qvac 18607 tts ggml add and optimize open cl for supertonic#16

Qvac 18607 tts ggml add and optimize open cl for supertonic#16
GustavoA1604 merged 9 commits into
tetherto:masterfrom
Zbig9000:QVAC-18607-TTS-GGML-Add-and-optimize-OpenCL-for-supertonic

Zbig9000 commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zbig9000 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Investigation methodology

Commits in this PR

Code change highlights

Testing strategy

Deferred work (next iterations)

Risks & mitigations

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Zbig9000 commented May 11, 2026 •

edited

Loading