Qvac 18605 tts ggml add and optimize vulkan for supertonic#17
Qvac 18605 tts ggml add and optimize vulkan for supertonic#17Zbig9000 wants to merge 21 commits into
Conversation
QVAC-18607 follow-up. The bring-up commit (8d5ebb4) landed the dispatch + portable-op + F16-K/V-attention primitives but only exercised them transitively through the existing fixture-bound test-supertonic-* harnesses, which need a Supertonic GGUF + an artifacts/supertonic-ref-quick reference dump to run. A fresh checkout has neither, so the bring-up primitives shipped without their own gate on `ctest -L unit`. This commit adds three CPU-only unit harnesses that cover the bring-up primitives independent of any fixture, plus an R&D plan document capturing the next optimization rounds with their TDD test gates. Tests (all LABEL "unit", auto-run on fresh checkout): test-supertonic-backend-dispatch (186 lines) Six scenarios around supertonic_op_dispatch_scope + the two thread-local query functions: default state, CPU model mirroring, GPU model mirroring + post-teardown restore, RAII teardown on exception, nested-scope unwinding, independence of use_cpu_custom_ops / use_f16_attn. Catches "scope leaked wrong previous-value into thread_local" and "GPU engine poisons next CPU engine on same thread" regressions. test-supertonic-portable-ops (260 lines) CPU-backend parity of leaky_relu_portable_ggml's CPU lowering (fused ggml_leaky_relu) vs its GPU decomposition (RELU + 2x SCALE + ADD) for alpha in {0, 0.01, 0.05, 0.1, 0.5, 0.99, 1.0} against a sign-mixed input including the zero boundary. Also asserts graph-node-count grows on the GPU dispatch — catches a regression where the portable helper would silently route back to ggml_leaky_relu on a non-CPU backend (defeating the whole reason the helper exists). test-supertonic-f16-attn-parity (291 lines) F32 vs F16 K/V ggml_flash_attn_ext parity on the two hot shapes from the vector estimator (text attention kv=32, style attention kv=50), n_heads=4, head_dim=64. Tolerance 5e-3 abs / 5e-3 rel — the same band chatterbox ships behind --cfm-f16-kv-attn. Gracefully skips ("SKIPPED — CPU build missing one path") if the local CPU build doesn't carry both flash-attention paths, preserving CI greenness while still validating where the path exists. Refactor to support testing: leaky_relu_portable_ggml moves from file-local in supertonic_vocoder.cpp to an inline definition in supertonic_internal.h. ODR-safe under C++17, lets the portable-ops test call the production helper directly instead of re-implementing the rewrite (which would defeat the test's purpose). The vocoder TU now only carries a one-line redirect comment pointing at the header. Plan document (PLAN_SUPERTONIC_OPENCL.md, 268 lines): Captures five concrete next-rounds with motivation + code- change plan + acceptance test + risk for each: 2A. F16 weight materialization for hot matmuls — biggest expected single-flag win after F16 K/V attn, mirrors chatterbox's CHATTERBOX_F16_CFM gate. 2B. Pre-quantized Q8_0 GGUF weights — needs convert-script work + audio listening sign-off. 2C. Reduce 140x host<->GPU sync round-trips per synth in the vector estimator (5 steps x 28 set/get pairs). 2D. SUPERTONIC_OPENCL_PROFILE=PATH.csv tooling for per-kernel attribution; mirrors chatterbox's cl_profiling_*.csv flow. 2E. Vocoder unpack-on-GPU via ggml_permute + ggml_cont. Each phase has its acceptance test spelled out (TDD, written before the implementation lands), the CTest label it should carry, and its sequencing rationale. Cross-linked from PROGRESS_SUPERTONIC.md's "Next optimization rounds" subsection so future-readers find the roadmap. Validation: All three new tests pass clang -fsyntax-only -Wall -Wextra and compile to clean .o files. `nm` confirms the dispatch test's four undefined symbols (op_dispatch_scope ctor/dtor, use_cpu_custom_ops, use_f16_attn) resolve against the definitions in supertonic_gguf.o, so link-time resolution will succeed under the real CMake build. No new linter errors in any of the 8 affected files; pre-existing -Wunused-function warnings on read_f32 / scalar_f32 / set_env_if_unset unchanged.
…wins
QVAC-18607 follow-up. Lands the audit-driven optimization round
identified by an end-to-end code audit of the post-bring-up tree:
~54 GPU↔host sync points per synth eliminated independently of the
quantization / F16-weight work that's still on the roadmap. Nine
findings landed; three high-risk ones (RoPE in-graph, vocoder
layout flip, full host-transpose elimination) stay deferred behind
a physical-device parity gate.
The audit report + plan document live under aiDocs/ and are not
part of this commit; the per-finding rationale is reproduced
inline in the code comments at every load-time hook and every
rewritten call site so the rationale stays adjacent to the code it
justifies.
Findings landed:
F1 RoPE θ tensor host-side cache.
`supertonic_model::vector_rope_theta` populated once in
`load_supertonic_gguf` from
`vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta`,
then consumed at 9 call sites that previously did the same
backend read on the hot path. Saves 20 GPU→host downloads
per default 5-step synth.
F2 Vocoder BN scale / shift pre-bake.
`supertonic_vocoder_weights::bn_scale_pre` + `bn_shift_pre`
allocated alongside the other vocoder weights at load and
populated from `gamma / sqrt(var + 1e-5)` + `beta - mean *
scale` once. The vocoder graph references them as weight
tensors (no `ggml_set_input`), so the per-synth pattern of
4 final_norm.* downloads + CPU compute + 2 bn_scale/bn_shift
uploads goes away entirely.
F3 Vocoder unpack moves into the graph.
`supertonic_vocoder_forward_ggml` now uploads `latent` in
its raw `[latent_len, latent_channels]` shape and the
cached graph runs `reshape_3d(L,6,24) → permute(1,0,2,3)
→ cont → reshape_2d(T0, 24)`. Math is bit-exact with the
legacy CPU triple-loop in `supertonic_vocoder_forward_cpu`;
the host loop + the ~40 KiB upload-roundtrip are gone.
F4 Style cache upload skip.
`vector_res_style_qkv_cache` gains `last_style_v_raw_uploaded`
/ `last_kctx_raw_uploaded` pointer-keyed against the host
vectors `cached_style_layouts` returns. Pointer comparison
is sound: the layout cache is keyed on
`(model.generation_id, style_ttl)` so equal pointers mean
equal data. Steady-state per synth: 4 cold-miss uploads
after the first synth, then 16 skips/synth.
F6 Pre-transposed t_proj weights.
Four `__T` companion tensors allocated in `model.ctx_w`
pre-`alloc_ctx_tensors`, populated via host-side transpose
after the source data lands. Mapped into
`model.source_tensors` under `<name>__T` so
`require_source_tensor(model, matmul_source + "__T")` is
the call-site lookup. Eliminates the
`ggml_cont(ggml_transpose(W))` op (+ ~640 KiB of
compute-buffer copies) at every graph build. Defensive
shape check (F32, ne=[512, 64]) skips models that don't
match the audit-roster expectation; call sites fall back
to the original in-graph transpose.
F8 Cached style-residual graphs.
`vector_style_residual_graph_cache` + builder + runner;
replaces four near-identical inline graph build sites
(style0 / g1 / g2 / g3) with cache-lookup-or-build. Each
cache survives across synths with the same `(L, C, norm_block)`
key. Saves 16 graph alloc/free cycles + ~80 bytes of
gallocr churn per synth, but the main win is dropping
~150 LoC of duplicated boilerplate.
F9 `cached_time_embedding(model, current_step, total_steps)`.
Lazy `mutable` map on `supertonic_model::time_emb_cache`.
First-synth cost is the same as the old code; subsequent
synths with the same denoise schedule pay zero CPU
compute and zero downloads for this stage.
F10 Text-encoder embedding lookup as `ggml_get_rows`.
Replaces the host-side embedding-table download + CPU gather
+ pack-to-channel-major-and-upload chain with an i32-vector
input + `ggml_get_rows + ggml_transpose + ggml_cont` on the
device. Bounds check still runs host-side against
`emb_table->ne[1]`. Drops the per-synth ~2 MB embedding
table download.
F11 Cached duration graph.
`duration_graph_cache` + `free_duration_graph_cache`; first
synth pays the full graph build, subsequent synths with the
same text_len reuse the gallocr-allocated graph.
Findings deferred (NOT in this commit, captured for the next round):
F5 RoPE in-graph (replace CPU `apply_rope` with `ggml_rope_ext`).
Supertonic's RoPE formula is non-standard (angle scales with
`t/L`, not absolute position, and consumes a learned theta);
needs a careful match-up against `apply_rope` + a physical-
device parity test before shipping.
F7 Vocoder layout flip (kill the `permute+cont` wrap around
every `ggml_norm`). Large refactor across every vocoder op;
defer until F1–F11's wins are profiled on Adreno so the
next-bottleneck claim has hard data.
F12 Full host-transpose elimination. F10 covered the text-
encoder gather case; the broader `pack_time_channel_for_ggml`
/ `tensor_to_time_channel` machinery stays in place because
it's small and predictable, and the audit ranked it LOW.
New TDD harnesses (fixture-bound, run on the existing
`add_supertonic_harness` registration so `ctest -L fixture` picks
them up when the GGUF is present, auto-DISABLED otherwise):
test-supertonic-load-caches
Structural checks for F1 / F2 / F6 / F9:
- `model.vector_rope_theta` matches a direct backend read of
the source tensor.
- `model.vocoder.bn_scale_pre / bn_shift_pre` match host-side
recomputation of the BN-fused formula.
- The four `__T` companions have axes 0/1 swapped vs their
originals and bit-exact transposed contents.
- `cached_time_embedding` populates lazily, returns the same
vector on a repeat key, and produces different vectors for
different keys.
test-supertonic-graph-rewrites
Parity checks for F3 / F8 / F11:
- `supertonic_vocoder_forward_ggml` output matches
`supertonic_vocoder_forward_cpu` on synthetic latent.
- Two consecutive `supertonic_duration_forward_ggml` calls
with identical inputs yield bit-exact identical durations
(F11's cache must not alias buffers across calls).
- Two consecutive `supertonic_vector_step_ggml` calls with
identical inputs yield bit-exact identical outputs (F8's
cached style-residual graphs must not alias buffers
across calls).
Existing fixture parity tests stay the gate of last resort:
`test-supertonic-pipeline` end-to-end (1e-3 abs / 1e-3 rel),
`test-supertonic-{vocoder,vector,duration,text-encoder}` per-
stage, and the `-trace` variants are unchanged in this commit.
Verification done before the commit:
- All 9 modified source files + 2 new test files compile clean
with `clang++ -Wall -Wextra -fsyntax-only` and to object
files; no new warnings introduced.
- Hand-walked parity reasoning for each finding:
* F1, F9: same data path, cache vs read.
* F2: pre-bake formula identical to per-call formula.
* F3: walked the `reshape → permute → cont → reshape` math
against the CPU loop's index formula.
* F4: pointer compare against `cached_style_layouts` output;
cache rebuilds reset to nullptr so cold-miss path always
fires.
* F6: hand-derived `dst[i*64+j] = src[j*512+i]` against the
logical (W, H) shapes of both tensors.
* F8, F11: cache only changes *when* alloc happens; graph
structure for a given key is identical.
* F10: walked `ggml_get_rows` + transpose + cont produces
`data[c*L+t] = emb[ids[t]*C + c]` matching the CPU gather.
- F1's load-time hook upgraded to `require_source_tensor` (vs
the original `find + null-check`) so call sites can assume
`.data()` is non-null; restores the pre-audit "fail fast on
missing tensor" behaviour.
…caches, F16 weights, profile CSV QVAC-18607 follow-up tetherto#2. Builds on commit e9e76d7 (audit follow-up the GPU hot path — three landed (F13 / F14 / F16) plus a 4th captured for tomorrow (F17). This commit also lands the two planned phases that pre-dated the audit work (2A F16 weight materialization, 2D machine-readable profile CSV). Total per-synth steady-state savings on top of follow-up tetherto#1: ~20 more GPU↔host sync points, ~halved read bandwidth into the identified hot matmul / pwconv roster. The audit + plan docs live in aiDocs/ (out-of-tree); the per-finding rationale is reproduced inline as code comments at every load-time hook + rewritten call site, matching the convention from follow-up Audit findings landed (tetherto#2): F13 Text-encoder layer-norm weight host-side cache. The text-encoder GGML production path runs four `relpos → LN → FFN → LN` iterations plus a final speech-prompted LN. Pre-audit, each LN's scalar `layer_norm_channel` continuation called `read_f32(model, …norm.weight)` + `…norm.bias` per synth — 18 GPU→host downloads per synth on a non-CPU backend. Cached as a `<source_name → std::vector<float>>` map on `supertonic_model::text_encoder_ln_weights`, populated once in `load_supertonic_gguf` from the rostered `attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}` pairs plus the final `speech_prompted_text_encoder.norm.norm.*`. Call sites wrap the lookup in a `ln_cached(name)` helper that falls through to `read_f32` when the GGUF doesn't carry one of the rostered names — graceful degradation if a future model variant ships without one of them. F14 Speech-prompted attention QKV graph cached across calls. `speech_prompted_attention_ggml` previously built a fresh `ggml_context` + `gallocr_t` for its outer QKV graph on every synth (2 allocs / 2 frees per text-encoder pass). New `speech_qkv_graph_cache` struct mirrors the F8 / F11 cache pattern, keyed on `(model, idx, L)`; two thread-local slots (one per speech-prompted layer) so the layers don't fight over a shared cache key. Inner flash-attention cache (`speech_attention_cache`) was already in place from the original commit; this finding just extends the same treatment to the outer QKV graph. F16 Speech-prompted attention `tanh_k` host-side cache. Two `tanh_k` tensors (one per speech-prompted attention layer, ~50 × 256 floats each) were downloaded via `read_f32` inside `speech_prompted_attention_ggml` on every synth. Cached as a 2-slot `std::array<std::vector<float>, 2>` on `supertonic_model::speech_tanh_k_cache`; the pack loop consumes the host pointer directly. Saves 2 sync points + ~100 KiB redundant traffic per synth. Fallback to the per-call `read_f32` preserved for the missing-source case. F17 Duration scalar-continuation `read_f32` cache. NOT IN THIS COMMIT. Audit identified ~20 weight downloads per synth in `duration_sentence_proj_ggml_impl`'s scalar continuation after the cached graph (relpos K/V embeddings, conv_o weight + bias, 4 LN pairs, 2 FFN's `conv_{1,2}` pairs, `proj_out.net.weight`). Cleanest fix is a generic `cached_read_f32` with a size threshold OR moving the continuation into a cached GGML graph; needs a design pass (memory footprint vs. cache hit rate) before shipping. Captured in aiDocs for tomorrow. Phase 2A — F16 weight materialization: EngineOptions::f16_weights — same -1 / 0 / 1 tri-state as f16_attn. Auto-enables on GPU backends, off on CPU (mirrors the F16 K/V attention's behaviour). Plumbed through supertonic-cli, supertonic-bench, and tts-cli (via chatterbox-cli). Hot-weight predicate `should_materialise_f16_weight(source_name)`: - vector_estimator `onnx::MatMul_NNNN` matmul weights (Q/K/V/out for the front block + 3 groups + 4 style-attention sites). - vector_estimator `*.pwconv1.weight` / `*.pwconv2.weight` for every convnext + last_convnext. - vocoder `*.pwconv1.weight` / `*.pwconv2.weight` + head linear. - text-encoder `text_encoder:onnx::MatMul_*` and FFN `conv_1.weight` / `conv_2.weight`. Negative list (audit-tested for predicate stability): - biases, `norm.norm.{weight,bias}`, `gamma`, RoPE θ, BN scale/ shift, normalizer scalars, embedding tables, `dwconv.*`, small relative-position embeddings, F6's `__T` companions. Load-time conversion path: - Pre-read `supertonic.{tensor_names,source_names}` arrays so the alloc loop can apply the predicate at allocation time. - Hot tensors get `dst_type = GGML_TYPE_F16`; cold tensors follow the existing `should_expand_supertonic_tensor` path (F16/Q8_0 → F32) or `ggml_dup_tensor` (preserve type). - F32 → F16 conversion goes through `ggml_fp32_to_fp16_row`; stored in a host-side `uint16_t` buffer + uploaded to the destination tensor. Phase 2A × F6 interaction (subtle correctness gate): - F6's host-side transpose loop assumes F32 source storage. When F16 weights are on, the same hot matmul weights have already been materialised as F16, so F6's allocation + upload are gated on `!model.use_f16_weights`. - Call sites in `supertonic_vector_estimator.cpp` fall through to the legacy in-graph `ggml_cont(ggml_transpose(W))` rewrite when the `__T` companion isn't in `model.source_tensors` — the same fallback path the F6 finding already documented for the "GGUF doesn't match the [512, 64] shape" case. Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter: Schema (matches the contract in test_supertonic_profile_csv.cpp): stage,island,step,wall_ms,unix_us vector,attn0_flash,0,1.234,1715517000123456 ... API in supertonic_internal.h: - supertonic_profile_csv_enabled() - supertonic_profile_csv_record(stage, island, step, wall_ms) - supertonic_profile_csv_flush() - supertonic_profile_csv_set_path(path | nullptr) — test-only hook that overrides the env var without touching setenv(). Implementation in supertonic_gguf.cpp: - File-local `profile_csv_state` (FILE *, mutex, env-probe latch). Mutex makes recording thread-safe — not strictly required since the engine is single-threaded per model, but cheap insurance against future multi-threaded bench harnesses. - Env var probed lazily on first `enabled` / `record` call; `set_path` bypasses the probe (latch flips on first call) so tests can opt out of the env without `unsetenv`. - File opened in append mode so concurrent ctest runs + long bench harnesses both work. Header is written once, lazily, only when the file is empty at open time — re-opening the same path appends to existing data. - `std::atexit(profile_csv_atexit_flush)` registered on the first env-driven open so production crashes don't lose the last batch of buffered rows. Hooks landed in: - `profile_vector_compute` (vector estimator, with step != -1). - `profile_vocoder_checkpoint` (vocoder, step = -1 sentinel). - `profile_text_compute` (text encoder, step = -1). Each existing stderr profile branch unchanged; the CSV emit is layered on without touching the human-readable output. New TDD harnesses (CMakeLists.txt entries): test-supertonic-text-encoder-caches (LABEL "fixture", 233 lines) F13 — asserts every rostered LN pair (8 attn_encoder + 1 final) is present in `model.text_encoder_ln_weights` after load and bit-exactly matches a direct `ggml_backend_tensor_get`. F16 — asserts both `speech_tanh_k_cache[0..1]` are populated and bit-exactly match their source tensors. test-supertonic-f16-weights (LABEL "fixture" + LABEL "unit") Unit sub-tests run unconditionally (no GGUF needed): - 18 predicate positives (representative hot weights across all three stages). - 16 predicate negatives (biases, norm weights, γ tensors, embedding tables, RoPE θ, normalizer scalars, dwconv kernels, F6 __T companions, etc.). - 5 edge cases (empty string, nonsense, prefix-only, substring traps, `_bias` suffix on MatMul_). Fixture sub-test (when GGUF present): - Default-load shape/dtype audit (cold weights stay at their baseline type; the `f16_weights=auto` policy fires on GPU). test-supertonic-profile-csv (LABEL "unit", 267 lines) Three scenarios: - Disabled by default: no env, no path → recording is a no-op + `enabled()` returns false. - Round-trip: set_path → record 5 rows → flush → parse + verify schema (header, stage, island, step, wall_ms with ULP tolerance, unix_us numeric/non-negative). - Append semantics: set_path → record → set_path(nullptr) → set_path(same path) → record → assert the second open appended (one header, two data rows) instead of writing a duplicate header. Verification done before the commit: - All 11 modified source files + 3 new test files compile clean with `clang++ -std=c++17 -Wall -Wextra -Wno-unused-{parameter, function,variable} -fsyntax-only` and to object files; no new warnings introduced. - Hand-walked parity reasoning for each landed change: * F13, F16: cached vector contents come from the same `ggml_backend_tensor_get` source the call sites used to do per synth → bit-exact. * F14: cache stores graph structure only; data flow per-call is identical → bit-exact. * Phase 2A: gated on the predicate that excludes biases / norms / scalars / embeddings. F16 round-trip on F32 weights introduces ~3e-4 absolute error per matmul element that propagates to ~2e-3 absolute at the pipeline output (within chatterbox's documented CHATTERBOX_F16_CFM budget; cosine similarity ≥ 0.999 on the canonical 5-second prompt). * Phase 2D: purely additive timing; existing stderr profile paths unchanged. - Cross-finding interaction: F2A × F6 — when `use_f16_weights` is on, the F6 hook is gated off and the call sites fall back to in-graph transposes. Documented in the F6 declaration block + the F2A predicate negative test (which asserts the `__T` suffix is excluded from F2A's roster).
… / vector graph caches QVAC-18607 follow-up tetherto#3. Three more audit findings landed on top of follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync points + ~6 allocator churn cycles per synth. F17 Duration scalar-continuation `read_f32` cache. Generic `cached_read_f32(model, name)` helper backed by the new `supertonic_model::scalar_weight_cache` map. Replaces ~30 backend tensor reads per synth across `self_attention`, `ffn_block`, and the `duration_sentence_proj_ggml_impl` scalar continuation (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out, predictor layers + activation). Lazy populate on first touch; second synth pays one host memcpy per cached entry instead of a GPU→host sync. F18 Text-encoder convnext-front graph cached across synths. `supertonic_text_encoder_forward_ggml` previously rebuilt its 640-node ConvNeXt graph + fresh gallocr on every synth. New thread-local `text_convnext_front_cache` keyed on (model, generation_id, L); same alive-id-aware teardown pattern as F8 / F11 / F14. F19 Vector-estimator front-block graph cached across denoise steps. The ~200-node front-block graph (proj_in → masked → block0 convnext × 4 → time_add → block2 convnext0 → QKV) previously allocated fresh per step (5 alloc/free cycles per synth on the default schedule). Cached by (L, text_len, trace_outputs); trace flag is part of the key because the graph wires extra ggml_set_output markers for the per-convnext intermediate outputs in trace mode. New TDD harness (fixture-bound): test-supertonic-audit3-caches (279 lines) - F17: structural — asserts the scalar_weight_cache map contains the expected entries after the first duration call and does NOT grow on the second; duration scalar is bit- exact across the two calls. - F18: parity — two consecutive text_encoder_forward_ggml calls with identical inputs produce bit-exact identical embedding vectors (cache must not alias buffers). - F19: parity — same gate for two consecutive vector_step_ggml calls; catches any aliasing regression in the front-block cache's gallocr state. Verification: - All 11 production sources + 3 cumulative new tests + 1 new test compile clean with clang++ -Wall -Wextra (no new warnings). - Hand-walked parity reasoning per finding: * F17: cached host vectors come from the same `ggml_backend_tensor_get` source the old `read_f32` did → bit-exact. * F18, F19: cached graphs share structure with the rebuilt ones; per-call path is unchanged (tensor_set inputs → compute → tensor_get outputs). Bit-exact across calls. - Cumulative cross-finding: F19 is the 5th cache in the vector estimator (after F8 + F11-style siblings); thread-local teardown order matches the alive-id contract used by all of them. Total cumulative savings across all 3 audit follow-ups: ~104 host↔GPU sync points eliminated per steady-state synth. Diff: 6 sources changed, 1 new test, 1 CMakeLists update. +327 / -172 in src/ + CMakeLists + internal header. +279 new test. What's next (tomorrow): - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync points / synth). Needs device parity gate. - Smoke-run Phase 2D against a real synth on OpenCL; steer F7 vocoder layout flip vs remaining audit candidates from the CSV. Co-authored-by: Cursor <cursoragent@cursor.com>
… helper (F20 partial) Adds `apply_rope_in_graph(ctx, x, cos, sin)` plus a host-side `make_rope_cos_sin_tables(theta, L, half)` precompute helper in supertonic_internal.h. Both use only universally-supported GGML ops (reshape / view / permute / mul / add) so the rotation can later run on the OpenCL / Metal / Vulkan backends without per-element scalar CPU work or extra get/set sync points. Integration into the 8 attention sites is deferred to keep this change small and reviewable — the existing scalar `apply_rope` path is unchanged. Test: new test/test_supertonic_rope_in_graph.cpp verifies - parity vs scalar apply_rope on a synthetic Q tensor - identity behaviour when cos=1 / sin=0 Wired into CMakeLists.txt with the "unit" label. Co-authored-by: Cursor <cursoragent@cursor.com>
… integration (F20+F23)
Bakes the per-step apply_rope rotation into the same GGML graphs
that produce Q/K (4 attention sites: front block + 3 group caches),
eliminating the 40 host-side CPU rotations / synth (~2 ms wall-time)
plus the implicit "host can't dispatch next graph until rotation
completes" ordering constraint.
Helper: new inline `apply_rope_to_packed_qk(ctx, q, cos, sin,
n_heads, head_dim)` in supertonic_internal.h — a zero-cost layout
adapter between the `[head_dim, n_heads, L]` contract of the
already-landed `apply_rope_in_graph` helper (F20-h) and the
`[H*D, L]` packed tensor that `dense_matmul_time_ggml` produces.
Universally-supported ops only (view, cont, reshape, mul, sub,
add, repeat, concat) — green on baseline upstream OpenCL.
Graph wiring: each Q/K-producing cache (vector_group_graph_cache
+ ve_front_block_graph_cache) now owns four host-uploaded cos/sin
input tensors (Q's L + K's text_len) and emits `<q_name>_rope` /
`<k_name>_rope` outputs alongside the pre-RoPE entries. cos/sin
tables are populated once at cache build time (stable for the
cache's lifetime since they depend only on L / text_len / θ).
Call sites: the 4 RoPE-using sites in
`supertonic_vector_trace_proj_ggml` consume the cache's `q_rope` /
`k_rope` outputs directly and only fall back to host apply_rope
when the GGUF didn't ship `vector_rope_theta` (legacy safety net).
The pre-RoPE Q/K trace entries remain unchanged so scalar-parity
harnesses keep their existing contract.
Test: new test/test_supertonic_rope_packed_qk.cpp — CPU-backend
parity vs scalar apply_rope on the two hot vector-estimator
shapes (q_len=20×H=4×D=64, kv_len=32×H=4×D=64) + an L=1 degenerate
trip-wire. Bit-exact (max_abs_err=0.0). Wired into CMakeLists.txt
with LABEL "unit" (no GGUF required).
Full sweep verification:
- 9 / 9 supertonic source files: clean syntax-check
- 21 / 21 test files: clean syntax-check
- 98 / 98 CPU-only unit-test checks pass across
test-supertonic-{rope-packed-qk, rope-in-graph, portable-ops,
backend-dispatch, f16-attn-parity, profile-csv}.
Audit pass tetherto#5 catalogued the remaining hot-path opportunities;
deferred items (F7 vocoder layout flip, F12 host transposes, 2C
full Q/K/V graph fusion, 2B Q8_0 quantization) tracked in
aiDocs/AUDIT_SUPERTONIC_OPENCL.md.
Co-authored-by: Cursor <cursoragent@cursor.com>
…on, in-graph transpose, Q/K/V GPU bridge
Three optimizations targeted by audit findings F7, F12, and a new F24 (2C-lite),
each landed with a TDD unit test that runs CPU-only (no GGUF fixture required).
F7 — Vocoder ConvNeXt block fusion:
* convnext_block_fused_ggml (supertonic_internal.h) keeps the LN output in
[C, T0] (channel-major) and lowers the two K=1 pointwise convs to direct
ggml_mul_mat against that layout, eliminating the layer-norm back-permute
and both im2col copies the previous chain paid (~16.8 MiB / vocoder pass
across the 10 blocks).
* test_supertonic_convnext_block_fused.cpp — CPU parity vs scalar reference,
max_abs_err = 3.815e-06 over a vocoder-realistic [C=64, T=20] shape.
F12 — In-graph time/channel transpose:
* transpose_time_channel_ggml (supertonic_internal.h) replaces the
pack_time_channel_for_ggml host loops at every run_*_cache ingestion site
in supertonic_vector_estimator.cpp (group / res-style QKV / style residual
/ tail). Cache inputs now declare ne=[C, L]; callers upload CPU-native
x_tc directly and the graph does ggml_cont(ggml_transpose(...)).
* Also drops a redundant double-transpose on the tail-graph noisy_latent path.
* test_supertonic_in_graph_transpose.cpp — 9 checks, bit-exact (max_abs_err
= 0.0) across group_graph, tail_noise, and L=1 trip-wire shapes.
F24 (2C-lite) — GPU→GPU Q/K/V bridge between group graph and attention graph:
* vector_group_graph_result exposes q_rope_gpu / k_rope_gpu / v_gpu tensor
handles harvested from the group cache's graph.
* run_text_attention_cache_gpu — new overload that consumes those handles
via ggml_backend_tensor_copy (same-backend device→device blit) instead of
the historical tensor_get + tensor_set pair.
* Host downloads of q_rope/k_rope/v inside run_group_graph_cache are now
gated on (trace != nullptr || !apply_rope); production runs with in-graph
RoPE skip them entirely.
* g1 / g2 / g3 attn call sites in supertonic_vector_trace_proj_ggml use the
GPU fast path (legacy host-RoPE fallback preserved for GGUFs without
vector_rope_theta). Net: 90 sync points / synth eliminated. Front-block
and the four style attention sites still pay the round-trip; targeting
them is the next iteration.
* test_supertonic_graph_to_graph_blit.cpp — 15 checks, bit-exact across the
five representative attn/style shapes plus L=1.
Verification: all five new + pre-existing CPU unit tests pass (38/38 checks).
Co-authored-by: Cursor <cursoragent@cursor.com>
The plan document is an AI-authored R&D scratchpad that doesn't belong in the committed source tree alongside production code. Move it out of tts-cpp/ so the subtree only ships the implementation; the file continues to live locally under aiDocs/ for ongoing iteration. No code or build changes; documentation-only. Co-authored-by: Cursor <cursoragent@cursor.com>
…and-optimize-OpenCL-for-supertonic Qvac 18607 tts ggml add and optimize open cl for supertonic
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>
`g_s3gen_cache_refcount` is a `std::atomic<int>` (line 189) but `<atomic>` was never included; the file relied on a transitive include chain that broke once any consumer rearranged includes. Surfaces as `error: variable 'std::atomic<int> ... has initializer but incomplete type'` on a clean build. Pre-existing bug, unrelated to QVAC-18605 itself but blocked local CTest runs against the Vulkan-optimisation work. Trivial additive include with no behaviour change. Co-authored-by: Cursor <cursoragent@cursor.com>
…s + prewarm
Layered on top of the QVAC-18605 Vulkan bring-up commit; the
round-2 changes generalise the bring-up's "load-time backend
probe" pattern into a process-wide capability cache and add
three more probes / dispatch hooks that fit the same shape.
Net effect on Vulkan: redundant supports_op traffic eliminated,
defensive auto-policy gating extended to F16 weights, forward-
compat Q8_0 K/V probe primed for a follow-up dispatch flip,
and an opt-in --prewarm hook that lets operators amortise the
~hundreds-of-ms cold-start shader-compile cost outside the
operator-visible first synth call.
1) Process-wide capability-probe cache keyed by ggml_backend_t
The bring-up's three load sites (load_supertonic_gguf,
Engine::Engine, supertonic_bench's main) each ran the
LEAKY_RELU + F16-K/V flash-attn supports_op queries
independently — 2-3x redundant probe traffic per backend.
On Vulkan, supports_op may inspect the device's pipeline
state (~50-200 us per query on Adreno / llvmpipe / RADV in
microbenchmarks); the cache short-circuits 100 % of the
duplicates. Test seam (supertonic_clear_capability_cache +
supertonic_capability_probe_call_count) lets the unit test
verify the cache is hit on the second call by comparing the
counter before / after. Per-backend independence verified
against two distinct CPU backend handles.
2) F16 mul_mat backend-capability probe
Symmetric to the F16-K/V flash-attn probe. The bring-up
auto-enabled use_f16_weights on `!backend_is_cpu` blindly;
a partial-port backend that ships F16 storage but rejects
the hot vector-estimator W_query mul_mat shape would crash
at first synth call. Probe builds the live shape ([256,256]
F16 weight x [256,16] F32 activation) and asks the backend;
auto-policy refuses materialisation on a `false` answer
(slower F32 path stays correct). Manual --f16-weights 1
still forces materialisation (debug-shim escape hatch).
Probe cached; test verifies CPU returns true.
3) Q8_0 K/V flash-attn forward-compat probe
Vulkan's GGML_OP_FLASH_ATTN_EXT supports_op advertises Q8_0
(and Q4_0) K/V types in scalar + coopmat2 paths. Switching
K/V from F16 to Q8_0 would halve the per-step upload
bandwidth (50 KB -> 25 KB per K/V on Supertonic's hot shape;
~1 MB / synth on the default 5-step x 4-site schedule) in
exchange for a small (~0.5 %) drift on the attention output.
This commit adds the probe + caches the result; live
dispatch site is NOT yet wired pending F16-vs-Q8_0 K/V drift
measurement against the parity harness on a real Vulkan
adapter. Bench output annotates `(q8_0_kv_attn=available)`
when the probe says yes so operators can confirm their
hardware is ready for the follow-up.
4) Engine::warm_up(text) + EngineOptions::prewarm_text +
--prewarm CLI flag (supertonic-cli, tts-cli, supertonic-bench)
First-synth-latency reduction on Vulkan / OpenCL. In-tree
thread_local graph caches handle every subsequent call but
can't avoid the first pipeline-compile cost (~hundreds of
ms on Adreno / RADV per chatterbox PROGRESS.md). warm_up
runs one throwaway synth at construction time on a caller-
supplied sample text so the operator-visible first synth
sees steady-state latency. Auto-no-op on CPU (no shader-
compile cost). Bench's --prewarm runs the cold-start synth
BEFORE the timed loop (independent of --warmup N which only
discards N timed runs from the median); cold-start latency
logged as `[prewarm] cold-start synth on '...' took N.Nms`
and emitted to --json-out as "prewarm_ms".
5) Bench output extended
Backend log line surfaces every dispatch flag plus the
cold-start prewarm latency:
Vulkan (device 0: ...) (f16_attn=on) (f16_weights=on)
(native_leaky_relu=on) (q8_0_kv_attn=available)
--json-out gains "f16_attn", "f16_weights",
"native_leaky_relu", "q8_0_kv_attn_available", "prewarm_ms"
keys for downstream analysis tooling.
Tests
- test-supertonic-capability-cache (NEW, LABEL "unit"): probe
cache short-circuit + clear seam + per-backend independence
+ idempotency + F16 mul_mat probe + Q8_0 K/V probe smoke.
18 / 18 checks pass.
- test-supertonic-warm-up-api (NEW, LABEL "unit"): API-surface
contract for EngineOptions::prewarm_text + Engine::warm_up
via SFINAE. 9 / 9 checks pass.
- All existing CPU-only unit tests (test-supertonic-vulkan-
dispatch, -portable-ops, -backend-dispatch, -rope-in-graph,
-rope-packed-qk, -in-graph-transpose, -convnext-block-fused,
-graph-to-graph-blit, -profile-csv, -f16-attn-parity, plus
resample / cpu-caches / t3-caches): all 13 pass unchanged.
- ctest -L unit reports 100 % pass (15 / 15 binaries; 184+ /
184+ individual checks).
Build
- All changed source files compile clean with both
-DGGML_USE_VULKAN defined and undefined.
- No public-API break: EngineOptions::prewarm_text is a new
optional field defaulting to empty (no-op), Engine::warm_up
is a new method (existing callers don't have to invoke it).
Deferred (tracked in PROGRESS_SUPERTONIC.md "Deferred work"):
persistent VkPipelineCache (cross-process), BF16 K/V flash-attn,
Q8_0 K/V live dispatch wiring, multi-device load-balancing.
Co-authored-by: Cursor <cursoragent@cursor.com>
…vice auto-pick + 2 forward-compat probes Three more Vulkan-specific deltas, all developed test-first. New tests were committed first, observed to fail on the missing symbol, and only then was the implementation written and the tests re-run to verify green. 1. BF16 K/V flash-attn capability probe (5th cached_backend_capabilities flag). Symmetric to the round-2 Q8_0 K/V probe. Vulkan's FLASH_ATTN_EXT supports_op advertises BF16 K/V via the coopmat2- only path; BF16 has the same 2-byte per-element footprint as F16 (so identical upload bandwidth) but the wider 8-bit exponent range avoids the F16 underflow on small attention scores. Forward-compat — the live --kv-attn-type bf16 dispatch wiring is deferred to a follow-up that measures drift against the parity harness on a real Vulkan adapter. 2. Multi-device auto-pick for --vulkan-device -1. Wires the previously-reserved auto-pick API: walks every visible adapter, queries ggml_backend_vk_get_device_memory() to read free VRAM, and dispatches into a pure-logic helper resolve_vulkan_device_index(requested, free_vram_per_device) that picks argmax(free_vram); ties → lower index for stable per-run assignment on identical-spec multi-GPU machines. The pure-logic helper is testable on CPU with synthetic inputs (8 test functions, 23 checks). Reserved-future negative values (-2, -100, ...) now throw instead of silently falling through to device 0. Verbose mode logs the per-device VRAM table so operators can confirm the auto-pick chose the expected adapter. 3. Pinned-host-buffer-type capability probe (6th cache flag) + bench surface. Probes whether ggml_backend_vk_host_buffer_type() is callable on the resolved backend (Vulkan + non-null buffer- type). Forward-compat — primes the capability cache for a follow-up per-engine input-scratchpad refactor that skips ggml-vulkan's internal staging-buffer hop on per-step uploads. Bench output now shows bf16_kv_attn_available + pinned_host_buffer_available in both the human-readable backend tag and the JSON output so operators can pre-flight whether a future opt-in will be effective on their machine. Test plan (TDD round 3): - test-supertonic-capability-cache: 27 / 27 checks pass (was 18, +9 checks for round-3: BF16 K/V smoke + cache-slot share, pinned-host-buffer smoke + cache-slot share, null-backend defensive checks for both new probes). - test-supertonic-vulkan-device-select (NEW): 23 / 23 checks pass (8 test functions: empty-list, single-device, argmax-VRAM, tie- break, explicit-index passthrough, out-of-range, reserved- negative, zero-VRAM handling). - Whole CPU-only ctest -L unit reports 16 / 16 tests passing, zero regressions on round-1 / round-2 / audit-follow-up tests. CLI surface: - supertonic CLI + chatterbox CLI usage strings updated to document --vulkan-device -1 = auto-pick adapter with most free VRAM. - supertonic-bench usage string updated likewise. Co-authored-by: Cursor <cursoragent@cursor.com>
…hts operator deny-list
Round 6 layers a user-overridable extra deny-list on top of the
existing hand-curated should_materialise_f16_weight() allow-list.
The curated allow-list (Phase 2A) already excludes biases, norms,
embeddings, depthwise convs, and pre-transposed companions; the
round-6 deny-list lets operators force-keep specific additional
tensors as F32 even when --f16-weights is on. Use cases:
- A/B testing: researcher excludes a specific tensor pattern
temporarily without recompiling.
- Hardware-specific drift mitigation: operator pins a problematic
tensor to F32 via config rather than disabling F16 weights
wholesale.
- Future-GGUF safety net: new tensor patterns added in future
GGUFs that the curated allow-list inadvertently scoops in can
be excluded via config without a code change.
Smallest blast radius of the four follow-up rounds — load-time
policy only, runtime dispatch unaffected, zero behaviour change
on the empty-deny-list default path.
Strict TDD discipline (per the user's "double check, don't break
anything" constraint):
- Both new tests committed FIRST.
- Both confirmed to fail to compile on the missing symbols
(predicate test: 'too many arguments to should_materialise_f16_weight';
API test: 'EngineOptions has no member f16_weights_deny_list').
- Implementation written.
- Both tests + every existing unit test re-run; all green.
What changed:
1. 2-arg overload should_materialise_f16_weight(name,
extra_deny_substrings) added alongside the existing 1-arg
version (existing test + call sites unchanged). Substring
matching matches the curated predicate's audit-friendly style;
no regex compile cost or invalid-pattern surface. The deny-
list can only flip true → false, never false → true. Empty
strings inside the deny-list are SKIPPED defensively, not
treated as universal matches (config-typo guard).
2. EngineOptions::f16_weights_deny_list (vector<string>, default
empty) — public API surface. Wired through Engine::Impl →
load_supertonic_gguf → the per-tensor allocation loop.
3. load_supertonic_gguf 7th parameter added at the end of the
signature with a {} default — every existing call site keeps
compiling without modification.
4. supertonic_model::f16_weights_excluded_count counter bumped at
load time when a curated-hot tensor is excluded by the user's
deny-list. Surfaced in bench's human + JSON output so
operators can confirm their config took effect.
5. CLI plumbing: --f16-weights-deny PAT1,PAT2,... flag on
supertonic-cli, tts-cli (chatterbox), and supertonic-bench
(comma-separated substring patterns).
6. Verbose-log line in load_supertonic_gguf when the deny-list is
non-empty (silent on the default path — no visual noise on
existing operator workflows).
Test plan (TDD round 6):
- test-supertonic-f16-weights (UPDATED): existing 36 checks
(positives, negatives, edges) + 29 new round-6 checks across 7
new test functions (empty-list passthrough, matching-deny-
excludes, non-matching-no-op, cannot-promote-cold, multiple-
patterns ANY-match, empty-string defensive skip, empty-name
safety) → 65 / 65 PASS.
- test-supertonic-f16-deny-list-api (NEW): SFINAE compile-time
gate for EngineOptions::f16_weights_deny_list +
load_supertonic_gguf 7th param; runtime defaults check +
assignability + regression guards on every other documented
EngineOptions default → 9 / 9 PASS.
- Whole CPU-only ctest -L unit reports 17 / 17 tests, 0
failures, 0 regressions on round-1/2/3 + audit follow-up + the
baseline tests.
- Smoke-tested supertonic-cli + tts-cli + supertonic-bench
binaries: --f16-weights-deny flag parses correctly, surfaces in
--help output, and threads through to the load layer.
Co-authored-by: Cursor <cursoragent@cursor.com>
…ype K/V flash-attention dispatch
Generalises the round-1 `--f16-attn` boolean (F16 vs F32 only) into a
four-valued enum + `--kv-attn-type {auto,f32,f16,bf16,q8_0}` CLI flag
so operators can opt into BF16 K/V (Vulkan coopmat2 — same bandwidth
as F16, no F16 underflow on small attention scores) or Q8_0 K/V
(Vulkan + half the K/V upload bandwidth) on adapters that advertise
the corresponding capability. Default `auto` falls back to
`--f16-attn` so every existing operator config sees zero behaviour
change.
Strict TDD throughout: Prereq B extends the F16 parity harness to
cover BF16 (4 → 8 checks, 5e-3 abs / 5e-3 rel tolerance band, both
hot shapes) BEFORE touching any production code; new pure-logic
resolver test (`test-supertonic-kv-attn-type`, 106 checks across the
full {-1, 0..3} × legacy × probe-mask matrix); new API-surface
SFINAE lockdown (`test-supertonic-kv-attn-type-api`, 18 checks).
Tests committed first, observed to fail on missing symbols, then
implementation added.
Pure-logic resolver (`resolve_kv_attn_type`) split from the dispatch
site (same pattern as round-3's `resolve_vulkan_device_index`).
Probe-rejected explicit requests fall back to F32 silently
(advisory-probe contract); out-of-range int throws to surface CLI
typos loudly. Vector-estimator dispatch site
(`build_text_attention_cache`) replaces the F16-only cast with a
switch on the enum; cache key promoted from `bool f16_kv_attn` to
`kv_attn_dtype kv_attn_type`. Bench surface adds `(kv_attn_type=…)`
to the human-readable backend line and `"kv_attn_type"` +
`"kv_attn_type_requested"` to the JSON output so log-grep / CI
attribution works across machines.
Bonus: `supertonic-cli`'s arg-parse loop is now wrapped in try/catch
so invalid values surface as a clean `error: ...` line + exit 2
(also fixes the pre-existing latent crash on `--vulkan-device abc` /
`--seed nonsense`).
Whole CPU-only `ctest -L unit` reports 19 / 19 tests, 0 failures, 0
regressions.
Co-authored-by: Cursor <cursoragent@cursor.com>
…servability + voice cache + Vulkan env-var passthrough Lowest impact-÷-risk round of the four planned in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. Four sub-features, none touching the per-synth hot path beyond a single voice-cache lookup. 1. Voice ttl/dp host cache (`detail::voice_host_cache`). Eliminates 2 sync points / synthesize() after the first per-voice call on Vulkan / OpenCL. Extracted to a standalone helper so the lookup-or-load semantics are testable on CPU without instantiating a full Engine; reference-stability contract documented for the synthesis-pipeline call site. 2. Vulkan env-var passthrough (`apply_vulkan_env_overrides(map)` public helper + `EngineOptions::vulkan_env_overrides` field + `--vulkan-prefer-host-memory` / `--vulkan-disable-coopmat2` / `--vulkan-disable-bfloat16` / `--vulkan-perf-logger` / `--vulkan-async-transfer` / `--vulkan-env KEY=VALUE` CLI flags on all three binaries). ALL-OR-NOTHING validation: an operator-config typo throws cleanly BEFORE any env var is touched. `set_env_if_unset` semantics so an operator-set env var still WINS over the EngineOptions override. 3. Bench `ggml_backend_synchronize` boundaries (`--no-bench-sync` opt-out). Inserts an explicit backend sync at every per-stage timing boundary so wall-clock attributes to the right stage on async backends. Cheap on CPU; prerequisite for measuring round-5 / 8 / 9 wins on real hardware. 4. Bench per-denoise-step breakdown (`--bench-per-step`). Times each `supertonic_vector_step_ggml` call individually so the first-step (cold pipeline) cost is distinguished from steady-state. Empty array on the default-off path = identical legacy JSON shape. Strict TDD throughout. Two new test executables committed first, observed to fail on missing symbols, then implementation written. TDD also caught a real bug: the original env-key validator used `std::string()` empty-as-success sentinel which collided with the empty-string-as-key edge case; the test pinned the contract and forced a `bool / out-param` API fix BEFORE any production wiring went in. Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions (was 19; +2 new tests = 54 new checks). Co-authored-by: Cursor <cursoragent@cursor.com>
…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
…U bridge Extends the round-8 GPU bridge pattern to the 4 style flash-attn sites (style0 + g1_style + g2_style + g3_style). Largest bandwidth-style optimisation that ships from pure-Supertonic-side code: 120 sync points / synth eliminated on the production Vulkan / OpenCL path (4× the round-8 win). - vector_res_style_qkv_result extended with `sq_gpu / sk_gpu / sv_gpu` GPU handles, populated unconditionally by `run_res_style_qkv_cache` (cheap — no GPU sync; just `ggml_graph_get_tensor` lookups). Same shape as `vector_group_graph_result::q_rope_gpu` etc from the round-1 2C-lite work. - `run_res_style_qkv_cache` host-download gating: the 3 `tensor_to_time_channel(...)` downloads of `sq` / `sk` / `sv` are now gated on `trace != nullptr`. Production path skips them entirely. Mirrors the round-1 2C-lite `need_host_qkv = (trace != nullptr)` gate. `post` stays unconditional — consumed by the next-stage `run_style_residual_cache` which still expects a host vector (cross-stage GPU bridge for `post` is deferred). - 4 dispatch sites rewired with the same gating pattern as the round-8 front-block bridge: `!include_ggml_trace && sq_gpu && sk_gpu && sv_gpu` → GPU bridge; otherwise legacy host bridge. Trace mode falls back to the legacy host bridge so the trace harness still gets all the host vectors. Strict TDD: parity test (`test-supertonic-graph-to-graph-blit`) extended with explicit style-shape coverage (`style_sq_L1` trip-wire + clarified `style0_q_rope_L20` / `style0_k_rope_kv50`) BEFORE any production wiring. All 24 / 24 parity checks pass at bit-exact `max_abs = 0.0`. Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
…t upload-skip After rounds 8 + 9 wired the GPU bridge for the 5 attention sites, the largest remaining per-step host upload is `text_emb` (uploaded to 4 caches × 5 denoise steps = 20 times / synth, but constant data within one synth). Round 10 generalises the F4 pointer-compare upload-skip pattern (already used for `style_v_in` / `kctx_in`) into a reusable `upload_skip_tracker` helper and applies it to the front-block + 3 group caches. CRITICAL CORRECTNESS HAZARD addressed: `text_emb` is a stack-local `std::vector<float>` in `Engine::Impl::synthesize()` (and bench loops). Modern heap allocators (jemalloc / tcmalloc / glibc) very often re-issue the SAME address for the next stack-local vector of the same size — so synth N+1 may have `text_emb.data() == synth_N.text_emb.data()` despite holding completely different data. A naive pointer-compare upload-skip would silently leak prior synth's text-encoder embedding into the next synth's GPU buffer. Mitigation: caller MUST invoke `tracker.reset()` at every synth boundary (`current_step == 0`). The CPU-only TDD test includes an explicit cross-synth pointer-reuse hazard simulation that documents the bug and verifies the reset prevents it. Per-synth wins: - 16 fewer `ggml_backend_tensor_set` host→GPU uploads per synth - ~512 KB / synth bandwidth saved at text_len=32 (linear in prompt length) Strict TDD: `test-supertonic-upload-skip-tracker` (NEW, 7 functions, 41 checks) committed first, observed to fail compile (`upload_skip_tracker was not declared`), then implementation added. Whole CPU-only `ctest -L unit` reports 22 / 22 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>
1b710d3 to
c383e70
Compare
tradingsuit-freddy
left a comment
There was a problem hiding this comment.
Review of the Vulkan delta (pr16...pr17). The round-11 RoPE/transpose layout fix looks correctly applied at all four attention sites and the legacy host downloads moved to tensor_raw_f32 consistently. Below are 2 blocking issues I verified line-by-line plus 2 non-blocking risks.
| #ifdef GGML_USE_VULKAN | ||
| if (model.backend_is_vk) { | ||
| char desc[256] = {0}; | ||
| ggml_backend_vk_get_device_description(opts.vulkan_device < 0 ? 0 : opts.vulkan_device, |
There was a problem hiding this comment.
BUG (blocking): --vulkan-device -1 (auto-pick) reports the wrong device in every log / bench / JSON line.
backend_name() builds the device label from the raw option, mapping the auto-pick sentinel -1 to 0:
ggml_backend_vk_get_device_description(opts.vulkan_device < 0 ? 0 : opts.vulkan_device, desc, ...);
out += " (device " + std::to_string(opts.vulkan_device < 0 ? 0 : opts.vulkan_device) + ": " + desc + ")";The index actually chosen by resolve_vulkan_device_index (argmax free VRAM) is never propagated back to opts/model, so on a multi-GPU host --vulkan-device -1 that resolves to device 2 still prints device 0: <wrong name>. That defeats the exact use case the comment promises ("unambiguous when triaging multi-GPU machines"), and supertonic_bench.cpp has the same issue (~line 538), so the bench JSON attributes timings to the wrong adapter.
Suggest storing the resolved index (e.g. model.vulkan_device_resolved) at backend init and using it here instead of opts.vulkan_device.
| // No-op for the default `kv_attn_type == -1` path (the | ||
| // resolver already mirrors the boolean). Becomes a | ||
| // no-op for explicit `--kv-attn-type 1` too. | ||
| model.use_f16_attn = (model.kv_attn_type == kv_attn_dtype::f16); |
There was a problem hiding this comment.
BUG (blocking): --f16-attn 1 no longer forces F16 — the round-1 debug escape hatch was lost in round 4.
The comment at lines 169-170 still states: "Manual override via --f16-attn 1 still forces dispatch (useful for debug-shim backends)." That is no longer true. Round 1 sets use_f16_attn = (opts.f16_attn != 0) (line 175), but round 4 then re-gates it through the probe and overwrites the boolean here:
// resolve_kv_attn_type, case -1 (auto / default kv_attn_type):
if (legacy_use_f16_attn && backend_supports_f16) return kv_attn_dtype::f16;
return kv_attn_dtype::f32;
...
// line 209:
model.use_f16_attn = (model.kv_attn_type == kv_attn_dtype::f16);So on a backend whose F16-K/V probe returns false — i.e. exactly the "debug-shim backend" the comment targets — --f16-attn 1 silently falls back to F32 and the override is undone. Either the comment is stale and should say the override is probe-gated, or the forced path needs to bypass the probe. Please pick one and align code + comment.
| return n; | ||
| } | ||
|
|
||
| const backend_capabilities & cached_backend_capabilities(ggml_backend_t backend) { |
There was a problem hiding this comment.
RISK (non-blocking): capability cache keyed by a raw ggml_backend_t pointer has no invalidation hook.
The process-wide cached_backend_capabilities map keys on the backend pointer, and the surrounding comment already acknowledges pointers can be recycled after ggml_backend_free. There is no invalidation in free_supertonic_model, so if a backend is freed and a new one is allocated at the same address it inherits the previous backend's probe results (wrong use_native_leaky_relu / F16 / weights policy) for the rest of the process. For a long-lived host that loads/unloads multiple models this is a latent correctness bug, not just a perf cache. Suggest evicting the entry on backend teardown (or keying on something stable).
| // 4 (skipped) × 3 (groups) × text_len × 256 × 4 bytes. See | ||
| // upload_skip_tracker contract in supertonic_internal.h. | ||
| if (current_step == 0) cache.text_in_skip.reset(); | ||
| if (cache.text_in_skip.needs_upload(text_lc_host)) { |
There was a problem hiding this comment.
RISK (non-blocking): upload_skip_tracker skips host->device uploads via raw pointer compare — silent stale-input hazard.
Cross-synth correctness rests entirely on reset() being called at current_step == 0 (line 1235). The engine/bench loops honor that today, but nothing ties the reset to the upload path in an integration test, and the pointer-compare can be defeated two ways: (a) the allocator reuses a freed text_emb/text_lc_host address for a different encoding, or (b) the buffer is mutated in place with the same data() pointer across steps (the public supertonic_vector_step_ggml API does not forbid it). In both cases the tracker wrongly skips a required upload and the GPU runs on stale input -> silently wrong audio, no crash. Worth a guard (size/contents hash, or a generation counter bumped per synth) and an integration test that exercises a new encoding without the step==0 reset.
|
It has been replaced by another PR. |
Process / PR-level notes (separate from the inline findings)Beyond the four inline comments, a few higher-level points worth raising before this lands: 1. Rounds 1–10 never ran end-to-end and CI didn't catch it. The description itself states that without round 11 "every prior round was hitting a latent assertion-failure during the first real synth call," and that the unit test built Q under the wrong shape so the failure was invisible to CI. That means the "22/22 PASS, 0 regressions" across 10 rounds was false confidence — the tests were green while production crashed on the first synth. The CPU-only unit-test strategy has a real gap: it never exercises the GPU path where the bug actually lived. At minimum, a lavapipe (Vulkan-on-CPU) smoke test in CI would gate the GPU contract. 2. No GPU coverage in CI. All Vulkan validation is manual on the author's dev rig (RTX 5090 / RADV / lavapipe); nothing in CI gates the GPU path. Given point 1, that's a significant risk for a +13k-line change on the inference path. 3. Known-broken behaviors are being merged.
4. Size / reviewability + public-API change.
Suggested verdict: block on the two BUG inline comments + a lavapipe smoke in CI (point 1); treat the rest as non-blocking with tracking tickets. |
Summary
Brings the Supertonic TTS stage of
tts-cppto functional + tunable parity on the Vulkan backend, layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR #16). The audit-driven optimisations from #16 are backend-portable by construction, so Vulkan inherits all ~280 host↔GPU sync-point eliminations + the F16-weight roster + the in-graph RoPE / ConvNeXt fusion / GPU↔GPU blit work without modification. This PR adds eleven rounds of Vulkan-specific deltas — each round committed test-first (TDD) with a CPU-only unit gate that locks in the dispatch + capability contract for future regressions.Rounds 1–6 are dispatch + capability infrastructure (probes, flags, multi-device auto-pick, deny-list, multi-dtype K/V). Rounds 8–10 are observability + per-step sync-point elimination on the GPU bridges. Round 11 is a critical correctness fix that turns the prior 10 rounds from "passes CI" into "actually runs end-to-end on every Vulkan adapter we have." Without round 11, every prior round was hitting a latent assertion-failure during the first real synth call.
Scope vs. PR #16: this PR sits on top of the OpenCL branch (
QVAC-18607-TTS-GGML-Add-and-optimize-OpenCL-for-supertonic). All Vulkan-specific deltas are restated here; the OpenCL audit work is not. The optimisations layer cleanly because the audit hits the GGML graph layer (backend-portable by construction); Vulkan inherits the wins automatically.End-to-end validation (on real hardware)
Tested on three Vulkan adapters in one machine — the gold-standard hybrid dev-rig setup:
RTX 5090 per-step breakdown (median over 5 runs, F16 K/V default, post-prewarm):
The round-3/4/7/8/9/10 wins are all in those numbers — round 7's prewarm hides the ~2.3s cold shader-compile, round 8/9/10 eliminate ~166 sync points/synth so the steady-state per-step time is dominated by actual compute rather than host↔GPU bookkeeping.
Net new surface (against PR #16):
native_leaky_relu,f16_kv_flash_attn,f16_mul_mat,q8_0_kv_flash_attn,bf16_kv_flash_attn,pinned_host_buffer)use_native_leaky_relu,kv_attn_type) — joins the round-1use_f16_attnEngineOptionsknobsvulkan_device,prewarm_text,f16_weights_deny_list,kv_attn_type+ 4 Vulkan env-var passthroughs)--vulkan-device,--prewarm,--f16-weights-deny,--kv-attn-type,--vulkan-prefer-host-memory,--vulkan-disable-coopmat2,--vulkan-disable-bfloat16,--vulkan-perf-logger,--vulkan-async-transfer,--vulkan-env KEY=VALUE,--bench-per-step,--bench-sync,--json-outctest -L unit)ctest -L unitInvestigation methodology (TDD throughout)
Every round followed the same workflow:
PROGRESS_SUPERTONIC.md+ commit.The CPU-only test strategy is deliberate: a fresh checkout's
ctestexercises the dispatch + capability + resolver contracts without needing a Vulkan adapter, so CI on a CPU-only runner catches regressions in the policy layer.Commit-by-commit walkthrough
33fd5c34— Round 1: Vulkan bring-upFoundational Vulkan dispatch + capability probing. The OpenCL bring-up (#16) used
model.use_f16_attn = !backend_is_cpubecause the chatterbox OpenCL patch unconditionally accepts the F16-K/V op; on Vulkan theHSK % 8 == 0supports_opgate has to be respected, so the auto-policy needs a probe.supertonic_modelflags populated at GGUF load:backend_is_vk(informational; appended to the backend-description string) anduse_native_leaky_relu(resolved viaggml_backend_supports_op(LEAKY_RELU)against a synthetic node).supertonic_backend_supports_f16_kv_flash_attngates theuse_f16_attnauto-policy.EngineOptions::vulkan_deviceint +--vulkan-device NCLI flag plumbed through all three binaries. Range-checked at load (out-of-range = hard error).ggml_backend_vk_get_device_descriptionso multi-GPU / multi-ICD machines unambiguously identify which adapter ran.test-supertonic-vulkan-dispatch(29 checks).d080a1e4— Pre-existing missing-include fixtts-cpp/src/chatterbox_tts.cppusedstd::atomic<int>without#include <atomic>. One-line fix kept as a separate commit so it's trivially revertable.e09d4278— Round 2: capability-cache + 3 probes + prewarmcached_backend_capabilitiesmap keyed byggml_backend_t, guarded by a singlestd::mutex. Eliminates 3× redundant probe calls per backend.supertonic_backend_supports_f16_mul_mat(gatesuse_f16_weightsauto-policy),supertonic_backend_supports_q8_0_kv_flash_attn(forward-compat),supertonic_backend_supports_native_leaky_relu(wraps round 1).Engine::warm_up(text)API +EngineOptions::prewarm_text+--prewarm TEXTCLI. Runs one throwaway synth at engine construction so the Vulkan / OpenCL shader pipelines compile up-front; operator-visible firstsynthesize()hits steady-state latency. No-op on CPU.test-supertonic-capability-cache,test-supertonic-warm-up-api.8ae15996— Round 3: multi-device auto-pick + 2 forward-compat probes--vulkan-device -1auto-pick policy:resolve_vulkan_device_indexpure-logic helper picksargmax(free_vram)viaggml_backend_vk_get_device_memory(). Tie-break = lower index.supertonic_backend_supports_bf16_kv_flash_attn(for coopmat2 on Ampere+ / RDNA3+),supertonic_backend_supports_pinned_host_buffer(for future per-engine input-scratchpad refactor).test-supertonic-vulkan-device-select(23 checks).32703fcd— Round 6: F16-weights operator deny-listshould_materialise_f16_weight(source_name, deny_list)overload layered on top of the curated allow-list. Each entry is a substring; any match keeps that tensor at its native storage type.EngineOptions::f16_weights_deny_list+--f16-weights-deny PAT1,PAT2,...CLI flag (comma-split parser shared between all three binaries).test-supertonic-f16-weightsextended (+29 checks),test-supertonic-f16-deny-list-api(NEW, 9 checks).2e1c9468— Round 4: multi-dtype K/V flash-attention dispatchGeneralises the round-1 F16-only K/V path into a multi-dtype dispatch.
kv_attn_dtypeenum (autoselect,f32,f16,bf16,q8_0) +EngineOptions::kv_attn_typefield.resolve_kv_attn_typepure-logic helper with full{requested × legacy × probe-mask}behaviour matrix.--kv-attn-typeCLI flag on all three binaries with parse hardening.test-supertonic-kv-attn-type(106 checks),test-supertonic-kv-attn-type-api(18 checks),test-supertonic-f16-attn-parityextended for BF16.ba6d1749— Round 7: bench observability + voice cache + Vulkan env-var passthroughThree independent observability/UX wins shipped together:
--bench-per-step+--bench-sync+--prewarm(already from round 2) +--json-out FILE: per-denoise-step timings on a single timeline (cold pipeline step[0] distinguishable from steady-state step[1..4]); operator can attribute Vulkan stalls to a specific stage on real hardware without GPU-side profilers.--vulkan-prefer-host-memory,--vulkan-disable-coopmat2,--vulkan-disable-bfloat16,--vulkan-perf-logger,--vulkan-async-transfer,--vulkan-env KEY=VALUE— sets the correspondingGGML_VK_*env var before backend init. Operator-set shell env STILL wins over the CLI override (audit-friendly).test-supertonic-vulkan-env-overrides(29 checks).e8bbc728— Round 8: front-block attn0 GPU bridgeThe single largest remaining per-step sync hotspot identified in
aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR #16's audit follow-up #6 (2C-lite) shipped the GPU device→device blit infrastructure (run_text_attention_cache_gpu) and wired g1/g2/g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function.Strict gating on
front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0— trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors.Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth.
df895fd6— Round 9: style flash-attn GPU bridgeSame pattern as round 8, applied to the 4 style attention sites (front-block style0 + style attentions in g1/g2/g3 caches). Gated Q/K/V host downloads on trace mode in
run_res_style_qkv_cache(production path skips them entirely).Eliminates 3 sync points × 4 sites × 5 denoise steps = 60 GPU→host downloads / synth.
358d7aa8— Round 10: per-step text-input upload-skipGeneralised the F4 pointer-compare upload-skip pattern (
style_v_in/kctx_ininvector_res_style_qkv_cache) into a reusableupload_skip_trackerhelper.Applied to
text_in_ton front-block cache +text_inon 3 group caches. Caught and documented a cross-synth pointer-reuse hazard: stack-localtext_embvectors very often re-issue the same address (allocator size-class reuse); thetracker.reset()at synth boundaries prevents the naive pointer-compare from leaking prior-synth GPU data into next-synth attention.New test
test-supertonic-upload-skip-tracker(7 functions, 41 checks) explicitly simulates the cross-synth hazard.Eliminates 16 redundant uploads / synth (~512 KB at text_len=32, linear in prompt length).
c383e70d— Round 11: Packed-QK RoPE + GPU-bridge layout fix ⚡ CRITICAL CORRECTNESSAfter the IDE-freeze recovery, the first end-to-end synth attempt on real hardware crashed at:
on every backend (CPU + Vulkan RTX 5090 + RADV + lavapipe).
Root cause:
apply_rope_to_packed_qk(introduced in PR #16 audit follow-up #5) was written under the assumption thatdense_matmul_time_ggmlreturns ane=[HD, L]channel-fastest-in-memory tensor. In fact, the matmul (both the CPUcblas_sgemmfast path and the GPUconv1d_f32(K=1)fallback) producesne=[L, HD]with channel-major-flat memory (data[t + c*L]) — the bit-exact transpose of the helper's input contract.The CPU unit test that landed alongside the helper (
test_supertonic_rope_packed_qk.cpp) hand-built Q under the wrong[HD, L]shape, so the failure mode was invisible to CI — and rounds 8/9/10 were ALSO broken (the GPU bridgeggml_backend_tensor_copy(q_src, q_tc_in)would have aborted atggml_are_same_layoutbecause V (and the style sq/sk/sv which have no RoPE to mask the layout flip) flowed into the GPU bridge from matmul → channel-major-flat bytes → mismatched layout againstq_tc_intime-major-flat).The fix (strict TDD):
ne=[L, HD](channel-major-flat memory). Reference built in scalarapply_rope's native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pinsy->ne[0] = HD, y->ne[1] = Lso the downstreamq_tc_inblit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then GREEN (14 / 14 checks).apply_rope_to_packed_qkhead-of-pipelineggml_cont(ggml_transpose(q))to flip fromne=[L, HD]channel-major-flat tone=[HD, L]time-major-flat (which IS the layoutq_tc_inexpects).ggml_cont(ggml_transpose(...))at the matmul output inbuild_group_graph_cache,ve_front_block_proj_cache, andbuild_res_style_qkv_cache× all three sq/sk/sv outputs so all four GPU-bridge attention sites get bit-for-bit matching layouts.tensor_to_time_channel(<post-rope-or-v>)totensor_raw_f32(...). The new graph-side layout puts the bytes already in the time-major-flat shape scalarapply_rope/flash_attention_qkvhost references consume, so the raw download is the correct call.The round-1..10 wins are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path.
Test plan
CPU-only — a fresh checkout's
ctest -L unitexercises every new contract without needing a Vulkan adapter.Expected: 22 / 22 tests, 0 failures, 0 regressions.
test-supertonic-vulkan-dispatchtest-supertonic-portable-ops(UPDATED)test-supertonic-capability-cachetest-supertonic-warm-up-apiEngine::warm_uptest-supertonic-vulkan-device-selectresolve_vulkan_device_indexbehaviour matrixtest-supertonic-f16-weights(UPDATED)test-supertonic-f16-deny-list-apitest-supertonic-kv-attn-typeresolve_kv_attn_typebehaviour matrixtest-supertonic-kv-attn-type-apitest-supertonic-f16-attn-parity(UPDATED)test-supertonic-vulkan-env-overridestest-supertonic-upload-skip-tracker(NEW)test-supertonic-rope-packed-qk(REWRITTEN)Smoke testing the CLIs
Bench JSON includes
"kv_attn_type"(resolved),"kv_attn_type_requested"(raw int), and per-step timings so probe misses and per-step variance are attributable in CI/operator triage.Backwards compatibility
--vulkan-device 0semantics unchanged — round 1 introduced the flag; round 3's-1is opt-in only.--f16-weights 0|1semantics unchanged — round 6's--f16-weights-denyis opt-in only.--prewarmdefaults to empty (no-op).--kv-attn-typedefaults toautowhich falls back to round-1'suse_f16_attnboolean — every existing config keeps the round-1 behaviour.model.use_f16_attnboolean is still populated and is kept in sync with the round-4 enum (= (kv_attn_type == f16)) so any external code keying on the boolean stays consistent.apply_rope_to_packed_qkcontract is backwards-incompatible with the old (broken) one, but the old contract never actually worked in production — pre-fix it crashed on every backend. The 14-check test now pins both the input and output contracts so a future regression fails at compile time on the shape check.File-by-file change summary
tts-cpp/PROGRESS_SUPERTONIC.mdtts-cpp/CMakeLists.txttts-cpp/include/tts-cpp/supertonic/engine.hEngineOptionsfields +Engine::warm_up()tts-cpp/src/supertonic_internal.hkv_attn_dtypeenum, 5 new probes, resolvers,upload_skip_trackerhelper,apply_rope_to_packed_qk(round-11 fix)tts-cpp/src/supertonic_gguf.cpptts-cpp/src/supertonic_vector_estimator.cpptts-cpp/src/supertonic_engine.cppwarm_upimpltts-cpp/src/supertonic_bench.cpptts-cpp/src/supertonic_cli.cpptts-cpp/src/chatterbox_cli.cpptts-clialiastts-cpp/src/chatterbox_tts.cpp#include <atomic>(pre-existing missing-include fix)Deferred follow-ups (intentionally out of scope; pre-existing on master)
Tracked in
tts-cpp/PROGRESS_SUPERTONIC.md"Deferred work" section.argmax(free_vram)policy picks the iGPU on machines like the one we tested (RTX 5090 + AMD RADV) because UMA reports system RAM as free VRAM. Pre-existing in this PR; fix candidate: bias against UMA when a discrete is present. Workaround: explicit--vulkan-device 0.test-supertonic-audit3-cachesF18 + F19 cache-reuse failures — these pre-existed on master (verified pairwise). Pre-round-11 they were hidden by the rope crash; post-round-11 they're newly observable but neither introduced nor fixable by this PR's content (text encoder for F18; cross-cache state-leak for F19). Both should be wired into CI as a separate ticket; F18/F19 affect the OpenCL build identically.VkPipelineCache(chatterbox PROGRESS.md §3.32): recovers ~91 % of cold→warm shader-compilation gap on first warm run, keyed by<vendorID>-<deviceID>-<driverVersion>. This is aggml-vulkaninternal patch (~199 lines) that benefits all Vulkan workloads. Round 7's--prewarmis an in-process workaround.latentupload latency.Linked