Qvac 18607 tts ggml add and optimize open cl for supertonic#16
Merged
GustavoA1604 merged 9 commits intoMay 12, 2026
Conversation
QVAC-18607 follow-up. The bring-up commit (8d5ebb4) landed the dispatch + portable-op + F16-K/V-attention primitives but only exercised them transitively through the existing fixture-bound test-supertonic-* harnesses, which need a Supertonic GGUF + an artifacts/supertonic-ref-quick reference dump to run. A fresh checkout has neither, so the bring-up primitives shipped without their own gate on `ctest -L unit`. This commit adds three CPU-only unit harnesses that cover the bring-up primitives independent of any fixture, plus an R&D plan document capturing the next optimization rounds with their TDD test gates. Tests (all LABEL "unit", auto-run on fresh checkout): test-supertonic-backend-dispatch (186 lines) Six scenarios around supertonic_op_dispatch_scope + the two thread-local query functions: default state, CPU model mirroring, GPU model mirroring + post-teardown restore, RAII teardown on exception, nested-scope unwinding, independence of use_cpu_custom_ops / use_f16_attn. Catches "scope leaked wrong previous-value into thread_local" and "GPU engine poisons next CPU engine on same thread" regressions. test-supertonic-portable-ops (260 lines) CPU-backend parity of leaky_relu_portable_ggml's CPU lowering (fused ggml_leaky_relu) vs its GPU decomposition (RELU + 2x SCALE + ADD) for alpha in {0, 0.01, 0.05, 0.1, 0.5, 0.99, 1.0} against a sign-mixed input including the zero boundary. Also asserts graph-node-count grows on the GPU dispatch — catches a regression where the portable helper would silently route back to ggml_leaky_relu on a non-CPU backend (defeating the whole reason the helper exists). test-supertonic-f16-attn-parity (291 lines) F32 vs F16 K/V ggml_flash_attn_ext parity on the two hot shapes from the vector estimator (text attention kv=32, style attention kv=50), n_heads=4, head_dim=64. Tolerance 5e-3 abs / 5e-3 rel — the same band chatterbox ships behind --cfm-f16-kv-attn. Gracefully skips ("SKIPPED — CPU build missing one path") if the local CPU build doesn't carry both flash-attention paths, preserving CI greenness while still validating where the path exists. Refactor to support testing: leaky_relu_portable_ggml moves from file-local in supertonic_vocoder.cpp to an inline definition in supertonic_internal.h. ODR-safe under C++17, lets the portable-ops test call the production helper directly instead of re-implementing the rewrite (which would defeat the test's purpose). The vocoder TU now only carries a one-line redirect comment pointing at the header. Plan document (PLAN_SUPERTONIC_OPENCL.md, 268 lines): Captures five concrete next-rounds with motivation + code- change plan + acceptance test + risk for each: 2A. F16 weight materialization for hot matmuls — biggest expected single-flag win after F16 K/V attn, mirrors chatterbox's CHATTERBOX_F16_CFM gate. 2B. Pre-quantized Q8_0 GGUF weights — needs convert-script work + audio listening sign-off. 2C. Reduce 140x host<->GPU sync round-trips per synth in the vector estimator (5 steps x 28 set/get pairs). 2D. SUPERTONIC_OPENCL_PROFILE=PATH.csv tooling for per-kernel attribution; mirrors chatterbox's cl_profiling_*.csv flow. 2E. Vocoder unpack-on-GPU via ggml_permute + ggml_cont. Each phase has its acceptance test spelled out (TDD, written before the implementation lands), the CTest label it should carry, and its sequencing rationale. Cross-linked from PROGRESS_SUPERTONIC.md's "Next optimization rounds" subsection so future-readers find the roadmap. Validation: All three new tests pass clang -fsyntax-only -Wall -Wextra and compile to clean .o files. `nm` confirms the dispatch test's four undefined symbols (op_dispatch_scope ctor/dtor, use_cpu_custom_ops, use_f16_attn) resolve against the definitions in supertonic_gguf.o, so link-time resolution will succeed under the real CMake build. No new linter errors in any of the 8 affected files; pre-existing -Wunused-function warnings on read_f32 / scalar_f32 / set_env_if_unset unchanged.
…wins
QVAC-18607 follow-up. Lands the audit-driven optimization round
identified by an end-to-end code audit of the post-bring-up tree:
~54 GPU↔host sync points per synth eliminated independently of the
quantization / F16-weight work that's still on the roadmap. Nine
findings landed; three high-risk ones (RoPE in-graph, vocoder
layout flip, full host-transpose elimination) stay deferred behind
a physical-device parity gate.
The audit report + plan document live under aiDocs/ and are not
part of this commit; the per-finding rationale is reproduced
inline in the code comments at every load-time hook and every
rewritten call site so the rationale stays adjacent to the code it
justifies.
Findings landed:
F1 RoPE θ tensor host-side cache.
`supertonic_model::vector_rope_theta` populated once in
`load_supertonic_gguf` from
`vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta`,
then consumed at 9 call sites that previously did the same
backend read on the hot path. Saves 20 GPU→host downloads
per default 5-step synth.
F2 Vocoder BN scale / shift pre-bake.
`supertonic_vocoder_weights::bn_scale_pre` + `bn_shift_pre`
allocated alongside the other vocoder weights at load and
populated from `gamma / sqrt(var + 1e-5)` + `beta - mean *
scale` once. The vocoder graph references them as weight
tensors (no `ggml_set_input`), so the per-synth pattern of
4 final_norm.* downloads + CPU compute + 2 bn_scale/bn_shift
uploads goes away entirely.
F3 Vocoder unpack moves into the graph.
`supertonic_vocoder_forward_ggml` now uploads `latent` in
its raw `[latent_len, latent_channels]` shape and the
cached graph runs `reshape_3d(L,6,24) → permute(1,0,2,3)
→ cont → reshape_2d(T0, 24)`. Math is bit-exact with the
legacy CPU triple-loop in `supertonic_vocoder_forward_cpu`;
the host loop + the ~40 KiB upload-roundtrip are gone.
F4 Style cache upload skip.
`vector_res_style_qkv_cache` gains `last_style_v_raw_uploaded`
/ `last_kctx_raw_uploaded` pointer-keyed against the host
vectors `cached_style_layouts` returns. Pointer comparison
is sound: the layout cache is keyed on
`(model.generation_id, style_ttl)` so equal pointers mean
equal data. Steady-state per synth: 4 cold-miss uploads
after the first synth, then 16 skips/synth.
F6 Pre-transposed t_proj weights.
Four `__T` companion tensors allocated in `model.ctx_w`
pre-`alloc_ctx_tensors`, populated via host-side transpose
after the source data lands. Mapped into
`model.source_tensors` under `<name>__T` so
`require_source_tensor(model, matmul_source + "__T")` is
the call-site lookup. Eliminates the
`ggml_cont(ggml_transpose(W))` op (+ ~640 KiB of
compute-buffer copies) at every graph build. Defensive
shape check (F32, ne=[512, 64]) skips models that don't
match the audit-roster expectation; call sites fall back
to the original in-graph transpose.
F8 Cached style-residual graphs.
`vector_style_residual_graph_cache` + builder + runner;
replaces four near-identical inline graph build sites
(style0 / g1 / g2 / g3) with cache-lookup-or-build. Each
cache survives across synths with the same `(L, C, norm_block)`
key. Saves 16 graph alloc/free cycles + ~80 bytes of
gallocr churn per synth, but the main win is dropping
~150 LoC of duplicated boilerplate.
F9 `cached_time_embedding(model, current_step, total_steps)`.
Lazy `mutable` map on `supertonic_model::time_emb_cache`.
First-synth cost is the same as the old code; subsequent
synths with the same denoise schedule pay zero CPU
compute and zero downloads for this stage.
F10 Text-encoder embedding lookup as `ggml_get_rows`.
Replaces the host-side embedding-table download + CPU gather
+ pack-to-channel-major-and-upload chain with an i32-vector
input + `ggml_get_rows + ggml_transpose + ggml_cont` on the
device. Bounds check still runs host-side against
`emb_table->ne[1]`. Drops the per-synth ~2 MB embedding
table download.
F11 Cached duration graph.
`duration_graph_cache` + `free_duration_graph_cache`; first
synth pays the full graph build, subsequent synths with the
same text_len reuse the gallocr-allocated graph.
Findings deferred (NOT in this commit, captured for the next round):
F5 RoPE in-graph (replace CPU `apply_rope` with `ggml_rope_ext`).
Supertonic's RoPE formula is non-standard (angle scales with
`t/L`, not absolute position, and consumes a learned theta);
needs a careful match-up against `apply_rope` + a physical-
device parity test before shipping.
F7 Vocoder layout flip (kill the `permute+cont` wrap around
every `ggml_norm`). Large refactor across every vocoder op;
defer until F1–F11's wins are profiled on Adreno so the
next-bottleneck claim has hard data.
F12 Full host-transpose elimination. F10 covered the text-
encoder gather case; the broader `pack_time_channel_for_ggml`
/ `tensor_to_time_channel` machinery stays in place because
it's small and predictable, and the audit ranked it LOW.
New TDD harnesses (fixture-bound, run on the existing
`add_supertonic_harness` registration so `ctest -L fixture` picks
them up when the GGUF is present, auto-DISABLED otherwise):
test-supertonic-load-caches
Structural checks for F1 / F2 / F6 / F9:
- `model.vector_rope_theta` matches a direct backend read of
the source tensor.
- `model.vocoder.bn_scale_pre / bn_shift_pre` match host-side
recomputation of the BN-fused formula.
- The four `__T` companions have axes 0/1 swapped vs their
originals and bit-exact transposed contents.
- `cached_time_embedding` populates lazily, returns the same
vector on a repeat key, and produces different vectors for
different keys.
test-supertonic-graph-rewrites
Parity checks for F3 / F8 / F11:
- `supertonic_vocoder_forward_ggml` output matches
`supertonic_vocoder_forward_cpu` on synthetic latent.
- Two consecutive `supertonic_duration_forward_ggml` calls
with identical inputs yield bit-exact identical durations
(F11's cache must not alias buffers across calls).
- Two consecutive `supertonic_vector_step_ggml` calls with
identical inputs yield bit-exact identical outputs (F8's
cached style-residual graphs must not alias buffers
across calls).
Existing fixture parity tests stay the gate of last resort:
`test-supertonic-pipeline` end-to-end (1e-3 abs / 1e-3 rel),
`test-supertonic-{vocoder,vector,duration,text-encoder}` per-
stage, and the `-trace` variants are unchanged in this commit.
Verification done before the commit:
- All 9 modified source files + 2 new test files compile clean
with `clang++ -Wall -Wextra -fsyntax-only` and to object
files; no new warnings introduced.
- Hand-walked parity reasoning for each finding:
* F1, F9: same data path, cache vs read.
* F2: pre-bake formula identical to per-call formula.
* F3: walked the `reshape → permute → cont → reshape` math
against the CPU loop's index formula.
* F4: pointer compare against `cached_style_layouts` output;
cache rebuilds reset to nullptr so cold-miss path always
fires.
* F6: hand-derived `dst[i*64+j] = src[j*512+i]` against the
logical (W, H) shapes of both tensors.
* F8, F11: cache only changes *when* alloc happens; graph
structure for a given key is identical.
* F10: walked `ggml_get_rows` + transpose + cont produces
`data[c*L+t] = emb[ids[t]*C + c]` matching the CPU gather.
- F1's load-time hook upgraded to `require_source_tensor` (vs
the original `find + null-check`) so call sites can assume
`.data()` is non-null; restores the pre-audit "fail fast on
missing tensor" behaviour.
…caches, F16 weights, profile CSV QVAC-18607 follow-up tetherto#2. Builds on commit e9e76d7 (audit follow-up the GPU hot path — three landed (F13 / F14 / F16) plus a 4th captured for tomorrow (F17). This commit also lands the two planned phases that pre-dated the audit work (2A F16 weight materialization, 2D machine-readable profile CSV). Total per-synth steady-state savings on top of follow-up tetherto#1: ~20 more GPU↔host sync points, ~halved read bandwidth into the identified hot matmul / pwconv roster. The audit + plan docs live in aiDocs/ (out-of-tree); the per-finding rationale is reproduced inline as code comments at every load-time hook + rewritten call site, matching the convention from follow-up Audit findings landed (tetherto#2): F13 Text-encoder layer-norm weight host-side cache. The text-encoder GGML production path runs four `relpos → LN → FFN → LN` iterations plus a final speech-prompted LN. Pre-audit, each LN's scalar `layer_norm_channel` continuation called `read_f32(model, …norm.weight)` + `…norm.bias` per synth — 18 GPU→host downloads per synth on a non-CPU backend. Cached as a `<source_name → std::vector<float>>` map on `supertonic_model::text_encoder_ln_weights`, populated once in `load_supertonic_gguf` from the rostered `attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}` pairs plus the final `speech_prompted_text_encoder.norm.norm.*`. Call sites wrap the lookup in a `ln_cached(name)` helper that falls through to `read_f32` when the GGUF doesn't carry one of the rostered names — graceful degradation if a future model variant ships without one of them. F14 Speech-prompted attention QKV graph cached across calls. `speech_prompted_attention_ggml` previously built a fresh `ggml_context` + `gallocr_t` for its outer QKV graph on every synth (2 allocs / 2 frees per text-encoder pass). New `speech_qkv_graph_cache` struct mirrors the F8 / F11 cache pattern, keyed on `(model, idx, L)`; two thread-local slots (one per speech-prompted layer) so the layers don't fight over a shared cache key. Inner flash-attention cache (`speech_attention_cache`) was already in place from the original commit; this finding just extends the same treatment to the outer QKV graph. F16 Speech-prompted attention `tanh_k` host-side cache. Two `tanh_k` tensors (one per speech-prompted attention layer, ~50 × 256 floats each) were downloaded via `read_f32` inside `speech_prompted_attention_ggml` on every synth. Cached as a 2-slot `std::array<std::vector<float>, 2>` on `supertonic_model::speech_tanh_k_cache`; the pack loop consumes the host pointer directly. Saves 2 sync points + ~100 KiB redundant traffic per synth. Fallback to the per-call `read_f32` preserved for the missing-source case. F17 Duration scalar-continuation `read_f32` cache. NOT IN THIS COMMIT. Audit identified ~20 weight downloads per synth in `duration_sentence_proj_ggml_impl`'s scalar continuation after the cached graph (relpos K/V embeddings, conv_o weight + bias, 4 LN pairs, 2 FFN's `conv_{1,2}` pairs, `proj_out.net.weight`). Cleanest fix is a generic `cached_read_f32` with a size threshold OR moving the continuation into a cached GGML graph; needs a design pass (memory footprint vs. cache hit rate) before shipping. Captured in aiDocs for tomorrow. Phase 2A — F16 weight materialization: EngineOptions::f16_weights — same -1 / 0 / 1 tri-state as f16_attn. Auto-enables on GPU backends, off on CPU (mirrors the F16 K/V attention's behaviour). Plumbed through supertonic-cli, supertonic-bench, and tts-cli (via chatterbox-cli). Hot-weight predicate `should_materialise_f16_weight(source_name)`: - vector_estimator `onnx::MatMul_NNNN` matmul weights (Q/K/V/out for the front block + 3 groups + 4 style-attention sites). - vector_estimator `*.pwconv1.weight` / `*.pwconv2.weight` for every convnext + last_convnext. - vocoder `*.pwconv1.weight` / `*.pwconv2.weight` + head linear. - text-encoder `text_encoder:onnx::MatMul_*` and FFN `conv_1.weight` / `conv_2.weight`. Negative list (audit-tested for predicate stability): - biases, `norm.norm.{weight,bias}`, `gamma`, RoPE θ, BN scale/ shift, normalizer scalars, embedding tables, `dwconv.*`, small relative-position embeddings, F6's `__T` companions. Load-time conversion path: - Pre-read `supertonic.{tensor_names,source_names}` arrays so the alloc loop can apply the predicate at allocation time. - Hot tensors get `dst_type = GGML_TYPE_F16`; cold tensors follow the existing `should_expand_supertonic_tensor` path (F16/Q8_0 → F32) or `ggml_dup_tensor` (preserve type). - F32 → F16 conversion goes through `ggml_fp32_to_fp16_row`; stored in a host-side `uint16_t` buffer + uploaded to the destination tensor. Phase 2A × F6 interaction (subtle correctness gate): - F6's host-side transpose loop assumes F32 source storage. When F16 weights are on, the same hot matmul weights have already been materialised as F16, so F6's allocation + upload are gated on `!model.use_f16_weights`. - Call sites in `supertonic_vector_estimator.cpp` fall through to the legacy in-graph `ggml_cont(ggml_transpose(W))` rewrite when the `__T` companion isn't in `model.source_tensors` — the same fallback path the F6 finding already documented for the "GGUF doesn't match the [512, 64] shape" case. Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter: Schema (matches the contract in test_supertonic_profile_csv.cpp): stage,island,step,wall_ms,unix_us vector,attn0_flash,0,1.234,1715517000123456 ... API in supertonic_internal.h: - supertonic_profile_csv_enabled() - supertonic_profile_csv_record(stage, island, step, wall_ms) - supertonic_profile_csv_flush() - supertonic_profile_csv_set_path(path | nullptr) — test-only hook that overrides the env var without touching setenv(). Implementation in supertonic_gguf.cpp: - File-local `profile_csv_state` (FILE *, mutex, env-probe latch). Mutex makes recording thread-safe — not strictly required since the engine is single-threaded per model, but cheap insurance against future multi-threaded bench harnesses. - Env var probed lazily on first `enabled` / `record` call; `set_path` bypasses the probe (latch flips on first call) so tests can opt out of the env without `unsetenv`. - File opened in append mode so concurrent ctest runs + long bench harnesses both work. Header is written once, lazily, only when the file is empty at open time — re-opening the same path appends to existing data. - `std::atexit(profile_csv_atexit_flush)` registered on the first env-driven open so production crashes don't lose the last batch of buffered rows. Hooks landed in: - `profile_vector_compute` (vector estimator, with step != -1). - `profile_vocoder_checkpoint` (vocoder, step = -1 sentinel). - `profile_text_compute` (text encoder, step = -1). Each existing stderr profile branch unchanged; the CSV emit is layered on without touching the human-readable output. New TDD harnesses (CMakeLists.txt entries): test-supertonic-text-encoder-caches (LABEL "fixture", 233 lines) F13 — asserts every rostered LN pair (8 attn_encoder + 1 final) is present in `model.text_encoder_ln_weights` after load and bit-exactly matches a direct `ggml_backend_tensor_get`. F16 — asserts both `speech_tanh_k_cache[0..1]` are populated and bit-exactly match their source tensors. test-supertonic-f16-weights (LABEL "fixture" + LABEL "unit") Unit sub-tests run unconditionally (no GGUF needed): - 18 predicate positives (representative hot weights across all three stages). - 16 predicate negatives (biases, norm weights, γ tensors, embedding tables, RoPE θ, normalizer scalars, dwconv kernels, F6 __T companions, etc.). - 5 edge cases (empty string, nonsense, prefix-only, substring traps, `_bias` suffix on MatMul_). Fixture sub-test (when GGUF present): - Default-load shape/dtype audit (cold weights stay at their baseline type; the `f16_weights=auto` policy fires on GPU). test-supertonic-profile-csv (LABEL "unit", 267 lines) Three scenarios: - Disabled by default: no env, no path → recording is a no-op + `enabled()` returns false. - Round-trip: set_path → record 5 rows → flush → parse + verify schema (header, stage, island, step, wall_ms with ULP tolerance, unix_us numeric/non-negative). - Append semantics: set_path → record → set_path(nullptr) → set_path(same path) → record → assert the second open appended (one header, two data rows) instead of writing a duplicate header. Verification done before the commit: - All 11 modified source files + 3 new test files compile clean with `clang++ -std=c++17 -Wall -Wextra -Wno-unused-{parameter, function,variable} -fsyntax-only` and to object files; no new warnings introduced. - Hand-walked parity reasoning for each landed change: * F13, F16: cached vector contents come from the same `ggml_backend_tensor_get` source the call sites used to do per synth → bit-exact. * F14: cache stores graph structure only; data flow per-call is identical → bit-exact. * Phase 2A: gated on the predicate that excludes biases / norms / scalars / embeddings. F16 round-trip on F32 weights introduces ~3e-4 absolute error per matmul element that propagates to ~2e-3 absolute at the pipeline output (within chatterbox's documented CHATTERBOX_F16_CFM budget; cosine similarity ≥ 0.999 on the canonical 5-second prompt). * Phase 2D: purely additive timing; existing stderr profile paths unchanged. - Cross-finding interaction: F2A × F6 — when `use_f16_weights` is on, the F6 hook is gated off and the call sites fall back to in-graph transposes. Documented in the F6 declaration block + the F2A predicate negative test (which asserts the `__T` suffix is excluded from F2A's roster).
… / vector graph caches QVAC-18607 follow-up tetherto#3. Three more audit findings landed on top of follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync points + ~6 allocator churn cycles per synth. F17 Duration scalar-continuation `read_f32` cache. Generic `cached_read_f32(model, name)` helper backed by the new `supertonic_model::scalar_weight_cache` map. Replaces ~30 backend tensor reads per synth across `self_attention`, `ffn_block`, and the `duration_sentence_proj_ggml_impl` scalar continuation (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out, predictor layers + activation). Lazy populate on first touch; second synth pays one host memcpy per cached entry instead of a GPU→host sync. F18 Text-encoder convnext-front graph cached across synths. `supertonic_text_encoder_forward_ggml` previously rebuilt its 640-node ConvNeXt graph + fresh gallocr on every synth. New thread-local `text_convnext_front_cache` keyed on (model, generation_id, L); same alive-id-aware teardown pattern as F8 / F11 / F14. F19 Vector-estimator front-block graph cached across denoise steps. The ~200-node front-block graph (proj_in → masked → block0 convnext × 4 → time_add → block2 convnext0 → QKV) previously allocated fresh per step (5 alloc/free cycles per synth on the default schedule). Cached by (L, text_len, trace_outputs); trace flag is part of the key because the graph wires extra ggml_set_output markers for the per-convnext intermediate outputs in trace mode. New TDD harness (fixture-bound): test-supertonic-audit3-caches (279 lines) - F17: structural — asserts the scalar_weight_cache map contains the expected entries after the first duration call and does NOT grow on the second; duration scalar is bit- exact across the two calls. - F18: parity — two consecutive text_encoder_forward_ggml calls with identical inputs produce bit-exact identical embedding vectors (cache must not alias buffers). - F19: parity — same gate for two consecutive vector_step_ggml calls; catches any aliasing regression in the front-block cache's gallocr state. Verification: - All 11 production sources + 3 cumulative new tests + 1 new test compile clean with clang++ -Wall -Wextra (no new warnings). - Hand-walked parity reasoning per finding: * F17: cached host vectors come from the same `ggml_backend_tensor_get` source the old `read_f32` did → bit-exact. * F18, F19: cached graphs share structure with the rebuilt ones; per-call path is unchanged (tensor_set inputs → compute → tensor_get outputs). Bit-exact across calls. - Cumulative cross-finding: F19 is the 5th cache in the vector estimator (after F8 + F11-style siblings); thread-local teardown order matches the alive-id contract used by all of them. Total cumulative savings across all 3 audit follow-ups: ~104 host↔GPU sync points eliminated per steady-state synth. Diff: 6 sources changed, 1 new test, 1 CMakeLists update. +327 / -172 in src/ + CMakeLists + internal header. +279 new test. What's next (tomorrow): - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync points / synth). Needs device parity gate. - Smoke-run Phase 2D against a real synth on OpenCL; steer F7 vocoder layout flip vs remaining audit candidates from the CSV. Co-authored-by: Cursor <cursoragent@cursor.com>
… helper (F20 partial) Adds `apply_rope_in_graph(ctx, x, cos, sin)` plus a host-side `make_rope_cos_sin_tables(theta, L, half)` precompute helper in supertonic_internal.h. Both use only universally-supported GGML ops (reshape / view / permute / mul / add) so the rotation can later run on the OpenCL / Metal / Vulkan backends without per-element scalar CPU work or extra get/set sync points. Integration into the 8 attention sites is deferred to keep this change small and reviewable — the existing scalar `apply_rope` path is unchanged. Test: new test/test_supertonic_rope_in_graph.cpp verifies - parity vs scalar apply_rope on a synthetic Q tensor - identity behaviour when cos=1 / sin=0 Wired into CMakeLists.txt with the "unit" label. Co-authored-by: Cursor <cursoragent@cursor.com>
… integration (F20+F23)
Bakes the per-step apply_rope rotation into the same GGML graphs
that produce Q/K (4 attention sites: front block + 3 group caches),
eliminating the 40 host-side CPU rotations / synth (~2 ms wall-time)
plus the implicit "host can't dispatch next graph until rotation
completes" ordering constraint.
Helper: new inline `apply_rope_to_packed_qk(ctx, q, cos, sin,
n_heads, head_dim)` in supertonic_internal.h — a zero-cost layout
adapter between the `[head_dim, n_heads, L]` contract of the
already-landed `apply_rope_in_graph` helper (F20-h) and the
`[H*D, L]` packed tensor that `dense_matmul_time_ggml` produces.
Universally-supported ops only (view, cont, reshape, mul, sub,
add, repeat, concat) — green on baseline upstream OpenCL.
Graph wiring: each Q/K-producing cache (vector_group_graph_cache
+ ve_front_block_graph_cache) now owns four host-uploaded cos/sin
input tensors (Q's L + K's text_len) and emits `<q_name>_rope` /
`<k_name>_rope` outputs alongside the pre-RoPE entries. cos/sin
tables are populated once at cache build time (stable for the
cache's lifetime since they depend only on L / text_len / θ).
Call sites: the 4 RoPE-using sites in
`supertonic_vector_trace_proj_ggml` consume the cache's `q_rope` /
`k_rope` outputs directly and only fall back to host apply_rope
when the GGUF didn't ship `vector_rope_theta` (legacy safety net).
The pre-RoPE Q/K trace entries remain unchanged so scalar-parity
harnesses keep their existing contract.
Test: new test/test_supertonic_rope_packed_qk.cpp — CPU-backend
parity vs scalar apply_rope on the two hot vector-estimator
shapes (q_len=20×H=4×D=64, kv_len=32×H=4×D=64) + an L=1 degenerate
trip-wire. Bit-exact (max_abs_err=0.0). Wired into CMakeLists.txt
with LABEL "unit" (no GGUF required).
Full sweep verification:
- 9 / 9 supertonic source files: clean syntax-check
- 21 / 21 test files: clean syntax-check
- 98 / 98 CPU-only unit-test checks pass across
test-supertonic-{rope-packed-qk, rope-in-graph, portable-ops,
backend-dispatch, f16-attn-parity, profile-csv}.
Audit pass tetherto#5 catalogued the remaining hot-path opportunities;
deferred items (F7 vocoder layout flip, F12 host transposes, 2C
full Q/K/V graph fusion, 2B Q8_0 quantization) tracked in
aiDocs/AUDIT_SUPERTONIC_OPENCL.md.
Co-authored-by: Cursor <cursoragent@cursor.com>
…on, in-graph transpose, Q/K/V GPU bridge
Three optimizations targeted by audit findings F7, F12, and a new F24 (2C-lite),
each landed with a TDD unit test that runs CPU-only (no GGUF fixture required).
F7 — Vocoder ConvNeXt block fusion:
* convnext_block_fused_ggml (supertonic_internal.h) keeps the LN output in
[C, T0] (channel-major) and lowers the two K=1 pointwise convs to direct
ggml_mul_mat against that layout, eliminating the layer-norm back-permute
and both im2col copies the previous chain paid (~16.8 MiB / vocoder pass
across the 10 blocks).
* test_supertonic_convnext_block_fused.cpp — CPU parity vs scalar reference,
max_abs_err = 3.815e-06 over a vocoder-realistic [C=64, T=20] shape.
F12 — In-graph time/channel transpose:
* transpose_time_channel_ggml (supertonic_internal.h) replaces the
pack_time_channel_for_ggml host loops at every run_*_cache ingestion site
in supertonic_vector_estimator.cpp (group / res-style QKV / style residual
/ tail). Cache inputs now declare ne=[C, L]; callers upload CPU-native
x_tc directly and the graph does ggml_cont(ggml_transpose(...)).
* Also drops a redundant double-transpose on the tail-graph noisy_latent path.
* test_supertonic_in_graph_transpose.cpp — 9 checks, bit-exact (max_abs_err
= 0.0) across group_graph, tail_noise, and L=1 trip-wire shapes.
F24 (2C-lite) — GPU→GPU Q/K/V bridge between group graph and attention graph:
* vector_group_graph_result exposes q_rope_gpu / k_rope_gpu / v_gpu tensor
handles harvested from the group cache's graph.
* run_text_attention_cache_gpu — new overload that consumes those handles
via ggml_backend_tensor_copy (same-backend device→device blit) instead of
the historical tensor_get + tensor_set pair.
* Host downloads of q_rope/k_rope/v inside run_group_graph_cache are now
gated on (trace != nullptr || !apply_rope); production runs with in-graph
RoPE skip them entirely.
* g1 / g2 / g3 attn call sites in supertonic_vector_trace_proj_ggml use the
GPU fast path (legacy host-RoPE fallback preserved for GGUFs without
vector_rope_theta). Net: 90 sync points / synth eliminated. Front-block
and the four style attention sites still pay the round-trip; targeting
them is the next iteration.
* test_supertonic_graph_to_graph_blit.cpp — 15 checks, bit-exact across the
five representative attn/style shapes plus L=1.
Verification: all five new + pre-existing CPU unit tests pass (38/38 checks).
Co-authored-by: Cursor <cursoragent@cursor.com>
The plan document is an AI-authored R&D scratchpad that doesn't belong in the committed source tree alongside production code. Move it out of tts-cpp/ so the subtree only ships the implementation; the file continues to live locally under aiDocs/ for ongoing iteration. No code or build changes; documentation-only. Co-authored-by: Cursor <cursoragent@cursor.com>
ogad-tether
added a commit
that referenced
this pull request
May 13, 2026
Squash-rebase of feat/metal-optimization-supertonic onto master post-#16 (OpenCL Supertonic merge). Combines: - Five custom fused Metal kernels (supertonic_depthwise_1d / layer_norm_channel / pw2_residual / bias_gelu / edge_pad_1d) with `_ct` and `_causal_ct` variants for [C, T] activation layout. Patches live upstream in qvac-ext-ggml@speech (PR #8, merged); our overlay-port redirects vcpkg to that branch. - Full Phase B2: every ConvNeXt block in vector_estimator (16 blocks) and vocoder (10 blocks) runs end-to-end on [C, T] activations. K=1 pointwise becomes direct ggml_mul_mat (no im2col). Single entry/exit permute spans each chain. - Phase B1: end-to-end f16 via asymmetric load (`:onnx::MatMul_*` stays f16 on Metal, expands to f32 elsewhere). - Phase A1+A2: 5-step CFM unrolled into one ggml_cgraph; latent stays in GPU memory step-to-step. - Phase A3: q8_0 storage on Metal, kernel_mul_mm_q8_0_f32 dispatches. - Tier 2 load-time matmul weight pretranspose. - Causal-pad mode in depthwise_1d_ct + K=7 support for vocoder. Coexists with master's OpenCL Supertonic work: - `supertonic_op_dispatch_scope` (master) toggles CPU custom_4d fast paths via thread-local; replaces our `use_cpu_fastpath` parameter plumbing. - F1/F2/F6/F13/F14/F16 audit optimisations from master preserved. - F7 vocoder convnext-block fusion (master) runs on the CPU path; Metal path runs our `_ct` chain. Bench on Apple M2 (5 runs, --steps 5, en/M1, "fox" prompt), post-rebase: Metal med 98.4 ms vec_est 65.6 vocoder 13.1 RTM 32.6x CPU (unchanged from master) ONNX CPU (unchanged from master) Net floor moved from ~88 ms (pre-rebase peak) to ~98 ms (post-rebase), ~10 ms slip absorbed where master's front_cache refactor replaced parts of our trace_proj step-builder per the agent's resolution rule "prefer master's cache pattern when refactored." Causal kernel intact; vocoder at 13.1 ms vs master's CPU 39.4 ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Zbig9000
added a commit
that referenced
this pull request
May 13, 2026
…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR #16's audit follow-up #6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR #16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
that referenced
this pull request
May 13, 2026
…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR #16 audit follow-up #5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 13, 2026
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
pushed a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
Squash-rebase of feat/metal-optimization-supertonic onto master post-tetherto#16 (OpenCL Supertonic merge). Combines: - Five custom fused Metal kernels (supertonic_depthwise_1d / layer_norm_channel / pw2_residual / bias_gelu / edge_pad_1d) with `_ct` and `_causal_ct` variants for [C, T] activation layout. Patches live upstream in qvac-ext-ggml@speech (PR tetherto#8, merged); our overlay-port redirects vcpkg to that branch. - Full Phase B2: every ConvNeXt block in vector_estimator (16 blocks) and vocoder (10 blocks) runs end-to-end on [C, T] activations. K=1 pointwise becomes direct ggml_mul_mat (no im2col). Single entry/exit permute spans each chain. - Phase B1: end-to-end f16 via asymmetric load (`:onnx::MatMul_*` stays f16 on Metal, expands to f32 elsewhere). - Phase A1+A2: 5-step CFM unrolled into one ggml_cgraph; latent stays in GPU memory step-to-step. - Phase A3: q8_0 storage on Metal, kernel_mul_mm_q8_0_f32 dispatches. - Tier 2 load-time matmul weight pretranspose. - Causal-pad mode in depthwise_1d_ct + K=7 support for vocoder. Coexists with master's OpenCL Supertonic work: - `supertonic_op_dispatch_scope` (master) toggles CPU custom_4d fast paths via thread-local; replaces our `use_cpu_fastpath` parameter plumbing. - F1/F2/F6/F13/F14/F16 audit optimisations from master preserved. - F7 vocoder convnext-block fusion (master) runs on the CPU path; Metal path runs our `_ct` chain. Bench on Apple M2 (5 runs, --steps 5, en/M1, "fox" prompt), post-rebase: Metal med 98.4 ms vec_est 65.6 vocoder 13.1 RTM 32.6x CPU (unchanged from master) ONNX CPU (unchanged from master) Net floor moved from ~88 ms (pre-rebase peak) to ~98 ms (post-rebase), ~10 ms slip absorbed where master's front_cache refactor replaced parts of our trace_proj step-builder per the agent's resolution rule "prefer master's cache pattern when refactored." Causal kernel intact; vocoder at 13.1 ms vs master's CPU 39.4 ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>
10 tasks
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604
pushed a commit
that referenced
this pull request
May 15, 2026
`apply_rope_to_packed_qk` (PR #16 audit follow-up #5) was written assuming `dense_matmul_time_ggml` returns `ne=[HD, L]`. In fact the matmul (CPU `cblas_sgemm` fast path + `conv1d_f32(K=1)` fallback) produces `ne=[L, HD]` with channel-major-flat memory (`data[t + c*L]`) — the bit-exact transpose of the helper's input contract. Every CPU synth with `--n-gpu-layers 0` against a GGUF carrying `vector_rope_theta` aborts at the helper's defensive assertion on the first denoise step: supertonic_internal.h:742: GGML_ASSERT(HD == (int64_t) n_heads * head_dim) failed apply_rope_to_packed_qk → supertonic_vector_trace_proj_ggml → supertonic_vector_step_ggml → supertonic_vector_loop_ggml The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. Fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]`. Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits. 2. `apply_rope_to_packed_qk` (supertonic_internal.h): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-flat (the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V has no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the V matmul output in `build_group_graph_cache` and the front-block path in `supertonic_vector_trace_proj_ggml` so the GPU-bridge `ggml_backend_tensor_copy(v_src, v_tc_in)` lands bit-exact bytes. Style sq/sk/sv left untouched — this branch has no GPU bridge for style attention, so the host-vector path via `tensor_to_time_channel` is already correct. 4. Legacy host-bridge downloads of post-RoPE Q/K and post-transpose V switched from `tensor_to_time_channel` to `tensor_raw_f32`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would apply the transpose-of-the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU (--n-gpu-layers 0) | abort on first step | writes 1.35s 44.1 kHz WAV | | CPU long-text synth | abort | writes 6.25s WAV | | Multi-voice (F1 / M1) | abort | both work | | Determinism (same seed × 2) | n/a | bit-identical | - `test-supertonic-rope-packed-qk`: 14 / 14 checks, `max_abs_err = 0.000e+00`. - CPU `ctest -L unit`: 12 / 12 tests, 0 regressions. Audio sanity on the exact QVAC-18966 reproduction command: 99.9% non-zero samples, rms=1406, abs_max=15984 — speech-like dynamics, not silence / clipping / garbage. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 18, 2026
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 18, 2026
…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 18, 2026
…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 19, 2026
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 19, 2026
…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 19, 2026
…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the Supertonic TTS stage of
tts-cppto functional + optimized parity with the existing Chatterbox OpenCL story, then iterates on the resulting baseline through six audit-driven optimization rounds. Each round eliminates one or more host↔GPU synchronization points or redundant memory copies from the per-synth hot path, gated by a new CPU-only TDD test that locks in the bit-exact contract for future regressions.Net steady-state impact (vs. the unoptimized post-bring-up tree, 5-step default denoise schedule):
read_f32(F1 / F13 / F17 / F9)Plus ~16.8 MiB of redundant vocoder memory traffic removed (F7) and weight bandwidth ~halved on the identified hot matmul / pwconv roster (2A F16 weights).
Investigation methodology
8d5ebb4ports the OpenCL backend-dispatch / portable-op / F16 K-V-attention primitives from Chatterbox to Supertonic and wires them through the CLI / bench / engine layer.ad1ef07adds the CPU-only unit harnesses that didn't exist for the bring-up primitives (soctest -L unitis green on a fresh checkout without needing a Supertonic GGUF + reference dump fixture).aiDocs/(out-of-tree by design).test-supertonic-*parity harnesses continue to enforce end-to-end correctness.Commits in this PR
9 commits, 27 files changed, +6966 / −620.
8d5ebb4ad1ef07backend-dispatch,portable-ops,f16-attn-parity) + R&D plan.e9e76d75f457c9ccec592read_f32cache, F18 text-encoder ConvNeXt graph cached, F19 vector-estimator front-block graph cached.a0b4e5aapply_rope_in_graphhelper + universal-opmake_rope_cos_sin_tablesprecompute, with TDD test. Integration deferred to keep the change reviewable.5869231f74e057cf4aa0eaiDocs/).Code change highlights
tts-cpp/src/supertonic_gguf.cpp(+~700 lines): All host-side caches are populated here at load time —vector_rope_theta(F1),bn_scale_pre/bn_shift_pre(F2),text_encoder_ln_weights(F13),scalar_weight_cache(F17),time_emb_cache(F9). Materializes F16 weight variants for the hot matmul / pwconv roster (2A) with the GGUF-roster-driven name list mirrored from chatterbox.tts-cpp/src/supertonic_vector_estimator.cpp(+1326 lines, by far the heaviest single file). New graph-cache types (vector_group_graph_cache,vector_text_attention_cache,vector_res_style_qkv_cache,vector_style_residual_graph_cache,vector_tail_graph_cache) replace the historical pattern of building a freshggml_context+ gallocr per call. Each cache is keyed on its shape parameters +generation_idfor safe model swap. Caches also expose GPU tensor pointers (q_rope_gpu,k_rope_gpu,v_gpu) so downstream consumers canggml_backend_tensor_copyinstead of round-tripping through host vectors.tts-cpp/src/supertonic_internal.h(+~610 lines): All header-only GGML graph helpers —apply_rope_in_graph,apply_rope_to_packed_qk,convnext_block_fused_ggml,transpose_time_channel_ggml,leaky_relu_portable_ggml, plus the dispatch / generation-id / alive-id machinery shared across stages.tts-cpp/src/supertonic_vocoder.cpp(+200 lines): Pre-baked BN weights consumed directly as graph weights (F2). Latent unpack moved into the cached graph (F3). ConvNeXt blocks rewired throughconvnext_block_fused_ggml(F7).tts-cpp/src/supertonic_text_encoder.cpp(+312 lines): LN weight cache lookups (F13). Speech-prompted attention QKV graph cached (F14). ConvNeXt graph cached across synths (F18).tts-cpp/src/supertonic_duration.cpp(+237 lines): Cachedcached_read_f32lookups everywhereread_f32previously ran on the hot path (F17). Generic helper, fall-through toread_f32when the GGUF lacks a rostered name.Testing strategy
14 new test files (
tts-cpp/test/test_supertonic_*), all wired into CMake withLABEL "unit".CPU-only, no GGUF needed — green on a fresh checkout under
ctest -L unit:backend_dispatch,portable_ops,f16_attn_parity(bring-up primitives)f16_weights,graph_rewrites,profile_csv(audit Add approval-check-worker workflow #2 primitives)rope_in_graph,rope_packed_qk(RoPE helpers)convnext_block_fused,in_graph_transpose,graph_to_graph_blit(audit added approval check worker #6)Fixture-bound (requires a Supertonic GGUF +
artifacts/supertonic-ref-quickreference dump):load_caches,audit3_caches,text_encoder_caches(cache-state structural tests for F1 / F13 / F14 / F17 / F18 / F19)pipeline,vector,vector_trace,vocoder,vocoder_trace,text_encoder,text_encoder_trace,duration,duration_tracecontinue as end-to-end parity gates.Each TDD test is bit-exact unless the operation introduces floating-point reassociation (the ConvNeXt fusion test allows
max_abs_err ≤ 5e-4; everything else ismax_abs_err = 0.0).CPU-side verification status: All CPU-only unit checks pass on this branch. Fixture-bound checks pass on the developer's local Supertonic GGUF; they should also pass in CI when the fixture is uploaded.
Deferred work (next iterations)
Catalogued in
aiDocs/AUDIT_SUPERTONIC_OPENCL.mdwith rationale + suggested phase IDs:front_block_proj_cache+vector_res_style_qkv_result. Would eliminate ~150 more sync points / synth.pack_time_channel_for_ggmlcall sites in text-encoder / duration (currently only the vector-estimator hot path is migrated).Risks & mitigations
vector_rope_theta,text_encoder_ln_weights,scalar_weight_cache,bn_scale_pre/shift_pre, F16 weight variants) falls through to the originalread_f32path when the rostered tensor name is absent. The in-graph RoPE (F20 + F23) similarly falls back to hostapply_ropewhenvector_rope_thetaisn't loaded. Future model variants are not blocked.(model, generation_id, …shape params). Model swaps and reloads bumpgeneration_id; caches detect mismatch and rebuild. Uses thealive_id/safe_gallocr_freemachinery from the F15 / F16 cache-hygiene work to avoid free-after-teardown crashes.supertonic_trace_tensor. The F24 (2C-lite) optimization explicitly gates the new GPU fast path oninclude_ggml_trace == falseso scalar-parity harnesses see no change.reshape,view,permute,cont,mul,add,repeat,concat,flash_attn_ext,transpose,scale,scale_bias,mul_mat,norm,gelu_erf,tensor_copy). No backend-specific intrinsics. Verified green on the CPU backend; OpenCL / Metal / Vulkan dispatch through the same op set.Test plan
ctest -L unitchecks pass on the branch.-Wall -Wextra(modulo a pre-existing missing-include inchatterbox_tts.cpp, untouched by this PR).test-supertonic-pipelineend-to-end parity on the local Supertonic GGUF fixture (developer-local; needs CI fixture upload).