Merge supertonic_optimizations into master — QVAC-18605 rounds 1-13 + master reconcile#31
Conversation
Two latent bugs surfaced together when whisper.cpp is built with
-DWHISPER_COREML=ON, both reproducible at CMake configure time:
1. install(TARGETS whisper.coreml) did not join the whisper-targets
export set. Since whisper PRIVATE-links to whisper.coreml and is
itself in whisper-targets, CMake refuses to generate with
install(EXPORT "whisper-targets" ...) includes target "whisper"
which requires target "whisper.coreml" that is not in any
export set.
Add EXPORT whisper-targets to the install (must come before LIBRARY
in CMake's install(TARGETS ...) signature).
2. Once whisper.coreml is in the export set, its PUBLIC include dirs
are validated against the install interface. The current "."
include dir is a raw source-tree path with no
$<BUILD_INTERFACE>/$<INSTALL_INTERFACE> guards and CMake refuses
with
INTERFACE_INCLUDE_DIRECTORIES property contains path "..."
which is prefixed in the source directory.
The headers under coreml/ are internal implementation details only
consumed by whisper.cpp (in the same directory), so the correct fix
is to mark them PRIVATE rather than wrapping them in install/build
generator expressions.
Verified locally with -DWHISPER_COREML=ON -DGGML_METAL=ON: configure
clean, whisper.coreml + libwhisper.dylib build end-to-end.
This unblocks the ios-xcode-build CI job on PR #12.
QVAC-18300
Co-authored-by: Cursor <cursoragent@cursor.com>
The bindings-java tests testGetDefaultFullParams_Greedy / testGetDefaultFullParams_BeamSearch on PR #12 fail with expected: <5> but was: <0> (greedy.best_of) expected: <5> but was: <-1> (beam_search.beam_size) while whisper_full_default_params() still returns 5 for both — the actual transcription test (testFullTranscribe) produces correct text. Diagnosis: the Java JNA WhisperFullParams Structure is missing fields that exist in the C whisper_full_params struct, so JNA computes wrong offsets and reads garbage at greedy.best_of / beam_search.beam_size. Specifically the Java layout was missing: 1. int32_t seed — added by tetherto's local seed patch between no_speech_thold and greedy (include/whisper.h:553). This single omission shifts every subsequent field by 4 bytes and is the proximate cause of both failing assertions. 2. bool vad — added by upstream 3. const char * vad_model_path 4. whisper_vad_params vad_params (struct) Fix: * New WhisperVadParams.java JNA Structure mirroring whisper_vad_params {threshold, min_speech_duration_ms, min_silence_duration_ms, max_speech_duration_s, speech_pad_ms, samples_overlap}. * Add `public int seed`, `public CBool vad`, `public String vad_model_path`, `public WhisperVadParams vad_params` fields and thread them into getFieldOrder() at the matching positions. Field order in WhisperFullParams.getFieldOrder() now matches the C struct in include/whisper.h field-for-field, so JNA-computed offsets agree with the native side. QVAC-18300 Co-authored-by: Cursor <cursoragent@cursor.com>
QVAC-18607 follow-up. The bring-up commit (8d5ebb4) landed the dispatch + portable-op + F16-K/V-attention primitives but only exercised them transitively through the existing fixture-bound test-supertonic-* harnesses, which need a Supertonic GGUF + an artifacts/supertonic-ref-quick reference dump to run. A fresh checkout has neither, so the bring-up primitives shipped without their own gate on `ctest -L unit`. This commit adds three CPU-only unit harnesses that cover the bring-up primitives independent of any fixture, plus an R&D plan document capturing the next optimization rounds with their TDD test gates. Tests (all LABEL "unit", auto-run on fresh checkout): test-supertonic-backend-dispatch (186 lines) Six scenarios around supertonic_op_dispatch_scope + the two thread-local query functions: default state, CPU model mirroring, GPU model mirroring + post-teardown restore, RAII teardown on exception, nested-scope unwinding, independence of use_cpu_custom_ops / use_f16_attn. Catches "scope leaked wrong previous-value into thread_local" and "GPU engine poisons next CPU engine on same thread" regressions. test-supertonic-portable-ops (260 lines) CPU-backend parity of leaky_relu_portable_ggml's CPU lowering (fused ggml_leaky_relu) vs its GPU decomposition (RELU + 2x SCALE + ADD) for alpha in {0, 0.01, 0.05, 0.1, 0.5, 0.99, 1.0} against a sign-mixed input including the zero boundary. Also asserts graph-node-count grows on the GPU dispatch — catches a regression where the portable helper would silently route back to ggml_leaky_relu on a non-CPU backend (defeating the whole reason the helper exists). test-supertonic-f16-attn-parity (291 lines) F32 vs F16 K/V ggml_flash_attn_ext parity on the two hot shapes from the vector estimator (text attention kv=32, style attention kv=50), n_heads=4, head_dim=64. Tolerance 5e-3 abs / 5e-3 rel — the same band chatterbox ships behind --cfm-f16-kv-attn. Gracefully skips ("SKIPPED — CPU build missing one path") if the local CPU build doesn't carry both flash-attention paths, preserving CI greenness while still validating where the path exists. Refactor to support testing: leaky_relu_portable_ggml moves from file-local in supertonic_vocoder.cpp to an inline definition in supertonic_internal.h. ODR-safe under C++17, lets the portable-ops test call the production helper directly instead of re-implementing the rewrite (which would defeat the test's purpose). The vocoder TU now only carries a one-line redirect comment pointing at the header. Plan document (PLAN_SUPERTONIC_OPENCL.md, 268 lines): Captures five concrete next-rounds with motivation + code- change plan + acceptance test + risk for each: 2A. F16 weight materialization for hot matmuls — biggest expected single-flag win after F16 K/V attn, mirrors chatterbox's CHATTERBOX_F16_CFM gate. 2B. Pre-quantized Q8_0 GGUF weights — needs convert-script work + audio listening sign-off. 2C. Reduce 140x host<->GPU sync round-trips per synth in the vector estimator (5 steps x 28 set/get pairs). 2D. SUPERTONIC_OPENCL_PROFILE=PATH.csv tooling for per-kernel attribution; mirrors chatterbox's cl_profiling_*.csv flow. 2E. Vocoder unpack-on-GPU via ggml_permute + ggml_cont. Each phase has its acceptance test spelled out (TDD, written before the implementation lands), the CTest label it should carry, and its sequencing rationale. Cross-linked from PROGRESS_SUPERTONIC.md's "Next optimization rounds" subsection so future-readers find the roadmap. Validation: All three new tests pass clang -fsyntax-only -Wall -Wextra and compile to clean .o files. `nm` confirms the dispatch test's four undefined symbols (op_dispatch_scope ctor/dtor, use_cpu_custom_ops, use_f16_attn) resolve against the definitions in supertonic_gguf.o, so link-time resolution will succeed under the real CMake build. No new linter errors in any of the 8 affected files; pre-existing -Wunused-function warnings on read_f32 / scalar_f32 / set_env_if_unset unchanged.
…wins
QVAC-18607 follow-up. Lands the audit-driven optimization round
identified by an end-to-end code audit of the post-bring-up tree:
~54 GPU↔host sync points per synth eliminated independently of the
quantization / F16-weight work that's still on the roadmap. Nine
findings landed; three high-risk ones (RoPE in-graph, vocoder
layout flip, full host-transpose elimination) stay deferred behind
a physical-device parity gate.
The audit report + plan document live under aiDocs/ and are not
part of this commit; the per-finding rationale is reproduced
inline in the code comments at every load-time hook and every
rewritten call site so the rationale stays adjacent to the code it
justifies.
Findings landed:
F1 RoPE θ tensor host-side cache.
`supertonic_model::vector_rope_theta` populated once in
`load_supertonic_gguf` from
`vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta`,
then consumed at 9 call sites that previously did the same
backend read on the hot path. Saves 20 GPU→host downloads
per default 5-step synth.
F2 Vocoder BN scale / shift pre-bake.
`supertonic_vocoder_weights::bn_scale_pre` + `bn_shift_pre`
allocated alongside the other vocoder weights at load and
populated from `gamma / sqrt(var + 1e-5)` + `beta - mean *
scale` once. The vocoder graph references them as weight
tensors (no `ggml_set_input`), so the per-synth pattern of
4 final_norm.* downloads + CPU compute + 2 bn_scale/bn_shift
uploads goes away entirely.
F3 Vocoder unpack moves into the graph.
`supertonic_vocoder_forward_ggml` now uploads `latent` in
its raw `[latent_len, latent_channels]` shape and the
cached graph runs `reshape_3d(L,6,24) → permute(1,0,2,3)
→ cont → reshape_2d(T0, 24)`. Math is bit-exact with the
legacy CPU triple-loop in `supertonic_vocoder_forward_cpu`;
the host loop + the ~40 KiB upload-roundtrip are gone.
F4 Style cache upload skip.
`vector_res_style_qkv_cache` gains `last_style_v_raw_uploaded`
/ `last_kctx_raw_uploaded` pointer-keyed against the host
vectors `cached_style_layouts` returns. Pointer comparison
is sound: the layout cache is keyed on
`(model.generation_id, style_ttl)` so equal pointers mean
equal data. Steady-state per synth: 4 cold-miss uploads
after the first synth, then 16 skips/synth.
F6 Pre-transposed t_proj weights.
Four `__T` companion tensors allocated in `model.ctx_w`
pre-`alloc_ctx_tensors`, populated via host-side transpose
after the source data lands. Mapped into
`model.source_tensors` under `<name>__T` so
`require_source_tensor(model, matmul_source + "__T")` is
the call-site lookup. Eliminates the
`ggml_cont(ggml_transpose(W))` op (+ ~640 KiB of
compute-buffer copies) at every graph build. Defensive
shape check (F32, ne=[512, 64]) skips models that don't
match the audit-roster expectation; call sites fall back
to the original in-graph transpose.
F8 Cached style-residual graphs.
`vector_style_residual_graph_cache` + builder + runner;
replaces four near-identical inline graph build sites
(style0 / g1 / g2 / g3) with cache-lookup-or-build. Each
cache survives across synths with the same `(L, C, norm_block)`
key. Saves 16 graph alloc/free cycles + ~80 bytes of
gallocr churn per synth, but the main win is dropping
~150 LoC of duplicated boilerplate.
F9 `cached_time_embedding(model, current_step, total_steps)`.
Lazy `mutable` map on `supertonic_model::time_emb_cache`.
First-synth cost is the same as the old code; subsequent
synths with the same denoise schedule pay zero CPU
compute and zero downloads for this stage.
F10 Text-encoder embedding lookup as `ggml_get_rows`.
Replaces the host-side embedding-table download + CPU gather
+ pack-to-channel-major-and-upload chain with an i32-vector
input + `ggml_get_rows + ggml_transpose + ggml_cont` on the
device. Bounds check still runs host-side against
`emb_table->ne[1]`. Drops the per-synth ~2 MB embedding
table download.
F11 Cached duration graph.
`duration_graph_cache` + `free_duration_graph_cache`; first
synth pays the full graph build, subsequent synths with the
same text_len reuse the gallocr-allocated graph.
Findings deferred (NOT in this commit, captured for the next round):
F5 RoPE in-graph (replace CPU `apply_rope` with `ggml_rope_ext`).
Supertonic's RoPE formula is non-standard (angle scales with
`t/L`, not absolute position, and consumes a learned theta);
needs a careful match-up against `apply_rope` + a physical-
device parity test before shipping.
F7 Vocoder layout flip (kill the `permute+cont` wrap around
every `ggml_norm`). Large refactor across every vocoder op;
defer until F1–F11's wins are profiled on Adreno so the
next-bottleneck claim has hard data.
F12 Full host-transpose elimination. F10 covered the text-
encoder gather case; the broader `pack_time_channel_for_ggml`
/ `tensor_to_time_channel` machinery stays in place because
it's small and predictable, and the audit ranked it LOW.
New TDD harnesses (fixture-bound, run on the existing
`add_supertonic_harness` registration so `ctest -L fixture` picks
them up when the GGUF is present, auto-DISABLED otherwise):
test-supertonic-load-caches
Structural checks for F1 / F2 / F6 / F9:
- `model.vector_rope_theta` matches a direct backend read of
the source tensor.
- `model.vocoder.bn_scale_pre / bn_shift_pre` match host-side
recomputation of the BN-fused formula.
- The four `__T` companions have axes 0/1 swapped vs their
originals and bit-exact transposed contents.
- `cached_time_embedding` populates lazily, returns the same
vector on a repeat key, and produces different vectors for
different keys.
test-supertonic-graph-rewrites
Parity checks for F3 / F8 / F11:
- `supertonic_vocoder_forward_ggml` output matches
`supertonic_vocoder_forward_cpu` on synthetic latent.
- Two consecutive `supertonic_duration_forward_ggml` calls
with identical inputs yield bit-exact identical durations
(F11's cache must not alias buffers across calls).
- Two consecutive `supertonic_vector_step_ggml` calls with
identical inputs yield bit-exact identical outputs (F8's
cached style-residual graphs must not alias buffers
across calls).
Existing fixture parity tests stay the gate of last resort:
`test-supertonic-pipeline` end-to-end (1e-3 abs / 1e-3 rel),
`test-supertonic-{vocoder,vector,duration,text-encoder}` per-
stage, and the `-trace` variants are unchanged in this commit.
Verification done before the commit:
- All 9 modified source files + 2 new test files compile clean
with `clang++ -Wall -Wextra -fsyntax-only` and to object
files; no new warnings introduced.
- Hand-walked parity reasoning for each finding:
* F1, F9: same data path, cache vs read.
* F2: pre-bake formula identical to per-call formula.
* F3: walked the `reshape → permute → cont → reshape` math
against the CPU loop's index formula.
* F4: pointer compare against `cached_style_layouts` output;
cache rebuilds reset to nullptr so cold-miss path always
fires.
* F6: hand-derived `dst[i*64+j] = src[j*512+i]` against the
logical (W, H) shapes of both tensors.
* F8, F11: cache only changes *when* alloc happens; graph
structure for a given key is identical.
* F10: walked `ggml_get_rows` + transpose + cont produces
`data[c*L+t] = emb[ids[t]*C + c]` matching the CPU gather.
- F1's load-time hook upgraded to `require_source_tensor` (vs
the original `find + null-check`) so call sites can assume
`.data()` is non-null; restores the pre-audit "fail fast on
missing tensor" behaviour.
…F16 weights, profile CSV QVAC-18607 follow-up #2. Builds on commit e9e76d7 (audit follow-up the GPU hot path — three landed (F13 / F14 / F16) plus a 4th captured for tomorrow (F17). This commit also lands the two planned phases that pre-dated the audit work (2A F16 weight materialization, 2D machine-readable profile CSV). Total per-synth steady-state savings on top of follow-up #1: ~20 more GPU↔host sync points, ~halved read bandwidth into the identified hot matmul / pwconv roster. The audit + plan docs live in aiDocs/ (out-of-tree); the per-finding rationale is reproduced inline as code comments at every load-time hook + rewritten call site, matching the convention from follow-up Audit findings landed (#2): F13 Text-encoder layer-norm weight host-side cache. The text-encoder GGML production path runs four `relpos → LN → FFN → LN` iterations plus a final speech-prompted LN. Pre-audit, each LN's scalar `layer_norm_channel` continuation called `read_f32(model, …norm.weight)` + `…norm.bias` per synth — 18 GPU→host downloads per synth on a non-CPU backend. Cached as a `<source_name → std::vector<float>>` map on `supertonic_model::text_encoder_ln_weights`, populated once in `load_supertonic_gguf` from the rostered `attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}` pairs plus the final `speech_prompted_text_encoder.norm.norm.*`. Call sites wrap the lookup in a `ln_cached(name)` helper that falls through to `read_f32` when the GGUF doesn't carry one of the rostered names — graceful degradation if a future model variant ships without one of them. F14 Speech-prompted attention QKV graph cached across calls. `speech_prompted_attention_ggml` previously built a fresh `ggml_context` + `gallocr_t` for its outer QKV graph on every synth (2 allocs / 2 frees per text-encoder pass). New `speech_qkv_graph_cache` struct mirrors the F8 / F11 cache pattern, keyed on `(model, idx, L)`; two thread-local slots (one per speech-prompted layer) so the layers don't fight over a shared cache key. Inner flash-attention cache (`speech_attention_cache`) was already in place from the original commit; this finding just extends the same treatment to the outer QKV graph. F16 Speech-prompted attention `tanh_k` host-side cache. Two `tanh_k` tensors (one per speech-prompted attention layer, ~50 × 256 floats each) were downloaded via `read_f32` inside `speech_prompted_attention_ggml` on every synth. Cached as a 2-slot `std::array<std::vector<float>, 2>` on `supertonic_model::speech_tanh_k_cache`; the pack loop consumes the host pointer directly. Saves 2 sync points + ~100 KiB redundant traffic per synth. Fallback to the per-call `read_f32` preserved for the missing-source case. F17 Duration scalar-continuation `read_f32` cache. NOT IN THIS COMMIT. Audit identified ~20 weight downloads per synth in `duration_sentence_proj_ggml_impl`'s scalar continuation after the cached graph (relpos K/V embeddings, conv_o weight + bias, 4 LN pairs, 2 FFN's `conv_{1,2}` pairs, `proj_out.net.weight`). Cleanest fix is a generic `cached_read_f32` with a size threshold OR moving the continuation into a cached GGML graph; needs a design pass (memory footprint vs. cache hit rate) before shipping. Captured in aiDocs for tomorrow. Phase 2A — F16 weight materialization: EngineOptions::f16_weights — same -1 / 0 / 1 tri-state as f16_attn. Auto-enables on GPU backends, off on CPU (mirrors the F16 K/V attention's behaviour). Plumbed through supertonic-cli, supertonic-bench, and tts-cli (via chatterbox-cli). Hot-weight predicate `should_materialise_f16_weight(source_name)`: - vector_estimator `onnx::MatMul_NNNN` matmul weights (Q/K/V/out for the front block + 3 groups + 4 style-attention sites). - vector_estimator `*.pwconv1.weight` / `*.pwconv2.weight` for every convnext + last_convnext. - vocoder `*.pwconv1.weight` / `*.pwconv2.weight` + head linear. - text-encoder `text_encoder:onnx::MatMul_*` and FFN `conv_1.weight` / `conv_2.weight`. Negative list (audit-tested for predicate stability): - biases, `norm.norm.{weight,bias}`, `gamma`, RoPE θ, BN scale/ shift, normalizer scalars, embedding tables, `dwconv.*`, small relative-position embeddings, F6's `__T` companions. Load-time conversion path: - Pre-read `supertonic.{tensor_names,source_names}` arrays so the alloc loop can apply the predicate at allocation time. - Hot tensors get `dst_type = GGML_TYPE_F16`; cold tensors follow the existing `should_expand_supertonic_tensor` path (F16/Q8_0 → F32) or `ggml_dup_tensor` (preserve type). - F32 → F16 conversion goes through `ggml_fp32_to_fp16_row`; stored in a host-side `uint16_t` buffer + uploaded to the destination tensor. Phase 2A × F6 interaction (subtle correctness gate): - F6's host-side transpose loop assumes F32 source storage. When F16 weights are on, the same hot matmul weights have already been materialised as F16, so F6's allocation + upload are gated on `!model.use_f16_weights`. - Call sites in `supertonic_vector_estimator.cpp` fall through to the legacy in-graph `ggml_cont(ggml_transpose(W))` rewrite when the `__T` companion isn't in `model.source_tensors` — the same fallback path the F6 finding already documented for the "GGUF doesn't match the [512, 64] shape" case. Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter: Schema (matches the contract in test_supertonic_profile_csv.cpp): stage,island,step,wall_ms,unix_us vector,attn0_flash,0,1.234,1715517000123456 ... API in supertonic_internal.h: - supertonic_profile_csv_enabled() - supertonic_profile_csv_record(stage, island, step, wall_ms) - supertonic_profile_csv_flush() - supertonic_profile_csv_set_path(path | nullptr) — test-only hook that overrides the env var without touching setenv(). Implementation in supertonic_gguf.cpp: - File-local `profile_csv_state` (FILE *, mutex, env-probe latch). Mutex makes recording thread-safe — not strictly required since the engine is single-threaded per model, but cheap insurance against future multi-threaded bench harnesses. - Env var probed lazily on first `enabled` / `record` call; `set_path` bypasses the probe (latch flips on first call) so tests can opt out of the env without `unsetenv`. - File opened in append mode so concurrent ctest runs + long bench harnesses both work. Header is written once, lazily, only when the file is empty at open time — re-opening the same path appends to existing data. - `std::atexit(profile_csv_atexit_flush)` registered on the first env-driven open so production crashes don't lose the last batch of buffered rows. Hooks landed in: - `profile_vector_compute` (vector estimator, with step != -1). - `profile_vocoder_checkpoint` (vocoder, step = -1 sentinel). - `profile_text_compute` (text encoder, step = -1). Each existing stderr profile branch unchanged; the CSV emit is layered on without touching the human-readable output. New TDD harnesses (CMakeLists.txt entries): test-supertonic-text-encoder-caches (LABEL "fixture", 233 lines) F13 — asserts every rostered LN pair (8 attn_encoder + 1 final) is present in `model.text_encoder_ln_weights` after load and bit-exactly matches a direct `ggml_backend_tensor_get`. F16 — asserts both `speech_tanh_k_cache[0..1]` are populated and bit-exactly match their source tensors. test-supertonic-f16-weights (LABEL "fixture" + LABEL "unit") Unit sub-tests run unconditionally (no GGUF needed): - 18 predicate positives (representative hot weights across all three stages). - 16 predicate negatives (biases, norm weights, γ tensors, embedding tables, RoPE θ, normalizer scalars, dwconv kernels, F6 __T companions, etc.). - 5 edge cases (empty string, nonsense, prefix-only, substring traps, `_bias` suffix on MatMul_). Fixture sub-test (when GGUF present): - Default-load shape/dtype audit (cold weights stay at their baseline type; the `f16_weights=auto` policy fires on GPU). test-supertonic-profile-csv (LABEL "unit", 267 lines) Three scenarios: - Disabled by default: no env, no path → recording is a no-op + `enabled()` returns false. - Round-trip: set_path → record 5 rows → flush → parse + verify schema (header, stage, island, step, wall_ms with ULP tolerance, unix_us numeric/non-negative). - Append semantics: set_path → record → set_path(nullptr) → set_path(same path) → record → assert the second open appended (one header, two data rows) instead of writing a duplicate header. Verification done before the commit: - All 11 modified source files + 3 new test files compile clean with `clang++ -std=c++17 -Wall -Wextra -Wno-unused-{parameter, function,variable} -fsyntax-only` and to object files; no new warnings introduced. - Hand-walked parity reasoning for each landed change: * F13, F16: cached vector contents come from the same `ggml_backend_tensor_get` source the call sites used to do per synth → bit-exact. * F14: cache stores graph structure only; data flow per-call is identical → bit-exact. * Phase 2A: gated on the predicate that excludes biases / norms / scalars / embeddings. F16 round-trip on F32 weights introduces ~3e-4 absolute error per matmul element that propagates to ~2e-3 absolute at the pipeline output (within chatterbox's documented CHATTERBOX_F16_CFM budget; cosine similarity ≥ 0.999 on the canonical 5-second prompt). * Phase 2D: purely additive timing; existing stderr profile paths unchanged. - Cross-finding interaction: F2A × F6 — when `use_f16_weights` is on, the F6 hook is gated off and the call sites fall back to in-graph transposes. Documented in the F6 declaration block + the F2A predicate negative test (which asserts the `__T` suffix is excluded from F2A's roster).
…r graph caches QVAC-18607 follow-up #3. Three more audit findings landed on top of follow-up #2 (commit 5f457c9); eliminates another ~30 GPU↔host sync points + ~6 allocator churn cycles per synth. F17 Duration scalar-continuation `read_f32` cache. Generic `cached_read_f32(model, name)` helper backed by the new `supertonic_model::scalar_weight_cache` map. Replaces ~30 backend tensor reads per synth across `self_attention`, `ffn_block`, and the `duration_sentence_proj_ggml_impl` scalar continuation (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out, predictor layers + activation). Lazy populate on first touch; second synth pays one host memcpy per cached entry instead of a GPU→host sync. F18 Text-encoder convnext-front graph cached across synths. `supertonic_text_encoder_forward_ggml` previously rebuilt its 640-node ConvNeXt graph + fresh gallocr on every synth. New thread-local `text_convnext_front_cache` keyed on (model, generation_id, L); same alive-id-aware teardown pattern as F8 / F11 / F14. F19 Vector-estimator front-block graph cached across denoise steps. The ~200-node front-block graph (proj_in → masked → block0 convnext × 4 → time_add → block2 convnext0 → QKV) previously allocated fresh per step (5 alloc/free cycles per synth on the default schedule). Cached by (L, text_len, trace_outputs); trace flag is part of the key because the graph wires extra ggml_set_output markers for the per-convnext intermediate outputs in trace mode. New TDD harness (fixture-bound): test-supertonic-audit3-caches (279 lines) - F17: structural — asserts the scalar_weight_cache map contains the expected entries after the first duration call and does NOT grow on the second; duration scalar is bit- exact across the two calls. - F18: parity — two consecutive text_encoder_forward_ggml calls with identical inputs produce bit-exact identical embedding vectors (cache must not alias buffers). - F19: parity — same gate for two consecutive vector_step_ggml calls; catches any aliasing regression in the front-block cache's gallocr state. Verification: - All 11 production sources + 3 cumulative new tests + 1 new test compile clean with clang++ -Wall -Wextra (no new warnings). - Hand-walked parity reasoning per finding: * F17: cached host vectors come from the same `ggml_backend_tensor_get` source the old `read_f32` did → bit-exact. * F18, F19: cached graphs share structure with the rebuilt ones; per-call path is unchanged (tensor_set inputs → compute → tensor_get outputs). Bit-exact across calls. - Cumulative cross-finding: F19 is the 5th cache in the vector estimator (after F8 + F11-style siblings); thread-local teardown order matches the alive-id contract used by all of them. Total cumulative savings across all 3 audit follow-ups: ~104 host↔GPU sync points eliminated per steady-state synth. Diff: 6 sources changed, 1 new test, 1 CMakeLists update. +327 / -172 in src/ + CMakeLists + internal header. +279 new test. What's next (tomorrow): - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync points / synth). Needs device parity gate. - Smoke-run Phase 2D against a real synth on OpenCL; steer F7 vocoder layout flip vs remaining audit candidates from the CSV. Co-authored-by: Cursor <cursoragent@cursor.com>
…(F20 partial) Adds `apply_rope_in_graph(ctx, x, cos, sin)` plus a host-side `make_rope_cos_sin_tables(theta, L, half)` precompute helper in supertonic_internal.h. Both use only universally-supported GGML ops (reshape / view / permute / mul / add) so the rotation can later run on the OpenCL / Metal / Vulkan backends without per-element scalar CPU work or extra get/set sync points. Integration into the 8 attention sites is deferred to keep this change small and reviewable — the existing scalar `apply_rope` path is unchanged. Test: new test/test_supertonic_rope_in_graph.cpp verifies - parity vs scalar apply_rope on a synthetic Q tensor - identity behaviour when cos=1 / sin=0 Wired into CMakeLists.txt with the "unit" label. Co-authored-by: Cursor <cursoragent@cursor.com>
…tion (F20+F23)
Bakes the per-step apply_rope rotation into the same GGML graphs
that produce Q/K (4 attention sites: front block + 3 group caches),
eliminating the 40 host-side CPU rotations / synth (~2 ms wall-time)
plus the implicit "host can't dispatch next graph until rotation
completes" ordering constraint.
Helper: new inline `apply_rope_to_packed_qk(ctx, q, cos, sin,
n_heads, head_dim)` in supertonic_internal.h — a zero-cost layout
adapter between the `[head_dim, n_heads, L]` contract of the
already-landed `apply_rope_in_graph` helper (F20-h) and the
`[H*D, L]` packed tensor that `dense_matmul_time_ggml` produces.
Universally-supported ops only (view, cont, reshape, mul, sub,
add, repeat, concat) — green on baseline upstream OpenCL.
Graph wiring: each Q/K-producing cache (vector_group_graph_cache
+ ve_front_block_graph_cache) now owns four host-uploaded cos/sin
input tensors (Q's L + K's text_len) and emits `<q_name>_rope` /
`<k_name>_rope` outputs alongside the pre-RoPE entries. cos/sin
tables are populated once at cache build time (stable for the
cache's lifetime since they depend only on L / text_len / θ).
Call sites: the 4 RoPE-using sites in
`supertonic_vector_trace_proj_ggml` consume the cache's `q_rope` /
`k_rope` outputs directly and only fall back to host apply_rope
when the GGUF didn't ship `vector_rope_theta` (legacy safety net).
The pre-RoPE Q/K trace entries remain unchanged so scalar-parity
harnesses keep their existing contract.
Test: new test/test_supertonic_rope_packed_qk.cpp — CPU-backend
parity vs scalar apply_rope on the two hot vector-estimator
shapes (q_len=20×H=4×D=64, kv_len=32×H=4×D=64) + an L=1 degenerate
trip-wire. Bit-exact (max_abs_err=0.0). Wired into CMakeLists.txt
with LABEL "unit" (no GGUF required).
Full sweep verification:
- 9 / 9 supertonic source files: clean syntax-check
- 21 / 21 test files: clean syntax-check
- 98 / 98 CPU-only unit-test checks pass across
test-supertonic-{rope-packed-qk, rope-in-graph, portable-ops,
backend-dispatch, f16-attn-parity, profile-csv}.
Audit pass #5 catalogued the remaining hot-path opportunities;
deferred items (F7 vocoder layout flip, F12 host transposes, 2C
full Q/K/V graph fusion, 2B Q8_0 quantization) tracked in
aiDocs/AUDIT_SUPERTONIC_OPENCL.md.
Co-authored-by: Cursor <cursoragent@cursor.com>
…raph transpose, Q/K/V GPU bridge
Three optimizations targeted by audit findings F7, F12, and a new F24 (2C-lite),
each landed with a TDD unit test that runs CPU-only (no GGUF fixture required).
F7 — Vocoder ConvNeXt block fusion:
* convnext_block_fused_ggml (supertonic_internal.h) keeps the LN output in
[C, T0] (channel-major) and lowers the two K=1 pointwise convs to direct
ggml_mul_mat against that layout, eliminating the layer-norm back-permute
and both im2col copies the previous chain paid (~16.8 MiB / vocoder pass
across the 10 blocks).
* test_supertonic_convnext_block_fused.cpp — CPU parity vs scalar reference,
max_abs_err = 3.815e-06 over a vocoder-realistic [C=64, T=20] shape.
F12 — In-graph time/channel transpose:
* transpose_time_channel_ggml (supertonic_internal.h) replaces the
pack_time_channel_for_ggml host loops at every run_*_cache ingestion site
in supertonic_vector_estimator.cpp (group / res-style QKV / style residual
/ tail). Cache inputs now declare ne=[C, L]; callers upload CPU-native
x_tc directly and the graph does ggml_cont(ggml_transpose(...)).
* Also drops a redundant double-transpose on the tail-graph noisy_latent path.
* test_supertonic_in_graph_transpose.cpp — 9 checks, bit-exact (max_abs_err
= 0.0) across group_graph, tail_noise, and L=1 trip-wire shapes.
F24 (2C-lite) — GPU→GPU Q/K/V bridge between group graph and attention graph:
* vector_group_graph_result exposes q_rope_gpu / k_rope_gpu / v_gpu tensor
handles harvested from the group cache's graph.
* run_text_attention_cache_gpu — new overload that consumes those handles
via ggml_backend_tensor_copy (same-backend device→device blit) instead of
the historical tensor_get + tensor_set pair.
* Host downloads of q_rope/k_rope/v inside run_group_graph_cache are now
gated on (trace != nullptr || !apply_rope); production runs with in-graph
RoPE skip them entirely.
* g1 / g2 / g3 attn call sites in supertonic_vector_trace_proj_ggml use the
GPU fast path (legacy host-RoPE fallback preserved for GGUFs without
vector_rope_theta). Net: 90 sync points / synth eliminated. Front-block
and the four style attention sites still pay the round-trip; targeting
them is the next iteration.
* test_supertonic_graph_to_graph_blit.cpp — 15 checks, bit-exact across the
five representative attn/style shapes plus L=1.
Verification: all five new + pre-existing CPU unit tests pass (38/38 checks).
Co-authored-by: Cursor <cursoragent@cursor.com>
The plan document is an AI-authored R&D scratchpad that doesn't belong in the committed source tree alongside production code. Move it out of tts-cpp/ so the subtree only ships the implementation; the file continues to live locally under aiDocs/ for ongoing iteration. No code or build changes; documentation-only. Co-authored-by: Cursor <cursoragent@cursor.com>
…mize-OpenCL-for-supertonic Qvac 18607 tts ggml add and optimize open cl for supertonic
Squash-rebase of feat/metal-optimization-supertonic onto master post-#16 (OpenCL Supertonic merge). Combines: - Five custom fused Metal kernels (supertonic_depthwise_1d / layer_norm_channel / pw2_residual / bias_gelu / edge_pad_1d) with `_ct` and `_causal_ct` variants for [C, T] activation layout. Patches live upstream in qvac-ext-ggml@speech (PR #8, merged); our overlay-port redirects vcpkg to that branch. - Full Phase B2: every ConvNeXt block in vector_estimator (16 blocks) and vocoder (10 blocks) runs end-to-end on [C, T] activations. K=1 pointwise becomes direct ggml_mul_mat (no im2col). Single entry/exit permute spans each chain. - Phase B1: end-to-end f16 via asymmetric load (`:onnx::MatMul_*` stays f16 on Metal, expands to f32 elsewhere). - Phase A1+A2: 5-step CFM unrolled into one ggml_cgraph; latent stays in GPU memory step-to-step. - Phase A3: q8_0 storage on Metal, kernel_mul_mm_q8_0_f32 dispatches. - Tier 2 load-time matmul weight pretranspose. - Causal-pad mode in depthwise_1d_ct + K=7 support for vocoder. Coexists with master's OpenCL Supertonic work: - `supertonic_op_dispatch_scope` (master) toggles CPU custom_4d fast paths via thread-local; replaces our `use_cpu_fastpath` parameter plumbing. - F1/F2/F6/F13/F14/F16 audit optimisations from master preserved. - F7 vocoder convnext-block fusion (master) runs on the CPU path; Metal path runs our `_ct` chain. Bench on Apple M2 (5 runs, --steps 5, en/M1, "fox" prompt), post-rebase: Metal med 98.4 ms vec_est 65.6 vocoder 13.1 RTM 32.6x CPU (unchanged from master) ONNX CPU (unchanged from master) Net floor moved from ~88 ms (pre-rebase peak) to ~98 ms (post-rebase), ~10 ms slip absorbed where master's front_cache refactor replaced parts of our trace_proj step-builder per the agent's resolution rule "prefer master's cache pattern when refactored." Causal kernel intact; vocoder at 13.1 ms vs master's CPU 39.4 ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ay-port
Replaces the local vcpkg overlay-port machinery with a simpler bundled-
ggml dev flow that clones tetherto/qvac-ext-ggml@speech directly into
`tts-cpp/ggml/` and lets CMake's `add_subdirectory(ggml)` consume it.
What's in / what's out:
+ tts-cpp/scripts/setup-ggml.sh — clones qvac-ext-ggml@speech at the
pinned commit (currently 60a172e48f, the merge of #8) into
tts-cpp/ggml/. Idempotent; re-run to bump the pin via the script's
GGML_REF variable.
+ tts-cpp/CMakeLists.txt — bundled path (`TTS_CPP_USE_SYSTEM_GGML=OFF`)
no longer requires a `patches/` directory. Speech branch is
pre-patched at the commit level, so `add_subdirectory(ggml)`
consumes the source directly.
- tts-cpp/cmake/vcpkg-overlay-ports/ggml/ (all 4 files)
- tts-cpp/vcpkg-configuration.json
- tts-cpp/vcpkg.json
Net diff: −250 lines of bridge plumbing, +50 lines of clone-and-build
script. The vcpkg overlay was always a stopgap until the registry
pin advanced past 60a172e (see qvac-registry-vcpkg#144); switching
to the bundled flow side-steps that wait entirely for dev builds.
Performance bonus: bundled `add_subdirectory(ggml)` defaults to
GGML_NATIVE=ON (native ARM dotprod / SVE / wider SIMD on M-series),
where the vcpkg port had GGML_NATIVE=OFF for portable redistributables.
On Apple M2, the dev flow benches ~9 ms faster total median and
~30 ms tighter variance — back within 3 ms of the pre-rebase 88 ms
peak:
vcpkg-overlay (rebased): total med 100.48 range 96-125 ms 31.9x
bundled-ggml (this): total med 91.15 range 88-92 ms 35.2x
^ +3.3x
Downstream production builds still go through vcpkg via
`TTS_CPP_USE_SYSTEM_GGML=ON` and find_package(ggml) — those pull from
the `ggml` port in qvac-registry-vcpkg (which qvac-registry-vcpkg#144
bumps to the same speech commit).
README §1 updated with the new dev flow as the canonical recipe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tts-cpp: Supertonic ggml Metal — full B2 + B1 f16 + causal kernel (88 ms, 36× real-time, fastest on every stage)
…ssion
SortformerStreamSession::Impl::process_chunk previously assigned each
emitted segment's speaker_id directly from Sortformer's per-pass output
(s.speaker_id), with no inter-chunk slot stabilisation. When a speaker
aged out of the rolling history window, the model's per-pass slot
ordering could permute and the consumer saw "the same speaker" under a
different slot index.
On a synthetic 3-English-speaker 90s clip with the default
history_ms=30000, the FIO089 monologue (30-90s) drifted twice:
hyp_2 -> hyp_1 at t=44s (FIO084 ageing out of the 30s window) and
hyp_1 -> hyp_0 at t=58s (FIO087 ageing out). Bumping history_ms to
90000 hid the bug only because the rolling window then matched the
clip length and never emptied -- on real conversations longer than
history_ms, drift always returned at the predicted age-out points.
This patch carries forward the previous chunk's session-stable segments
and computes a remap[local_id] -> session_id by maximising overlap
between the current chunk's local-ID segments and the previous chunk's
session-ID segments. Greedy assignment (highest-overlap pair first) is
sufficient for 4-speaker Sortformer; Hungarian would be optimal but
overkill for a 4x4 cost matrix. Unmatched local slots get the lowest
unused session ID. Identity remap on the first chunk (empty previous
state).
Verification on synthetic three-english-speakers.wav with the v1
sortformer-4spk q8_0 GGUF:
DER% speakerSwitches
offline (baseline) 4.95 0
streaming hist=30s pre-fix 50.34 2 (drift at t=44s, t=58s)
streaming hist=30s post-fix 4.17 0
streaming hist=60s post-fix 3.60 0
Cross-language synthetic three-speakers.wav (control):
DER% speakerSwitches
offline (baseline) 26.01 0
streaming hist=30s pre-fix 57.66 1
streaming hist=30s post-fix 23.76 0
The cross-language Croatian+French slot-collapse persists (model-side
acoustic-similarity issue, intentionally not addressed by this patch).
Public APIs (SortformerStreamSession, SortformerStreamingOptions,
StreamingDiarizationSegment) are unchanged.
Also extends test/test_sortformer_streaming.cpp with --history-ms,
--chunk-ms, --rttm-out CLI flags so the streaming path can be exercised
at multiple history values and a NIST RTTM dump consumed by external
DER scoring.
`apply_rope_to_packed_qk` (PR #16 audit follow-up #5) was written assuming `dense_matmul_time_ggml` returns `ne=[HD, L]`. In fact the matmul (CPU `cblas_sgemm` fast path + `conv1d_f32(K=1)` fallback) produces `ne=[L, HD]` with channel-major-flat memory (`data[t + c*L]`) — the bit-exact transpose of the helper's input contract. Every CPU synth with `--n-gpu-layers 0` against a GGUF carrying `vector_rope_theta` aborts at the helper's defensive assertion on the first denoise step: supertonic_internal.h:742: GGML_ASSERT(HD == (int64_t) n_heads * head_dim) failed apply_rope_to_packed_qk → supertonic_vector_trace_proj_ggml → supertonic_vector_step_ggml → supertonic_vector_loop_ggml The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. Fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]`. Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits. 2. `apply_rope_to_packed_qk` (supertonic_internal.h): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-flat (the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V has no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the V matmul output in `build_group_graph_cache` and the front-block path in `supertonic_vector_trace_proj_ggml` so the GPU-bridge `ggml_backend_tensor_copy(v_src, v_tc_in)` lands bit-exact bytes. Style sq/sk/sv left untouched — this branch has no GPU bridge for style attention, so the host-vector path via `tensor_to_time_channel` is already correct. 4. Legacy host-bridge downloads of post-RoPE Q/K and post-transpose V switched from `tensor_to_time_channel` to `tensor_raw_f32`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would apply the transpose-of-the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU (--n-gpu-layers 0) | abort on first step | writes 1.35s 44.1 kHz WAV | | CPU long-text synth | abort | writes 6.25s WAV | | Multi-voice (F1 / M1) | abort | both work | | Determinism (same seed × 2) | n/a | bit-identical | - `test-supertonic-rope-packed-qk`: 14 / 14 checks, `max_abs_err = 0.000e+00`. - CPU `ctest -L unit`: 12 / 12 tests, 0 regressions. Audio sanity on the exact QVAC-18966 reproduction command: 99.9% non-zero samples, rms=1406, abs_max=15984 — speech-like dynamics, not silence / clipping / garbage. Co-authored-by: Cursor <cursoragent@cursor.com>
…966-TTS-GGML-Fix-CPU-regression QVAC-18966 [TTS GGML] Fix CPU regression
… library
Faithful port of NeMo's Audio-Online Speaker Cache (AOSC) from
sortformer_modules.py + sortformer_diar_models.py, replacing the
previous shallow stub that collapsed v2.1 streaming output to a
single speaker slot.
Key changes:
- Add run_encoder_bypass_pre_encode for the cache-aware streaming
forward path. Lets callers feed pre-subsampled embeddings directly
into the conformer layers (skipping the subsampling block), which
is required for splicing the speaker cache + FIFO + chunk in the
post-subsampling embedding space the way NeMo trained v2.1 with.
- Port _compress_spkcache, _get_silence_profile, _disable_low_scores,
_boost_topk_scores, streaming_update, and forward_streaming_step
end-to-end. Each C++ helper carries a comment naming the NeMo
source line(s) it mirrors.
- Extend SortformerSpeakerCache with mean_sil_emb (runtime EMA over
silence frames), spkcache_preds, fifo_preds, n_sil_frames. Add
SortformerStreamingConfig with NeMo's e2e_diarize_speech.py
inference defaults (spkcache_len=188, fifo_len=188, chunk_len=6,
chunk_left_context=1, chunk_right_context=7, spkcache_update_period=144,
spkcache_sil_frames_per_spk=3, sil_threshold=0.2,
pred_score_threshold=0.25, scores_boost_latest=0.05,
strong_boost_rate=0.75, weak_boost_rate=1.5,
min_pos_scores_rate=0.5).
- Wire chunk left/right audio context windowing in the engine's
streaming session: try_emit_chunks now waits for chunk_right_context_ms
of lookahead audio before emitting, finalize uses left-context-only
for the tail chunk, and diarize_start populates the new config
fields from SortformerStreamingOptions.
- Public API: flip SortformerStreamingOptions::spkcache_enable
default to true; add chunk_left_context_ms (=80) alongside the
existing chunk_right_context_ms (now =560); switch fifo_len
default to 188 and spkcache_update_period to 144.
v1 path is unchanged. cache_active=false for v1 GGUFs (detected
via encoder shape: 18 layers / 80 mels for v1, 17 / 128 for v2.1).
v1 streaming DER on the synthetic English regression fixture stays
at 4.17% (bit-for-bit).
Behaviour on synthetic test fixtures:
- 3 distinct voices (Alex/Samantha/Daniel) re-entry test:
v1 streaming 0.91% DER, v2.1+AOSC 0.45% DER.
- 4-speaker re-entry test where v1's overlap-remap fails:
v1 streaming 47-51% DER, v2.1+AOSC 18-22% DER.
- Both Samantha (47-66s gap) and Alex (93s gap) cleanly recovered
to their original hyp slots in the AOSC path; v1 collapses
multiple speakers into one slot after the long silence.
QVAC-18625
Mirrors the chatterbox StreamCallback API: a second synthesize() overload
takes an on_chunk callback that receives PCM chunk-by-chunk while the
returned SynthesisResult still accumulates the full audio (callback is
an addition, not a replacement).
Supertonic's vector estimator is non-autoregressive (5-step CFM denoise
over the full duration-predicted latent), so the chatterbox token-level
streaming pattern doesn't transfer. Instead this splits text into
sentence-aligned chunks and runs the full pipeline per chunk:
- New src/supertonic_chunker.{h,cpp}: Unicode-aware splitter. Sentence-
end gets a wide implicit search window (target/2..3*target) because
sentence prosody dominates audio quality on this model — chunks cut
mid-clause receive an artificial trailing period from preprocess and
the model emits muddled / dropped words in response. Clause and
whitespace fallbacks use the user-supplied tolerance.
- Multilingual punctuation tables: ASCII .?! plus CJK fullwidth, double
exclamation/question, Devanagari danda, Urdu full stop for sentences;
ASCII / fullwidth / Arabic comma, semicolon, colon and closing
brackets for clauses. Whitespace fallback handles CJK / Thai / Lao /
Khmer where punctuation may be absent.
- Engine streaming path runs the full pipeline per chunk with opts.seed
(no per-chunk perturbation; different chunks have different latent_len
so noise tensors differ even with the same seed, and an earlier
per-chunk seed bump occasionally landed chunks on nearby seeds where
the model produces phantom-phoneme tail artifacts).
- 10 ms raised-cosine anti-click fade on inter-chunk seams only. First
chunk start and last chunk end stay untouched so streamed output is
acoustically equivalent to batch at the endpoints.
- CLI gains --stream-chunk-tokens / --stream-first-chunk-tokens /
--stream-chunk-tolerance-pct flags. --out - streams raw s16le PCM on
stdout for incremental playback (pipe into ffplay / sox -d).
SUPERTONIC_LOG_CHUNKS=1 logs chunker boundaries;
SUPERTONIC_DUMP_CHUNK_WAVS_PREFIX=path- dumps per-chunk WAVs for
debugging.
Validated end-to-end at ~35x realtime on M2 Metal: streamed output is
acoustically equivalent to batch on the same seed; first audio drops in
~1 s for an 18 s utterance instead of waiting the full ~4-5 s for batch
synth to complete.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two empirically-driven additions on top of the sentence-aligned
chunker:
1. is_continuation flag through supertonic_preprocess_text +
supertonic_text_to_ids. When the engine produces a mid-clause /
mid-word chunk during streaming, the preprocess skips its
auto-appended terminal period. Without the flag the model spoke
stub chunks as complete sentences with falling intonation and
trailing-phoneme artifacts (the original "park.K" tail bug). The
engine detects per-chunk whether the chunk ends on a natural
sentence terminator (ASCII .?! plus CJK / Devanagari / Urdu
equivalents) and passes through the flag accordingly.
2. stream_min_chunk_tokens (default 30) on EngineOptions. Below ~30
tokens the model emits dropped / muddled phonemes on stub input
regardless of the continuation flag (verified on multiple seeds
and texts — short text is a model-level failure mode, not a
preprocess one). The chunker treats min_chunk_tokens as a hard
floor: effective target = max(target, min), the sentence/clause/
whitespace search lower bound is clamped to start + min, and any
trailing chunk below the floor is merged into its predecessor.
The min floor is the practical ceiling on what Option A streaming
can achieve. True seam-free streaming inside one utterance would
require model retraining (causal attention, per-token duration,
mel-frame cache continuity — the bits chatterbox has by design but
supertonic was not trained for). Documenting that as the trade-off
honestly rather than papering over it.
Behavior:
- Multi-sentence input → sentence-aligned chunks (the v1 behavior).
Acoustically equivalent to batch on the same seed.
- Long single-sentence input → multi-chunk output at the min floor,
each chunk passed to the model without an artificial terminal
period. Inter-chunk pauses and rate shifts are inherent to
per-chunk synthesis on a non-streaming-trained model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…reshold Tail-merge was using min_chunk_tokens (30) as its threshold, which on languages denser than English (CJK in particular) merged the last chunk into the previous one even when that last chunk was a complete sentence. Concrete: Korean "공원에서 산책하기 좋은 날이다." is 18 code points — below the 30-cp floor — so the merger folded it into the previous chunk, which contained TWO sentences, producing a single 172-byte chunk for the whole utterance and zero streaming benefit. Switch to chatterbox_engine.cpp:608's heuristic: tail_thresh = max(6, target_tokens/3) (16 for target=50). Genuinely tiny stubs (<16 cps) still merge; real sentence chunks stay independent. The min_chunk_tokens floor governs what the chunker proactively *aims for* during iteration, not what it does with whatever's left after the last natural boundary. Verified: Korean 3-sentence text now chunks into 2 (first chunk spans 2 sentences due to first-sentence-below-min-floor, last sentence stays separate at 18 cps). English 3-sentence test stays at 3 sentence-aligned chunks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 3x sentence-search window slurped runaway-sentence tails as one huge "sentence-aligned" chunk: a 245-char single sentence with the final period 109 chars past start was found by the wide window, so chunker took the whole remainder as chunk[3] instead of falling through to whitespace and producing multiple sub-sentence chunks. 2x is still wide enough to catch a long-but-reasonable first sentence in multi-sentence input (covers up to ~90 chars at target=50, ample for typical English / French / Portuguese sentences) but narrow enough that genuinely runaway sentences (>2x target with no internal periods) fall through to whitespace and stream. Empirical: same 245-char English run-on now produces 5 evenly-sized chunks (30, 52, 54, 52, 56) instead of 4 with the tail-blob (30, 52, 54, 109). Multi-sentence test unchanged (still 3 sentence- aligned chunks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ered stdout) Two review-comment fixes from PR #20: 1. De-duplicated the sentence-terminator code-point table between supertonic_chunker.cpp's is_sentence_end_cp() and the engine's chunk_ends_with_sentence_term(). is_sentence_end_cp() is now declared in supertonic_chunker.h and called from the engine's per-chunk continuation detector — the engine still owns the UTF-8 trim/decode logic, but the predicate (and its multilingual table) live in one place. Adding Ethiopic ።, Tibetan ། or any other terminator now needs one edit, not two. 2. stream_emit_pcm_stdout was doing a per-sample fwrite(&v, 2, 1, stdout) loop — ~44k-132k syscall-adjacent calls per chunk. Build the chunk's int16 buffer once and write it in a single fwrite; flush after. No semantic change to the bytes on stdout; just throughput. Verified: multi-sentence chunker still produces 3 sentence-aligned chunks (unchanged); stdout streaming byte count still equals samples * 2 exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…UPERTONIC_LOG_CHUNKS Adds one line per chunk to the existing SUPERTONIC_LOG_CHUNKS env-var trace, showing the is_continuation flag the engine resolved before handing the chunk to run_single_chunk: chunk[0] (44 bytes): The quick brown fox jumps over the lazy dog. chunk[0] is_continuation=0 chunk[1] (64 bytes): Then she said hello to the world, ... chunk[1] is_continuation=0 Useful for validating that the engine's per-chunk continuation detector and the chunker's boundary search agree on what counts as a sentence terminator across UTF-8 — they share the same detail::is_sentence_end_cp table, but the engine reaches it via a UTF-8-decode of the final code point in the chunk string, so the two paths can in principle disagree on a malformed input. The log makes that observable in one place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tts-cpp: supertonic Engine streaming via multilingual chunker + callback
…rakeet-cpp work post-divergence)
Upstream ggml-org/whisper.cpp PR ggml-org#3677 added the streaming VAD entry points but shipped no test. Lock the public contract on the tetherto fork so regressions surface immediately: - whisper_vad_detect_speech idempotent (reset is implicit) - whisper_vad_reset_state restores LSTM state exactly - detect_speech == reset_state + detect_speech_no_reset - detect_speech_no_reset on contiguous halves == single-shot detect_speech (state carries across no-reset call boundary) Splits at a 512-sample boundary (Silero v6.2.0 window size) so no mid-stream zero padding is introduced. Uses the bundled silero VAD model and samples/jfk.wav; no whisper transcribe model needed. QVAC-18991 Co-authored-by: Cursor <cursoragent@cursor.com>
Follow-up to 8f11c2a (the AOSC port itself). Locks the v2.1 streaming behaviour into ctest and surfaces it to the live-mic example user, so neither piece silently regresses. Added regression suite: - test/test_sortformer_aosc_speakers.cpp asserts three invariants against a reference RTTM: (a) every ref speaker has at least one hyp frame, (b) speakers that re-enter after a gap land in the SAME hyp_<id> they were first assigned to (the AOSC contract), (c) frame-level DER under the optimal hyp->ref permutation is below --der-max (default 30 %). Brute-force permutation, 10 ms frame grid, std-lib only. - test/samples/abcba.{wav,rttm} (160.6 s, 3 speakers, A->B->C->B->A, A returns after a 97 s gap) and test/samples/abcdba.{wav,rttm} (191.2 s, 4 speakers, A->B->C->D->B->A, A returns after a 128 s gap, B after a 66 s gap). Generated from ElevenLabs TTS so the audio is redistributable; ground-truth RTTMs auto-built from clip durations. - CMakeLists.txt registers two ctest entries test-sortformer-aosc-speakers-{abcba,abcdba} sharing one binary, REQUIRES-gated on the v2.1 GGUF so a fresh checkout without models/ shows them as DISABLED rather than failing. Measured on q8_0 v2.1, M-series CPU backend: abcba DER 27.29 % (3 slots tracked, A and B re-bind correctly); abcdba DER 22.22 % (all 4 slots tracked, A and B re-bind). v1 streaming on the same fixtures collapses to 2 slots (abcdba 66.28 %), confirming the test distinguishes AOSC from non-AOSC. Public API: - SortformerStreamSession::aosc_active() — small getter returning the engine's internal cache_active flag. Lets callers tell v2.1+AOSC from v1 / v2.x-without-cache in CLI banners and logs without duplicating the v2.1 detection logic. live-mic example: - Banner now branches on aosc_active(): on v2.1 prints "(v2.1 diarization, AOSC) chunk=... spkcache_len=... fifo_len=... lc=... rc=..."; on v1 keeps the existing "(v1 diarization) chunk=... history=..." line bit-identical. --history-ms help text clarifies the flag is v1-only and that v2.1 takes the AOSC path automatically. No new CLI flags. Docs: - README.md: new model-table row for diar_streaming_sortformer_4spk-v2.1 (v2 row left untouched); API table's diarize_start description distinguishes v1 sliding-history vs v2.1 AOSC; "Shipped / Not in-repo" status block moves Sortformer spkcache streaming to "Shipped". - PROGRESS.md: new Phase 17 closing the §11.11.2 reservation. Covers the algorithm port (8 ported NeMo helpers), encoder context windowing, bypass_pre_encode forward, validation methodology, the measured DER table from above, files touched, and remaining follow-ups (engine n_finals end-of-session glitch; downstream qvac-addon plumbing). v1 path is bit-identical to pre-commit; all existing tests stay green. QVAC-18625
…t" inputs Three pre-existing bit-exactness regressions in the QVAC-18605 cache work (F8 style-residual cached-graph parity, F18 text-encoder convnext-front graph cache, F19 vector-estimator front-block cache) shared one root cause: leaf input tensors uploaded ONLY at build time (because their contents depend solely on cache-key fields like L / text_len / θ) had their backend buffers released by ggml-alloc's free pass once their last consumer in the graph ran. On the second compute pass through the same cache, intermediates aliased into the freed offsets and silently overwrote the "stable" upload — every downstream tensor went stale. The freed-leaf-input behaviour is documented inside ggml-alloc.c: `ggml_gallocr_free_node` exits early only when the tensor has `GGML_TENSOR_FLAG_OUTPUT` — the input flag does not extend that guarantee. Marking each affected tensor as INPUT and OUTPUT keeps its buffer alive across compute passes, so the one-shot upload at build remains valid for the cache's full lifetime. Affected tensors: - supertonic_text_encoder.cpp:build_relpos_cache — `masks[9]` relpos attention masks (9 × L×L floats, encode integer position deltas −4..+4). - supertonic_vector_estimator.cpp:build_group_graph_cache — RoPE cos/sin tables (q_cos_in / q_sin_in / k_cos_in / k_sin_in). - supertonic_vector_estimator.cpp:supertonic_vector_trace_proj_ggml front_cache RoPE cos/sin tables (same shape, separate cache). - supertonic_vector_estimator.cpp:build_res_style_qkv_cache — `style_v_in` / `kctx_in`. Both use the F4 pointer-compare upload- skip; without OUTPUT the skip preserved a host pointer to a backend buffer that gallocr had already released. Test fallout on tts-cpp/test (with bundled qvac-ext-ggml@speech 60a172e, supertonic2.gguf + supertonic-ref-quick fixture): before test-supertonic-audit3-caches 6/8 checks pass (F18, F19 fail) after test-supertonic-audit3-caches 8/8 checks pass before test-supertonic-graph-rewrites 4/5 checks pass (F8 fails) after test-supertonic-graph-rewrites 5/5 checks pass fixture suite: 9/16 → 15/16 (only `test-supertonic-pipeline` still fails — that's a separate ONNX-vs-GGUF reference drift, not a cache bug; the per-stage tests that take ref inputs directly all pass). unit suite: 25/25 (unchanged). Verified on the supertonic_optimizations branch pre-merge (`184c6410`) that the failures are identical in magnitude — this is a pre-existing bug in QVAC-18605 rounds 3+ cache work, not a regression from the master merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…peline test mask
Same root cause as the previous F8/F18/F19 fix: leaf input tensors that
the round-10 upload-skip tracker treats as "stable across denoise steps
within one synth" (uploaded only on `current_step == 0`, skipped on
steps 1..N-1) need INPUT + OUTPUT flags so ggml-alloc's free pass doesn't
release the buffer after step 0 and silently corrupt the skipped uploads
on subsequent steps.
Two more affected tensors found by tracing the pipeline parity test's
per-step divergence:
- supertonic_vector_estimator.cpp:supertonic_vector_trace_proj_ggml
front_cache.text_in_t (vector-estimator front-block text input)
- supertonic_vector_estimator.cpp:build_group_graph_cache
cache.text_in (vector-estimator group 1/2/3 text input)
Pipeline test (`test-supertonic-pipeline`) per-step max_abs_err:
before: step0 1.4e-05, step1 8.5e-01, step2 1.7e+00, … final 3.28e-01
after: step0 1.4e-05, step1 3.9e-05, step2 6.8e-05, … final 1.11e-04
The step-by-step error is now pure floating-point round-off
accumulation (~1e-5 per step), 4 orders of magnitude under the test's
1e-3 threshold.
Also: align the pipeline test's input prep with the
`dump-supertonic-reference.py` harness — the Python script feeds the
ONNX vector_step a pre-masked input (`xt = noise * latent_mask`) and
the vocoder a pre-masked latent (`vocoder({"latent": xt * latent_mask})`).
For the supertonic-ref-quick fixture the mask is all 1.0 so this is a
no-op today, but a fixture with padded tail latents would otherwise
diverge from the reference at every padded position.
Fixture suite on tts-cpp/build (bundled qvac-ext-ggml@speech 60a172e,
supertonic2.gguf + supertonic-ref-quick):
before: 15/16 fixture tests passing (test-supertonic-pipeline FAIL)
after: 16/16 fixture tests passing
Unit suite unchanged (25/25).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ing for ggml_reshape_2d CodeQL cpp/integer-multiplication-cast-to-long flagged `n_heads * head_dim` (both `int`, multiplied as `int` and then implicitly converted to `int64_t` for `ggml_reshape_2d`'s shape argument). For Supertonic's vector-estimator the values are 4 × 64 = 256 so there is no actual overflow risk today, but a tts-cpp callsite that ever uses larger n_heads / head_dim would silently truncate. Cast first to make the multiplication 64-bit. No behaviour change for any current caller. Alert was not introduced by this PR (line dates back to the original tts-cpp add `ef840d5c3`) but surfaces on PR #31 because the surrounding file was touched. Fixing here keeps the PR's CodeQL gate green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
backend_selection.cpp— missing#include <stdexcept>
Throws std::runtime_error in 4 places, compiles on macOS libc++ via transitive include, fails on libstdc++ (Linux / MSYS2-GCC). One line:
#include <mutex>
+#include <stdexcept>
#include <string>- Android
GGML_BACKEND_DL=ONmust keep the supertonic Vulkan optimisations — please don't ship them gated off
The PR currently lists this as a known follow-up, but Mali / non-Adreno-700+ Snapdragon / Exynos Xclipse are exactly the targets where the round-10 pinned-host-buffer + round-12 F16-KV bandwidth wins matter most; silently turning them off on DL undoes the QVAC-18605 business case on mobile.
Every direct ggml_backend_vk_* call in this PR has a public registry-API equivalent today at the 60a172e4 ggml pin:
ggml_backend_is_vk(backend)→strcmp(ggml_backend_reg_name(ggml_backend_dev_backend_reg(ggml_backend_get_device(backend))), "Vulkan") == 0ggml_backend_vk_host_buffer_type()→ggml_backend_dev_host_buffer_type(ggml_backend_get_device(backend))ggml_backend_vk_get_device_description(...)→ggml_backend_dev_description(ggml_backend_get_device(backend))- F16-KV / Q8_0-KV / BF16-KV FA capability predicates → build a probe tensor and call
ggml_backend_dev_supports_op(dev, op)
Please migrate the four call-site classes in this PR, drop the NOT GGML_BACKEND_DL clause from the GGML_USE_VULKAN define in tts-cpp/CMakeLists.txt:180-181, and add a Snapdragon DL smoke test confirming the round-10 / 12 logs fire on the dynamic-loader build. init_gpu_backend already proves the registry-only pattern works — extending it the rest of the way is mechanical and keeps tts-cpp's source under the same "no direct backend symbols" invariant parakeet-cpp ships today.
#1, #2) Addresses PR #31 review feedback from @GustavoA1604: 1. backend_selection.cpp — missing `#include <stdexcept>`. Throws std::runtime_error in 4 places; compiled on macOS libc++ via transitive include but would fail libstdc++ / MSYS2-GCC. 2. Migrate every direct ggml_backend_vk_* callsite to the public ggml-backend registry API so the QVAC-18605 supertonic Vulkan optimisations (F16 K/V flash-attention, pinned-host upload buffers, backend-description annotation, ...) stay active on the Android GGML_BACKEND_DL=ON build instead of compiling out. Migrations: - ggml_backend_is_vk(b) → tts_cpp::detail::backend_is_vulkan(b) — strcmp against ggml_backend_reg_name(ggml_backend_dev_backend_reg( ggml_backend_get_device(b))). Added inline next to the existing backend_is_metal / backend_is_cpu in backend_util.h (mirrors parakeet-cpp's helper module). - ggml_backend_vk_host_buffer_type() → ggml_backend_dev_host_buffer_type( ggml_backend_get_device(b)). Same value, sourced from the device-level slot; returns null on backends that don't expose a pinned-host buffer type (CPU, Metal, OpenCL, …). Affects: * backend_supports_pinned_host_buffer_uncached * try_alloc_inputs_in_pinned_host_buffer - ggml_backend_vk_get_device_description(idx, buf, len) → ggml_backend_dev_description( ggml_backend_get_device(b)). Same string, no host buf round-trip. Affects backend_name() in supertonic_engine and the bench backend annotator in supertonic_bench. Drop: - The `#include "ggml-vulkan.h"` includes in supertonic_engine.cpp and supertonic_bench.cpp (no longer needed; registry API lives in ggml-backend.h). - Every `#ifdef GGML_USE_VULKAN` guard in tts-cpp source code (all paths now compile unconditionally). - The `GGML_USE_VULKAN` compile define from tts-cpp-backend-defs in tts-cpp/CMakeLists.txt — no code references it any more. tts-cpp now mirrors parakeet-cpp's "no direct backend symbols" invariant. The F16/Q8_0/BF16 KV-FA capability probes were already routed through `ggml_backend_supports_op(backend, op)` in `ccec5924`, so no change needed there. Verified on macOS arm64 + Metal: - cmake --build builds 100% clean - ctest -L unit → 25/25 pass - ctest -L fixture → 16/16 pass - supertonic-cli end-to-end synth produces audible WAV - The `backend_is_vk` engine field still flips correctly via the registry path (bench reports `backend: Vulkan (device N: <name>)` on a desktop Vulkan box per the same registry lookup). Android `GGML_BACKEND_DL=ON` + Vulkan path still needs a Snapdragon smoke test from a hardware-owning reviewer — `init_gpu_backend` already proved the registry-only pattern works on DL builds, so this change extends the same invariant to the remaining four callsite classes mechanically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@GustavoA1604 thanks for the review — both items addressed in 1. 2. Direct Concrete swaps:
Local verification on macOS arm64 + Metal:
Android Heads-up: branch was DIRTY against master (the v1.8.5 sync + EOU work merged in while this PR was open). Resolving that next, then will re-request review. |
Pulls in the master-side activity since PR #31 opened: - QVAC-19386: v1.8.5 + sync vendored whisper.cpp + ggml to ggml-org upstream (#33). Bumps whisper version, refreshes the in-tree ggml, re-adds tts-cpp from a fresh snapshot of chatterbox.cpp's port. - QVAC-19270: parakeet EOU streaming mid-stream-boundary handling. - QVAC-19213: Adreno Vulkan fixes (mul_mat_vec subgroup->shmem, get_max_size cap scoped to Qualcomm/Adreno). Conflict resolution (all 24 conflicts were `add/add` because the merge-base — `4bf733672` `talk-llama : sync llama.cpp` — predates QVAC adding `tts-cpp/` and `parakeet-cpp/`): - tts-cpp/* → kept HEAD (`--ours`). This branch is the canonical home of the QVAC-18605 supertonic Vulkan optimisation rounds 1-13 + the registry-API migration + the cache-state-leak fixes. The chatterbox.cpp-mirrored fixes that master's `fce9d211 Add tts-cpp files` brought in (N1-N7 docstrings, ggml-quants.h fix, backend_device() public API) are already present in HEAD's starting point and surface as no-op diffs. - parakeet-cpp/* → took master (`--theirs`). Master is the canonical home of QVAC-19270 EOU streaming work; this branch has no parakeet-cpp changes to defend. - .github/CODEOWNERS → took master (team rename to `qvac-internal-dev` / `qvac-internal-merge`). Verified on macOS arm64 + Metal: - cmake --build cleanly - ctest -L unit → 25/25 pass - ctest -L fixture → 16/16 pass (incl. test-supertonic-pipeline end-to-end vs ONNX reference, max_abs_err = 1.1e-04 ≪ 1e-3 threshold) The branch is now in sync with origin/master at `eabcf6da`; the mergeStateStatus on PR #31 should flip from DIRTY back to UNSTABLE (then green, once the pre-existing master CI fails resolve too). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…view) Addresses PR #31 review comments from @freddy311082: 1. (#3355973146) `vulkan_device != 0` aborted `init_gpu_backend` on a machine with no Vulkan adapter. `pick_vulkan_device_index` throws on an empty device list, so a host wiring `vulkan_device = -1` as a generic "auto-pick GPU" would crash on Metal-only macOS or CUDA-only Linux instead of falling through the tier policy to the available backend. Guard the Vulkan-pick block on `!vulkan_devs.empty()`. Also log a one-shot warn when the override is requested but no Vulkan adapter is visible (so the silent fall-through is debuggable). 2. (#3355995666) `vulkan_device > 0` was silently shadowed by the OpenCL-Adreno-700+ tier preference. On a Snapdragon device that exposes both backends, the chosen Vulkan adapter is moved to the front of `other_gpu` but the dispatch tries `opencl_adreno_700plus` FIRST, so an explicit `--vulkan-device N` would silently end up on OpenCL anyway. Operators explicitly pinning a Vulkan adapter almost certainly want Vulkan. When `vulkan_device > 0`, try `other_gpu` BEFORE `opencl_adreno_700plus`. `vulkan_device == -1` (auto-pick across Vulkan adapters) leaves the tier policy unchanged — the user asked for "best Vulkan device", not "must be Vulkan over OpenCL". `vulkan_device == 0` (default) is unchanged. Verified locally on macOS arm64 + Metal: - cmake --build cleanly - ctest -L unit → 25/25 pass - ctest -L fixture → 16/16 pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@freddy311082 both addressed in `6ac6f073`: #3355973146 — guard the Vulkan-pick block on `!vulkan_devs.empty()` so a `vulkan_device != 0` config falls through to the tier policy on no-Vulkan hosts (Metal-only Mac / CUDA-only Linux / Adreno-OpenCL-only Snapdragon) instead of aborting via `pick_vulkan_device_index`'s throw. Added a verbose-mode warn line so the silent fall-through stays debuggable. #3355995666 — distinguish `vulkan_device > 0` (explicit operator pin) from `vulkan_device == -1` (auto-pick). On explicit pin, `other_gpu` is tried BEFORE `opencl_adreno_700plus` so Snapdragon devices honour the override. On `-1` auto-pick the tier policy is unchanged — the operator asked for "best Vulkan device", not "Vulkan over OpenCL" — so Adreno 700+ still wins where it should. Both review threads resolved. ctest -L unit + -L fixture still 25/25 + 16/16 on macOS arm64 + Metal. |
… QVAC-18605 rounds 1-13) Reconciles HEAD's supertonic Vulkan/Metal optimisations (F1-F23 caches, pre-baked weights, pinned-host scratchpad, front_cache architecture) with master's QVAC-19254 GPU-scheduler refactor (model.sched / model.cpu_backend, supertonic_sched_alloc / supertonic_sched_compute, direct vs sched runtime routing) and QVAC-19213 Adreno regex include. Conflict resolution highlights: - parakeet_ctc.cpp / backend_selection.cpp: kept master's regex include alongside HEAD's stdexcept. - supertonic_internal.h: kept HEAD's model_prefers_cpu_kernels alongside master's sched helpers. - engine.h: kept HEAD's six EngineOptions fields. - supertonic_engine.cpp: kept HEAD's chunker include and the extended load_supertonic_gguf call. - supertonic_gguf.cpp: kept HEAD's F1/F2/F6 pre-bakes + capability / debug probes; layered master's scheduler init/teardown on top of HEAD's extra ctx_w / buffer_w lifetime tracking. - supertonic_vector_estimator.cpp: combined cache-key checks, per-cache gallocr usage (F4/F8/F12/F18/F19/F23) with master's direct/sched runtime routing; profile_vector_compute keeps calling supertonic_graph_compute directly because the per-cache graphs are bound to gallocr storage, not the model scheduler. - supertonic_vocoder.cpp: kept HEAD's F2/F3 latent-only upload (BN pre-baked into model tensors); used supertonic_sched_compute for the trace-mode pairing required by QVAC-19254. Validation: all 38 supertonic ctest fixtures + audit3 caches pass (test-supertonic-vector, test-supertonic-vector-trace, test-supertonic-pipeline, F18/F19 bit-exact); mtl-synth tests remain gated on multilingual fixtures unavailable in this environment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rofiling Follow-up to the master sync (077bbcb). The merge accidentally created two issues in the per-cache run helpers (`run_text_attention_cache`, `_gpu`, `run_group_graph_cache`, `run_res_style_qkv_cache`, `run_tail_graph_cache`): 1. On the `direct=true` hot path, the compute call became a raw `supertonic_graph_compute(...)` — silently dropping the QVAC-18605 `profile_vector_compute` wrapper, so per-stage CSV / stderr timings were no longer emitted on the live backend. 2. The currently-dead `direct=false` branch called `profile_vector_compute(...)` *after* a `supertonic_sched_alloc`, but the post-merge `profile_vector_compute` hard-coded `supertonic_graph_compute` — i.e. sched-alloc paired with graph-compute, which would silently corrupt the output the first time a future op forced the routing. Fix: * Parameterise `profile_vector_compute` with `bool use_sched = false`. Internal `dispatch()` lambda picks `supertonic_sched_compute` when `use_sched`, else `supertonic_graph_compute`. Both early-return fast-path and timed path use the same dispatch, so profiling behaviour is identical for the two compute primitives. * The five call sites now read: if (direct) profile_vector_compute(model, gf, step, island); else profile_vector_compute(model, gf, step, island, /*use_sched=*/true); so the alloc + compute pair is consistent on both branches, and profiling is restored on the active path. * The two non-direct/sched call sites (`run_style_residual_cache`, `front_proj_attn0_qkv` graph in `supertonic_vector_trace_proj_ggml`) keep the 4-arg form and rely on the default `use_sched=false` — both compute graphs are gallocr-bound, which is the correct path. Validation: * All 38 supertonic ctests pass (16 fixture + 22 unit, serial run). * Adversarial subagent review SAFE on all 10 invariants. * Metal n=10 bench: F1 33.5x realtime / 93.6 ms median, M1 34.9x / 91.9 ms. CPU n=10: 13.7x / 229 ms median. No measurable regression vs pre-fix (the noisy n=3 numbers were inside thermal / warmup variance). * "The quick brown fox jumps over the lazy dog." synthesises cleanly on Metal with both F1 and M1 voices. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
GGML_BACKEND_DL=ONdynamic-loader path.tts_cpp::detail::init_gpu_backend()with an optionalvulkan_devicearg (0 = first adapter, N > 0 = explicit index, -1 = free-VRAM auto-pick with UMA bias) so the round-3 / round-12 Vulkan device-selection policy survives master's registry-only refactor without bringing back directggml_backend_vk_*calls. Implemented via the public registry APIs (ggml_backend_dev_memory+ggml_backend_dev_type) so it works in bothGGML_BACKEND_DL=ONand=OFFbuilds. Default value is 0, so chatterbox / s3gen / parakeet call sites are unaffected.GGML_USE_VULKANcompile define is re-enabled ontts-cpp-backend-defsonly whenGGML_VULKAN AND NOT GGML_BACKEND_DL— the supertonic optimisation paths (F16 K/V flash-attention, pinned-host upload buffers,ggml_backend_vk_host_buffer_type()per-step uploads,backend_name()device-description annotation) call direct ggml-vulkan symbols that are only linkable when Vulkan is statically linked. On the Android DL build those paths fall back to the registry-walked non-Vulkan code, matching master's design intent.init_gpu_backend/init_cpu_backend. Addssrc/backend_selection.cppto each.ccec5924, round 10/12): leaf input tensors uploaded once at build / once per synth via the round-10 upload-skip tracker had their backend buffers released by ggml-alloc's free pass once their last consumer in the graph ran. On the second compute pass through the same cache, intermediates aliased into the freed offsets and silently overwrote the "stable" upload. Fix is to mark each affected tensor asINPUT + OUTPUT(the OUTPUT flag is what gallocr'sggml_gallocr_free_nodechecks before releasing). Affects: relpos attention masks, per-group RoPE cos/sin tables, front-block RoPE cos/sin tables,style_v_in/kctx_ininbuild_res_style_qkv_cache,text_in_tinsupertonic_vector_trace_proj_ggml, andtext_ininbuild_group_graph_cache.Update 2026-06-04 — master sync (QVAC-19254 + QVAC-19213)
Pulled
origin/masterback into the branch (077bbcb5) to pick up:model.sched/model.cpu_backend,supertonic_sched_alloc/supertonic_sched_compute,directvsschedruntime routing).Reconciled — every QVAC-18605 optimisation (F1/F2/F3/F4/F6/F8/F12/F18/F19/F23 + round-10 upload-skip + round-12 pinned-host scratchpad + UMA bias + per-cache
ggml_gallocr_tstorage) was preserved alongside master's scheduler refactor. Conflict resolution highlights:supertonic_internal.h: kept HEAD'smodel_prefers_cpu_kernelsand master'ssupertonic_sched_alloc/supertonic_sched_computedeclarations.engine.h: kept all six HEADEngineOptionsfields (precision,f16_attn,vulkan_device,f16_weights,f16_weights_deny_list,kv_attn_type).supertonic_gguf.cpp: HEAD's F1/F2/F6 pre-bakes execute first, then master's scheduler init (sched/cpu_backend); free order is sched-first → backends → ctx_w_extra (avoids dangling refs).supertonic_vector_estimator.cpp: combined cache-key checks + per-cache gallocr + master'sdirect vs schedrouting.profile_vector_computedeliberately callssupertonic_graph_compute(notsupertonic_sched_compute) — the per-cache graphs are bound to gallocr storage; routing them through the model scheduler silently corrupts outputs.supertonic_vocoder.cpp: kept HEAD's F2/F3 direct-latent upload (BN pre-baked into model tensors, no per-call BN upload); usedsupertonic_sched_computefor trace-mode's QVAC-19254 pairing.Validation (Apple M-series + Metal)
test-supertonic-vector(rel = 2.1e-06),test-supertonic-vector-trace,test-supertonic-pipeline, andtest-supertonic-audit3-caches(F18/F19 bit-exact 8/8).ctest -jcan produce sporadic fixture-file collisions acrosstest-supertonic-*binaries that share/tmpartifacts — unrelated to merge correctness; serial run is clean.mtl-synth-*fixtures remain gated on multilingual ASR fixtures that aren't shipped in-tree (same status as on master).Known follow-up (non-blocking)
direct=falsebranch in the per-cache run helpers (run_text_attention_cache,_gpu,run_group_graph_cache,run_res_style_qkv_cache,run_tail_graph_cache) callssupertonic_sched_allocthenprofile_vector_compute(which routes tosupertonic_graph_compute). The branch is currently dead — with the present backendsdirectis always true — but the routing inside is inconsistent. Follow-up: either delete the dead branch or switch the compute call tosupertonic_sched_computeso it becomes coherent. Tracked outside this PR.Conflict resolution notes (original merge)
Three conflict files, all in
tts-cpp/supertonic_*:include/tts-cpp/supertonic/engine.hEngineOptionsfields (HEAD'sprecision/f16_attn/vulkan_device/f16_weights/kv_attn_type/vulkan_env_overrides+ master'sbackends_dir/opencl_cache_dir).src/supertonic_engine.cppapply_vulkan_env_overrides()→load_supertonic_gguf()with HEAD's extra args. Order matters: all setters must precedeinit_supertonic_backend().src/supertonic_gguf.cpp#ifdef GGML_USE_VULKANcascade with delegation totts_cpp::detail::init_gpu_backend(), threadingvulkan_devicethrough. Keptconvert_supertonic_tensor_data(HEAD-only addition).Test plan
cmake -S tts-cpp -B build -DTTS_CPP_USE_SYSTEM_GGML=OFFconfigures cleanly (bundledqvac-ext-ggml@speechpin60a172e)cmake --build build -jbuilds 100% clean (library + supertonic-cli + tts-cli + all unit/integration test binaries on macOS arm64 + Metal)ctest -L unit -j 4→ 25/25 passing, including every QVAC-18605 logic harness:test-supertonic-vulkan-device-select,vulkan-env-overrides,kv-attn-type(+-api),capability-cache,pinned-host-buffer,text-encoder-gpu-bridge,upload-skip-tracker,voice-host-cache,f16-deny-list-api,f16-attn-parity,warm-up-api,input-scratchpad,backend-dispatch,portable-ops,vulkan-dispatch,in-graph-transpose,graph-to-graph-blit,rope-in-graph,rope-packed-qk,profile-csv,convnext-block-fusedctest -L fixture(serial) → 16/16 passing (supertonic-ref-quick fixture, pointed via-DTTS_CPP_TEST_MODEL_DIR+-DTTS_CPP_TEST_REF_DIR). Includingtest-supertonic-pipeline(end-to-end vs ONNX reference WAV, max_abs_err = 1.1e-04 against 1e-3 threshold),test-supertonic-graph-rewrites(F3/F8/F11 5/5),test-supertonic-audit3-caches(F17/F18/F19 8/8)supertonic-cliagainstsupertonic2.ggufon Metal — 8.15 s of 44.1 kHz mono PCM produced; Metal pipeline log shows the QVAC custom kernels (e.g.kernel_supertonic_edge_pad_1d_f32) compiling and running. WAV length matches ONNX reference exactly (136 970 vs 136 972 samples — 2-sample EOF rounding).supertonic-benchon Apple M-series + Metal: 43.5× realtime (RTF 0.023, median over 3 runs). All QVAC-18605 auto-policies engaged:f16_attn=on / f16_weights=on / native_leaky_relu=on / kv_attn_type=f16 / q8_0_kv_attn=available / bf16_kv_attn=available.supertonic-cli --n-gpu-layers 99 --vulkan-device -1 --vulkan-perf-loggeragainstsupertonic2.ggufand confirm the auto-pick log line + steady-state perf numbers match round 12.GGML_BACKEND_DL=ONsmoke test. The merge accepts that the supertonic Vulkan-specific code paths compile out underGGML_USE_VULKAN-disabled (Android DL); registry-walked fallback should remain functional. Recommend a smoke test on a Snapdragon / non-Apple Android target before tagging.direct=falsepath is exercised there.chatterbox-s3gen.gguf/chatterbox-t3-mtl.gguf/s3gen-ref//streaming-ref//t3-mtl-ref/etc. are still auto-disabled because those fixtures aren't shipped in-tree. Out of scope for this PR but worth tracking.Known follow-ups (not blocking merge)
supertonic_engine.cpp:backend_name()Vulkan device-description annotation is inert underGGML_BACKEND_DL=ON(depends onggml_backend_vk_get_device_description). Cheap fix: route throughggml_backend_dev_description(ggml_backend_get_device(backend)).supertonic_gguf.cpp:backend_supports_pinned_host_buffer_uncachedand the F16-KV flash-attn capability probes similarly use direct ggml-vulkan entries. Same registry-API fix would let those optimisations stay active on Android DL too.GGML_ASSERT([rsets->data count] == 0)fires on Metal device shutdown at process exit (post-synth, doesn't affect output). Tracked separately; appears to live inqvac-ext-ggml@speech(ggml-metal-device.m:612), not in this merge.ggml_set_inputcall sites in tts-cpp for the same cache-state-leak pattern. Only sites with constant inputs OR upload-skip trackers are at risk; no other tests are failing today, so any latent same-shape bugs there don't surface in the current harness.direct=falsebranch in the per-cache run helpers post-QVAC-19254 sync (see "Known follow-up" under the 2026-06-04 update).🤖 Generated with Claude Code