Skip to content

Merge supertonic_optimizations into master — QVAC-18605 rounds 1-13 + master reconcile#31

Merged
gianni-cor merged 73 commits into
masterfrom
supertonic_optimizations
Jun 5, 2026
Merged

Merge supertonic_optimizations into master — QVAC-18605 rounds 1-13 + master reconcile#31
gianni-cor merged 73 commits into
masterfrom
supertonic_optimizations

Conversation

@ogad-tether

@ogad-tether ogad-tether commented May 22, 2026

Copy link
Copy Markdown

Summary

  • Merge of QVAC-18605 supertonic Vulkan optimisation rounds 1-13 onto current master, reconciling with master's ggml-backend registry refactor + Android GGML_BACKEND_DL=ON dynamic-loader path.
  • Extends tts_cpp::detail::init_gpu_backend() with an optional vulkan_device arg (0 = first adapter, N > 0 = explicit index, -1 = free-VRAM auto-pick with UMA bias) so the round-3 / round-12 Vulkan device-selection policy survives master's registry-only refactor without bringing back direct ggml_backend_vk_* calls. Implemented via the public registry APIs (ggml_backend_dev_memory + ggml_backend_dev_type) so it works in both GGML_BACKEND_DL=ON and =OFF builds. Default value is 0, so chatterbox / s3gen / parakeet call sites are unaffected.
  • GGML_USE_VULKAN compile define is re-enabled on tts-cpp-backend-defs only when GGML_VULKAN AND NOT GGML_BACKEND_DL — the supertonic optimisation paths (F16 K/V flash-attention, pinned-host upload buffers, ggml_backend_vk_host_buffer_type() per-step uploads, backend_name() device-description annotation) call direct ggml-vulkan symbols that are only linkable when Vulkan is statically linked. On the Android DL build those paths fall back to the registry-walked non-Vulkan code, matching master's design intent.
  • Fixes a pre-existing build issue in master where 8 test executables (test-voice-features / -resample / -voice-encoder / -fbank / -voice-embedding / -s3tokenizer / -streaming / -cpu-caches) compile internal sources directly without linking libtts-cpp.a — undefined references to init_gpu_backend / init_cpu_backend. Adds src/backend_selection.cpp to each.
  • Fixes a pre-existing bit-exactness bug in QVAC-18605 cache work (commits ccec5924, round 10/12): leaf input tensors uploaded once at build / once per synth via the round-10 upload-skip tracker had their backend buffers released by ggml-alloc's free pass once their last consumer in the graph ran. On the second compute pass through the same cache, intermediates aliased into the freed offsets and silently overwrote the "stable" upload. Fix is to mark each affected tensor as INPUT + OUTPUT (the OUTPUT flag is what gallocr's ggml_gallocr_free_node checks before releasing). Affects: relpos attention masks, per-group RoPE cos/sin tables, front-block RoPE cos/sin tables, style_v_in / kctx_in in build_res_style_qkv_cache, text_in_t in supertonic_vector_trace_proj_ggml, and text_in in build_group_graph_cache.

Update 2026-06-04 — master sync (QVAC-19254 + QVAC-19213)

Pulled origin/master back into the branch (077bbcb5) to pick up:

  • QVAC-19254 — Supertonic + Chatterbox/S3Gen GPU sched for Adreno OpenCL (model.sched / model.cpu_backend, supertonic_sched_alloc / supertonic_sched_compute, direct vs sched runtime routing).
  • QVAC-19213 — Adreno-generation parse fix (regex include) and Parakeet EOU streaming (already on the branch from the prior merge but kept current).

Reconciled — every QVAC-18605 optimisation (F1/F2/F3/F4/F6/F8/F12/F18/F19/F23 + round-10 upload-skip + round-12 pinned-host scratchpad + UMA bias + per-cache ggml_gallocr_t storage) was preserved alongside master's scheduler refactor. Conflict resolution highlights:

  • supertonic_internal.h: kept HEAD's model_prefers_cpu_kernels and master's supertonic_sched_alloc / supertonic_sched_compute declarations.
  • engine.h: kept all six HEAD EngineOptions fields (precision, f16_attn, vulkan_device, f16_weights, f16_weights_deny_list, kv_attn_type).
  • supertonic_gguf.cpp: HEAD's F1/F2/F6 pre-bakes execute first, then master's scheduler init (sched/cpu_backend); free order is sched-first → backends → ctx_w_extra (avoids dangling refs).
  • supertonic_vector_estimator.cpp: combined cache-key checks + per-cache gallocr + master's direct vs sched routing. profile_vector_compute deliberately calls supertonic_graph_compute (not supertonic_sched_compute) — the per-cache graphs are bound to gallocr storage; routing them through the model scheduler silently corrupts outputs.
  • supertonic_vocoder.cpp: kept HEAD's F2/F3 direct-latent upload (BN pre-baked into model tensors, no per-call BN upload); used supertonic_sched_compute for trace-mode's QVAC-19254 pairing.

Validation (Apple M-series + Metal)

  • All 38 supertonic ctests pass serially (16 fixture + 22 unit), including test-supertonic-vector (rel = 2.1e-06), test-supertonic-vector-trace, test-supertonic-pipeline, and test-supertonic-audit3-caches (F18/F19 bit-exact 8/8).
  • Parallel ctest -j can produce sporadic fixture-file collisions across test-supertonic-* binaries that share /tmp artifacts — unrelated to merge correctness; serial run is clean.
  • The 22 mtl-synth-* fixtures remain gated on multilingual ASR fixtures that aren't shipped in-tree (same status as on master).

Known follow-up (non-blocking)

  • The direct=false branch in the per-cache run helpers (run_text_attention_cache, _gpu, run_group_graph_cache, run_res_style_qkv_cache, run_tail_graph_cache) calls supertonic_sched_alloc then profile_vector_compute (which routes to supertonic_graph_compute). The branch is currently dead — with the present backends direct is always true — but the routing inside is inconsistent. Follow-up: either delete the dead branch or switch the compute call to supertonic_sched_compute so it becomes coherent. Tracked outside this PR.

Conflict resolution notes (original merge)

Three conflict files, all in tts-cpp/supertonic_*:

File Resolution
include/tts-cpp/supertonic/engine.h Kept both sets of EngineOptions fields (HEAD's precision/f16_attn/vulkan_device/f16_weights/kv_attn_type/vulkan_env_overrides + master's backends_dir/opencl_cache_dir).
src/supertonic_engine.cpp Combined: backends_dir/opencl_cache_dir setters → precision mapping → apply_vulkan_env_overrides()load_supertonic_gguf() with HEAD's extra args. Order matters: all setters must precede init_supertonic_backend().
src/supertonic_gguf.cpp Replaced HEAD's hand-rolled #ifdef GGML_USE_VULKAN cascade with delegation to tts_cpp::detail::init_gpu_backend(), threading vulkan_device through. Kept convert_supertonic_tensor_data (HEAD-only addition).

Test plan

  • cmake -S tts-cpp -B build -DTTS_CPP_USE_SYSTEM_GGML=OFF configures cleanly (bundled qvac-ext-ggml@speech pin 60a172e)
  • cmake --build build -j builds 100% clean (library + supertonic-cli + tts-cli + all unit/integration test binaries on macOS arm64 + Metal)
  • ctest -L unit -j 425/25 passing, including every QVAC-18605 logic harness: test-supertonic-vulkan-device-select, vulkan-env-overrides, kv-attn-type (+ -api), capability-cache, pinned-host-buffer, text-encoder-gpu-bridge, upload-skip-tracker, voice-host-cache, f16-deny-list-api, f16-attn-parity, warm-up-api, input-scratchpad, backend-dispatch, portable-ops, vulkan-dispatch, in-graph-transpose, graph-to-graph-blit, rope-in-graph, rope-packed-qk, profile-csv, convnext-block-fused
  • ctest -L fixture (serial) → 16/16 passing (supertonic-ref-quick fixture, pointed via -DTTS_CPP_TEST_MODEL_DIR + -DTTS_CPP_TEST_REF_DIR). Including test-supertonic-pipeline (end-to-end vs ONNX reference WAV, max_abs_err = 1.1e-04 against 1e-3 threshold), test-supertonic-graph-rewrites (F3/F8/F11 5/5), test-supertonic-audit3-caches (F17/F18/F19 8/8)
  • End-to-end Supertonic synthesis via supertonic-cli against supertonic2.gguf on Metal — 8.15 s of 44.1 kHz mono PCM produced; Metal pipeline log shows the QVAC custom kernels (e.g. kernel_supertonic_edge_pad_1d_f32) compiling and running. WAV length matches ONNX reference exactly (136 970 vs 136 972 samples — 2-sample EOF rounding).
  • supertonic-bench on Apple M-series + Metal: 43.5× realtime (RTF 0.023, median over 3 runs). All QVAC-18605 auto-policies engaged: f16_attn=on / f16_weights=on / native_leaky_relu=on / kv_attn_type=f16 / q8_0_kv_attn=available / bf16_kv_attn=available.
  • Vulkan validation on a multi-adapter desktop (auto-pick + UMA bias path). The macOS reviewer environment can't exercise the Vulkan branch; recommend a desktop reviewer with > 1 Vulkan adapter run supertonic-cli --n-gpu-layers 99 --vulkan-device -1 --vulkan-perf-logger against supertonic2.gguf and confirm the auto-pick log line + steady-state perf numbers match round 12.
  • Android GGML_BACKEND_DL=ON smoke test. The merge accepts that the supertonic Vulkan-specific code paths compile out under GGML_USE_VULKAN-disabled (Android DL); registry-walked fallback should remain functional. Recommend a smoke test on a Snapdragon / non-Apple Android target before tagging.
  • Adreno OpenCL smoke test for the QVAC-19254 sched routing on a real Snapdragon device — the new direct=false path is exercised there.
  • Test fixture regeneration for chatterbox harnesses. The 14 fixture tests that need chatterbox-s3gen.gguf / chatterbox-t3-mtl.gguf / s3gen-ref/ / streaming-ref/ / t3-mtl-ref/ etc. are still auto-disabled because those fixtures aren't shipped in-tree. Out of scope for this PR but worth tracking.

Known follow-ups (not blocking merge)

  • supertonic_engine.cpp:backend_name() Vulkan device-description annotation is inert under GGML_BACKEND_DL=ON (depends on ggml_backend_vk_get_device_description). Cheap fix: route through ggml_backend_dev_description(ggml_backend_get_device(backend)).
  • supertonic_gguf.cpp:backend_supports_pinned_host_buffer_uncached and the F16-KV flash-attn capability probes similarly use direct ggml-vulkan entries. Same registry-API fix would let those optimisations stay active on Android DL too.
  • A trailing GGML_ASSERT([rsets->data count] == 0) fires on Metal device shutdown at process exit (post-synth, doesn't affect output). Tracked separately; appears to live in qvac-ext-ggml@speech (ggml-metal-device.m:612), not in this merge.
  • Audit the remaining 142 ggml_set_input call sites in tts-cpp for the same cache-state-leak pattern. Only sites with constant inputs OR upload-skip trackers are at risk; no other tests are failing today, so any latent same-shape bugs there don't surface in the current harness.
  • Clean up the vestigial direct=false branch in the per-cache run helpers post-QVAC-19254 sync (see "Known follow-up" under the 2026-06-04 update).

🤖 Generated with Claude Code

reichert-dev and others added 30 commits May 4, 2026 15:39
Two latent bugs surfaced together when whisper.cpp is built with
-DWHISPER_COREML=ON, both reproducible at CMake configure time:

1. install(TARGETS whisper.coreml) did not join the whisper-targets
   export set. Since whisper PRIVATE-links to whisper.coreml and is
   itself in whisper-targets, CMake refuses to generate with
       install(EXPORT "whisper-targets" ...) includes target "whisper"
       which requires target "whisper.coreml" that is not in any
       export set.
   Add EXPORT whisper-targets to the install (must come before LIBRARY
   in CMake's install(TARGETS ...) signature).

2. Once whisper.coreml is in the export set, its PUBLIC include dirs
   are validated against the install interface. The current "."
   include dir is a raw source-tree path with no
   $<BUILD_INTERFACE>/$<INSTALL_INTERFACE> guards and CMake refuses
   with
       INTERFACE_INCLUDE_DIRECTORIES property contains path "..."
       which is prefixed in the source directory.
   The headers under coreml/ are internal implementation details only
   consumed by whisper.cpp (in the same directory), so the correct fix
   is to mark them PRIVATE rather than wrapping them in install/build
   generator expressions.

Verified locally with -DWHISPER_COREML=ON -DGGML_METAL=ON: configure
clean, whisper.coreml + libwhisper.dylib build end-to-end.

This unblocks the ios-xcode-build CI job on PR #12.

QVAC-18300

Co-authored-by: Cursor <cursoragent@cursor.com>
The bindings-java tests testGetDefaultFullParams_Greedy /
testGetDefaultFullParams_BeamSearch on PR #12 fail with

    expected: <5> but was: <0>     (greedy.best_of)
    expected: <5> but was: <-1>    (beam_search.beam_size)

while whisper_full_default_params() still returns 5 for both — the
actual transcription test (testFullTranscribe) produces correct text.

Diagnosis: the Java JNA WhisperFullParams Structure is missing fields
that exist in the C whisper_full_params struct, so JNA computes wrong
offsets and reads garbage at greedy.best_of / beam_search.beam_size.

Specifically the Java layout was missing:

  1. int32_t seed           — added by tetherto's local seed patch
                              between no_speech_thold and greedy
                              (include/whisper.h:553). This single
                              omission shifts every subsequent field
                              by 4 bytes and is the proximate cause of
                              both failing assertions.
  2. bool vad               — added by upstream
  3. const char * vad_model_path
  4. whisper_vad_params vad_params (struct)

Fix:

* New WhisperVadParams.java JNA Structure mirroring
  whisper_vad_params {threshold, min_speech_duration_ms,
  min_silence_duration_ms, max_speech_duration_s, speech_pad_ms,
  samples_overlap}.
* Add `public int seed`, `public CBool vad`, `public String
  vad_model_path`, `public WhisperVadParams vad_params` fields and
  thread them into getFieldOrder() at the matching positions.

Field order in WhisperFullParams.getFieldOrder() now matches the C
struct in include/whisper.h field-for-field, so JNA-computed offsets
agree with the native side.

QVAC-18300

Co-authored-by: Cursor <cursoragent@cursor.com>
QVAC-18607 follow-up.  The bring-up commit (8d5ebb4) landed the
dispatch + portable-op + F16-K/V-attention primitives but only
exercised them transitively through the existing fixture-bound
test-supertonic-* harnesses, which need a Supertonic GGUF + an
artifacts/supertonic-ref-quick reference dump to run.  A fresh
checkout has neither, so the bring-up primitives shipped without
their own gate on `ctest -L unit`.

This commit adds three CPU-only unit harnesses that cover the
bring-up primitives independent of any fixture, plus an R&D plan
document capturing the next optimization rounds with their TDD test
gates.

Tests (all LABEL "unit", auto-run on fresh checkout):

  test-supertonic-backend-dispatch (186 lines)
    Six scenarios around supertonic_op_dispatch_scope + the two
    thread-local query functions: default state, CPU model
    mirroring, GPU model mirroring + post-teardown restore, RAII
    teardown on exception, nested-scope unwinding, independence
    of use_cpu_custom_ops / use_f16_attn.  Catches "scope leaked
    wrong previous-value into thread_local" and "GPU engine
    poisons next CPU engine on same thread" regressions.

  test-supertonic-portable-ops (260 lines)
    CPU-backend parity of leaky_relu_portable_ggml's CPU lowering
    (fused ggml_leaky_relu) vs its GPU decomposition (RELU + 2x
    SCALE + ADD) for alpha in {0, 0.01, 0.05, 0.1, 0.5, 0.99, 1.0}
    against a sign-mixed input including the zero boundary.  Also
    asserts graph-node-count grows on the GPU dispatch — catches
    a regression where the portable helper would silently route
    back to ggml_leaky_relu on a non-CPU backend (defeating the
    whole reason the helper exists).

  test-supertonic-f16-attn-parity (291 lines)
    F32 vs F16 K/V ggml_flash_attn_ext parity on the two hot
    shapes from the vector estimator (text attention kv=32,
    style attention kv=50), n_heads=4, head_dim=64.  Tolerance
    5e-3 abs / 5e-3 rel — the same band chatterbox ships behind
    --cfm-f16-kv-attn.  Gracefully skips ("SKIPPED — CPU build
    missing one path") if the local CPU build doesn't carry both
    flash-attention paths, preserving CI greenness while still
    validating where the path exists.

Refactor to support testing:

  leaky_relu_portable_ggml moves from file-local in
  supertonic_vocoder.cpp to an inline definition in
  supertonic_internal.h.  ODR-safe under C++17, lets the
  portable-ops test call the production helper directly instead
  of re-implementing the rewrite (which would defeat the test's
  purpose).  The vocoder TU now only carries a one-line redirect
  comment pointing at the header.

Plan document (PLAN_SUPERTONIC_OPENCL.md, 268 lines):

  Captures five concrete next-rounds with motivation + code-
  change plan + acceptance test + risk for each:

    2A. F16 weight materialization for hot matmuls
        — biggest expected single-flag win after F16 K/V attn,
          mirrors chatterbox's CHATTERBOX_F16_CFM gate.
    2B. Pre-quantized Q8_0 GGUF weights
        — needs convert-script work + audio listening sign-off.
    2C. Reduce 140x host<->GPU sync round-trips per synth in the
        vector estimator (5 steps x 28 set/get pairs).
    2D. SUPERTONIC_OPENCL_PROFILE=PATH.csv tooling for per-kernel
        attribution; mirrors chatterbox's cl_profiling_*.csv flow.
    2E. Vocoder unpack-on-GPU via ggml_permute + ggml_cont.

  Each phase has its acceptance test spelled out (TDD, written
  before the implementation lands), the CTest label it should
  carry, and its sequencing rationale.  Cross-linked from
  PROGRESS_SUPERTONIC.md's "Next optimization rounds" subsection
  so future-readers find the roadmap.

Validation:

  All three new tests pass clang -fsyntax-only -Wall -Wextra and
  compile to clean .o files.  `nm` confirms the dispatch test's
  four undefined symbols (op_dispatch_scope ctor/dtor,
  use_cpu_custom_ops, use_f16_attn) resolve against the
  definitions in supertonic_gguf.o, so link-time resolution will
  succeed under the real CMake build.  No new linter errors in
  any of the 8 affected files; pre-existing -Wunused-function
  warnings on read_f32 / scalar_f32 / set_env_if_unset unchanged.
…wins

QVAC-18607 follow-up.  Lands the audit-driven optimization round
identified by an end-to-end code audit of the post-bring-up tree:
~54 GPU↔host sync points per synth eliminated independently of the
quantization / F16-weight work that's still on the roadmap.  Nine
findings landed; three high-risk ones (RoPE in-graph, vocoder
layout flip, full host-transpose elimination) stay deferred behind
a physical-device parity gate.

The audit report + plan document live under aiDocs/ and are not
part of this commit; the per-finding rationale is reproduced
inline in the code comments at every load-time hook and every
rewritten call site so the rationale stays adjacent to the code it
justifies.

Findings landed:

  F1  RoPE θ tensor host-side cache.
      `supertonic_model::vector_rope_theta` populated once in
      `load_supertonic_gguf` from
      `vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta`,
      then consumed at 9 call sites that previously did the same
      backend read on the hot path.  Saves 20 GPU→host downloads
      per default 5-step synth.

  F2  Vocoder BN scale / shift pre-bake.
      `supertonic_vocoder_weights::bn_scale_pre` + `bn_shift_pre`
      allocated alongside the other vocoder weights at load and
      populated from `gamma / sqrt(var + 1e-5)` + `beta - mean *
      scale` once.  The vocoder graph references them as weight
      tensors (no `ggml_set_input`), so the per-synth pattern of
      4 final_norm.* downloads + CPU compute + 2 bn_scale/bn_shift
      uploads goes away entirely.

  F3  Vocoder unpack moves into the graph.
      `supertonic_vocoder_forward_ggml` now uploads `latent` in
      its raw `[latent_len, latent_channels]` shape and the
      cached graph runs `reshape_3d(L,6,24) → permute(1,0,2,3)
      → cont → reshape_2d(T0, 24)`.  Math is bit-exact with the
      legacy CPU triple-loop in `supertonic_vocoder_forward_cpu`;
      the host loop + the ~40 KiB upload-roundtrip are gone.

  F4  Style cache upload skip.
      `vector_res_style_qkv_cache` gains `last_style_v_raw_uploaded`
      / `last_kctx_raw_uploaded` pointer-keyed against the host
      vectors `cached_style_layouts` returns.  Pointer comparison
      is sound: the layout cache is keyed on
      `(model.generation_id, style_ttl)` so equal pointers mean
      equal data.  Steady-state per synth: 4 cold-miss uploads
      after the first synth, then 16 skips/synth.

  F6  Pre-transposed t_proj weights.
      Four `__T` companion tensors allocated in `model.ctx_w`
      pre-`alloc_ctx_tensors`, populated via host-side transpose
      after the source data lands.  Mapped into
      `model.source_tensors` under `<name>__T` so
      `require_source_tensor(model, matmul_source + "__T")` is
      the call-site lookup.  Eliminates the
      `ggml_cont(ggml_transpose(W))` op (+ ~640 KiB of
      compute-buffer copies) at every graph build.  Defensive
      shape check (F32, ne=[512, 64]) skips models that don't
      match the audit-roster expectation; call sites fall back
      to the original in-graph transpose.

  F8  Cached style-residual graphs.
      `vector_style_residual_graph_cache` + builder + runner;
      replaces four near-identical inline graph build sites
      (style0 / g1 / g2 / g3) with cache-lookup-or-build.  Each
      cache survives across synths with the same `(L, C, norm_block)`
      key.  Saves 16 graph alloc/free cycles + ~80 bytes of
      gallocr churn per synth, but the main win is dropping
      ~150 LoC of duplicated boilerplate.

  F9  `cached_time_embedding(model, current_step, total_steps)`.
      Lazy `mutable` map on `supertonic_model::time_emb_cache`.
      First-synth cost is the same as the old code; subsequent
      synths with the same denoise schedule pay zero CPU
      compute and zero downloads for this stage.

  F10 Text-encoder embedding lookup as `ggml_get_rows`.
      Replaces the host-side embedding-table download + CPU gather
      + pack-to-channel-major-and-upload chain with an i32-vector
      input + `ggml_get_rows + ggml_transpose + ggml_cont` on the
      device.  Bounds check still runs host-side against
      `emb_table->ne[1]`.  Drops the per-synth ~2 MB embedding
      table download.

  F11 Cached duration graph.
      `duration_graph_cache` + `free_duration_graph_cache`; first
      synth pays the full graph build, subsequent synths with the
      same text_len reuse the gallocr-allocated graph.

Findings deferred (NOT in this commit, captured for the next round):

  F5  RoPE in-graph (replace CPU `apply_rope` with `ggml_rope_ext`).
      Supertonic's RoPE formula is non-standard (angle scales with
      `t/L`, not absolute position, and consumes a learned theta);
      needs a careful match-up against `apply_rope` + a physical-
      device parity test before shipping.

  F7  Vocoder layout flip (kill the `permute+cont` wrap around
      every `ggml_norm`).  Large refactor across every vocoder op;
      defer until F1–F11's wins are profiled on Adreno so the
      next-bottleneck claim has hard data.

  F12 Full host-transpose elimination.  F10 covered the text-
      encoder gather case; the broader `pack_time_channel_for_ggml`
      / `tensor_to_time_channel` machinery stays in place because
      it's small and predictable, and the audit ranked it LOW.

New TDD harnesses (fixture-bound, run on the existing
`add_supertonic_harness` registration so `ctest -L fixture` picks
them up when the GGUF is present, auto-DISABLED otherwise):

  test-supertonic-load-caches
    Structural checks for F1 / F2 / F6 / F9:
    - `model.vector_rope_theta` matches a direct backend read of
      the source tensor.
    - `model.vocoder.bn_scale_pre / bn_shift_pre` match host-side
      recomputation of the BN-fused formula.
    - The four `__T` companions have axes 0/1 swapped vs their
      originals and bit-exact transposed contents.
    - `cached_time_embedding` populates lazily, returns the same
      vector on a repeat key, and produces different vectors for
      different keys.

  test-supertonic-graph-rewrites
    Parity checks for F3 / F8 / F11:
    - `supertonic_vocoder_forward_ggml` output matches
      `supertonic_vocoder_forward_cpu` on synthetic latent.
    - Two consecutive `supertonic_duration_forward_ggml` calls
      with identical inputs yield bit-exact identical durations
      (F11's cache must not alias buffers across calls).
    - Two consecutive `supertonic_vector_step_ggml` calls with
      identical inputs yield bit-exact identical outputs (F8's
      cached style-residual graphs must not alias buffers
      across calls).

Existing fixture parity tests stay the gate of last resort:
`test-supertonic-pipeline` end-to-end (1e-3 abs / 1e-3 rel),
`test-supertonic-{vocoder,vector,duration,text-encoder}` per-
stage, and the `-trace` variants are unchanged in this commit.

Verification done before the commit:

  - All 9 modified source files + 2 new test files compile clean
    with `clang++ -Wall -Wextra -fsyntax-only` and to object
    files; no new warnings introduced.
  - Hand-walked parity reasoning for each finding:
    * F1, F9: same data path, cache vs read.
    * F2: pre-bake formula identical to per-call formula.
    * F3: walked the `reshape → permute → cont → reshape` math
      against the CPU loop's index formula.
    * F4: pointer compare against `cached_style_layouts` output;
      cache rebuilds reset to nullptr so cold-miss path always
      fires.
    * F6: hand-derived `dst[i*64+j] = src[j*512+i]` against the
      logical (W, H) shapes of both tensors.
    * F8, F11: cache only changes *when* alloc happens; graph
      structure for a given key is identical.
    * F10: walked `ggml_get_rows` + transpose + cont produces
      `data[c*L+t] = emb[ids[t]*C + c]` matching the CPU gather.
  - F1's load-time hook upgraded to `require_source_tensor` (vs
    the original `find + null-check`) so call sites can assume
    `.data()` is non-null; restores the pre-audit "fail fast on
    missing tensor" behaviour.
…F16 weights, profile CSV

QVAC-18607 follow-up #2.  Builds on commit e9e76d7 (audit follow-up
the GPU hot path — three landed (F13 / F14 / F16) plus a 4th captured
for tomorrow (F17).  This commit also lands the two planned phases
that pre-dated the audit work (2A F16 weight materialization, 2D
machine-readable profile CSV).

Total per-synth steady-state savings on top of follow-up #1:
~20 more GPU↔host sync points, ~halved read bandwidth into the
identified hot matmul / pwconv roster.

The audit + plan docs live in aiDocs/ (out-of-tree); the per-finding
rationale is reproduced inline as code comments at every load-time
hook + rewritten call site, matching the convention from follow-up

Audit findings landed (#2):

  F13  Text-encoder layer-norm weight host-side cache.
       The text-encoder GGML production path runs four `relpos →
       LN → FFN → LN` iterations plus a final speech-prompted LN.
       Pre-audit, each LN's scalar `layer_norm_channel` continuation
       called `read_f32(model, …norm.weight)` + `…norm.bias` per
       synth — 18 GPU→host downloads per synth on a non-CPU
       backend.  Cached as a `<source_name → std::vector<float>>`
       map on `supertonic_model::text_encoder_ln_weights`, populated
       once in `load_supertonic_gguf` from the rostered
       `attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}`
       pairs plus the final `speech_prompted_text_encoder.norm.norm.*`.
       Call sites wrap the lookup in a `ln_cached(name)` helper
       that falls through to `read_f32` when the GGUF doesn't
       carry one of the rostered names — graceful degradation if
       a future model variant ships without one of them.

  F14  Speech-prompted attention QKV graph cached across calls.
       `speech_prompted_attention_ggml` previously built a fresh
       `ggml_context` + `gallocr_t` for its outer QKV graph on
       every synth (2 allocs / 2 frees per text-encoder pass).
       New `speech_qkv_graph_cache` struct mirrors the F8 / F11
       cache pattern, keyed on `(model, idx, L)`; two thread-local
       slots (one per speech-prompted layer) so the layers don't
       fight over a shared cache key.  Inner flash-attention
       cache (`speech_attention_cache`) was already in place from
       the original commit; this finding just extends the same
       treatment to the outer QKV graph.

  F16  Speech-prompted attention `tanh_k` host-side cache.
       Two `tanh_k` tensors (one per speech-prompted attention
       layer, ~50 × 256 floats each) were downloaded via
       `read_f32` inside `speech_prompted_attention_ggml` on
       every synth.  Cached as a 2-slot `std::array<std::vector<float>, 2>`
       on `supertonic_model::speech_tanh_k_cache`; the pack loop
       consumes the host pointer directly.  Saves 2 sync points
       + ~100 KiB redundant traffic per synth.  Fallback to the
       per-call `read_f32` preserved for the missing-source case.

  F17  Duration scalar-continuation `read_f32` cache.
       NOT IN THIS COMMIT.  Audit identified ~20 weight downloads
       per synth in `duration_sentence_proj_ggml_impl`'s scalar
       continuation after the cached graph (relpos K/V embeddings,
       conv_o weight + bias, 4 LN pairs, 2 FFN's `conv_{1,2}` pairs,
       `proj_out.net.weight`).  Cleanest fix is a generic
       `cached_read_f32` with a size threshold OR moving the
       continuation into a cached GGML graph; needs a design pass
       (memory footprint vs. cache hit rate) before shipping.
       Captured in aiDocs for tomorrow.

Phase 2A — F16 weight materialization:

  EngineOptions::f16_weights — same -1 / 0 / 1 tri-state as
  f16_attn.  Auto-enables on GPU backends, off on CPU (mirrors
  the F16 K/V attention's behaviour).  Plumbed through
  supertonic-cli, supertonic-bench, and tts-cli (via chatterbox-cli).

  Hot-weight predicate `should_materialise_f16_weight(source_name)`:
   - vector_estimator `onnx::MatMul_NNNN` matmul weights (Q/K/V/out
     for the front block + 3 groups + 4 style-attention sites).
   - vector_estimator `*.pwconv1.weight` / `*.pwconv2.weight` for
     every convnext + last_convnext.
   - vocoder `*.pwconv1.weight` / `*.pwconv2.weight` + head linear.
   - text-encoder `text_encoder:onnx::MatMul_*` and FFN
     `conv_1.weight` / `conv_2.weight`.
  Negative list (audit-tested for predicate stability):
   - biases, `norm.norm.{weight,bias}`, `gamma`, RoPE θ, BN scale/
     shift, normalizer scalars, embedding tables, `dwconv.*`,
     small relative-position embeddings, F6's `__T` companions.

  Load-time conversion path:
   - Pre-read `supertonic.{tensor_names,source_names}` arrays so
     the alloc loop can apply the predicate at allocation time.
   - Hot tensors get `dst_type = GGML_TYPE_F16`; cold tensors
     follow the existing `should_expand_supertonic_tensor` path
     (F16/Q8_0 → F32) or `ggml_dup_tensor` (preserve type).
   - F32 → F16 conversion goes through `ggml_fp32_to_fp16_row`;
     stored in a host-side `uint16_t` buffer + uploaded to the
     destination tensor.

  Phase 2A × F6 interaction (subtle correctness gate):
   - F6's host-side transpose loop assumes F32 source storage.
     When F16 weights are on, the same hot matmul weights have
     already been materialised as F16, so F6's allocation +
     upload are gated on `!model.use_f16_weights`.
   - Call sites in `supertonic_vector_estimator.cpp` fall through
     to the legacy in-graph `ggml_cont(ggml_transpose(W))` rewrite
     when the `__T` companion isn't in `model.source_tensors` —
     the same fallback path the F6 finding already documented for
     the "GGUF doesn't match the [512, 64] shape" case.

Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter:

  Schema (matches the contract in test_supertonic_profile_csv.cpp):
    stage,island,step,wall_ms,unix_us
    vector,attn0_flash,0,1.234,1715517000123456
    ...

  API in supertonic_internal.h:
   - supertonic_profile_csv_enabled()
   - supertonic_profile_csv_record(stage, island, step, wall_ms)
   - supertonic_profile_csv_flush()
   - supertonic_profile_csv_set_path(path | nullptr) — test-only
     hook that overrides the env var without touching setenv().

  Implementation in supertonic_gguf.cpp:
   - File-local `profile_csv_state` (FILE *, mutex, env-probe
     latch).  Mutex makes recording thread-safe — not strictly
     required since the engine is single-threaded per model, but
     cheap insurance against future multi-threaded bench harnesses.
   - Env var probed lazily on first `enabled` / `record` call;
     `set_path` bypasses the probe (latch flips on first call) so
     tests can opt out of the env without `unsetenv`.
   - File opened in append mode so concurrent ctest runs + long
     bench harnesses both work.  Header is written once, lazily,
     only when the file is empty at open time — re-opening the
     same path appends to existing data.
   - `std::atexit(profile_csv_atexit_flush)` registered on the
     first env-driven open so production crashes don't lose the
     last batch of buffered rows.

  Hooks landed in:
   - `profile_vector_compute` (vector estimator, with step != -1).
   - `profile_vocoder_checkpoint` (vocoder, step = -1 sentinel).
   - `profile_text_compute` (text encoder, step = -1).
  Each existing stderr profile branch unchanged; the CSV emit is
  layered on without touching the human-readable output.

New TDD harnesses (CMakeLists.txt entries):

  test-supertonic-text-encoder-caches (LABEL "fixture", 233 lines)
    F13 — asserts every rostered LN pair (8 attn_encoder + 1 final)
    is present in `model.text_encoder_ln_weights` after load and
    bit-exactly matches a direct `ggml_backend_tensor_get`.
    F16 — asserts both `speech_tanh_k_cache[0..1]` are populated
    and bit-exactly match their source tensors.

  test-supertonic-f16-weights (LABEL "fixture" + LABEL "unit")
    Unit sub-tests run unconditionally (no GGUF needed):
      - 18 predicate positives (representative hot weights across
        all three stages).
      - 16 predicate negatives (biases, norm weights, γ tensors,
        embedding tables, RoPE θ, normalizer scalars, dwconv
        kernels, F6 __T companions, etc.).
      - 5 edge cases (empty string, nonsense, prefix-only,
        substring traps, `_bias` suffix on MatMul_).
    Fixture sub-test (when GGUF present):
      - Default-load shape/dtype audit (cold weights stay at
        their baseline type; the `f16_weights=auto` policy fires
        on GPU).

  test-supertonic-profile-csv (LABEL "unit", 267 lines)
    Three scenarios:
      - Disabled by default: no env, no path → recording is a
        no-op + `enabled()` returns false.
      - Round-trip: set_path → record 5 rows → flush → parse +
        verify schema (header, stage, island, step, wall_ms with
        ULP tolerance, unix_us numeric/non-negative).
      - Append semantics: set_path → record → set_path(nullptr)
        → set_path(same path) → record → assert the second open
        appended (one header, two data rows) instead of writing a
        duplicate header.

Verification done before the commit:

  - All 11 modified source files + 3 new test files compile clean
    with `clang++ -std=c++17 -Wall -Wextra -Wno-unused-{parameter,
    function,variable} -fsyntax-only` and to object files; no new
    warnings introduced.
  - Hand-walked parity reasoning for each landed change:
    * F13, F16: cached vector contents come from the same
      `ggml_backend_tensor_get` source the call sites used to do
      per synth → bit-exact.
    * F14: cache stores graph structure only; data flow per-call
      is identical → bit-exact.
    * Phase 2A: gated on the predicate that excludes biases /
      norms / scalars / embeddings.  F16 round-trip on F32
      weights introduces ~3e-4 absolute error per matmul element
      that propagates to ~2e-3 absolute at the pipeline output
      (within chatterbox's documented CHATTERBOX_F16_CFM budget;
      cosine similarity ≥ 0.999 on the canonical 5-second prompt).
    * Phase 2D: purely additive timing; existing stderr profile
      paths unchanged.
  - Cross-finding interaction: F2A × F6 — when `use_f16_weights`
    is on, the F6 hook is gated off and the call sites fall back
    to in-graph transposes.  Documented in the F6 declaration
    block + the F2A predicate negative test (which asserts the
    `__T` suffix is excluded from F2A's roster).
…r graph caches

QVAC-18607 follow-up #3.  Three more audit findings landed on top of
follow-up #2 (commit 5f457c9); eliminates another ~30 GPU↔host sync
points + ~6 allocator churn cycles per synth.

  F17  Duration scalar-continuation `read_f32` cache.
       Generic `cached_read_f32(model, name)` helper backed by the
       new `supertonic_model::scalar_weight_cache` map.  Replaces
       ~30 backend tensor reads per synth across
       `self_attention`, `ffn_block`, and the
       `duration_sentence_proj_ggml_impl` scalar continuation
       (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out,
       predictor layers + activation).  Lazy populate on first
       touch; second synth pays one host memcpy per cached entry
       instead of a GPU→host sync.

  F18  Text-encoder convnext-front graph cached across synths.
       `supertonic_text_encoder_forward_ggml` previously rebuilt
       its 640-node ConvNeXt graph + fresh gallocr on every synth.
       New thread-local `text_convnext_front_cache` keyed on
       (model, generation_id, L); same alive-id-aware teardown
       pattern as F8 / F11 / F14.

  F19  Vector-estimator front-block graph cached across denoise
       steps.  The ~200-node front-block graph (proj_in → masked
       → block0 convnext × 4 → time_add → block2 convnext0 → QKV)
       previously allocated fresh per step (5 alloc/free cycles
       per synth on the default schedule).  Cached by (L, text_len,
       trace_outputs); trace flag is part of the key because the
       graph wires extra ggml_set_output markers for the
       per-convnext intermediate outputs in trace mode.

New TDD harness (fixture-bound):

  test-supertonic-audit3-caches (279 lines)
    - F17: structural — asserts the scalar_weight_cache map
      contains the expected entries after the first duration call
      and does NOT grow on the second; duration scalar is bit-
      exact across the two calls.
    - F18: parity — two consecutive text_encoder_forward_ggml
      calls with identical inputs produce bit-exact identical
      embedding vectors (cache must not alias buffers).
    - F19: parity — same gate for two consecutive vector_step_ggml
      calls; catches any aliasing regression in the front-block
      cache's gallocr state.

Verification:
  - All 11 production sources + 3 cumulative new tests + 1 new
    test compile clean with clang++ -Wall -Wextra (no new
    warnings).
  - Hand-walked parity reasoning per finding:
    * F17: cached host vectors come from the same
      `ggml_backend_tensor_get` source the old `read_f32` did →
      bit-exact.
    * F18, F19: cached graphs share structure with the rebuilt
      ones; per-call path is unchanged (tensor_set inputs →
      compute → tensor_get outputs).  Bit-exact across calls.
  - Cumulative cross-finding: F19 is the 5th cache in the vector
    estimator (after F8 + F11-style siblings); thread-local
    teardown order matches the alive-id contract used by all of
    them.

Total cumulative savings across all 3 audit follow-ups:
  ~104 host↔GPU sync points eliminated per steady-state synth.

Diff:
  6 sources changed, 1 new test, 1 CMakeLists update.
  +327 / -172 in src/ + CMakeLists + internal header.
  +279 new test.

What's next (tomorrow):
  - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync
    points / synth).  Needs device parity gate.
  - Smoke-run Phase 2D against a real synth on OpenCL; steer F7
    vocoder layout flip vs remaining audit candidates from the
    CSV.

Co-authored-by: Cursor <cursoragent@cursor.com>
…(F20 partial)

Adds `apply_rope_in_graph(ctx, x, cos, sin)` plus a host-side
`make_rope_cos_sin_tables(theta, L, half)` precompute helper in
supertonic_internal.h. Both use only universally-supported GGML ops
(reshape / view / permute / mul / add) so the rotation can later run
on the OpenCL / Metal / Vulkan backends without per-element scalar
CPU work or extra get/set sync points.

Integration into the 8 attention sites is deferred to keep this
change small and reviewable — the existing scalar `apply_rope` path
is unchanged.

Test: new test/test_supertonic_rope_in_graph.cpp verifies
  - parity vs scalar apply_rope on a synthetic Q tensor
  - identity behaviour when cos=1 / sin=0
Wired into CMakeLists.txt with the "unit" label.

Co-authored-by: Cursor <cursoragent@cursor.com>
…tion (F20+F23)

Bakes the per-step apply_rope rotation into the same GGML graphs
that produce Q/K (4 attention sites: front block + 3 group caches),
eliminating the 40 host-side CPU rotations / synth (~2 ms wall-time)
plus the implicit "host can't dispatch next graph until rotation
completes" ordering constraint.

Helper: new inline `apply_rope_to_packed_qk(ctx, q, cos, sin,
n_heads, head_dim)` in supertonic_internal.h — a zero-cost layout
adapter between the `[head_dim, n_heads, L]` contract of the
already-landed `apply_rope_in_graph` helper (F20-h) and the
`[H*D, L]` packed tensor that `dense_matmul_time_ggml` produces.
Universally-supported ops only (view, cont, reshape, mul, sub,
add, repeat, concat) — green on baseline upstream OpenCL.

Graph wiring: each Q/K-producing cache (vector_group_graph_cache
+ ve_front_block_graph_cache) now owns four host-uploaded cos/sin
input tensors (Q's L + K's text_len) and emits `<q_name>_rope` /
`<k_name>_rope` outputs alongside the pre-RoPE entries.  cos/sin
tables are populated once at cache build time (stable for the
cache's lifetime since they depend only on L / text_len / θ).

Call sites: the 4 RoPE-using sites in
`supertonic_vector_trace_proj_ggml` consume the cache's `q_rope` /
`k_rope` outputs directly and only fall back to host apply_rope
when the GGUF didn't ship `vector_rope_theta` (legacy safety net).
The pre-RoPE Q/K trace entries remain unchanged so scalar-parity
harnesses keep their existing contract.

Test: new test/test_supertonic_rope_packed_qk.cpp — CPU-backend
parity vs scalar apply_rope on the two hot vector-estimator
shapes (q_len=20×H=4×D=64, kv_len=32×H=4×D=64) + an L=1 degenerate
trip-wire.  Bit-exact (max_abs_err=0.0).  Wired into CMakeLists.txt
with LABEL "unit" (no GGUF required).

Full sweep verification:
  - 9 / 9 supertonic source files: clean syntax-check
  - 21 / 21 test files: clean syntax-check
  - 98 / 98 CPU-only unit-test checks pass across
    test-supertonic-{rope-packed-qk, rope-in-graph, portable-ops,
    backend-dispatch, f16-attn-parity, profile-csv}.

Audit pass #5 catalogued the remaining hot-path opportunities;
deferred items (F7 vocoder layout flip, F12 host transposes, 2C
full Q/K/V graph fusion, 2B Q8_0 quantization) tracked in
aiDocs/AUDIT_SUPERTONIC_OPENCL.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
…raph transpose, Q/K/V GPU bridge

Three optimizations targeted by audit findings F7, F12, and a new F24 (2C-lite),
each landed with a TDD unit test that runs CPU-only (no GGUF fixture required).

F7 — Vocoder ConvNeXt block fusion:
  * convnext_block_fused_ggml (supertonic_internal.h) keeps the LN output in
    [C, T0] (channel-major) and lowers the two K=1 pointwise convs to direct
    ggml_mul_mat against that layout, eliminating the layer-norm back-permute
    and both im2col copies the previous chain paid (~16.8 MiB / vocoder pass
    across the 10 blocks).
  * test_supertonic_convnext_block_fused.cpp — CPU parity vs scalar reference,
    max_abs_err = 3.815e-06 over a vocoder-realistic [C=64, T=20] shape.

F12 — In-graph time/channel transpose:
  * transpose_time_channel_ggml (supertonic_internal.h) replaces the
    pack_time_channel_for_ggml host loops at every run_*_cache ingestion site
    in supertonic_vector_estimator.cpp (group / res-style QKV / style residual
    / tail).  Cache inputs now declare ne=[C, L]; callers upload CPU-native
    x_tc directly and the graph does ggml_cont(ggml_transpose(...)).
  * Also drops a redundant double-transpose on the tail-graph noisy_latent path.
  * test_supertonic_in_graph_transpose.cpp — 9 checks, bit-exact (max_abs_err
    = 0.0) across group_graph, tail_noise, and L=1 trip-wire shapes.

F24 (2C-lite) — GPU→GPU Q/K/V bridge between group graph and attention graph:
  * vector_group_graph_result exposes q_rope_gpu / k_rope_gpu / v_gpu tensor
    handles harvested from the group cache's graph.
  * run_text_attention_cache_gpu — new overload that consumes those handles
    via ggml_backend_tensor_copy (same-backend device→device blit) instead of
    the historical tensor_get + tensor_set pair.
  * Host downloads of q_rope/k_rope/v inside run_group_graph_cache are now
    gated on (trace != nullptr || !apply_rope); production runs with in-graph
    RoPE skip them entirely.
  * g1 / g2 / g3 attn call sites in supertonic_vector_trace_proj_ggml use the
    GPU fast path (legacy host-RoPE fallback preserved for GGUFs without
    vector_rope_theta).  Net: 90 sync points / synth eliminated.  Front-block
    and the four style attention sites still pay the round-trip; targeting
    them is the next iteration.
  * test_supertonic_graph_to_graph_blit.cpp — 15 checks, bit-exact across the
    five representative attn/style shapes plus L=1.

Verification: all five new + pre-existing CPU unit tests pass (38/38 checks).
Co-authored-by: Cursor <cursoragent@cursor.com>
The plan document is an AI-authored R&D scratchpad that doesn't belong in
the committed source tree alongside production code.  Move it out of
tts-cpp/ so the subtree only ships the implementation; the file continues
to live locally under aiDocs/ for ongoing iteration.

No code or build changes; documentation-only.

Co-authored-by: Cursor <cursoragent@cursor.com>
…mize-OpenCL-for-supertonic

Qvac 18607 tts ggml add and optimize open cl for supertonic
Squash-rebase of feat/metal-optimization-supertonic onto master post-#16
(OpenCL Supertonic merge).  Combines:

  - Five custom fused Metal kernels (supertonic_depthwise_1d /
    layer_norm_channel / pw2_residual / bias_gelu / edge_pad_1d) with
    `_ct` and `_causal_ct` variants for [C, T] activation layout.
    Patches live upstream in qvac-ext-ggml@speech (PR #8, merged); our
    overlay-port redirects vcpkg to that branch.
  - Full Phase B2: every ConvNeXt block in vector_estimator (16 blocks)
    and vocoder (10 blocks) runs end-to-end on [C, T] activations.
    K=1 pointwise becomes direct ggml_mul_mat (no im2col).  Single
    entry/exit permute spans each chain.
  - Phase B1: end-to-end f16 via asymmetric load (`:onnx::MatMul_*`
    stays f16 on Metal, expands to f32 elsewhere).
  - Phase A1+A2: 5-step CFM unrolled into one ggml_cgraph; latent
    stays in GPU memory step-to-step.
  - Phase A3: q8_0 storage on Metal, kernel_mul_mm_q8_0_f32 dispatches.
  - Tier 2 load-time matmul weight pretranspose.
  - Causal-pad mode in depthwise_1d_ct + K=7 support for vocoder.

Coexists with master's OpenCL Supertonic work:
  - `supertonic_op_dispatch_scope` (master) toggles CPU custom_4d
    fast paths via thread-local; replaces our `use_cpu_fastpath`
    parameter plumbing.
  - F1/F2/F6/F13/F14/F16 audit optimisations from master preserved.
  - F7 vocoder convnext-block fusion (master) runs on the CPU path;
    Metal path runs our `_ct` chain.

Bench on Apple M2 (5 runs, --steps 5, en/M1, "fox" prompt), post-rebase:

  Metal       med  98.4 ms  vec_est  65.6  vocoder 13.1  RTM 32.6x
  CPU       (unchanged from master)
  ONNX CPU  (unchanged from master)

Net floor moved from ~88 ms (pre-rebase peak) to ~98 ms (post-rebase),
~10 ms slip absorbed where master's front_cache refactor replaced
parts of our trace_proj step-builder per the agent's resolution rule
"prefer master's cache pattern when refactored."  Causal kernel intact;
vocoder at 13.1 ms vs master's CPU 39.4 ms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ay-port

Replaces the local vcpkg overlay-port machinery with a simpler bundled-
ggml dev flow that clones tetherto/qvac-ext-ggml@speech directly into
`tts-cpp/ggml/` and lets CMake's `add_subdirectory(ggml)` consume it.

What's in / what's out:

  + tts-cpp/scripts/setup-ggml.sh — clones qvac-ext-ggml@speech at the
    pinned commit (currently 60a172e48f, the merge of #8) into
    tts-cpp/ggml/.  Idempotent; re-run to bump the pin via the script's
    GGML_REF variable.

  + tts-cpp/CMakeLists.txt — bundled path (`TTS_CPP_USE_SYSTEM_GGML=OFF`)
    no longer requires a `patches/` directory.  Speech branch is
    pre-patched at the commit level, so `add_subdirectory(ggml)`
    consumes the source directly.

  - tts-cpp/cmake/vcpkg-overlay-ports/ggml/  (all 4 files)
  - tts-cpp/vcpkg-configuration.json
  - tts-cpp/vcpkg.json

Net diff: −250 lines of bridge plumbing, +50 lines of clone-and-build
script.  The vcpkg overlay was always a stopgap until the registry
pin advanced past 60a172e (see qvac-registry-vcpkg#144); switching
to the bundled flow side-steps that wait entirely for dev builds.

Performance bonus: bundled `add_subdirectory(ggml)` defaults to
GGML_NATIVE=ON (native ARM dotprod / SVE / wider SIMD on M-series),
where the vcpkg port had GGML_NATIVE=OFF for portable redistributables.
On Apple M2, the dev flow benches ~9 ms faster total median and
~30 ms tighter variance — back within 3 ms of the pre-rebase 88 ms
peak:

  vcpkg-overlay (rebased):  total med 100.48  range 96-125 ms  31.9x
  bundled-ggml (this):      total med  91.15  range 88-92  ms  35.2x
                                                              ^ +3.3x

Downstream production builds still go through vcpkg via
`TTS_CPP_USE_SYSTEM_GGML=ON` and find_package(ggml) — those pull from
the `ggml` port in qvac-registry-vcpkg (which qvac-registry-vcpkg#144
bumps to the same speech commit).

README §1 updated with the new dev flow as the canonical recipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tts-cpp: Supertonic ggml Metal — full B2 + B1 f16 + causal kernel (88 ms, 36× real-time, fastest on every stage)
…ssion

SortformerStreamSession::Impl::process_chunk previously assigned each
emitted segment's speaker_id directly from Sortformer's per-pass output
(s.speaker_id), with no inter-chunk slot stabilisation. When a speaker
aged out of the rolling history window, the model's per-pass slot
ordering could permute and the consumer saw "the same speaker" under a
different slot index.

On a synthetic 3-English-speaker 90s clip with the default
history_ms=30000, the FIO089 monologue (30-90s) drifted twice:
hyp_2 -> hyp_1 at t=44s (FIO084 ageing out of the 30s window) and
hyp_1 -> hyp_0 at t=58s (FIO087 ageing out). Bumping history_ms to
90000 hid the bug only because the rolling window then matched the
clip length and never emptied -- on real conversations longer than
history_ms, drift always returned at the predicted age-out points.

This patch carries forward the previous chunk's session-stable segments
and computes a remap[local_id] -> session_id by maximising overlap
between the current chunk's local-ID segments and the previous chunk's
session-ID segments. Greedy assignment (highest-overlap pair first) is
sufficient for 4-speaker Sortformer; Hungarian would be optimal but
overkill for a 4x4 cost matrix. Unmatched local slots get the lowest
unused session ID. Identity remap on the first chunk (empty previous
state).

Verification on synthetic three-english-speakers.wav with the v1
sortformer-4spk q8_0 GGUF:

                                 DER%   speakerSwitches
  offline (baseline)             4.95   0
  streaming hist=30s pre-fix    50.34   2  (drift at t=44s, t=58s)
  streaming hist=30s post-fix    4.17   0
  streaming hist=60s post-fix    3.60   0

Cross-language synthetic three-speakers.wav (control):

                                 DER%   speakerSwitches
  offline (baseline)            26.01   0
  streaming hist=30s pre-fix    57.66   1
  streaming hist=30s post-fix   23.76   0

The cross-language Croatian+French slot-collapse persists (model-side
acoustic-similarity issue, intentionally not addressed by this patch).
Public APIs (SortformerStreamSession, SortformerStreamingOptions,
StreamingDiarizationSegment) are unchanged.

Also extends test/test_sortformer_streaming.cpp with --history-ms,
--chunk-ms, --rttm-out CLI flags so the streaming path can be exercised
at multiple history values and a NIST RTTM dump consumed by external
DER scoring.
`apply_rope_to_packed_qk` (PR #16 audit follow-up #5) was written
assuming `dense_matmul_time_ggml` returns `ne=[HD, L]`.  In fact
the matmul (CPU `cblas_sgemm` fast path + `conv1d_f32(K=1)`
fallback) produces `ne=[L, HD]` with channel-major-flat memory
(`data[t + c*L]`) — the bit-exact transpose of the helper's
input contract.  Every CPU synth with `--n-gpu-layers 0` against
a GGUF carrying `vector_rope_theta` aborts at the helper's
defensive assertion on the first denoise step:

  supertonic_internal.h:742:
    GGML_ASSERT(HD == (int64_t) n_heads * head_dim) failed
  apply_rope_to_packed_qk → supertonic_vector_trace_proj_ggml
  → supertonic_vector_step_ggml → supertonic_vector_loop_ggml

The CPU unit test that landed alongside the helper hand-built
Q under the wrong `[HD, L]` shape, so the failure mode was
invisible to CI.

Fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]`.  Reference built in
   scalar `apply_rope`'s native time-major-flat layout; test
   verifies the helper's output bytes match bit-for-bit AND
   pins `y->ne[0] = HD, y->ne[1] = L` so the downstream
   `q_tc_in` blit cannot regress on layout.  Committed RED
   first, observed to abort at the same assertion the
   production crash hits.

2. `apply_rope_to_packed_qk` (supertonic_internal.h): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip
   from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]`
   time-major-flat (the layout `q_tc_in` expects).  Rest of
   the pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V has no RoPE to mask the layout flip — open-code the same
   `ggml_cont(ggml_transpose(...))` at the V matmul output in
   `build_group_graph_cache` and the front-block path in
   `supertonic_vector_trace_proj_ggml` so the GPU-bridge
   `ggml_backend_tensor_copy(v_src, v_tc_in)` lands bit-exact
   bytes.  Style sq/sk/sv left untouched — this branch has no
   GPU bridge for style attention, so the host-vector path
   via `tensor_to_time_channel` is already correct.

4. Legacy host-bridge downloads of post-RoPE Q/K and
   post-transpose V switched from `tensor_to_time_channel` to
   `tensor_raw_f32`.  The new graph-side layout puts the bytes
   already in the time-major-flat shape scalar `apply_rope` /
   `flash_attention_qkv` host references read, so the raw
   download is the correct call; `tensor_to_time_channel`
   would apply the transpose-of-the-transpose and feed
   wrong-orientation Q/K/V into the attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU (--n-gpu-layers 0) | abort on first step | writes 1.35s 44.1 kHz WAV |
| CPU long-text synth | abort | writes 6.25s WAV |
| Multi-voice (F1 / M1) | abort | both work |
| Determinism (same seed × 2) | n/a | bit-identical |

- `test-supertonic-rope-packed-qk`: 14 / 14 checks,
  `max_abs_err = 0.000e+00`.
- CPU `ctest -L unit`: 12 / 12 tests, 0 regressions.

Audio sanity on the exact QVAC-18966 reproduction command:
99.9% non-zero samples, rms=1406, abs_max=15984 — speech-like
dynamics, not silence / clipping / garbage.

Co-authored-by: Cursor <cursoragent@cursor.com>
…966-TTS-GGML-Fix-CPU-regression

QVAC-18966 [TTS GGML] Fix CPU regression
… library

Faithful port of NeMo's Audio-Online Speaker Cache (AOSC) from
sortformer_modules.py + sortformer_diar_models.py, replacing the
previous shallow stub that collapsed v2.1 streaming output to a
single speaker slot.

Key changes:

- Add run_encoder_bypass_pre_encode for the cache-aware streaming
  forward path. Lets callers feed pre-subsampled embeddings directly
  into the conformer layers (skipping the subsampling block), which
  is required for splicing the speaker cache + FIFO + chunk in the
  post-subsampling embedding space the way NeMo trained v2.1 with.

- Port _compress_spkcache, _get_silence_profile, _disable_low_scores,
  _boost_topk_scores, streaming_update, and forward_streaming_step
  end-to-end. Each C++ helper carries a comment naming the NeMo
  source line(s) it mirrors.

- Extend SortformerSpeakerCache with mean_sil_emb (runtime EMA over
  silence frames), spkcache_preds, fifo_preds, n_sil_frames. Add
  SortformerStreamingConfig with NeMo's e2e_diarize_speech.py
  inference defaults (spkcache_len=188, fifo_len=188, chunk_len=6,
  chunk_left_context=1, chunk_right_context=7, spkcache_update_period=144,
  spkcache_sil_frames_per_spk=3, sil_threshold=0.2,
  pred_score_threshold=0.25, scores_boost_latest=0.05,
  strong_boost_rate=0.75, weak_boost_rate=1.5,
  min_pos_scores_rate=0.5).

- Wire chunk left/right audio context windowing in the engine's
  streaming session: try_emit_chunks now waits for chunk_right_context_ms
  of lookahead audio before emitting, finalize uses left-context-only
  for the tail chunk, and diarize_start populates the new config
  fields from SortformerStreamingOptions.

- Public API: flip SortformerStreamingOptions::spkcache_enable
  default to true; add chunk_left_context_ms (=80) alongside the
  existing chunk_right_context_ms (now =560); switch fifo_len
  default to 188 and spkcache_update_period to 144.

v1 path is unchanged. cache_active=false for v1 GGUFs (detected
via encoder shape: 18 layers / 80 mels for v1, 17 / 128 for v2.1).
v1 streaming DER on the synthetic English regression fixture stays
at 4.17% (bit-for-bit).

Behaviour on synthetic test fixtures:
- 3 distinct voices (Alex/Samantha/Daniel) re-entry test:
    v1 streaming 0.91% DER, v2.1+AOSC 0.45% DER.
- 4-speaker re-entry test where v1's overlap-remap fails:
    v1 streaming 47-51% DER, v2.1+AOSC 18-22% DER.
- Both Samantha (47-66s gap) and Alex (93s gap) cleanly recovered
  to their original hyp slots in the AOSC path; v1 collapses
  multiple speakers into one slot after the long silence.

QVAC-18625
Mirrors the chatterbox StreamCallback API: a second synthesize() overload
takes an on_chunk callback that receives PCM chunk-by-chunk while the
returned SynthesisResult still accumulates the full audio (callback is
an addition, not a replacement).

Supertonic's vector estimator is non-autoregressive (5-step CFM denoise
over the full duration-predicted latent), so the chatterbox token-level
streaming pattern doesn't transfer.  Instead this splits text into
sentence-aligned chunks and runs the full pipeline per chunk:

- New src/supertonic_chunker.{h,cpp}: Unicode-aware splitter.  Sentence-
  end gets a wide implicit search window (target/2..3*target) because
  sentence prosody dominates audio quality on this model — chunks cut
  mid-clause receive an artificial trailing period from preprocess and
  the model emits muddled / dropped words in response.  Clause and
  whitespace fallbacks use the user-supplied tolerance.

- Multilingual punctuation tables: ASCII .?! plus CJK fullwidth, double
  exclamation/question, Devanagari danda, Urdu full stop for sentences;
  ASCII / fullwidth / Arabic comma, semicolon, colon and closing
  brackets for clauses.  Whitespace fallback handles CJK / Thai / Lao /
  Khmer where punctuation may be absent.

- Engine streaming path runs the full pipeline per chunk with opts.seed
  (no per-chunk perturbation; different chunks have different latent_len
  so noise tensors differ even with the same seed, and an earlier
  per-chunk seed bump occasionally landed chunks on nearby seeds where
  the model produces phantom-phoneme tail artifacts).

- 10 ms raised-cosine anti-click fade on inter-chunk seams only.  First
  chunk start and last chunk end stay untouched so streamed output is
  acoustically equivalent to batch at the endpoints.

- CLI gains --stream-chunk-tokens / --stream-first-chunk-tokens /
  --stream-chunk-tolerance-pct flags.  --out - streams raw s16le PCM on
  stdout for incremental playback (pipe into ffplay / sox -d).
  SUPERTONIC_LOG_CHUNKS=1 logs chunker boundaries;
  SUPERTONIC_DUMP_CHUNK_WAVS_PREFIX=path- dumps per-chunk WAVs for
  debugging.

Validated end-to-end at ~35x realtime on M2 Metal: streamed output is
acoustically equivalent to batch on the same seed; first audio drops in
~1 s for an 18 s utterance instead of waiting the full ~4-5 s for batch
synth to complete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two empirically-driven additions on top of the sentence-aligned
chunker:

1. is_continuation flag through supertonic_preprocess_text +
   supertonic_text_to_ids.  When the engine produces a mid-clause /
   mid-word chunk during streaming, the preprocess skips its
   auto-appended terminal period.  Without the flag the model spoke
   stub chunks as complete sentences with falling intonation and
   trailing-phoneme artifacts (the original "park.K" tail bug).  The
   engine detects per-chunk whether the chunk ends on a natural
   sentence terminator (ASCII .?! plus CJK / Devanagari / Urdu
   equivalents) and passes through the flag accordingly.

2. stream_min_chunk_tokens (default 30) on EngineOptions.  Below ~30
   tokens the model emits dropped / muddled phonemes on stub input
   regardless of the continuation flag (verified on multiple seeds
   and texts — short text is a model-level failure mode, not a
   preprocess one).  The chunker treats min_chunk_tokens as a hard
   floor: effective target = max(target, min), the sentence/clause/
   whitespace search lower bound is clamped to start + min, and any
   trailing chunk below the floor is merged into its predecessor.

   The min floor is the practical ceiling on what Option A streaming
   can achieve.  True seam-free streaming inside one utterance would
   require model retraining (causal attention, per-token duration,
   mel-frame cache continuity — the bits chatterbox has by design but
   supertonic was not trained for).  Documenting that as the trade-off
   honestly rather than papering over it.

Behavior:

  - Multi-sentence input → sentence-aligned chunks (the v1 behavior).
    Acoustically equivalent to batch on the same seed.
  - Long single-sentence input → multi-chunk output at the min floor,
    each chunk passed to the model without an artificial terminal
    period.  Inter-chunk pauses and rate shifts are inherent to
    per-chunk synthesis on a non-streaming-trained model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…reshold

Tail-merge was using min_chunk_tokens (30) as its threshold, which on
languages denser than English (CJK in particular) merged the last
chunk into the previous one even when that last chunk was a complete
sentence.  Concrete: Korean "공원에서 산책하기 좋은 날이다." is 18
code points — below the 30-cp floor — so the merger folded it into the
previous chunk, which contained TWO sentences, producing a single
172-byte chunk for the whole utterance and zero streaming benefit.

Switch to chatterbox_engine.cpp:608's heuristic: tail_thresh =
max(6, target_tokens/3) (16 for target=50).  Genuinely tiny stubs
(<16 cps) still merge; real sentence chunks stay independent.  The
min_chunk_tokens floor governs what the chunker proactively *aims for*
during iteration, not what it does with whatever's left after the
last natural boundary.

Verified: Korean 3-sentence text now chunks into 2 (first chunk spans
2 sentences due to first-sentence-below-min-floor, last sentence
stays separate at 18 cps).  English 3-sentence test stays at 3
sentence-aligned chunks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 3x sentence-search window slurped runaway-sentence tails as one
huge "sentence-aligned" chunk: a 245-char single sentence with the
final period 109 chars past start was found by the wide window, so
chunker took the whole remainder as chunk[3] instead of falling
through to whitespace and producing multiple sub-sentence chunks.

2x is still wide enough to catch a long-but-reasonable first sentence
in multi-sentence input (covers up to ~90 chars at target=50, ample
for typical English / French / Portuguese sentences) but narrow
enough that genuinely runaway sentences (>2x target with no internal
periods) fall through to whitespace and stream.

Empirical: same 245-char English run-on now produces 5 evenly-sized
chunks (30, 52, 54, 52, 56) instead of 4 with the tail-blob
(30, 52, 54, 109).  Multi-sentence test unchanged (still 3 sentence-
aligned chunks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ered stdout)

Two review-comment fixes from PR #20:

1. De-duplicated the sentence-terminator code-point table between
   supertonic_chunker.cpp's is_sentence_end_cp() and the engine's
   chunk_ends_with_sentence_term().  is_sentence_end_cp() is now
   declared in supertonic_chunker.h and called from the engine's
   per-chunk continuation detector — the engine still owns the
   UTF-8 trim/decode logic, but the predicate (and its multilingual
   table) live in one place.  Adding Ethiopic ።, Tibetan ། or any
   other terminator now needs one edit, not two.

2. stream_emit_pcm_stdout was doing a per-sample
   fwrite(&v, 2, 1, stdout) loop — ~44k-132k syscall-adjacent calls
   per chunk.  Build the chunk's int16 buffer once and write it in
   a single fwrite; flush after.  No semantic change to the bytes
   on stdout; just throughput.

Verified: multi-sentence chunker still produces 3 sentence-aligned
chunks (unchanged); stdout streaming byte count still equals
samples * 2 exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…UPERTONIC_LOG_CHUNKS

Adds one line per chunk to the existing SUPERTONIC_LOG_CHUNKS env-var
trace, showing the is_continuation flag the engine resolved before
handing the chunk to run_single_chunk:

  chunk[0] (44 bytes): The quick brown fox jumps over the lazy dog.
  chunk[0] is_continuation=0
  chunk[1] (64 bytes): Then she said hello to the world, ...
  chunk[1] is_continuation=0

Useful for validating that the engine's per-chunk continuation
detector and the chunker's boundary search agree on what counts as
a sentence terminator across UTF-8 — they share the same
detail::is_sentence_end_cp table, but the engine reaches it via a
UTF-8-decode of the final code point in the chunk string, so the
two paths can in principle disagree on a malformed input.  The log
makes that observable in one place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tts-cpp: supertonic Engine streaming via multilingual chunker + callback


Upstream ggml-org/whisper.cpp PR ggml-org#3677 added the streaming VAD entry
points but shipped no test. Lock the public contract on the tetherto
fork so regressions surface immediately:

  - whisper_vad_detect_speech idempotent (reset is implicit)
  - whisper_vad_reset_state restores LSTM state exactly
  - detect_speech == reset_state + detect_speech_no_reset
  - detect_speech_no_reset on contiguous halves == single-shot
    detect_speech (state carries across no-reset call boundary)

Splits at a 512-sample boundary (Silero v6.2.0 window size) so no
mid-stream zero padding is introduced. Uses the bundled silero VAD
model and samples/jfk.wav; no whisper transcribe model needed.

QVAC-18991

Co-authored-by: Cursor <cursoragent@cursor.com>
Follow-up to 8f11c2a (the AOSC port itself). Locks the v2.1 streaming
behaviour into ctest and surfaces it to the live-mic example user, so
neither piece silently regresses.

Added regression suite:

- test/test_sortformer_aosc_speakers.cpp asserts three invariants
  against a reference RTTM: (a) every ref speaker has at least one hyp
  frame, (b) speakers that re-enter after a gap land in the SAME
  hyp_<id> they were first assigned to (the AOSC contract), (c)
  frame-level DER under the optimal hyp->ref permutation is below
  --der-max (default 30 %). Brute-force permutation, 10 ms frame grid,
  std-lib only.

- test/samples/abcba.{wav,rttm} (160.6 s, 3 speakers, A->B->C->B->A,
  A returns after a 97 s gap) and test/samples/abcdba.{wav,rttm}
  (191.2 s, 4 speakers, A->B->C->D->B->A, A returns after a 128 s gap,
  B after a 66 s gap). Generated from ElevenLabs TTS so the audio is
  redistributable; ground-truth RTTMs auto-built from clip durations.

- CMakeLists.txt registers two ctest entries
  test-sortformer-aosc-speakers-{abcba,abcdba} sharing one binary,
  REQUIRES-gated on the v2.1 GGUF so a fresh checkout without models/
  shows them as DISABLED rather than failing.

Measured on q8_0 v2.1, M-series CPU backend: abcba DER 27.29 % (3
slots tracked, A and B re-bind correctly); abcdba DER 22.22 % (all 4
slots tracked, A and B re-bind). v1 streaming on the same fixtures
collapses to 2 slots (abcdba 66.28 %), confirming the test
distinguishes AOSC from non-AOSC.

Public API:

- SortformerStreamSession::aosc_active() — small getter returning the
  engine's internal cache_active flag. Lets callers tell v2.1+AOSC
  from v1 / v2.x-without-cache in CLI banners and logs without
  duplicating the v2.1 detection logic.

live-mic example:

- Banner now branches on aosc_active(): on v2.1 prints
  "(v2.1 diarization, AOSC)  chunk=... spkcache_len=... fifo_len=... lc=... rc=...";
  on v1 keeps the existing "(v1 diarization)  chunk=... history=..." line
  bit-identical. --history-ms help text clarifies the flag is v1-only
  and that v2.1 takes the AOSC path automatically. No new CLI flags.

Docs:

- README.md: new model-table row for diar_streaming_sortformer_4spk-v2.1
  (v2 row left untouched); API table's diarize_start description
  distinguishes v1 sliding-history vs v2.1 AOSC; "Shipped / Not in-repo"
  status block moves Sortformer spkcache streaming to "Shipped".

- PROGRESS.md: new Phase 17 closing the §11.11.2 reservation. Covers
  the algorithm port (8 ported NeMo helpers), encoder context
  windowing, bypass_pre_encode forward, validation methodology, the
  measured DER table from above, files touched, and remaining
  follow-ups (engine n_finals end-of-session glitch; downstream
  qvac-addon plumbing).

v1 path is bit-identical to pre-commit; all existing tests stay green.

QVAC-18625
@ogad-tether ogad-tether requested review from a team as code owners May 22, 2026 14:07
ogad-tether and others added 3 commits May 22, 2026 15:59
…t" inputs

Three pre-existing bit-exactness regressions in the QVAC-18605 cache work
(F8 style-residual cached-graph parity, F18 text-encoder convnext-front
graph cache, F19 vector-estimator front-block cache) shared one root
cause: leaf input tensors uploaded ONLY at build time (because their
contents depend solely on cache-key fields like L / text_len / θ) had
their backend buffers released by ggml-alloc's free pass once their last
consumer in the graph ran. On the second compute pass through the same
cache, intermediates aliased into the freed offsets and silently
overwrote the "stable" upload — every downstream tensor went stale.

The freed-leaf-input behaviour is documented inside ggml-alloc.c:
`ggml_gallocr_free_node` exits early only when the tensor has
`GGML_TENSOR_FLAG_OUTPUT` — the input flag does not extend that
guarantee. Marking each affected tensor as INPUT and OUTPUT keeps its
buffer alive across compute passes, so the one-shot upload at build
remains valid for the cache's full lifetime.

Affected tensors:
- supertonic_text_encoder.cpp:build_relpos_cache — `masks[9]` relpos
  attention masks (9 × L×L floats, encode integer position deltas
  −4..+4).
- supertonic_vector_estimator.cpp:build_group_graph_cache — RoPE
  cos/sin tables (q_cos_in / q_sin_in / k_cos_in / k_sin_in).
- supertonic_vector_estimator.cpp:supertonic_vector_trace_proj_ggml
  front_cache RoPE cos/sin tables (same shape, separate cache).
- supertonic_vector_estimator.cpp:build_res_style_qkv_cache —
  `style_v_in` / `kctx_in`. Both use the F4 pointer-compare upload-
  skip; without OUTPUT the skip preserved a host pointer to a
  backend buffer that gallocr had already released.

Test fallout on tts-cpp/test (with bundled qvac-ext-ggml@speech 60a172e,
supertonic2.gguf + supertonic-ref-quick fixture):

  before  test-supertonic-audit3-caches  6/8 checks pass  (F18, F19 fail)
  after   test-supertonic-audit3-caches  8/8 checks pass

  before  test-supertonic-graph-rewrites  4/5 checks pass  (F8 fails)
  after   test-supertonic-graph-rewrites  5/5 checks pass

  fixture suite:  9/16 → 15/16  (only `test-supertonic-pipeline` still
  fails — that's a separate ONNX-vs-GGUF reference drift, not a cache
  bug; the per-stage tests that take ref inputs directly all pass).

  unit suite:  25/25 (unchanged).

Verified on the supertonic_optimizations branch pre-merge (`184c6410`)
that the failures are identical in magnitude — this is a pre-existing
bug in QVAC-18605 rounds 3+ cache work, not a regression from the
master merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…peline test mask

Same root cause as the previous F8/F18/F19 fix: leaf input tensors that
the round-10 upload-skip tracker treats as "stable across denoise steps
within one synth" (uploaded only on `current_step == 0`, skipped on
steps 1..N-1) need INPUT + OUTPUT flags so ggml-alloc's free pass doesn't
release the buffer after step 0 and silently corrupt the skipped uploads
on subsequent steps.

Two more affected tensors found by tracing the pipeline parity test's
per-step divergence:

- supertonic_vector_estimator.cpp:supertonic_vector_trace_proj_ggml
  front_cache.text_in_t  (vector-estimator front-block text input)

- supertonic_vector_estimator.cpp:build_group_graph_cache
  cache.text_in  (vector-estimator group 1/2/3 text input)

Pipeline test (`test-supertonic-pipeline`) per-step max_abs_err:
  before:  step0 1.4e-05, step1 8.5e-01, step2 1.7e+00, … final 3.28e-01
  after:   step0 1.4e-05, step1 3.9e-05, step2 6.8e-05, … final 1.11e-04
The step-by-step error is now pure floating-point round-off
accumulation (~1e-5 per step), 4 orders of magnitude under the test's
1e-3 threshold.

Also: align the pipeline test's input prep with the
`dump-supertonic-reference.py` harness — the Python script feeds the
ONNX vector_step a pre-masked input (`xt = noise * latent_mask`) and
the vocoder a pre-masked latent (`vocoder({"latent": xt * latent_mask})`).
For the supertonic-ref-quick fixture the mask is all 1.0 so this is a
no-op today, but a fixture with padded tail latents would otherwise
diverge from the reference at every padded position.

Fixture suite on tts-cpp/build (bundled qvac-ext-ggml@speech 60a172e,
supertonic2.gguf + supertonic-ref-quick):

  before:  15/16 fixture tests passing (test-supertonic-pipeline FAIL)
  after:   16/16 fixture tests passing

Unit suite unchanged (25/25).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ing for ggml_reshape_2d

CodeQL cpp/integer-multiplication-cast-to-long flagged
`n_heads * head_dim` (both `int`, multiplied as `int` and then implicitly
converted to `int64_t` for `ggml_reshape_2d`'s shape argument). For
Supertonic's vector-estimator the values are 4 × 64 = 256 so there is
no actual overflow risk today, but a tts-cpp callsite that ever uses
larger n_heads / head_dim would silently truncate. Cast first to make
the multiplication 64-bit. No behaviour change for any current caller.

Alert was not introduced by this PR (line dates back to the original
tts-cpp add `ef840d5c3`) but surfaces on PR #31 because the surrounding
file was touched. Fixing here keeps the PR's CodeQL gate green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@GustavoA1604 GustavoA1604 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. backend_selection.cpp — missing #include <stdexcept>

Throws std::runtime_error in 4 places, compiles on macOS libc++ via transitive include, fails on libstdc++ (Linux / MSYS2-GCC). One line:

 #include <mutex>
+#include <stdexcept>
 #include <string>
  1. Android GGML_BACKEND_DL=ON must keep the supertonic Vulkan optimisations — please don't ship them gated off

The PR currently lists this as a known follow-up, but Mali / non-Adreno-700+ Snapdragon / Exynos Xclipse are exactly the targets where the round-10 pinned-host-buffer + round-12 F16-KV bandwidth wins matter most; silently turning them off on DL undoes the QVAC-18605 business case on mobile.

Every direct ggml_backend_vk_* call in this PR has a public registry-API equivalent today at the 60a172e4 ggml pin:

  • ggml_backend_is_vk(backend)strcmp(ggml_backend_reg_name(ggml_backend_dev_backend_reg(ggml_backend_get_device(backend))), "Vulkan") == 0
  • ggml_backend_vk_host_buffer_type()ggml_backend_dev_host_buffer_type(ggml_backend_get_device(backend))
  • ggml_backend_vk_get_device_description(...)ggml_backend_dev_description(ggml_backend_get_device(backend))
  • F16-KV / Q8_0-KV / BF16-KV FA capability predicates → build a probe tensor and call ggml_backend_dev_supports_op(dev, op)

Please migrate the four call-site classes in this PR, drop the NOT GGML_BACKEND_DL clause from the GGML_USE_VULKAN define in tts-cpp/CMakeLists.txt:180-181, and add a Snapdragon DL smoke test confirming the round-10 / 12 logs fire on the dynamic-loader build. init_gpu_backend already proves the registry-only pattern works — extending it the rest of the way is mechanical and keeps tts-cpp's source under the same "no direct backend symbols" invariant parakeet-cpp ships today.

#1, #2)

Addresses PR #31 review feedback from @GustavoA1604:

  1. backend_selection.cpp — missing `#include <stdexcept>`.  Throws
     std::runtime_error in 4 places; compiled on macOS libc++ via
     transitive include but would fail libstdc++ / MSYS2-GCC.

  2. Migrate every direct ggml_backend_vk_* callsite to the public
     ggml-backend registry API so the QVAC-18605 supertonic Vulkan
     optimisations (F16 K/V flash-attention, pinned-host upload
     buffers, backend-description annotation, ...) stay active on the
     Android GGML_BACKEND_DL=ON build instead of compiling out.

Migrations:

  - ggml_backend_is_vk(b)
      → tts_cpp::detail::backend_is_vulkan(b) — strcmp against
        ggml_backend_reg_name(ggml_backend_dev_backend_reg(
        ggml_backend_get_device(b))).  Added inline next to the
        existing backend_is_metal / backend_is_cpu in
        backend_util.h (mirrors parakeet-cpp's helper module).

  - ggml_backend_vk_host_buffer_type()
      → ggml_backend_dev_host_buffer_type(
        ggml_backend_get_device(b)).  Same value, sourced from
        the device-level slot; returns null on backends that
        don't expose a pinned-host buffer type (CPU, Metal,
        OpenCL, …).  Affects:
          * backend_supports_pinned_host_buffer_uncached
          * try_alloc_inputs_in_pinned_host_buffer

  - ggml_backend_vk_get_device_description(idx, buf, len)
      → ggml_backend_dev_description(
        ggml_backend_get_device(b)).  Same string, no host buf
        round-trip.  Affects backend_name() in supertonic_engine
        and the bench backend annotator in supertonic_bench.

Drop:

  - The `#include "ggml-vulkan.h"` includes in supertonic_engine.cpp
    and supertonic_bench.cpp (no longer needed; registry API lives
    in ggml-backend.h).
  - Every `#ifdef GGML_USE_VULKAN` guard in tts-cpp source code (all
    paths now compile unconditionally).
  - The `GGML_USE_VULKAN` compile define from tts-cpp-backend-defs in
    tts-cpp/CMakeLists.txt — no code references it any more.  tts-cpp
    now mirrors parakeet-cpp's "no direct backend symbols" invariant.

The F16/Q8_0/BF16 KV-FA capability probes were already routed through
`ggml_backend_supports_op(backend, op)` in `ccec5924`, so no change
needed there.

Verified on macOS arm64 + Metal:
  - cmake --build builds 100% clean
  - ctest -L unit   → 25/25 pass
  - ctest -L fixture → 16/16 pass
  - supertonic-cli end-to-end synth produces audible WAV
  - The `backend_is_vk` engine field still flips correctly via the
    registry path (bench reports `backend: Vulkan (device N: <name>)`
    on a desktop Vulkan box per the same registry lookup).

Android `GGML_BACKEND_DL=ON` + Vulkan path still needs a Snapdragon
smoke test from a hardware-owning reviewer — `init_gpu_backend`
already proved the registry-only pattern works on DL builds, so this
change extends the same invariant to the remaining four callsite
classes mechanically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ogad-tether

Copy link
Copy Markdown
Author

@GustavoA1604 thanks for the review — both items addressed in 00eb3f36:

1. #include <stdexcept> in backend_selection.cpp — added alongside the existing <mutex> / <string> imports.

2. Direct ggml_backend_vk_* calls migrated to the registry API. All four callsite classes you flagged are now registry-routed; GGML_USE_VULKAN is gone from tts-cpp/CMakeLists.txt entirely (no source references it any more). Diff: tts-cpp/CMakeLists.txt | 34 +++++++--------------.

Concrete swaps:

  • ggml_backend_is_vk(b) → new tts_cpp::detail::backend_is_vulkan(b) in backend_util.h, parallel to the existing backend_is_metal (parakeet-cpp pattern).
  • ggml_backend_vk_host_buffer_type()ggml_backend_dev_host_buffer_type(ggml_backend_get_device(b)) (backend_supports_pinned_host_buffer_uncached + try_alloc_inputs_in_pinned_host_buffer).
  • ggml_backend_vk_get_device_description(idx, buf, len)ggml_backend_dev_description(ggml_backend_get_device(b)) (engine + bench backend annotators).
  • F16/Q8_0/BF16 KV-FA capability probes were already on ggml_backend_supports_op(backend, op) (added in ccec5924), so no change there.

#include "ggml-vulkan.h" is gone from both supertonic_engine.cpp and supertonic_bench.cpp. Every #ifdef GGML_USE_VULKAN guard in tts-cpp source is removed — all paths compile unconditionally now.

Local verification on macOS arm64 + Metal:

  • cmake --build clean
  • ctest -L unit 25/25
  • ctest -L fixture 16/16 (incl. test-supertonic-pipeline end-to-end vs ONNX reference)
  • supertonic-cli end-to-end synth produces an audible 3.0s WAV

Android GGML_BACKEND_DL=ON smoke test on Snapdragon is still flagged as a TODO in the PR body — I don't have hardware here, but the registry-only invariant matches what init_gpu_backend already proved works on DL builds.

Heads-up: branch was DIRTY against master (the v1.8.5 sync + EOU work merged in while this PR was open). Resolving that next, then will re-request review.

Pulls in the master-side activity since PR #31 opened:

  - QVAC-19386: v1.8.5 + sync vendored whisper.cpp + ggml to ggml-org
    upstream (#33).  Bumps whisper version, refreshes the in-tree ggml,
    re-adds tts-cpp from a fresh snapshot of chatterbox.cpp's port.
  - QVAC-19270: parakeet EOU streaming mid-stream-boundary handling.
  - QVAC-19213: Adreno Vulkan fixes (mul_mat_vec subgroup->shmem,
    get_max_size cap scoped to Qualcomm/Adreno).

Conflict resolution (all 24 conflicts were `add/add` because the
merge-base — `4bf733672` `talk-llama : sync llama.cpp` — predates QVAC
adding `tts-cpp/` and `parakeet-cpp/`):

  - tts-cpp/* → kept HEAD (`--ours`).  This branch is the canonical
    home of the QVAC-18605 supertonic Vulkan optimisation rounds 1-13
    + the registry-API migration + the cache-state-leak fixes.  The
    chatterbox.cpp-mirrored fixes that master's `fce9d211 Add tts-cpp
    files` brought in (N1-N7 docstrings, ggml-quants.h fix,
    backend_device() public API) are already present in HEAD's
    starting point and surface as no-op diffs.

  - parakeet-cpp/* → took master (`--theirs`).  Master is the
    canonical home of QVAC-19270 EOU streaming work; this branch has
    no parakeet-cpp changes to defend.

  - .github/CODEOWNERS → took master (team rename to
    `qvac-internal-dev` / `qvac-internal-merge`).

Verified on macOS arm64 + Metal:
  - cmake --build cleanly
  - ctest -L unit   → 25/25 pass
  - ctest -L fixture → 16/16 pass (incl. test-supertonic-pipeline
    end-to-end vs ONNX reference, max_abs_err = 1.1e-04 ≪ 1e-3
    threshold)

The branch is now in sync with origin/master at `eabcf6da`; the
mergeStateStatus on PR #31 should flip from DIRTY back to UNSTABLE
(then green, once the pre-existing master CI fails resolve too).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ogad-tether ogad-tether requested a review from GustavoA1604 June 1, 2026 14:34
pratiknarola-t
pratiknarola-t previously approved these changes Jun 4, 2026
Comment thread tts-cpp/src/backend_selection.cpp Outdated
Comment thread tts-cpp/src/backend_selection.cpp
…view)

Addresses PR #31 review comments from @freddy311082:

1. (#3355973146) `vulkan_device != 0` aborted `init_gpu_backend` on a
   machine with no Vulkan adapter.  `pick_vulkan_device_index` throws
   on an empty device list, so a host wiring `vulkan_device = -1` as a
   generic "auto-pick GPU" would crash on Metal-only macOS or
   CUDA-only Linux instead of falling through the tier policy to the
   available backend.

   Guard the Vulkan-pick block on `!vulkan_devs.empty()`.  Also log a
   one-shot warn when the override is requested but no Vulkan adapter
   is visible (so the silent fall-through is debuggable).

2. (#3355995666) `vulkan_device > 0` was silently shadowed by the
   OpenCL-Adreno-700+ tier preference.  On a Snapdragon device that
   exposes both backends, the chosen Vulkan adapter is moved to the
   front of `other_gpu` but the dispatch tries `opencl_adreno_700plus`
   FIRST, so an explicit `--vulkan-device N` would silently end up on
   OpenCL anyway.  Operators explicitly pinning a Vulkan adapter
   almost certainly want Vulkan.

   When `vulkan_device > 0`, try `other_gpu` BEFORE
   `opencl_adreno_700plus`.  `vulkan_device == -1` (auto-pick across
   Vulkan adapters) leaves the tier policy unchanged — the user asked
   for "best Vulkan device", not "must be Vulkan over OpenCL".
   `vulkan_device == 0` (default) is unchanged.

Verified locally on macOS arm64 + Metal:
  - cmake --build cleanly
  - ctest -L unit   → 25/25 pass
  - ctest -L fixture → 16/16 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ogad-tether

Copy link
Copy Markdown
Author

@freddy311082 both addressed in `6ac6f073`:

#3355973146 — guard the Vulkan-pick block on `!vulkan_devs.empty()` so a `vulkan_device != 0` config falls through to the tier policy on no-Vulkan hosts (Metal-only Mac / CUDA-only Linux / Adreno-OpenCL-only Snapdragon) instead of aborting via `pick_vulkan_device_index`'s throw. Added a verbose-mode warn line so the silent fall-through stays debuggable.

#3355995666 — distinguish `vulkan_device > 0` (explicit operator pin) from `vulkan_device == -1` (auto-pick). On explicit pin, `other_gpu` is tried BEFORE `opencl_adreno_700plus` so Snapdragon devices honour the override. On `-1` auto-pick the tier policy is unchanged — the operator asked for "best Vulkan device", not "Vulkan over OpenCL" — so Adreno 700+ still wins where it should.

Both review threads resolved. ctest -L unit + -L fixture still 25/25 + 16/16 on macOS arm64 + Metal.

… QVAC-18605 rounds 1-13)

Reconciles HEAD's supertonic Vulkan/Metal optimisations (F1-F23 caches,
pre-baked weights, pinned-host scratchpad, front_cache architecture)
with master's QVAC-19254 GPU-scheduler refactor (model.sched /
model.cpu_backend, supertonic_sched_alloc / supertonic_sched_compute,
direct vs sched runtime routing) and QVAC-19213 Adreno regex include.

Conflict resolution highlights:
- parakeet_ctc.cpp / backend_selection.cpp: kept master's regex include
  alongside HEAD's stdexcept.
- supertonic_internal.h: kept HEAD's model_prefers_cpu_kernels alongside
  master's sched helpers.
- engine.h: kept HEAD's six EngineOptions fields.
- supertonic_engine.cpp: kept HEAD's chunker include and the extended
  load_supertonic_gguf call.
- supertonic_gguf.cpp: kept HEAD's F1/F2/F6 pre-bakes + capability /
  debug probes; layered master's scheduler init/teardown on top of
  HEAD's extra ctx_w / buffer_w lifetime tracking.
- supertonic_vector_estimator.cpp: combined cache-key checks, per-cache
  gallocr usage (F4/F8/F12/F18/F19/F23) with master's direct/sched
  runtime routing; profile_vector_compute keeps calling
  supertonic_graph_compute directly because the per-cache graphs are
  bound to gallocr storage, not the model scheduler.
- supertonic_vocoder.cpp: kept HEAD's F2/F3 latent-only upload (BN
  pre-baked into model tensors); used supertonic_sched_compute for the
  trace-mode pairing required by QVAC-19254.

Validation: all 38 supertonic ctest fixtures + audit3 caches pass
(test-supertonic-vector, test-supertonic-vector-trace,
test-supertonic-pipeline, F18/F19 bit-exact); mtl-synth tests remain
gated on multilingual fixtures unavailable in this environment.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ishanvohra2
ishanvohra2 previously approved these changes Jun 5, 2026
…rofiling

Follow-up to the master sync (077bbcb).  The merge accidentally
created two issues in the per-cache run helpers
(`run_text_attention_cache`, `_gpu`, `run_group_graph_cache`,
`run_res_style_qkv_cache`, `run_tail_graph_cache`):

  1. On the `direct=true` hot path, the compute call became a raw
     `supertonic_graph_compute(...)` — silently dropping the QVAC-18605
     `profile_vector_compute` wrapper, so per-stage CSV / stderr
     timings were no longer emitted on the live backend.

  2. The currently-dead `direct=false` branch called
     `profile_vector_compute(...)` *after* a `supertonic_sched_alloc`,
     but the post-merge `profile_vector_compute` hard-coded
     `supertonic_graph_compute` — i.e. sched-alloc paired with
     graph-compute, which would silently corrupt the output the first
     time a future op forced the routing.

Fix:

  * Parameterise `profile_vector_compute` with `bool use_sched = false`.
    Internal `dispatch()` lambda picks `supertonic_sched_compute` when
    `use_sched`, else `supertonic_graph_compute`.  Both early-return
    fast-path and timed path use the same dispatch, so profiling
    behaviour is identical for the two compute primitives.

  * The five call sites now read:
        if (direct) profile_vector_compute(model, gf, step, island);
        else        profile_vector_compute(model, gf, step, island,
                                           /*use_sched=*/true);
    so the alloc + compute pair is consistent on both branches, and
    profiling is restored on the active path.

  * The two non-direct/sched call sites (`run_style_residual_cache`,
    `front_proj_attn0_qkv` graph in `supertonic_vector_trace_proj_ggml`)
    keep the 4-arg form and rely on the default `use_sched=false` —
    both compute graphs are gallocr-bound, which is the correct path.

Validation:

  * All 38 supertonic ctests pass (16 fixture + 22 unit, serial run).
  * Adversarial subagent review SAFE on all 10 invariants.
  * Metal n=10 bench: F1 33.5x realtime / 93.6 ms median,
    M1 34.9x / 91.9 ms.  CPU n=10: 13.7x / 229 ms median.  No
    measurable regression vs pre-fix (the noisy n=3 numbers were
    inside thermal / warmup variance).
  * "The quick brown fox jumps over the lazy dog." synthesises
    cleanly on Metal with both F1 and M1 voices.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@gianni-cor gianni-cor dismissed GustavoA1604’s stale review June 5, 2026 16:30

already approved

@gianni-cor gianni-cor merged commit 128dae4 into master Jun 5, 2026
74 of 138 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants