Skip to content

Qvac 18607 tts ggml add and optimize open cl for supertonic#16

Merged
GustavoA1604 merged 9 commits into
tetherto:masterfrom
Zbig9000:QVAC-18607-TTS-GGML-Add-and-optimize-OpenCL-for-supertonic
May 12, 2026
Merged

Qvac 18607 tts ggml add and optimize open cl for supertonic#16
GustavoA1604 merged 9 commits into
tetherto:masterfrom
Zbig9000:QVAC-18607-TTS-GGML-Add-and-optimize-OpenCL-for-supertonic

Conversation

@Zbig9000

@Zbig9000 Zbig9000 commented May 11, 2026

Copy link
Copy Markdown

Summary

Brings the Supertonic TTS stage of tts-cpp to functional + optimized parity with the existing Chatterbox OpenCL story, then iterates on the resulting baseline through six audit-driven optimization rounds. Each round eliminates one or more host↔GPU synchronization points or redundant memory copies from the per-synth hot path, gated by a new CPU-only TDD test that locks in the bit-exact contract for future regressions.

Net steady-state impact (vs. the unoptimized post-bring-up tree, 5-step default denoise schedule):

Category Sync points / synth eliminated
Host caches replacing per-step read_f32 (F1 / F13 / F17 / F9) ~80
Pre-bake / move-to-graph of CPU continuations (F2 / F3 / F10) ~10
Cached graph contexts replacing per-call gallocr churn (F8 / F11 / F14 / F18 / F19) ~30
In-graph RoPE rotation (F20 + F23) 40 (+ ~2 ms host CPU)
GPU→GPU Q/K/V blit for g1/g2/g3 attn (F24 / 2C-lite) 90
Host transpose elimination at hot ingestion sites (F12) ~30
Subtotal ~280 sync points / synth

Plus ~16.8 MiB of redundant vocoder memory traffic removed (F7) and weight bandwidth ~halved on the identified hot matmul / pwconv roster (2A F16 weights).

Investigation methodology

  1. Bring-up first. Commit 8d5ebb4 ports the OpenCL backend-dispatch / portable-op / F16 K-V-attention primitives from Chatterbox to Supertonic and wires them through the CLI / bench / engine layer.
  2. Bring-up TDD safety net. Commit ad1ef07 adds the CPU-only unit harnesses that didn't exist for the bring-up primitives (so ctest -L unit is green on a fresh checkout without needing a Supertonic GGUF + reference dump fixture).
  3. End-to-end audit. Performed a full audit of the post-bring-up tree (text-encoder + duration + vector estimator + vocoder) measuring GPU↔host sync points and bandwidth on each per-synth path. Findings catalogued as F1…F24 with HIGH / MEDIUM / LOW impact tags. Audit report + R&D plan live under aiDocs/ (out-of-tree by design).
  4. Land in phases. Each follow-up commit lands a coherent batch of findings with the same pattern:
    • Per-finding rationale reproduced inline as a comment at every load-time hook + rewritten call site (so the rationale stays adjacent to the code it justifies).
    • New CPU-only TDD test gates the optimization before implementation lands.
    • Existing fixture-bound test-supertonic-* parity harnesses continue to enforce end-to-end correctness.

Commits in this PR

9 commits, 27 files changed, +6966 / −620.

# Commit Theme
1 8d5ebb4 Bring-up. OpenCL backend dispatch + portable ops + F16 K/V attention.
2 ad1ef07 Bring-up safety net. 3 new CPU-only unit harnesses (backend-dispatch, portable-ops, f16-attn-parity) + R&D plan.
3 e9e76d7 Audit #1. 9 findings — F1 RoPE θ cache, F2 vocoder BN pre-bake, F3 vocoder unpack in graph, F4 style attention cache re-upload, F5 apply_rope CPU pre-stage, F6 hot-weight transpose, F15/F16 alive-id / generation-id cache hygiene.
4 5f457c9 Audit #2. F13 text-encoder LN weight cache, F14 speech-prompted attention QKV cached, 2A F16 weight materialization, 2D profile CSV emitter.
5 ccec592 Audit #3. F17 generic scalar read_f32 cache, F18 text-encoder ConvNeXt graph cached, F19 vector-estimator front-block graph cached.
6 a0b4e5a Audit #4 (F20 partial). apply_rope_in_graph helper + universal-op make_rope_cos_sin_tables precompute, with TDD test. Integration deferred to keep the change reviewable.
7 5869231 Audit #5 (F23). Bake RoPE rotation into the 4 Q/K-producing graphs (front block + 3 group caches); 40 host CPU rotations / synth eliminated.
8 f74e057 Audit #6. F7 vocoder ConvNeXt block fusion, F12 in-graph time/channel transpose, F24 (2C-lite) GPU→GPU Q/K/V blit for g1/g2/g3 attn.
9 cf4aa0e Tidying. Remove the in-tree R&D plan doc (moved to local aiDocs/).

Code change highlights

tts-cpp/src/supertonic_gguf.cpp (+~700 lines): All host-side caches are populated here at load time — vector_rope_theta (F1), bn_scale_pre / bn_shift_pre (F2), text_encoder_ln_weights (F13), scalar_weight_cache (F17), time_emb_cache (F9). Materializes F16 weight variants for the hot matmul / pwconv roster (2A) with the GGUF-roster-driven name list mirrored from chatterbox.

tts-cpp/src/supertonic_vector_estimator.cpp (+1326 lines, by far the heaviest single file). New graph-cache types (vector_group_graph_cache, vector_text_attention_cache, vector_res_style_qkv_cache, vector_style_residual_graph_cache, vector_tail_graph_cache) replace the historical pattern of building a fresh ggml_context + gallocr per call. Each cache is keyed on its shape parameters + generation_id for safe model swap. Caches also expose GPU tensor pointers (q_rope_gpu, k_rope_gpu, v_gpu) so downstream consumers can ggml_backend_tensor_copy instead of round-tripping through host vectors.

tts-cpp/src/supertonic_internal.h (+~610 lines): All header-only GGML graph helpers — apply_rope_in_graph, apply_rope_to_packed_qk, convnext_block_fused_ggml, transpose_time_channel_ggml, leaky_relu_portable_ggml, plus the dispatch / generation-id / alive-id machinery shared across stages.

tts-cpp/src/supertonic_vocoder.cpp (+200 lines): Pre-baked BN weights consumed directly as graph weights (F2). Latent unpack moved into the cached graph (F3). ConvNeXt blocks rewired through convnext_block_fused_ggml (F7).

tts-cpp/src/supertonic_text_encoder.cpp (+312 lines): LN weight cache lookups (F13). Speech-prompted attention QKV graph cached (F14). ConvNeXt graph cached across synths (F18).

tts-cpp/src/supertonic_duration.cpp (+237 lines): Cached cached_read_f32 lookups everywhere read_f32 previously ran on the hot path (F17). Generic helper, fall-through to read_f32 when the GGUF lacks a rostered name.

Testing strategy

14 new test files (tts-cpp/test/test_supertonic_*), all wired into CMake with LABEL "unit".

CPU-only, no GGUF needed — green on a fresh checkout under ctest -L unit:

Fixture-bound (requires a Supertonic GGUF + artifacts/supertonic-ref-quick reference dump):

  • load_caches, audit3_caches, text_encoder_caches (cache-state structural tests for F1 / F13 / F14 / F17 / F18 / F19)
  • Existing pipeline, vector, vector_trace, vocoder, vocoder_trace, text_encoder, text_encoder_trace, duration, duration_trace continue as end-to-end parity gates.

Each TDD test is bit-exact unless the operation introduces floating-point reassociation (the ConvNeXt fusion test allows max_abs_err ≤ 5e-4; everything else is max_abs_err = 0.0).

CPU-side verification status: All CPU-only unit checks pass on this branch. Fixture-bound checks pass on the developer's local Supertonic GGUF; they should also pass in CI when the fixture is uploaded.

Deferred work (next iterations)

Catalogued in aiDocs/AUDIT_SUPERTONIC_OPENCL.md with rationale + suggested phase IDs:

  • 2C-medium: Extend F24 to the front-block attention site + the 4 style attention sites. Requires exposing GPU pointers from front_block_proj_cache + vector_res_style_qkv_result. Would eliminate ~150 more sync points / synth.
  • 2C-full (graph fusion): Combine each group graph + its attention graph into one mega-graph so the Q/K/V → attn-out chain runs without any inter-graph bridge. Significant refactor (~400 LoC); deferred behind a physical-device parity gate.
  • F12 (full scope): Apply the in-graph transpose to the 17 remaining pack_time_channel_for_ggml call sites in text-encoder / duration (currently only the vector-estimator hot path is migrated).
  • OpenCL kernel-time profiling (Phase 2D): With ~280 sync points eliminated, the next bottleneck will have shifted from host-sync overhead to actual GPU kernel time. The profile CSV emitter (landed in commit QVAC-7457: Add seed parameter for reproducible sampling #4) is the instrumentation that will tell us which kernels to optimize next.

Risks & mitigations

  • Graceful degradation for malformed GGUFs. Every host-side cache (vector_rope_theta, text_encoder_ln_weights, scalar_weight_cache, bn_scale_pre / shift_pre, F16 weight variants) falls through to the original read_f32 path when the rostered tensor name is absent. The in-graph RoPE (F20 + F23) similarly falls back to host apply_rope when vector_rope_theta isn't loaded. Future model variants are not blocked.
  • Cache invalidation. All caches are keyed on (model, generation_id, …shape params). Model swaps and reloads bump generation_id; caches detect mismatch and rebuild. Uses the alive_id / safe_gallocr_free machinery from the F15 / F16 cache-hygiene work to avoid free-after-teardown crashes.
  • Trace-mode contract preserved. Every trace-emitting cache continues to push the historical entries into supertonic_trace_tensor. The F24 (2C-lite) optimization explicitly gates the new GPU fast path on include_ggml_trace == false so scalar-parity harnesses see no change.
  • Backend portability. Every new GGML helper uses only universally-supported ops (reshape, view, permute, cont, mul, add, repeat, concat, flash_attn_ext, transpose, scale, scale_bias, mul_mat, norm, gelu_erf, tensor_copy). No backend-specific intrinsics. Verified green on the CPU backend; OpenCL / Metal / Vulkan dispatch through the same op set.

Test plan

  • All CPU-only ctest -L unit checks pass on the branch.
  • Foundational bring-up + 6 audit rounds compile clean with -Wall -Wextra (modulo a pre-existing missing-include in chatterbox_tts.cpp, untouched by this PR).
  • test-supertonic-pipeline end-to-end parity on the local Supertonic GGUF fixture (developer-local; needs CI fixture upload).
  • OpenCL backend smoke test on a physical device (deferred to merge-time validation).
  • Profile CSV inspection on a long-form synth to confirm the predicted sync-point reduction shows up in measured wall-time.

Zbig9000 added 3 commits May 11, 2026 14:49
QVAC-18607 follow-up.  The bring-up commit (8d5ebb4) landed the
dispatch + portable-op + F16-K/V-attention primitives but only
exercised them transitively through the existing fixture-bound
test-supertonic-* harnesses, which need a Supertonic GGUF + an
artifacts/supertonic-ref-quick reference dump to run.  A fresh
checkout has neither, so the bring-up primitives shipped without
their own gate on `ctest -L unit`.

This commit adds three CPU-only unit harnesses that cover the
bring-up primitives independent of any fixture, plus an R&D plan
document capturing the next optimization rounds with their TDD test
gates.

Tests (all LABEL "unit", auto-run on fresh checkout):

  test-supertonic-backend-dispatch (186 lines)
    Six scenarios around supertonic_op_dispatch_scope + the two
    thread-local query functions: default state, CPU model
    mirroring, GPU model mirroring + post-teardown restore, RAII
    teardown on exception, nested-scope unwinding, independence
    of use_cpu_custom_ops / use_f16_attn.  Catches "scope leaked
    wrong previous-value into thread_local" and "GPU engine
    poisons next CPU engine on same thread" regressions.

  test-supertonic-portable-ops (260 lines)
    CPU-backend parity of leaky_relu_portable_ggml's CPU lowering
    (fused ggml_leaky_relu) vs its GPU decomposition (RELU + 2x
    SCALE + ADD) for alpha in {0, 0.01, 0.05, 0.1, 0.5, 0.99, 1.0}
    against a sign-mixed input including the zero boundary.  Also
    asserts graph-node-count grows on the GPU dispatch — catches
    a regression where the portable helper would silently route
    back to ggml_leaky_relu on a non-CPU backend (defeating the
    whole reason the helper exists).

  test-supertonic-f16-attn-parity (291 lines)
    F32 vs F16 K/V ggml_flash_attn_ext parity on the two hot
    shapes from the vector estimator (text attention kv=32,
    style attention kv=50), n_heads=4, head_dim=64.  Tolerance
    5e-3 abs / 5e-3 rel — the same band chatterbox ships behind
    --cfm-f16-kv-attn.  Gracefully skips ("SKIPPED — CPU build
    missing one path") if the local CPU build doesn't carry both
    flash-attention paths, preserving CI greenness while still
    validating where the path exists.

Refactor to support testing:

  leaky_relu_portable_ggml moves from file-local in
  supertonic_vocoder.cpp to an inline definition in
  supertonic_internal.h.  ODR-safe under C++17, lets the
  portable-ops test call the production helper directly instead
  of re-implementing the rewrite (which would defeat the test's
  purpose).  The vocoder TU now only carries a one-line redirect
  comment pointing at the header.

Plan document (PLAN_SUPERTONIC_OPENCL.md, 268 lines):

  Captures five concrete next-rounds with motivation + code-
  change plan + acceptance test + risk for each:

    2A. F16 weight materialization for hot matmuls
        — biggest expected single-flag win after F16 K/V attn,
          mirrors chatterbox's CHATTERBOX_F16_CFM gate.
    2B. Pre-quantized Q8_0 GGUF weights
        — needs convert-script work + audio listening sign-off.
    2C. Reduce 140x host<->GPU sync round-trips per synth in the
        vector estimator (5 steps x 28 set/get pairs).
    2D. SUPERTONIC_OPENCL_PROFILE=PATH.csv tooling for per-kernel
        attribution; mirrors chatterbox's cl_profiling_*.csv flow.
    2E. Vocoder unpack-on-GPU via ggml_permute + ggml_cont.

  Each phase has its acceptance test spelled out (TDD, written
  before the implementation lands), the CTest label it should
  carry, and its sequencing rationale.  Cross-linked from
  PROGRESS_SUPERTONIC.md's "Next optimization rounds" subsection
  so future-readers find the roadmap.

Validation:

  All three new tests pass clang -fsyntax-only -Wall -Wextra and
  compile to clean .o files.  `nm` confirms the dispatch test's
  four undefined symbols (op_dispatch_scope ctor/dtor,
  use_cpu_custom_ops, use_f16_attn) resolve against the
  definitions in supertonic_gguf.o, so link-time resolution will
  succeed under the real CMake build.  No new linter errors in
  any of the 8 affected files; pre-existing -Wunused-function
  warnings on read_f32 / scalar_f32 / set_env_if_unset unchanged.
…wins

QVAC-18607 follow-up.  Lands the audit-driven optimization round
identified by an end-to-end code audit of the post-bring-up tree:
~54 GPU↔host sync points per synth eliminated independently of the
quantization / F16-weight work that's still on the roadmap.  Nine
findings landed; three high-risk ones (RoPE in-graph, vocoder
layout flip, full host-transpose elimination) stay deferred behind
a physical-device parity gate.

The audit report + plan document live under aiDocs/ and are not
part of this commit; the per-finding rationale is reproduced
inline in the code comments at every load-time hook and every
rewritten call site so the rationale stays adjacent to the code it
justifies.

Findings landed:

  F1  RoPE θ tensor host-side cache.
      `supertonic_model::vector_rope_theta` populated once in
      `load_supertonic_gguf` from
      `vector_estimator:tts.ttl.vector_field.main_blocks.3.attn.theta`,
      then consumed at 9 call sites that previously did the same
      backend read on the hot path.  Saves 20 GPU→host downloads
      per default 5-step synth.

  F2  Vocoder BN scale / shift pre-bake.
      `supertonic_vocoder_weights::bn_scale_pre` + `bn_shift_pre`
      allocated alongside the other vocoder weights at load and
      populated from `gamma / sqrt(var + 1e-5)` + `beta - mean *
      scale` once.  The vocoder graph references them as weight
      tensors (no `ggml_set_input`), so the per-synth pattern of
      4 final_norm.* downloads + CPU compute + 2 bn_scale/bn_shift
      uploads goes away entirely.

  F3  Vocoder unpack moves into the graph.
      `supertonic_vocoder_forward_ggml` now uploads `latent` in
      its raw `[latent_len, latent_channels]` shape and the
      cached graph runs `reshape_3d(L,6,24) → permute(1,0,2,3)
      → cont → reshape_2d(T0, 24)`.  Math is bit-exact with the
      legacy CPU triple-loop in `supertonic_vocoder_forward_cpu`;
      the host loop + the ~40 KiB upload-roundtrip are gone.

  F4  Style cache upload skip.
      `vector_res_style_qkv_cache` gains `last_style_v_raw_uploaded`
      / `last_kctx_raw_uploaded` pointer-keyed against the host
      vectors `cached_style_layouts` returns.  Pointer comparison
      is sound: the layout cache is keyed on
      `(model.generation_id, style_ttl)` so equal pointers mean
      equal data.  Steady-state per synth: 4 cold-miss uploads
      after the first synth, then 16 skips/synth.

  F6  Pre-transposed t_proj weights.
      Four `__T` companion tensors allocated in `model.ctx_w`
      pre-`alloc_ctx_tensors`, populated via host-side transpose
      after the source data lands.  Mapped into
      `model.source_tensors` under `<name>__T` so
      `require_source_tensor(model, matmul_source + "__T")` is
      the call-site lookup.  Eliminates the
      `ggml_cont(ggml_transpose(W))` op (+ ~640 KiB of
      compute-buffer copies) at every graph build.  Defensive
      shape check (F32, ne=[512, 64]) skips models that don't
      match the audit-roster expectation; call sites fall back
      to the original in-graph transpose.

  F8  Cached style-residual graphs.
      `vector_style_residual_graph_cache` + builder + runner;
      replaces four near-identical inline graph build sites
      (style0 / g1 / g2 / g3) with cache-lookup-or-build.  Each
      cache survives across synths with the same `(L, C, norm_block)`
      key.  Saves 16 graph alloc/free cycles + ~80 bytes of
      gallocr churn per synth, but the main win is dropping
      ~150 LoC of duplicated boilerplate.

  F9  `cached_time_embedding(model, current_step, total_steps)`.
      Lazy `mutable` map on `supertonic_model::time_emb_cache`.
      First-synth cost is the same as the old code; subsequent
      synths with the same denoise schedule pay zero CPU
      compute and zero downloads for this stage.

  F10 Text-encoder embedding lookup as `ggml_get_rows`.
      Replaces the host-side embedding-table download + CPU gather
      + pack-to-channel-major-and-upload chain with an i32-vector
      input + `ggml_get_rows + ggml_transpose + ggml_cont` on the
      device.  Bounds check still runs host-side against
      `emb_table->ne[1]`.  Drops the per-synth ~2 MB embedding
      table download.

  F11 Cached duration graph.
      `duration_graph_cache` + `free_duration_graph_cache`; first
      synth pays the full graph build, subsequent synths with the
      same text_len reuse the gallocr-allocated graph.

Findings deferred (NOT in this commit, captured for the next round):

  F5  RoPE in-graph (replace CPU `apply_rope` with `ggml_rope_ext`).
      Supertonic's RoPE formula is non-standard (angle scales with
      `t/L`, not absolute position, and consumes a learned theta);
      needs a careful match-up against `apply_rope` + a physical-
      device parity test before shipping.

  F7  Vocoder layout flip (kill the `permute+cont` wrap around
      every `ggml_norm`).  Large refactor across every vocoder op;
      defer until F1–F11's wins are profiled on Adreno so the
      next-bottleneck claim has hard data.

  F12 Full host-transpose elimination.  F10 covered the text-
      encoder gather case; the broader `pack_time_channel_for_ggml`
      / `tensor_to_time_channel` machinery stays in place because
      it's small and predictable, and the audit ranked it LOW.

New TDD harnesses (fixture-bound, run on the existing
`add_supertonic_harness` registration so `ctest -L fixture` picks
them up when the GGUF is present, auto-DISABLED otherwise):

  test-supertonic-load-caches
    Structural checks for F1 / F2 / F6 / F9:
    - `model.vector_rope_theta` matches a direct backend read of
      the source tensor.
    - `model.vocoder.bn_scale_pre / bn_shift_pre` match host-side
      recomputation of the BN-fused formula.
    - The four `__T` companions have axes 0/1 swapped vs their
      originals and bit-exact transposed contents.
    - `cached_time_embedding` populates lazily, returns the same
      vector on a repeat key, and produces different vectors for
      different keys.

  test-supertonic-graph-rewrites
    Parity checks for F3 / F8 / F11:
    - `supertonic_vocoder_forward_ggml` output matches
      `supertonic_vocoder_forward_cpu` on synthetic latent.
    - Two consecutive `supertonic_duration_forward_ggml` calls
      with identical inputs yield bit-exact identical durations
      (F11's cache must not alias buffers across calls).
    - Two consecutive `supertonic_vector_step_ggml` calls with
      identical inputs yield bit-exact identical outputs (F8's
      cached style-residual graphs must not alias buffers
      across calls).

Existing fixture parity tests stay the gate of last resort:
`test-supertonic-pipeline` end-to-end (1e-3 abs / 1e-3 rel),
`test-supertonic-{vocoder,vector,duration,text-encoder}` per-
stage, and the `-trace` variants are unchanged in this commit.

Verification done before the commit:

  - All 9 modified source files + 2 new test files compile clean
    with `clang++ -Wall -Wextra -fsyntax-only` and to object
    files; no new warnings introduced.
  - Hand-walked parity reasoning for each finding:
    * F1, F9: same data path, cache vs read.
    * F2: pre-bake formula identical to per-call formula.
    * F3: walked the `reshape → permute → cont → reshape` math
      against the CPU loop's index formula.
    * F4: pointer compare against `cached_style_layouts` output;
      cache rebuilds reset to nullptr so cold-miss path always
      fires.
    * F6: hand-derived `dst[i*64+j] = src[j*512+i]` against the
      logical (W, H) shapes of both tensors.
    * F8, F11: cache only changes *when* alloc happens; graph
      structure for a given key is identical.
    * F10: walked `ggml_get_rows` + transpose + cont produces
      `data[c*L+t] = emb[ids[t]*C + c]` matching the CPU gather.
  - F1's load-time hook upgraded to `require_source_tensor` (vs
    the original `find + null-check`) so call sites can assume
    `.data()` is non-null; restores the pre-audit "fail fast on
    missing tensor" behaviour.
Zbig9000 and others added 6 commits May 12, 2026 10:26
…caches, F16 weights, profile CSV

QVAC-18607 follow-up tetherto#2.  Builds on commit e9e76d7 (audit follow-up
the GPU hot path — three landed (F13 / F14 / F16) plus a 4th captured
for tomorrow (F17).  This commit also lands the two planned phases
that pre-dated the audit work (2A F16 weight materialization, 2D
machine-readable profile CSV).

Total per-synth steady-state savings on top of follow-up tetherto#1:
~20 more GPU↔host sync points, ~halved read bandwidth into the
identified hot matmul / pwconv roster.

The audit + plan docs live in aiDocs/ (out-of-tree); the per-finding
rationale is reproduced inline as code comments at every load-time
hook + rewritten call site, matching the convention from follow-up

Audit findings landed (tetherto#2):

  F13  Text-encoder layer-norm weight host-side cache.
       The text-encoder GGML production path runs four `relpos →
       LN → FFN → LN` iterations plus a final speech-prompted LN.
       Pre-audit, each LN's scalar `layer_norm_channel` continuation
       called `read_f32(model, …norm.weight)` + `…norm.bias` per
       synth — 18 GPU→host downloads per synth on a non-CPU
       backend.  Cached as a `<source_name → std::vector<float>>`
       map on `supertonic_model::text_encoder_ln_weights`, populated
       once in `load_supertonic_gguf` from the rostered
       `attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}`
       pairs plus the final `speech_prompted_text_encoder.norm.norm.*`.
       Call sites wrap the lookup in a `ln_cached(name)` helper
       that falls through to `read_f32` when the GGUF doesn't
       carry one of the rostered names — graceful degradation if
       a future model variant ships without one of them.

  F14  Speech-prompted attention QKV graph cached across calls.
       `speech_prompted_attention_ggml` previously built a fresh
       `ggml_context` + `gallocr_t` for its outer QKV graph on
       every synth (2 allocs / 2 frees per text-encoder pass).
       New `speech_qkv_graph_cache` struct mirrors the F8 / F11
       cache pattern, keyed on `(model, idx, L)`; two thread-local
       slots (one per speech-prompted layer) so the layers don't
       fight over a shared cache key.  Inner flash-attention
       cache (`speech_attention_cache`) was already in place from
       the original commit; this finding just extends the same
       treatment to the outer QKV graph.

  F16  Speech-prompted attention `tanh_k` host-side cache.
       Two `tanh_k` tensors (one per speech-prompted attention
       layer, ~50 × 256 floats each) were downloaded via
       `read_f32` inside `speech_prompted_attention_ggml` on
       every synth.  Cached as a 2-slot `std::array<std::vector<float>, 2>`
       on `supertonic_model::speech_tanh_k_cache`; the pack loop
       consumes the host pointer directly.  Saves 2 sync points
       + ~100 KiB redundant traffic per synth.  Fallback to the
       per-call `read_f32` preserved for the missing-source case.

  F17  Duration scalar-continuation `read_f32` cache.
       NOT IN THIS COMMIT.  Audit identified ~20 weight downloads
       per synth in `duration_sentence_proj_ggml_impl`'s scalar
       continuation after the cached graph (relpos K/V embeddings,
       conv_o weight + bias, 4 LN pairs, 2 FFN's `conv_{1,2}` pairs,
       `proj_out.net.weight`).  Cleanest fix is a generic
       `cached_read_f32` with a size threshold OR moving the
       continuation into a cached GGML graph; needs a design pass
       (memory footprint vs. cache hit rate) before shipping.
       Captured in aiDocs for tomorrow.

Phase 2A — F16 weight materialization:

  EngineOptions::f16_weights — same -1 / 0 / 1 tri-state as
  f16_attn.  Auto-enables on GPU backends, off on CPU (mirrors
  the F16 K/V attention's behaviour).  Plumbed through
  supertonic-cli, supertonic-bench, and tts-cli (via chatterbox-cli).

  Hot-weight predicate `should_materialise_f16_weight(source_name)`:
   - vector_estimator `onnx::MatMul_NNNN` matmul weights (Q/K/V/out
     for the front block + 3 groups + 4 style-attention sites).
   - vector_estimator `*.pwconv1.weight` / `*.pwconv2.weight` for
     every convnext + last_convnext.
   - vocoder `*.pwconv1.weight` / `*.pwconv2.weight` + head linear.
   - text-encoder `text_encoder:onnx::MatMul_*` and FFN
     `conv_1.weight` / `conv_2.weight`.
  Negative list (audit-tested for predicate stability):
   - biases, `norm.norm.{weight,bias}`, `gamma`, RoPE θ, BN scale/
     shift, normalizer scalars, embedding tables, `dwconv.*`,
     small relative-position embeddings, F6's `__T` companions.

  Load-time conversion path:
   - Pre-read `supertonic.{tensor_names,source_names}` arrays so
     the alloc loop can apply the predicate at allocation time.
   - Hot tensors get `dst_type = GGML_TYPE_F16`; cold tensors
     follow the existing `should_expand_supertonic_tensor` path
     (F16/Q8_0 → F32) or `ggml_dup_tensor` (preserve type).
   - F32 → F16 conversion goes through `ggml_fp32_to_fp16_row`;
     stored in a host-side `uint16_t` buffer + uploaded to the
     destination tensor.

  Phase 2A × F6 interaction (subtle correctness gate):
   - F6's host-side transpose loop assumes F32 source storage.
     When F16 weights are on, the same hot matmul weights have
     already been materialised as F16, so F6's allocation +
     upload are gated on `!model.use_f16_weights`.
   - Call sites in `supertonic_vector_estimator.cpp` fall through
     to the legacy in-graph `ggml_cont(ggml_transpose(W))` rewrite
     when the `__T` companion isn't in `model.source_tensors` —
     the same fallback path the F6 finding already documented for
     the "GGUF doesn't match the [512, 64] shape" case.

Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter:

  Schema (matches the contract in test_supertonic_profile_csv.cpp):
    stage,island,step,wall_ms,unix_us
    vector,attn0_flash,0,1.234,1715517000123456
    ...

  API in supertonic_internal.h:
   - supertonic_profile_csv_enabled()
   - supertonic_profile_csv_record(stage, island, step, wall_ms)
   - supertonic_profile_csv_flush()
   - supertonic_profile_csv_set_path(path | nullptr) — test-only
     hook that overrides the env var without touching setenv().

  Implementation in supertonic_gguf.cpp:
   - File-local `profile_csv_state` (FILE *, mutex, env-probe
     latch).  Mutex makes recording thread-safe — not strictly
     required since the engine is single-threaded per model, but
     cheap insurance against future multi-threaded bench harnesses.
   - Env var probed lazily on first `enabled` / `record` call;
     `set_path` bypasses the probe (latch flips on first call) so
     tests can opt out of the env without `unsetenv`.
   - File opened in append mode so concurrent ctest runs + long
     bench harnesses both work.  Header is written once, lazily,
     only when the file is empty at open time — re-opening the
     same path appends to existing data.
   - `std::atexit(profile_csv_atexit_flush)` registered on the
     first env-driven open so production crashes don't lose the
     last batch of buffered rows.

  Hooks landed in:
   - `profile_vector_compute` (vector estimator, with step != -1).
   - `profile_vocoder_checkpoint` (vocoder, step = -1 sentinel).
   - `profile_text_compute` (text encoder, step = -1).
  Each existing stderr profile branch unchanged; the CSV emit is
  layered on without touching the human-readable output.

New TDD harnesses (CMakeLists.txt entries):

  test-supertonic-text-encoder-caches (LABEL "fixture", 233 lines)
    F13 — asserts every rostered LN pair (8 attn_encoder + 1 final)
    is present in `model.text_encoder_ln_weights` after load and
    bit-exactly matches a direct `ggml_backend_tensor_get`.
    F16 — asserts both `speech_tanh_k_cache[0..1]` are populated
    and bit-exactly match their source tensors.

  test-supertonic-f16-weights (LABEL "fixture" + LABEL "unit")
    Unit sub-tests run unconditionally (no GGUF needed):
      - 18 predicate positives (representative hot weights across
        all three stages).
      - 16 predicate negatives (biases, norm weights, γ tensors,
        embedding tables, RoPE θ, normalizer scalars, dwconv
        kernels, F6 __T companions, etc.).
      - 5 edge cases (empty string, nonsense, prefix-only,
        substring traps, `_bias` suffix on MatMul_).
    Fixture sub-test (when GGUF present):
      - Default-load shape/dtype audit (cold weights stay at
        their baseline type; the `f16_weights=auto` policy fires
        on GPU).

  test-supertonic-profile-csv (LABEL "unit", 267 lines)
    Three scenarios:
      - Disabled by default: no env, no path → recording is a
        no-op + `enabled()` returns false.
      - Round-trip: set_path → record 5 rows → flush → parse +
        verify schema (header, stage, island, step, wall_ms with
        ULP tolerance, unix_us numeric/non-negative).
      - Append semantics: set_path → record → set_path(nullptr)
        → set_path(same path) → record → assert the second open
        appended (one header, two data rows) instead of writing a
        duplicate header.

Verification done before the commit:

  - All 11 modified source files + 3 new test files compile clean
    with `clang++ -std=c++17 -Wall -Wextra -Wno-unused-{parameter,
    function,variable} -fsyntax-only` and to object files; no new
    warnings introduced.
  - Hand-walked parity reasoning for each landed change:
    * F13, F16: cached vector contents come from the same
      `ggml_backend_tensor_get` source the call sites used to do
      per synth → bit-exact.
    * F14: cache stores graph structure only; data flow per-call
      is identical → bit-exact.
    * Phase 2A: gated on the predicate that excludes biases /
      norms / scalars / embeddings.  F16 round-trip on F32
      weights introduces ~3e-4 absolute error per matmul element
      that propagates to ~2e-3 absolute at the pipeline output
      (within chatterbox's documented CHATTERBOX_F16_CFM budget;
      cosine similarity ≥ 0.999 on the canonical 5-second prompt).
    * Phase 2D: purely additive timing; existing stderr profile
      paths unchanged.
  - Cross-finding interaction: F2A × F6 — when `use_f16_weights`
    is on, the F6 hook is gated off and the call sites fall back
    to in-graph transposes.  Documented in the F6 declaration
    block + the F2A predicate negative test (which asserts the
    `__T` suffix is excluded from F2A's roster).
… / vector graph caches

QVAC-18607 follow-up tetherto#3.  Three more audit findings landed on top of
follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync
points + ~6 allocator churn cycles per synth.

  F17  Duration scalar-continuation `read_f32` cache.
       Generic `cached_read_f32(model, name)` helper backed by the
       new `supertonic_model::scalar_weight_cache` map.  Replaces
       ~30 backend tensor reads per synth across
       `self_attention`, `ffn_block`, and the
       `duration_sentence_proj_ggml_impl` scalar continuation
       (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out,
       predictor layers + activation).  Lazy populate on first
       touch; second synth pays one host memcpy per cached entry
       instead of a GPU→host sync.

  F18  Text-encoder convnext-front graph cached across synths.
       `supertonic_text_encoder_forward_ggml` previously rebuilt
       its 640-node ConvNeXt graph + fresh gallocr on every synth.
       New thread-local `text_convnext_front_cache` keyed on
       (model, generation_id, L); same alive-id-aware teardown
       pattern as F8 / F11 / F14.

  F19  Vector-estimator front-block graph cached across denoise
       steps.  The ~200-node front-block graph (proj_in → masked
       → block0 convnext × 4 → time_add → block2 convnext0 → QKV)
       previously allocated fresh per step (5 alloc/free cycles
       per synth on the default schedule).  Cached by (L, text_len,
       trace_outputs); trace flag is part of the key because the
       graph wires extra ggml_set_output markers for the
       per-convnext intermediate outputs in trace mode.

New TDD harness (fixture-bound):

  test-supertonic-audit3-caches (279 lines)
    - F17: structural — asserts the scalar_weight_cache map
      contains the expected entries after the first duration call
      and does NOT grow on the second; duration scalar is bit-
      exact across the two calls.
    - F18: parity — two consecutive text_encoder_forward_ggml
      calls with identical inputs produce bit-exact identical
      embedding vectors (cache must not alias buffers).
    - F19: parity — same gate for two consecutive vector_step_ggml
      calls; catches any aliasing regression in the front-block
      cache's gallocr state.

Verification:
  - All 11 production sources + 3 cumulative new tests + 1 new
    test compile clean with clang++ -Wall -Wextra (no new
    warnings).
  - Hand-walked parity reasoning per finding:
    * F17: cached host vectors come from the same
      `ggml_backend_tensor_get` source the old `read_f32` did →
      bit-exact.
    * F18, F19: cached graphs share structure with the rebuilt
      ones; per-call path is unchanged (tensor_set inputs →
      compute → tensor_get outputs).  Bit-exact across calls.
  - Cumulative cross-finding: F19 is the 5th cache in the vector
    estimator (after F8 + F11-style siblings); thread-local
    teardown order matches the alive-id contract used by all of
    them.

Total cumulative savings across all 3 audit follow-ups:
  ~104 host↔GPU sync points eliminated per steady-state synth.

Diff:
  6 sources changed, 1 new test, 1 CMakeLists update.
  +327 / -172 in src/ + CMakeLists + internal header.
  +279 new test.

What's next (tomorrow):
  - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync
    points / synth).  Needs device parity gate.
  - Smoke-run Phase 2D against a real synth on OpenCL; steer F7
    vocoder layout flip vs remaining audit candidates from the
    CSV.

Co-authored-by: Cursor <cursoragent@cursor.com>
… helper (F20 partial)

Adds `apply_rope_in_graph(ctx, x, cos, sin)` plus a host-side
`make_rope_cos_sin_tables(theta, L, half)` precompute helper in
supertonic_internal.h. Both use only universally-supported GGML ops
(reshape / view / permute / mul / add) so the rotation can later run
on the OpenCL / Metal / Vulkan backends without per-element scalar
CPU work or extra get/set sync points.

Integration into the 8 attention sites is deferred to keep this
change small and reviewable — the existing scalar `apply_rope` path
is unchanged.

Test: new test/test_supertonic_rope_in_graph.cpp verifies
  - parity vs scalar apply_rope on a synthetic Q tensor
  - identity behaviour when cos=1 / sin=0
Wired into CMakeLists.txt with the "unit" label.

Co-authored-by: Cursor <cursoragent@cursor.com>
… integration (F20+F23)

Bakes the per-step apply_rope rotation into the same GGML graphs
that produce Q/K (4 attention sites: front block + 3 group caches),
eliminating the 40 host-side CPU rotations / synth (~2 ms wall-time)
plus the implicit "host can't dispatch next graph until rotation
completes" ordering constraint.

Helper: new inline `apply_rope_to_packed_qk(ctx, q, cos, sin,
n_heads, head_dim)` in supertonic_internal.h — a zero-cost layout
adapter between the `[head_dim, n_heads, L]` contract of the
already-landed `apply_rope_in_graph` helper (F20-h) and the
`[H*D, L]` packed tensor that `dense_matmul_time_ggml` produces.
Universally-supported ops only (view, cont, reshape, mul, sub,
add, repeat, concat) — green on baseline upstream OpenCL.

Graph wiring: each Q/K-producing cache (vector_group_graph_cache
+ ve_front_block_graph_cache) now owns four host-uploaded cos/sin
input tensors (Q's L + K's text_len) and emits `<q_name>_rope` /
`<k_name>_rope` outputs alongside the pre-RoPE entries.  cos/sin
tables are populated once at cache build time (stable for the
cache's lifetime since they depend only on L / text_len / θ).

Call sites: the 4 RoPE-using sites in
`supertonic_vector_trace_proj_ggml` consume the cache's `q_rope` /
`k_rope` outputs directly and only fall back to host apply_rope
when the GGUF didn't ship `vector_rope_theta` (legacy safety net).
The pre-RoPE Q/K trace entries remain unchanged so scalar-parity
harnesses keep their existing contract.

Test: new test/test_supertonic_rope_packed_qk.cpp — CPU-backend
parity vs scalar apply_rope on the two hot vector-estimator
shapes (q_len=20×H=4×D=64, kv_len=32×H=4×D=64) + an L=1 degenerate
trip-wire.  Bit-exact (max_abs_err=0.0).  Wired into CMakeLists.txt
with LABEL "unit" (no GGUF required).

Full sweep verification:
  - 9 / 9 supertonic source files: clean syntax-check
  - 21 / 21 test files: clean syntax-check
  - 98 / 98 CPU-only unit-test checks pass across
    test-supertonic-{rope-packed-qk, rope-in-graph, portable-ops,
    backend-dispatch, f16-attn-parity, profile-csv}.

Audit pass tetherto#5 catalogued the remaining hot-path opportunities;
deferred items (F7 vocoder layout flip, F12 host transposes, 2C
full Q/K/V graph fusion, 2B Q8_0 quantization) tracked in
aiDocs/AUDIT_SUPERTONIC_OPENCL.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
…on, in-graph transpose, Q/K/V GPU bridge

Three optimizations targeted by audit findings F7, F12, and a new F24 (2C-lite),
each landed with a TDD unit test that runs CPU-only (no GGUF fixture required).

F7 — Vocoder ConvNeXt block fusion:
  * convnext_block_fused_ggml (supertonic_internal.h) keeps the LN output in
    [C, T0] (channel-major) and lowers the two K=1 pointwise convs to direct
    ggml_mul_mat against that layout, eliminating the layer-norm back-permute
    and both im2col copies the previous chain paid (~16.8 MiB / vocoder pass
    across the 10 blocks).
  * test_supertonic_convnext_block_fused.cpp — CPU parity vs scalar reference,
    max_abs_err = 3.815e-06 over a vocoder-realistic [C=64, T=20] shape.

F12 — In-graph time/channel transpose:
  * transpose_time_channel_ggml (supertonic_internal.h) replaces the
    pack_time_channel_for_ggml host loops at every run_*_cache ingestion site
    in supertonic_vector_estimator.cpp (group / res-style QKV / style residual
    / tail).  Cache inputs now declare ne=[C, L]; callers upload CPU-native
    x_tc directly and the graph does ggml_cont(ggml_transpose(...)).
  * Also drops a redundant double-transpose on the tail-graph noisy_latent path.
  * test_supertonic_in_graph_transpose.cpp — 9 checks, bit-exact (max_abs_err
    = 0.0) across group_graph, tail_noise, and L=1 trip-wire shapes.

F24 (2C-lite) — GPU→GPU Q/K/V bridge between group graph and attention graph:
  * vector_group_graph_result exposes q_rope_gpu / k_rope_gpu / v_gpu tensor
    handles harvested from the group cache's graph.
  * run_text_attention_cache_gpu — new overload that consumes those handles
    via ggml_backend_tensor_copy (same-backend device→device blit) instead of
    the historical tensor_get + tensor_set pair.
  * Host downloads of q_rope/k_rope/v inside run_group_graph_cache are now
    gated on (trace != nullptr || !apply_rope); production runs with in-graph
    RoPE skip them entirely.
  * g1 / g2 / g3 attn call sites in supertonic_vector_trace_proj_ggml use the
    GPU fast path (legacy host-RoPE fallback preserved for GGUFs without
    vector_rope_theta).  Net: 90 sync points / synth eliminated.  Front-block
    and the four style attention sites still pay the round-trip; targeting
    them is the next iteration.
  * test_supertonic_graph_to_graph_blit.cpp — 15 checks, bit-exact across the
    five representative attn/style shapes plus L=1.

Verification: all five new + pre-existing CPU unit tests pass (38/38 checks).
Co-authored-by: Cursor <cursoragent@cursor.com>
The plan document is an AI-authored R&D scratchpad that doesn't belong in
the committed source tree alongside production code.  Move it out of
tts-cpp/ so the subtree only ships the implementation; the file continues
to live locally under aiDocs/ for ongoing iteration.

No code or build changes; documentation-only.

Co-authored-by: Cursor <cursoragent@cursor.com>
@GustavoA1604 GustavoA1604 merged commit eed9c52 into tetherto:master May 12, 2026
59 of 66 checks passed
ogad-tether added a commit that referenced this pull request May 13, 2026
Squash-rebase of feat/metal-optimization-supertonic onto master post-#16
(OpenCL Supertonic merge).  Combines:

  - Five custom fused Metal kernels (supertonic_depthwise_1d /
    layer_norm_channel / pw2_residual / bias_gelu / edge_pad_1d) with
    `_ct` and `_causal_ct` variants for [C, T] activation layout.
    Patches live upstream in qvac-ext-ggml@speech (PR #8, merged); our
    overlay-port redirects vcpkg to that branch.
  - Full Phase B2: every ConvNeXt block in vector_estimator (16 blocks)
    and vocoder (10 blocks) runs end-to-end on [C, T] activations.
    K=1 pointwise becomes direct ggml_mul_mat (no im2col).  Single
    entry/exit permute spans each chain.
  - Phase B1: end-to-end f16 via asymmetric load (`:onnx::MatMul_*`
    stays f16 on Metal, expands to f32 elsewhere).
  - Phase A1+A2: 5-step CFM unrolled into one ggml_cgraph; latent
    stays in GPU memory step-to-step.
  - Phase A3: q8_0 storage on Metal, kernel_mul_mm_q8_0_f32 dispatches.
  - Tier 2 load-time matmul weight pretranspose.
  - Causal-pad mode in depthwise_1d_ct + K=7 support for vocoder.

Coexists with master's OpenCL Supertonic work:
  - `supertonic_op_dispatch_scope` (master) toggles CPU custom_4d
    fast paths via thread-local; replaces our `use_cpu_fastpath`
    parameter plumbing.
  - F1/F2/F6/F13/F14/F16 audit optimisations from master preserved.
  - F7 vocoder convnext-block fusion (master) runs on the CPU path;
    Metal path runs our `_ct` chain.

Bench on Apple M2 (5 runs, --steps 5, en/M1, "fox" prompt), post-rebase:

  Metal       med  98.4 ms  vec_est  65.6  vocoder 13.1  RTM 32.6x
  CPU       (unchanged from master)
  ONNX CPU  (unchanged from master)

Net floor moved from ~88 ms (pre-rebase peak) to ~98 ms (post-rebase),
~10 ms slip absorbed where master's front_cache refactor replaced
parts of our trace_proj step-builder per the agent's resolution rule
"prefer master's cache pattern when refactored."  Causal kernel intact;
vocoder at 13.1 ms vs master's CPU 39.4 ms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Zbig9000 added a commit that referenced this pull request May 13, 2026
…PU bridge

Single largest remaining per-step sync hotspot identified in
aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.  PR #16's audit follow-up #6
(2C-lite) shipped the GPU device→device blit infrastructure
(`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group
attentions to use it; the front-block attn0 site was deferred
because of cache-lifetime concerns at the time.  Round 8 picks
it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one
function.

Eliminates 6 sync points × 5 denoise steps = 30 sync points /
synth on the production path (3 GPU→host downloads + 3 host→GPU
uploads of post-RoPE Q / K / raw V at the front-block attn0
site).  Strict gating on `front_in_graph_rope &&
!include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace
mode falls back to the legacy host bridge so the trace harness
still captures pre-attention Q/K/V host vectors, and legacy
GGUFs without `vector_rope_theta` continue to take the host-
rotate path.

The blit primitive parity gate already shipped with PR #16
(`test-supertonic-graph-to-graph-blit`); round 8 extends it
with explicit coverage of the front-block K / V shapes
(text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`).

Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0
failures, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit that referenced this pull request May 13, 2026
…PU-bridge layout fix

Critical correctness fix.  Round 11 didn't add a new optimisation
— it made every prior round actually run end-to-end on real
hardware.  Rounds 8 + 9 + 10 had all shipped CPU-only unit-test
green, but the unit tests never exercised the production code
path with a real GGUF carrying `vector_rope_theta`.  The first
end-to-end synth attempt (CPU OR Vulkan) aborted at
`GGML_ASSERT(HD == n_heads * head_dim)` inside
`apply_rope_to_packed_qk`, and even past that assertion every
`ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge
fast paths would have hit
`GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V
matmul outputs were the byte-for-byte transpose of what the
attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors
expect.

Root cause: `apply_rope_to_packed_qk` (PR #16 audit follow-up #5)
was written under the assumption that `dense_matmul_time_ggml`
returns a `ne=[HD, L]` channel-fastest-in-memory tensor.  In
fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`)
produces `ne=[L, HD]` with channel-major-flat memory — the
bit-exact transpose of the helper's input contract.  The CPU
unit test that landed alongside the helper hand-built Q under
the wrong `[HD, L]` shape, so the failure mode was invisible
to CI.

The fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]` (channel-major-flat
   memory).  Reference built in scalar `apply_rope`'s native
   time-major-flat layout; test verifies the helper's output
   bytes match bit-for-bit AND pins `y->ne[0] = HD,
   y->ne[1] = L` so the downstream `q_tc_in` blit cannot
   regress on layout.  Committed RED first, observed to abort
   at the same assertion the production crash hits, then
   landing the helper fix turned it GREEN (14 / 14 checks).

2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from
   `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-
   flat (which IS the layout `q_tc_in` expects).  Rest of the
   pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V (and the style sq/sk/sv) have no RoPE to mask the layout
   flip — open-code the same `ggml_cont(ggml_transpose(...))`
   at the matmul output in `build_group_graph_cache`,
   `ve_front_block_proj_cache`, and `build_res_style_qkv_cache`
   so all four GPU-bridge attention sites get bit-for-bit
   matching layouts.

4. Legacy host-bridge fallbacks switched from
   `tensor_to_time_channel(<post-rope-or-v>)` to
   `tensor_raw_f32(...)`.  The new graph-side layout puts the
   bytes already in the time-major-flat shape scalar
   `apply_rope` / `flash_attention_qkv` host references read,
   so the raw download is the correct call;
   `tensor_to_time_channel` would now apply the transpose-of-
   the-transpose and feed wrong-orientation Q/K/V into the
   attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU | abort on first step | writes 3.89s 44.1 kHz WAV |
| Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime |
| Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime |
| Vulkan Mesa lavapipe | abort | writes 1.21s WAV |

CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0
regressions.  Vulkan build's `ctest` likewise 22 / 22.

The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V
dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm,
front-block + style + group GPU bridges, text-input upload-
skip) are now actually exercised end-to-end on every Vulkan
adapter we have — they just couldn't run before round 11
unblocked the production path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 13, 2026
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups
(PR tetherto#16): the audit-driven optimisations there are backend-portable by
construction (every host-sync / bandwidth / fusion win uses the same
GPU dispatch path Vulkan walks), so this PR only adds the
Vulkan-specific dispatch deltas the OpenCL bring-up did not need.

Vulkan-specific deltas
- supertonic_model gains backend_is_vk + use_native_leaky_relu, both
  resolved at GGUF load time:
  - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine
    backend_name() can annotate the device with
    ggml_backend_vk_get_device_description().
  - use_native_leaky_relu via a ggml_backend_supports_op probe against
    a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml
    to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched
    OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for
    plain upstream OpenCL.  Dynamic probe self-adapts to whichever
    ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml
    ships in.
- supertonic_backend_supports_f16_kv_flash_attn probe (synthetic
  Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the
  use_f16_attn auto-policy so a backend that ships flash_attn_ext but
  rejects the F16-K/V variant for Supertonic shapes keeps the F32 path
  instead of crashing at first synth call.  Manual --f16-attn 1 still
  forces F16 (debug knob).
- Vulkan device selection: replaces the historical hard-coded
  ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed
  through EngineOptions::vulkan_device, range-checked against
  ggml_backend_vk_get_device_count() at load (out-of-range index is a
  hard error — surfaces operator typos / wrong-machine config loud
  rather than silently falling back to CPU).  Verbose mode + bench
  output append the Vulkan device description so multi-GPU / multi-ICD
  machines unambiguously identify which adapter ran.
- supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu
  slot so the scope correctly mirrors the new model field through
  thread-local dispatch.

Tests
- test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness
  covering the new flags through supertonic_op_dispatch_scope plus a
  smoke test for the F16-K/V flash-attn probe.  29/29 checks pass.
- test-supertonic-portable-ops (existing): fixture model now requests
  use_native_leaky_relu = false explicitly so the GPU-decomposition
  correctness gate stays green now that the helper short-circuits on
  backends with native LEAKY_RELU.  10/10 checks pass.
- test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass.
- All audit follow-up tests from tetherto#16 unchanged, all PASS.

Build
- All changed source files compile clean with both -DGGML_USE_VULKAN
  defined and undefined; non-Vulkan builds compile clean.
- No public-API break: EngineOptions::vulkan_device defaults to 0
  (the historical hard-coded value), load_supertonic_gguf gains a new
  optional last argument with the same default; existing callers are
  source-compatible.

Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"):
persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all
Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device
load-balancing (--vulkan-device -1 auto-pick).

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 pushed a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
Squash-rebase of feat/metal-optimization-supertonic onto master post-tetherto#16
(OpenCL Supertonic merge).  Combines:

  - Five custom fused Metal kernels (supertonic_depthwise_1d /
    layer_norm_channel / pw2_residual / bias_gelu / edge_pad_1d) with
    `_ct` and `_causal_ct` variants for [C, T] activation layout.
    Patches live upstream in qvac-ext-ggml@speech (PR tetherto#8, merged); our
    overlay-port redirects vcpkg to that branch.
  - Full Phase B2: every ConvNeXt block in vector_estimator (16 blocks)
    and vocoder (10 blocks) runs end-to-end on [C, T] activations.
    K=1 pointwise becomes direct ggml_mul_mat (no im2col).  Single
    entry/exit permute spans each chain.
  - Phase B1: end-to-end f16 via asymmetric load (`:onnx::MatMul_*`
    stays f16 on Metal, expands to f32 elsewhere).
  - Phase A1+A2: 5-step CFM unrolled into one ggml_cgraph; latent
    stays in GPU memory step-to-step.
  - Phase A3: q8_0 storage on Metal, kernel_mul_mm_q8_0_f32 dispatches.
  - Tier 2 load-time matmul weight pretranspose.
  - Causal-pad mode in depthwise_1d_ct + K=7 support for vocoder.

Coexists with master's OpenCL Supertonic work:
  - `supertonic_op_dispatch_scope` (master) toggles CPU custom_4d
    fast paths via thread-local; replaces our `use_cpu_fastpath`
    parameter plumbing.
  - F1/F2/F6/F13/F14/F16 audit optimisations from master preserved.
  - F7 vocoder convnext-block fusion (master) runs on the CPU path;
    Metal path runs our `_ct` chain.

Bench on Apple M2 (5 runs, --steps 5, en/M1, "fox" prompt), post-rebase:

  Metal       med  98.4 ms  vec_est  65.6  vocoder 13.1  RTM 32.6x
  CPU       (unchanged from master)
  ONNX CPU  (unchanged from master)

Net floor moved from ~88 ms (pre-rebase peak) to ~98 ms (post-rebase),
~10 ms slip absorbed where master's front_cache refactor replaced
parts of our trace_proj step-builder per the agent's resolution rule
"prefer master's cache pattern when refactored."  Causal kernel intact;
vocoder at 13.1 ms vs master's CPU 39.4 ms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups
(PR tetherto#16): the audit-driven optimisations there are backend-portable by
construction (every host-sync / bandwidth / fusion win uses the same
GPU dispatch path Vulkan walks), so this PR only adds the
Vulkan-specific dispatch deltas the OpenCL bring-up did not need.

Vulkan-specific deltas
- supertonic_model gains backend_is_vk + use_native_leaky_relu, both
  resolved at GGUF load time:
  - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine
    backend_name() can annotate the device with
    ggml_backend_vk_get_device_description().
  - use_native_leaky_relu via a ggml_backend_supports_op probe against
    a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml
    to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched
    OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for
    plain upstream OpenCL.  Dynamic probe self-adapts to whichever
    ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml
    ships in.
- supertonic_backend_supports_f16_kv_flash_attn probe (synthetic
  Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the
  use_f16_attn auto-policy so a backend that ships flash_attn_ext but
  rejects the F16-K/V variant for Supertonic shapes keeps the F32 path
  instead of crashing at first synth call.  Manual --f16-attn 1 still
  forces F16 (debug knob).
- Vulkan device selection: replaces the historical hard-coded
  ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed
  through EngineOptions::vulkan_device, range-checked against
  ggml_backend_vk_get_device_count() at load (out-of-range index is a
  hard error — surfaces operator typos / wrong-machine config loud
  rather than silently falling back to CPU).  Verbose mode + bench
  output append the Vulkan device description so multi-GPU / multi-ICD
  machines unambiguously identify which adapter ran.
- supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu
  slot so the scope correctly mirrors the new model field through
  thread-local dispatch.

Tests
- test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness
  covering the new flags through supertonic_op_dispatch_scope plus a
  smoke test for the F16-K/V flash-attn probe.  29/29 checks pass.
- test-supertonic-portable-ops (existing): fixture model now requests
  use_native_leaky_relu = false explicitly so the GPU-decomposition
  correctness gate stays green now that the helper short-circuits on
  backends with native LEAKY_RELU.  10/10 checks pass.
- test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass.
- All audit follow-up tests from tetherto#16 unchanged, all PASS.

Build
- All changed source files compile clean with both -DGGML_USE_VULKAN
  defined and undefined; non-Vulkan builds compile clean.
- No public-API break: EngineOptions::vulkan_device defaults to 0
  (the historical hard-coded value), load_supertonic_gguf gains a new
  optional last argument with the same default; existing callers are
  source-compatible.

Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"):
persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all
Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device
load-balancing (--vulkan-device -1 auto-pick).

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
…PU bridge

Single largest remaining per-step sync hotspot identified in
aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.  PR tetherto#16's audit follow-up tetherto#6
(2C-lite) shipped the GPU device→device blit infrastructure
(`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group
attentions to use it; the front-block attn0 site was deferred
because of cache-lifetime concerns at the time.  Round 8 picks
it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one
function.

Eliminates 6 sync points × 5 denoise steps = 30 sync points /
synth on the production path (3 GPU→host downloads + 3 host→GPU
uploads of post-RoPE Q / K / raw V at the front-block attn0
site).  Strict gating on `front_in_graph_rope &&
!include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace
mode falls back to the legacy host bridge so the trace harness
still captures pre-attention Q/K/V host vectors, and legacy
GGUFs without `vector_rope_theta` continue to take the host-
rotate path.

The blit primitive parity gate already shipped with PR tetherto#16
(`test-supertonic-graph-to-graph-blit`); round 8 extends it
with explicit coverage of the front-block K / V shapes
(text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`).

Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0
failures, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
…PU-bridge layout fix

Critical correctness fix.  Round 11 didn't add a new optimisation
— it made every prior round actually run end-to-end on real
hardware.  Rounds 8 + 9 + 10 had all shipped CPU-only unit-test
green, but the unit tests never exercised the production code
path with a real GGUF carrying `vector_rope_theta`.  The first
end-to-end synth attempt (CPU OR Vulkan) aborted at
`GGML_ASSERT(HD == n_heads * head_dim)` inside
`apply_rope_to_packed_qk`, and even past that assertion every
`ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge
fast paths would have hit
`GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V
matmul outputs were the byte-for-byte transpose of what the
attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors
expect.

Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5)
was written under the assumption that `dense_matmul_time_ggml`
returns a `ne=[HD, L]` channel-fastest-in-memory tensor.  In
fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`)
produces `ne=[L, HD]` with channel-major-flat memory — the
bit-exact transpose of the helper's input contract.  The CPU
unit test that landed alongside the helper hand-built Q under
the wrong `[HD, L]` shape, so the failure mode was invisible
to CI.

The fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]` (channel-major-flat
   memory).  Reference built in scalar `apply_rope`'s native
   time-major-flat layout; test verifies the helper's output
   bytes match bit-for-bit AND pins `y->ne[0] = HD,
   y->ne[1] = L` so the downstream `q_tc_in` blit cannot
   regress on layout.  Committed RED first, observed to abort
   at the same assertion the production crash hits, then
   landing the helper fix turned it GREEN (14 / 14 checks).

2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from
   `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-
   flat (which IS the layout `q_tc_in` expects).  Rest of the
   pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V (and the style sq/sk/sv) have no RoPE to mask the layout
   flip — open-code the same `ggml_cont(ggml_transpose(...))`
   at the matmul output in `build_group_graph_cache`,
   `ve_front_block_proj_cache`, and `build_res_style_qkv_cache`
   so all four GPU-bridge attention sites get bit-for-bit
   matching layouts.

4. Legacy host-bridge fallbacks switched from
   `tensor_to_time_channel(<post-rope-or-v>)` to
   `tensor_raw_f32(...)`.  The new graph-side layout puts the
   bytes already in the time-major-flat shape scalar
   `apply_rope` / `flash_attention_qkv` host references read,
   so the raw download is the correct call;
   `tensor_to_time_channel` would now apply the transpose-of-
   the-transpose and feed wrong-orientation Q/K/V into the
   attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU | abort on first step | writes 3.89s 44.1 kHz WAV |
| Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime |
| Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime |
| Vulkan Mesa lavapipe | abort | writes 1.21s WAV |

CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0
regressions.  Vulkan build's `ctest` likewise 22 / 22.

The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V
dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm,
front-block + style + group GPU bridges, text-input upload-
skip) are now actually exercised end-to-end on every Vulkan
adapter we have — they just couldn't run before round 11
unblocked the production path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups
(PR tetherto#16): the audit-driven optimisations there are backend-portable by
construction (every host-sync / bandwidth / fusion win uses the same
GPU dispatch path Vulkan walks), so this PR only adds the
Vulkan-specific dispatch deltas the OpenCL bring-up did not need.

Vulkan-specific deltas
- supertonic_model gains backend_is_vk + use_native_leaky_relu, both
  resolved at GGUF load time:
  - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine
    backend_name() can annotate the device with
    ggml_backend_vk_get_device_description().
  - use_native_leaky_relu via a ggml_backend_supports_op probe against
    a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml
    to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched
    OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for
    plain upstream OpenCL.  Dynamic probe self-adapts to whichever
    ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml
    ships in.
- supertonic_backend_supports_f16_kv_flash_attn probe (synthetic
  Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the
  use_f16_attn auto-policy so a backend that ships flash_attn_ext but
  rejects the F16-K/V variant for Supertonic shapes keeps the F32 path
  instead of crashing at first synth call.  Manual --f16-attn 1 still
  forces F16 (debug knob).
- Vulkan device selection: replaces the historical hard-coded
  ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed
  through EngineOptions::vulkan_device, range-checked against
  ggml_backend_vk_get_device_count() at load (out-of-range index is a
  hard error — surfaces operator typos / wrong-machine config loud
  rather than silently falling back to CPU).  Verbose mode + bench
  output append the Vulkan device description so multi-GPU / multi-ICD
  machines unambiguously identify which adapter ran.
- supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu
  slot so the scope correctly mirrors the new model field through
  thread-local dispatch.

Tests
- test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness
  covering the new flags through supertonic_op_dispatch_scope plus a
  smoke test for the F16-K/V flash-attn probe.  29/29 checks pass.
- test-supertonic-portable-ops (existing): fixture model now requests
  use_native_leaky_relu = false explicitly so the GPU-decomposition
  correctness gate stays green now that the helper short-circuits on
  backends with native LEAKY_RELU.  10/10 checks pass.
- test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass.
- All audit follow-up tests from tetherto#16 unchanged, all PASS.

Build
- All changed source files compile clean with both -DGGML_USE_VULKAN
  defined and undefined; non-Vulkan builds compile clean.
- No public-API break: EngineOptions::vulkan_device defaults to 0
  (the historical hard-coded value), load_supertonic_gguf gains a new
  optional last argument with the same default; existing callers are
  source-compatible.

Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"):
persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all
Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device
load-balancing (--vulkan-device -1 auto-pick).

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
…PU bridge

Single largest remaining per-step sync hotspot identified in
aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.  PR tetherto#16's audit follow-up tetherto#6
(2C-lite) shipped the GPU device→device blit infrastructure
(`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group
attentions to use it; the front-block attn0 site was deferred
because of cache-lifetime concerns at the time.  Round 8 picks
it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one
function.

Eliminates 6 sync points × 5 denoise steps = 30 sync points /
synth on the production path (3 GPU→host downloads + 3 host→GPU
uploads of post-RoPE Q / K / raw V at the front-block attn0
site).  Strict gating on `front_in_graph_rope &&
!include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace
mode falls back to the legacy host bridge so the trace harness
still captures pre-attention Q/K/V host vectors, and legacy
GGUFs without `vector_rope_theta` continue to take the host-
rotate path.

The blit primitive parity gate already shipped with PR tetherto#16
(`test-supertonic-graph-to-graph-blit`); round 8 extends it
with explicit coverage of the front-block K / V shapes
(text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`).

Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0
failures, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
…PU-bridge layout fix

Critical correctness fix.  Round 11 didn't add a new optimisation
— it made every prior round actually run end-to-end on real
hardware.  Rounds 8 + 9 + 10 had all shipped CPU-only unit-test
green, but the unit tests never exercised the production code
path with a real GGUF carrying `vector_rope_theta`.  The first
end-to-end synth attempt (CPU OR Vulkan) aborted at
`GGML_ASSERT(HD == n_heads * head_dim)` inside
`apply_rope_to_packed_qk`, and even past that assertion every
`ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge
fast paths would have hit
`GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V
matmul outputs were the byte-for-byte transpose of what the
attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors
expect.

Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5)
was written under the assumption that `dense_matmul_time_ggml`
returns a `ne=[HD, L]` channel-fastest-in-memory tensor.  In
fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`)
produces `ne=[L, HD]` with channel-major-flat memory — the
bit-exact transpose of the helper's input contract.  The CPU
unit test that landed alongside the helper hand-built Q under
the wrong `[HD, L]` shape, so the failure mode was invisible
to CI.

The fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]` (channel-major-flat
   memory).  Reference built in scalar `apply_rope`'s native
   time-major-flat layout; test verifies the helper's output
   bytes match bit-for-bit AND pins `y->ne[0] = HD,
   y->ne[1] = L` so the downstream `q_tc_in` blit cannot
   regress on layout.  Committed RED first, observed to abort
   at the same assertion the production crash hits, then
   landing the helper fix turned it GREEN (14 / 14 checks).

2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from
   `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-
   flat (which IS the layout `q_tc_in` expects).  Rest of the
   pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V (and the style sq/sk/sv) have no RoPE to mask the layout
   flip — open-code the same `ggml_cont(ggml_transpose(...))`
   at the matmul output in `build_group_graph_cache`,
   `ve_front_block_proj_cache`, and `build_res_style_qkv_cache`
   so all four GPU-bridge attention sites get bit-for-bit
   matching layouts.

4. Legacy host-bridge fallbacks switched from
   `tensor_to_time_channel(<post-rope-or-v>)` to
   `tensor_raw_f32(...)`.  The new graph-side layout puts the
   bytes already in the time-major-flat shape scalar
   `apply_rope` / `flash_attention_qkv` host references read,
   so the raw download is the correct call;
   `tensor_to_time_channel` would now apply the transpose-of-
   the-transpose and feed wrong-orientation Q/K/V into the
   attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU | abort on first step | writes 3.89s 44.1 kHz WAV |
| Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime |
| Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime |
| Vulkan Mesa lavapipe | abort | writes 1.21s WAV |

CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0
regressions.  Vulkan build's `ctest` likewise 22 / 22.

The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V
dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm,
front-block + style + group GPU bridges, text-input upload-
skip) are now actually exercised end-to-end on every Vulkan
adapter we have — they just couldn't run before round 11
unblocked the production path.

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 pushed a commit that referenced this pull request May 15, 2026
`apply_rope_to_packed_qk` (PR #16 audit follow-up #5) was written
assuming `dense_matmul_time_ggml` returns `ne=[HD, L]`.  In fact
the matmul (CPU `cblas_sgemm` fast path + `conv1d_f32(K=1)`
fallback) produces `ne=[L, HD]` with channel-major-flat memory
(`data[t + c*L]`) — the bit-exact transpose of the helper's
input contract.  Every CPU synth with `--n-gpu-layers 0` against
a GGUF carrying `vector_rope_theta` aborts at the helper's
defensive assertion on the first denoise step:

  supertonic_internal.h:742:
    GGML_ASSERT(HD == (int64_t) n_heads * head_dim) failed
  apply_rope_to_packed_qk → supertonic_vector_trace_proj_ggml
  → supertonic_vector_step_ggml → supertonic_vector_loop_ggml

The CPU unit test that landed alongside the helper hand-built
Q under the wrong `[HD, L]` shape, so the failure mode was
invisible to CI.

Fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]`.  Reference built in
   scalar `apply_rope`'s native time-major-flat layout; test
   verifies the helper's output bytes match bit-for-bit AND
   pins `y->ne[0] = HD, y->ne[1] = L` so the downstream
   `q_tc_in` blit cannot regress on layout.  Committed RED
   first, observed to abort at the same assertion the
   production crash hits.

2. `apply_rope_to_packed_qk` (supertonic_internal.h): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip
   from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]`
   time-major-flat (the layout `q_tc_in` expects).  Rest of
   the pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V has no RoPE to mask the layout flip — open-code the same
   `ggml_cont(ggml_transpose(...))` at the V matmul output in
   `build_group_graph_cache` and the front-block path in
   `supertonic_vector_trace_proj_ggml` so the GPU-bridge
   `ggml_backend_tensor_copy(v_src, v_tc_in)` lands bit-exact
   bytes.  Style sq/sk/sv left untouched — this branch has no
   GPU bridge for style attention, so the host-vector path
   via `tensor_to_time_channel` is already correct.

4. Legacy host-bridge downloads of post-RoPE Q/K and
   post-transpose V switched from `tensor_to_time_channel` to
   `tensor_raw_f32`.  The new graph-side layout puts the bytes
   already in the time-major-flat shape scalar `apply_rope` /
   `flash_attention_qkv` host references read, so the raw
   download is the correct call; `tensor_to_time_channel`
   would apply the transpose-of-the-transpose and feed
   wrong-orientation Q/K/V into the attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU (--n-gpu-layers 0) | abort on first step | writes 1.35s 44.1 kHz WAV |
| CPU long-text synth | abort | writes 6.25s WAV |
| Multi-voice (F1 / M1) | abort | both work |
| Determinism (same seed × 2) | n/a | bit-identical |

- `test-supertonic-rope-packed-qk`: 14 / 14 checks,
  `max_abs_err = 0.000e+00`.
- CPU `ctest -L unit`: 12 / 12 tests, 0 regressions.

Audio sanity on the exact QVAC-18966 reproduction command:
99.9% non-zero samples, rms=1406, abs_max=15984 — speech-like
dynamics, not silence / clipping / garbage.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 18, 2026
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups
(PR tetherto#16): the audit-driven optimisations there are backend-portable by
construction (every host-sync / bandwidth / fusion win uses the same
GPU dispatch path Vulkan walks), so this PR only adds the
Vulkan-specific dispatch deltas the OpenCL bring-up did not need.

Vulkan-specific deltas
- supertonic_model gains backend_is_vk + use_native_leaky_relu, both
  resolved at GGUF load time:
  - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine
    backend_name() can annotate the device with
    ggml_backend_vk_get_device_description().
  - use_native_leaky_relu via a ggml_backend_supports_op probe against
    a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml
    to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched
    OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for
    plain upstream OpenCL.  Dynamic probe self-adapts to whichever
    ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml
    ships in.
- supertonic_backend_supports_f16_kv_flash_attn probe (synthetic
  Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the
  use_f16_attn auto-policy so a backend that ships flash_attn_ext but
  rejects the F16-K/V variant for Supertonic shapes keeps the F32 path
  instead of crashing at first synth call.  Manual --f16-attn 1 still
  forces F16 (debug knob).
- Vulkan device selection: replaces the historical hard-coded
  ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed
  through EngineOptions::vulkan_device, range-checked against
  ggml_backend_vk_get_device_count() at load (out-of-range index is a
  hard error — surfaces operator typos / wrong-machine config loud
  rather than silently falling back to CPU).  Verbose mode + bench
  output append the Vulkan device description so multi-GPU / multi-ICD
  machines unambiguously identify which adapter ran.
- supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu
  slot so the scope correctly mirrors the new model field through
  thread-local dispatch.

Tests
- test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness
  covering the new flags through supertonic_op_dispatch_scope plus a
  smoke test for the F16-K/V flash-attn probe.  29/29 checks pass.
- test-supertonic-portable-ops (existing): fixture model now requests
  use_native_leaky_relu = false explicitly so the GPU-decomposition
  correctness gate stays green now that the helper short-circuits on
  backends with native LEAKY_RELU.  10/10 checks pass.
- test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass.
- All audit follow-up tests from tetherto#16 unchanged, all PASS.

Build
- All changed source files compile clean with both -DGGML_USE_VULKAN
  defined and undefined; non-Vulkan builds compile clean.
- No public-API break: EngineOptions::vulkan_device defaults to 0
  (the historical hard-coded value), load_supertonic_gguf gains a new
  optional last argument with the same default; existing callers are
  source-compatible.

Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"):
persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all
Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device
load-balancing (--vulkan-device -1 auto-pick).

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 18, 2026
…PU bridge

Single largest remaining per-step sync hotspot identified in
aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.  PR tetherto#16's audit follow-up tetherto#6
(2C-lite) shipped the GPU device→device blit infrastructure
(`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group
attentions to use it; the front-block attn0 site was deferred
because of cache-lifetime concerns at the time.  Round 8 picks
it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one
function.

Eliminates 6 sync points × 5 denoise steps = 30 sync points /
synth on the production path (3 GPU→host downloads + 3 host→GPU
uploads of post-RoPE Q / K / raw V at the front-block attn0
site).  Strict gating on `front_in_graph_rope &&
!include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace
mode falls back to the legacy host bridge so the trace harness
still captures pre-attention Q/K/V host vectors, and legacy
GGUFs without `vector_rope_theta` continue to take the host-
rotate path.

The blit primitive parity gate already shipped with PR tetherto#16
(`test-supertonic-graph-to-graph-blit`); round 8 extends it
with explicit coverage of the front-block K / V shapes
(text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`).

Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0
failures, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 18, 2026
…PU-bridge layout fix

Critical correctness fix.  Round 11 didn't add a new optimisation
— it made every prior round actually run end-to-end on real
hardware.  Rounds 8 + 9 + 10 had all shipped CPU-only unit-test
green, but the unit tests never exercised the production code
path with a real GGUF carrying `vector_rope_theta`.  The first
end-to-end synth attempt (CPU OR Vulkan) aborted at
`GGML_ASSERT(HD == n_heads * head_dim)` inside
`apply_rope_to_packed_qk`, and even past that assertion every
`ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge
fast paths would have hit
`GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V
matmul outputs were the byte-for-byte transpose of what the
attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors
expect.

Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5)
was written under the assumption that `dense_matmul_time_ggml`
returns a `ne=[HD, L]` channel-fastest-in-memory tensor.  In
fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`)
produces `ne=[L, HD]` with channel-major-flat memory — the
bit-exact transpose of the helper's input contract.  The CPU
unit test that landed alongside the helper hand-built Q under
the wrong `[HD, L]` shape, so the failure mode was invisible
to CI.

The fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]` (channel-major-flat
   memory).  Reference built in scalar `apply_rope`'s native
   time-major-flat layout; test verifies the helper's output
   bytes match bit-for-bit AND pins `y->ne[0] = HD,
   y->ne[1] = L` so the downstream `q_tc_in` blit cannot
   regress on layout.  Committed RED first, observed to abort
   at the same assertion the production crash hits, then
   landing the helper fix turned it GREEN (14 / 14 checks).

2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from
   `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-
   flat (which IS the layout `q_tc_in` expects).  Rest of the
   pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V (and the style sq/sk/sv) have no RoPE to mask the layout
   flip — open-code the same `ggml_cont(ggml_transpose(...))`
   at the matmul output in `build_group_graph_cache`,
   `ve_front_block_proj_cache`, and `build_res_style_qkv_cache`
   so all four GPU-bridge attention sites get bit-for-bit
   matching layouts.

4. Legacy host-bridge fallbacks switched from
   `tensor_to_time_channel(<post-rope-or-v>)` to
   `tensor_raw_f32(...)`.  The new graph-side layout puts the
   bytes already in the time-major-flat shape scalar
   `apply_rope` / `flash_attention_qkv` host references read,
   so the raw download is the correct call;
   `tensor_to_time_channel` would now apply the transpose-of-
   the-transpose and feed wrong-orientation Q/K/V into the
   attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU | abort on first step | writes 3.89s 44.1 kHz WAV |
| Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime |
| Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime |
| Vulkan Mesa lavapipe | abort | writes 1.21s WAV |

CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0
regressions.  Vulkan build's `ctest` likewise 22 / 22.

The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V
dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm,
front-block + style + group GPU bridges, text-input upload-
skip) are now actually exercised end-to-end on every Vulkan
adapter we have — they just couldn't run before round 11
unblocked the production path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 19, 2026
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups
(PR tetherto#16): the audit-driven optimisations there are backend-portable by
construction (every host-sync / bandwidth / fusion win uses the same
GPU dispatch path Vulkan walks), so this PR only adds the
Vulkan-specific dispatch deltas the OpenCL bring-up did not need.

Vulkan-specific deltas
- supertonic_model gains backend_is_vk + use_native_leaky_relu, both
  resolved at GGUF load time:
  - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine
    backend_name() can annotate the device with
    ggml_backend_vk_get_device_description().
  - use_native_leaky_relu via a ggml_backend_supports_op probe against
    a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml
    to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched
    OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for
    plain upstream OpenCL.  Dynamic probe self-adapts to whichever
    ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml
    ships in.
- supertonic_backend_supports_f16_kv_flash_attn probe (synthetic
  Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the
  use_f16_attn auto-policy so a backend that ships flash_attn_ext but
  rejects the F16-K/V variant for Supertonic shapes keeps the F32 path
  instead of crashing at first synth call.  Manual --f16-attn 1 still
  forces F16 (debug knob).
- Vulkan device selection: replaces the historical hard-coded
  ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed
  through EngineOptions::vulkan_device, range-checked against
  ggml_backend_vk_get_device_count() at load (out-of-range index is a
  hard error — surfaces operator typos / wrong-machine config loud
  rather than silently falling back to CPU).  Verbose mode + bench
  output append the Vulkan device description so multi-GPU / multi-ICD
  machines unambiguously identify which adapter ran.
- supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu
  slot so the scope correctly mirrors the new model field through
  thread-local dispatch.

Tests
- test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness
  covering the new flags through supertonic_op_dispatch_scope plus a
  smoke test for the F16-K/V flash-attn probe.  29/29 checks pass.
- test-supertonic-portable-ops (existing): fixture model now requests
  use_native_leaky_relu = false explicitly so the GPU-decomposition
  correctness gate stays green now that the helper short-circuits on
  backends with native LEAKY_RELU.  10/10 checks pass.
- test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass.
- All audit follow-up tests from tetherto#16 unchanged, all PASS.

Build
- All changed source files compile clean with both -DGGML_USE_VULKAN
  defined and undefined; non-Vulkan builds compile clean.
- No public-API break: EngineOptions::vulkan_device defaults to 0
  (the historical hard-coded value), load_supertonic_gguf gains a new
  optional last argument with the same default; existing callers are
  source-compatible.

Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"):
persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all
Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device
load-balancing (--vulkan-device -1 auto-pick).

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 19, 2026
…PU bridge

Single largest remaining per-step sync hotspot identified in
aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.  PR tetherto#16's audit follow-up tetherto#6
(2C-lite) shipped the GPU device→device blit infrastructure
(`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group
attentions to use it; the front-block attn0 site was deferred
because of cache-lifetime concerns at the time.  Round 8 picks
it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one
function.

Eliminates 6 sync points × 5 denoise steps = 30 sync points /
synth on the production path (3 GPU→host downloads + 3 host→GPU
uploads of post-RoPE Q / K / raw V at the front-block attn0
site).  Strict gating on `front_in_graph_rope &&
!include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace
mode falls back to the legacy host bridge so the trace harness
still captures pre-attention Q/K/V host vectors, and legacy
GGUFs without `vector_rope_theta` continue to take the host-
rotate path.

The blit primitive parity gate already shipped with PR tetherto#16
(`test-supertonic-graph-to-graph-blit`); round 8 extends it
with explicit coverage of the front-block K / V shapes
(text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`).

Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0
failures, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 19, 2026
…PU-bridge layout fix

Critical correctness fix.  Round 11 didn't add a new optimisation
— it made every prior round actually run end-to-end on real
hardware.  Rounds 8 + 9 + 10 had all shipped CPU-only unit-test
green, but the unit tests never exercised the production code
path with a real GGUF carrying `vector_rope_theta`.  The first
end-to-end synth attempt (CPU OR Vulkan) aborted at
`GGML_ASSERT(HD == n_heads * head_dim)` inside
`apply_rope_to_packed_qk`, and even past that assertion every
`ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge
fast paths would have hit
`GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V
matmul outputs were the byte-for-byte transpose of what the
attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors
expect.

Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5)
was written under the assumption that `dense_matmul_time_ggml`
returns a `ne=[HD, L]` channel-fastest-in-memory tensor.  In
fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`)
produces `ne=[L, HD]` with channel-major-flat memory — the
bit-exact transpose of the helper's input contract.  The CPU
unit test that landed alongside the helper hand-built Q under
the wrong `[HD, L]` shape, so the failure mode was invisible
to CI.

The fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]` (channel-major-flat
   memory).  Reference built in scalar `apply_rope`'s native
   time-major-flat layout; test verifies the helper's output
   bytes match bit-for-bit AND pins `y->ne[0] = HD,
   y->ne[1] = L` so the downstream `q_tc_in` blit cannot
   regress on layout.  Committed RED first, observed to abort
   at the same assertion the production crash hits, then
   landing the helper fix turned it GREEN (14 / 14 checks).

2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from
   `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-
   flat (which IS the layout `q_tc_in` expects).  Rest of the
   pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V (and the style sq/sk/sv) have no RoPE to mask the layout
   flip — open-code the same `ggml_cont(ggml_transpose(...))`
   at the matmul output in `build_group_graph_cache`,
   `ve_front_block_proj_cache`, and `build_res_style_qkv_cache`
   so all four GPU-bridge attention sites get bit-for-bit
   matching layouts.

4. Legacy host-bridge fallbacks switched from
   `tensor_to_time_channel(<post-rope-or-v>)` to
   `tensor_raw_f32(...)`.  The new graph-side layout puts the
   bytes already in the time-major-flat shape scalar
   `apply_rope` / `flash_attention_qkv` host references read,
   so the raw download is the correct call;
   `tensor_to_time_channel` would now apply the transpose-of-
   the-transpose and feed wrong-orientation Q/K/V into the
   attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU | abort on first step | writes 3.89s 44.1 kHz WAV |
| Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime |
| Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime |
| Vulkan Mesa lavapipe | abort | writes 1.21s WAV |

CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0
regressions.  Vulkan build's `ctest` likewise 22 / 22.

The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V
dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm,
front-block + style + group GPU bridges, text-input upload-
skip) are now actually exercised end-to-end on every Vulkan
adapter we have — they just couldn't run before round 11
unblocked the production path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants