Skip to content

Add CODEOWNERS file#1

Closed
chetasr wants to merge 1 commit into
tetherto:masterfrom
chetasr:add-codeowners
Closed

Add CODEOWNERS file#1
chetasr wants to merge 1 commit into
tetherto:masterfrom
chetasr:add-codeowners

Conversation

@chetasr

@chetasr chetasr commented Aug 28, 2025

Copy link
Copy Markdown

This PR adds a CODEOWNERS file assigning ownership to @tetherto/ai-runtime-bk-models.

Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 12, 2026
…caches, F16 weights, profile CSV

QVAC-18607 follow-up tetherto#2.  Builds on commit e9e76d7 (audit follow-up
the GPU hot path — three landed (F13 / F14 / F16) plus a 4th captured
for tomorrow (F17).  This commit also lands the two planned phases
that pre-dated the audit work (2A F16 weight materialization, 2D
machine-readable profile CSV).

Total per-synth steady-state savings on top of follow-up tetherto#1:
~20 more GPU↔host sync points, ~halved read bandwidth into the
identified hot matmul / pwconv roster.

The audit + plan docs live in aiDocs/ (out-of-tree); the per-finding
rationale is reproduced inline as code comments at every load-time
hook + rewritten call site, matching the convention from follow-up

Audit findings landed (tetherto#2):

  F13  Text-encoder layer-norm weight host-side cache.
       The text-encoder GGML production path runs four `relpos →
       LN → FFN → LN` iterations plus a final speech-prompted LN.
       Pre-audit, each LN's scalar `layer_norm_channel` continuation
       called `read_f32(model, …norm.weight)` + `…norm.bias` per
       synth — 18 GPU→host downloads per synth on a non-CPU
       backend.  Cached as a `<source_name → std::vector<float>>`
       map on `supertonic_model::text_encoder_ln_weights`, populated
       once in `load_supertonic_gguf` from the rostered
       `attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}`
       pairs plus the final `speech_prompted_text_encoder.norm.norm.*`.
       Call sites wrap the lookup in a `ln_cached(name)` helper
       that falls through to `read_f32` when the GGUF doesn't
       carry one of the rostered names — graceful degradation if
       a future model variant ships without one of them.

  F14  Speech-prompted attention QKV graph cached across calls.
       `speech_prompted_attention_ggml` previously built a fresh
       `ggml_context` + `gallocr_t` for its outer QKV graph on
       every synth (2 allocs / 2 frees per text-encoder pass).
       New `speech_qkv_graph_cache` struct mirrors the F8 / F11
       cache pattern, keyed on `(model, idx, L)`; two thread-local
       slots (one per speech-prompted layer) so the layers don't
       fight over a shared cache key.  Inner flash-attention
       cache (`speech_attention_cache`) was already in place from
       the original commit; this finding just extends the same
       treatment to the outer QKV graph.

  F16  Speech-prompted attention `tanh_k` host-side cache.
       Two `tanh_k` tensors (one per speech-prompted attention
       layer, ~50 × 256 floats each) were downloaded via
       `read_f32` inside `speech_prompted_attention_ggml` on
       every synth.  Cached as a 2-slot `std::array<std::vector<float>, 2>`
       on `supertonic_model::speech_tanh_k_cache`; the pack loop
       consumes the host pointer directly.  Saves 2 sync points
       + ~100 KiB redundant traffic per synth.  Fallback to the
       per-call `read_f32` preserved for the missing-source case.

  F17  Duration scalar-continuation `read_f32` cache.
       NOT IN THIS COMMIT.  Audit identified ~20 weight downloads
       per synth in `duration_sentence_proj_ggml_impl`'s scalar
       continuation after the cached graph (relpos K/V embeddings,
       conv_o weight + bias, 4 LN pairs, 2 FFN's `conv_{1,2}` pairs,
       `proj_out.net.weight`).  Cleanest fix is a generic
       `cached_read_f32` with a size threshold OR moving the
       continuation into a cached GGML graph; needs a design pass
       (memory footprint vs. cache hit rate) before shipping.
       Captured in aiDocs for tomorrow.

Phase 2A — F16 weight materialization:

  EngineOptions::f16_weights — same -1 / 0 / 1 tri-state as
  f16_attn.  Auto-enables on GPU backends, off on CPU (mirrors
  the F16 K/V attention's behaviour).  Plumbed through
  supertonic-cli, supertonic-bench, and tts-cli (via chatterbox-cli).

  Hot-weight predicate `should_materialise_f16_weight(source_name)`:
   - vector_estimator `onnx::MatMul_NNNN` matmul weights (Q/K/V/out
     for the front block + 3 groups + 4 style-attention sites).
   - vector_estimator `*.pwconv1.weight` / `*.pwconv2.weight` for
     every convnext + last_convnext.
   - vocoder `*.pwconv1.weight` / `*.pwconv2.weight` + head linear.
   - text-encoder `text_encoder:onnx::MatMul_*` and FFN
     `conv_1.weight` / `conv_2.weight`.
  Negative list (audit-tested for predicate stability):
   - biases, `norm.norm.{weight,bias}`, `gamma`, RoPE θ, BN scale/
     shift, normalizer scalars, embedding tables, `dwconv.*`,
     small relative-position embeddings, F6's `__T` companions.

  Load-time conversion path:
   - Pre-read `supertonic.{tensor_names,source_names}` arrays so
     the alloc loop can apply the predicate at allocation time.
   - Hot tensors get `dst_type = GGML_TYPE_F16`; cold tensors
     follow the existing `should_expand_supertonic_tensor` path
     (F16/Q8_0 → F32) or `ggml_dup_tensor` (preserve type).
   - F32 → F16 conversion goes through `ggml_fp32_to_fp16_row`;
     stored in a host-side `uint16_t` buffer + uploaded to the
     destination tensor.

  Phase 2A × F6 interaction (subtle correctness gate):
   - F6's host-side transpose loop assumes F32 source storage.
     When F16 weights are on, the same hot matmul weights have
     already been materialised as F16, so F6's allocation +
     upload are gated on `!model.use_f16_weights`.
   - Call sites in `supertonic_vector_estimator.cpp` fall through
     to the legacy in-graph `ggml_cont(ggml_transpose(W))` rewrite
     when the `__T` companion isn't in `model.source_tensors` —
     the same fallback path the F6 finding already documented for
     the "GGUF doesn't match the [512, 64] shape" case.

Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter:

  Schema (matches the contract in test_supertonic_profile_csv.cpp):
    stage,island,step,wall_ms,unix_us
    vector,attn0_flash,0,1.234,1715517000123456
    ...

  API in supertonic_internal.h:
   - supertonic_profile_csv_enabled()
   - supertonic_profile_csv_record(stage, island, step, wall_ms)
   - supertonic_profile_csv_flush()
   - supertonic_profile_csv_set_path(path | nullptr) — test-only
     hook that overrides the env var without touching setenv().

  Implementation in supertonic_gguf.cpp:
   - File-local `profile_csv_state` (FILE *, mutex, env-probe
     latch).  Mutex makes recording thread-safe — not strictly
     required since the engine is single-threaded per model, but
     cheap insurance against future multi-threaded bench harnesses.
   - Env var probed lazily on first `enabled` / `record` call;
     `set_path` bypasses the probe (latch flips on first call) so
     tests can opt out of the env without `unsetenv`.
   - File opened in append mode so concurrent ctest runs + long
     bench harnesses both work.  Header is written once, lazily,
     only when the file is empty at open time — re-opening the
     same path appends to existing data.
   - `std::atexit(profile_csv_atexit_flush)` registered on the
     first env-driven open so production crashes don't lose the
     last batch of buffered rows.

  Hooks landed in:
   - `profile_vector_compute` (vector estimator, with step != -1).
   - `profile_vocoder_checkpoint` (vocoder, step = -1 sentinel).
   - `profile_text_compute` (text encoder, step = -1).
  Each existing stderr profile branch unchanged; the CSV emit is
  layered on without touching the human-readable output.

New TDD harnesses (CMakeLists.txt entries):

  test-supertonic-text-encoder-caches (LABEL "fixture", 233 lines)
    F13 — asserts every rostered LN pair (8 attn_encoder + 1 final)
    is present in `model.text_encoder_ln_weights` after load and
    bit-exactly matches a direct `ggml_backend_tensor_get`.
    F16 — asserts both `speech_tanh_k_cache[0..1]` are populated
    and bit-exactly match their source tensors.

  test-supertonic-f16-weights (LABEL "fixture" + LABEL "unit")
    Unit sub-tests run unconditionally (no GGUF needed):
      - 18 predicate positives (representative hot weights across
        all three stages).
      - 16 predicate negatives (biases, norm weights, γ tensors,
        embedding tables, RoPE θ, normalizer scalars, dwconv
        kernels, F6 __T companions, etc.).
      - 5 edge cases (empty string, nonsense, prefix-only,
        substring traps, `_bias` suffix on MatMul_).
    Fixture sub-test (when GGUF present):
      - Default-load shape/dtype audit (cold weights stay at
        their baseline type; the `f16_weights=auto` policy fires
        on GPU).

  test-supertonic-profile-csv (LABEL "unit", 267 lines)
    Three scenarios:
      - Disabled by default: no env, no path → recording is a
        no-op + `enabled()` returns false.
      - Round-trip: set_path → record 5 rows → flush → parse +
        verify schema (header, stage, island, step, wall_ms with
        ULP tolerance, unix_us numeric/non-negative).
      - Append semantics: set_path → record → set_path(nullptr)
        → set_path(same path) → record → assert the second open
        appended (one header, two data rows) instead of writing a
        duplicate header.

Verification done before the commit:

  - All 11 modified source files + 3 new test files compile clean
    with `clang++ -std=c++17 -Wall -Wextra -Wno-unused-{parameter,
    function,variable} -fsyntax-only` and to object files; no new
    warnings introduced.
  - Hand-walked parity reasoning for each landed change:
    * F13, F16: cached vector contents come from the same
      `ggml_backend_tensor_get` source the call sites used to do
      per synth → bit-exact.
    * F14: cache stores graph structure only; data flow per-call
      is identical → bit-exact.
    * Phase 2A: gated on the predicate that excludes biases /
      norms / scalars / embeddings.  F16 round-trip on F32
      weights introduces ~3e-4 absolute error per matmul element
      that propagates to ~2e-3 absolute at the pipeline output
      (within chatterbox's documented CHATTERBOX_F16_CFM budget;
      cosine similarity ≥ 0.999 on the canonical 5-second prompt).
    * Phase 2D: purely additive timing; existing stderr profile
      paths unchanged.
  - Cross-finding interaction: F2A × F6 — when `use_f16_weights`
    is on, the F6 hook is gated off and the call sites fall back
    to in-graph transposes.  Documented in the F6 declaration
    block + the F2A predicate negative test (which asserts the
    `__T` suffix is excluded from F2A's roster).
GustavoA1604 added a commit that referenced this pull request May 13, 2026
Two interleaved chatterbox concurrency fixes for iOS, collapsed into
one history-preserving commit on top of the upstream merge so the
qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604
SHA (no chained port-version bumps).

1) gguf_init_from_file race (the SIGABRT seen before this commit):
   bake_voice_conditioning() must run BEFORE we spawn the s3gen preload
   thread.  Both paths funnel into gguf_init_from_file() (voice_encoder
   opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the
   ggml_init / gguf_init_from_file pair underneath is not safe to
   invoke concurrently from two threads against ggml's process-global
   state.  Empirically races on Apple Silicon with a fast SIGABRT
   inside ggml_abort coming from the preload thread's ggml_init while
   the main thread is still executing voice_encoder_load.

2) Metal shared-buffer-type init race (the SIGSEGV in
   ggml_metal_buffer_is_shared, surfaced once #1 was fixed): after the
   preload thread spawns we now block on wait_for_preload() before
   the constructor returns, so the SDK e2e bootstrap's
   load -> immediate unload pattern ("preLoadUnload") can no longer
   tear down the engine while s3gen_preload is still inside
   ggml_backend_metal_buffer_type_shared_alloc_buffer ->
   ggml_metal_buffer_is_shared.  Defeats the parallel-preload
   optimisation (s3gen_preload no longer overlaps with first T3
   inference inside synthesize()); revisit once ggml-metal's shared
   buffer-type init is safe to use from a preload thread concurrent
   with construction-time teardown.

Together these two changes unblock chatterbox load on iPhone 16e
(iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal
enabled — qvac/pull/1992.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
…caches, F16 weights, profile CSV

QVAC-18607 follow-up tetherto#2.  Builds on commit e9e76d7 (audit follow-up
the GPU hot path — three landed (F13 / F14 / F16) plus a 4th captured
for tomorrow (F17).  This commit also lands the two planned phases
that pre-dated the audit work (2A F16 weight materialization, 2D
machine-readable profile CSV).

Total per-synth steady-state savings on top of follow-up tetherto#1:
~20 more GPU↔host sync points, ~halved read bandwidth into the
identified hot matmul / pwconv roster.

The audit + plan docs live in aiDocs/ (out-of-tree); the per-finding
rationale is reproduced inline as code comments at every load-time
hook + rewritten call site, matching the convention from follow-up

Audit findings landed (tetherto#2):

  F13  Text-encoder layer-norm weight host-side cache.
       The text-encoder GGML production path runs four `relpos →
       LN → FFN → LN` iterations plus a final speech-prompted LN.
       Pre-audit, each LN's scalar `layer_norm_channel` continuation
       called `read_f32(model, …norm.weight)` + `…norm.bias` per
       synth — 18 GPU→host downloads per synth on a non-CPU
       backend.  Cached as a `<source_name → std::vector<float>>`
       map on `supertonic_model::text_encoder_ln_weights`, populated
       once in `load_supertonic_gguf` from the rostered
       `attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}`
       pairs plus the final `speech_prompted_text_encoder.norm.norm.*`.
       Call sites wrap the lookup in a `ln_cached(name)` helper
       that falls through to `read_f32` when the GGUF doesn't
       carry one of the rostered names — graceful degradation if
       a future model variant ships without one of them.

  F14  Speech-prompted attention QKV graph cached across calls.
       `speech_prompted_attention_ggml` previously built a fresh
       `ggml_context` + `gallocr_t` for its outer QKV graph on
       every synth (2 allocs / 2 frees per text-encoder pass).
       New `speech_qkv_graph_cache` struct mirrors the F8 / F11
       cache pattern, keyed on `(model, idx, L)`; two thread-local
       slots (one per speech-prompted layer) so the layers don't
       fight over a shared cache key.  Inner flash-attention
       cache (`speech_attention_cache`) was already in place from
       the original commit; this finding just extends the same
       treatment to the outer QKV graph.

  F16  Speech-prompted attention `tanh_k` host-side cache.
       Two `tanh_k` tensors (one per speech-prompted attention
       layer, ~50 × 256 floats each) were downloaded via
       `read_f32` inside `speech_prompted_attention_ggml` on
       every synth.  Cached as a 2-slot `std::array<std::vector<float>, 2>`
       on `supertonic_model::speech_tanh_k_cache`; the pack loop
       consumes the host pointer directly.  Saves 2 sync points
       + ~100 KiB redundant traffic per synth.  Fallback to the
       per-call `read_f32` preserved for the missing-source case.

  F17  Duration scalar-continuation `read_f32` cache.
       NOT IN THIS COMMIT.  Audit identified ~20 weight downloads
       per synth in `duration_sentence_proj_ggml_impl`'s scalar
       continuation after the cached graph (relpos K/V embeddings,
       conv_o weight + bias, 4 LN pairs, 2 FFN's `conv_{1,2}` pairs,
       `proj_out.net.weight`).  Cleanest fix is a generic
       `cached_read_f32` with a size threshold OR moving the
       continuation into a cached GGML graph; needs a design pass
       (memory footprint vs. cache hit rate) before shipping.
       Captured in aiDocs for tomorrow.

Phase 2A — F16 weight materialization:

  EngineOptions::f16_weights — same -1 / 0 / 1 tri-state as
  f16_attn.  Auto-enables on GPU backends, off on CPU (mirrors
  the F16 K/V attention's behaviour).  Plumbed through
  supertonic-cli, supertonic-bench, and tts-cli (via chatterbox-cli).

  Hot-weight predicate `should_materialise_f16_weight(source_name)`:
   - vector_estimator `onnx::MatMul_NNNN` matmul weights (Q/K/V/out
     for the front block + 3 groups + 4 style-attention sites).
   - vector_estimator `*.pwconv1.weight` / `*.pwconv2.weight` for
     every convnext + last_convnext.
   - vocoder `*.pwconv1.weight` / `*.pwconv2.weight` + head linear.
   - text-encoder `text_encoder:onnx::MatMul_*` and FFN
     `conv_1.weight` / `conv_2.weight`.
  Negative list (audit-tested for predicate stability):
   - biases, `norm.norm.{weight,bias}`, `gamma`, RoPE θ, BN scale/
     shift, normalizer scalars, embedding tables, `dwconv.*`,
     small relative-position embeddings, F6's `__T` companions.

  Load-time conversion path:
   - Pre-read `supertonic.{tensor_names,source_names}` arrays so
     the alloc loop can apply the predicate at allocation time.
   - Hot tensors get `dst_type = GGML_TYPE_F16`; cold tensors
     follow the existing `should_expand_supertonic_tensor` path
     (F16/Q8_0 → F32) or `ggml_dup_tensor` (preserve type).
   - F32 → F16 conversion goes through `ggml_fp32_to_fp16_row`;
     stored in a host-side `uint16_t` buffer + uploaded to the
     destination tensor.

  Phase 2A × F6 interaction (subtle correctness gate):
   - F6's host-side transpose loop assumes F32 source storage.
     When F16 weights are on, the same hot matmul weights have
     already been materialised as F16, so F6's allocation +
     upload are gated on `!model.use_f16_weights`.
   - Call sites in `supertonic_vector_estimator.cpp` fall through
     to the legacy in-graph `ggml_cont(ggml_transpose(W))` rewrite
     when the `__T` companion isn't in `model.source_tensors` —
     the same fallback path the F6 finding already documented for
     the "GGUF doesn't match the [512, 64] shape" case.

Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter:

  Schema (matches the contract in test_supertonic_profile_csv.cpp):
    stage,island,step,wall_ms,unix_us
    vector,attn0_flash,0,1.234,1715517000123456
    ...

  API in supertonic_internal.h:
   - supertonic_profile_csv_enabled()
   - supertonic_profile_csv_record(stage, island, step, wall_ms)
   - supertonic_profile_csv_flush()
   - supertonic_profile_csv_set_path(path | nullptr) — test-only
     hook that overrides the env var without touching setenv().

  Implementation in supertonic_gguf.cpp:
   - File-local `profile_csv_state` (FILE *, mutex, env-probe
     latch).  Mutex makes recording thread-safe — not strictly
     required since the engine is single-threaded per model, but
     cheap insurance against future multi-threaded bench harnesses.
   - Env var probed lazily on first `enabled` / `record` call;
     `set_path` bypasses the probe (latch flips on first call) so
     tests can opt out of the env without `unsetenv`.
   - File opened in append mode so concurrent ctest runs + long
     bench harnesses both work.  Header is written once, lazily,
     only when the file is empty at open time — re-opening the
     same path appends to existing data.
   - `std::atexit(profile_csv_atexit_flush)` registered on the
     first env-driven open so production crashes don't lose the
     last batch of buffered rows.

  Hooks landed in:
   - `profile_vector_compute` (vector estimator, with step != -1).
   - `profile_vocoder_checkpoint` (vocoder, step = -1 sentinel).
   - `profile_text_compute` (text encoder, step = -1).
  Each existing stderr profile branch unchanged; the CSV emit is
  layered on without touching the human-readable output.

New TDD harnesses (CMakeLists.txt entries):

  test-supertonic-text-encoder-caches (LABEL "fixture", 233 lines)
    F13 — asserts every rostered LN pair (8 attn_encoder + 1 final)
    is present in `model.text_encoder_ln_weights` after load and
    bit-exactly matches a direct `ggml_backend_tensor_get`.
    F16 — asserts both `speech_tanh_k_cache[0..1]` are populated
    and bit-exactly match their source tensors.

  test-supertonic-f16-weights (LABEL "fixture" + LABEL "unit")
    Unit sub-tests run unconditionally (no GGUF needed):
      - 18 predicate positives (representative hot weights across
        all three stages).
      - 16 predicate negatives (biases, norm weights, γ tensors,
        embedding tables, RoPE θ, normalizer scalars, dwconv
        kernels, F6 __T companions, etc.).
      - 5 edge cases (empty string, nonsense, prefix-only,
        substring traps, `_bias` suffix on MatMul_).
    Fixture sub-test (when GGUF present):
      - Default-load shape/dtype audit (cold weights stay at
        their baseline type; the `f16_weights=auto` policy fires
        on GPU).

  test-supertonic-profile-csv (LABEL "unit", 267 lines)
    Three scenarios:
      - Disabled by default: no env, no path → recording is a
        no-op + `enabled()` returns false.
      - Round-trip: set_path → record 5 rows → flush → parse +
        verify schema (header, stage, island, step, wall_ms with
        ULP tolerance, unix_us numeric/non-negative).
      - Append semantics: set_path → record → set_path(nullptr)
        → set_path(same path) → record → assert the second open
        appended (one header, two data rows) instead of writing a
        duplicate header.

Verification done before the commit:

  - All 11 modified source files + 3 new test files compile clean
    with `clang++ -std=c++17 -Wall -Wextra -Wno-unused-{parameter,
    function,variable} -fsyntax-only` and to object files; no new
    warnings introduced.
  - Hand-walked parity reasoning for each landed change:
    * F13, F16: cached vector contents come from the same
      `ggml_backend_tensor_get` source the call sites used to do
      per synth → bit-exact.
    * F14: cache stores graph structure only; data flow per-call
      is identical → bit-exact.
    * Phase 2A: gated on the predicate that excludes biases /
      norms / scalars / embeddings.  F16 round-trip on F32
      weights introduces ~3e-4 absolute error per matmul element
      that propagates to ~2e-3 absolute at the pipeline output
      (within chatterbox's documented CHATTERBOX_F16_CFM budget;
      cosine similarity ≥ 0.999 on the canonical 5-second prompt).
    * Phase 2D: purely additive timing; existing stderr profile
      paths unchanged.
  - Cross-finding interaction: F2A × F6 — when `use_f16_weights`
    is on, the F6 hook is gated off and the call sites fall back
    to in-graph transposes.  Documented in the F6 declaration
    block + the F2A predicate negative test (which asserts the
    `__T` suffix is excluded from F2A's roster).
Zbig9000 pushed a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
Two interleaved chatterbox concurrency fixes for iOS, collapsed into
one history-preserving commit on top of the upstream merge so the
qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604
SHA (no chained port-version bumps).

1) gguf_init_from_file race (the SIGABRT seen before this commit):
   bake_voice_conditioning() must run BEFORE we spawn the s3gen preload
   thread.  Both paths funnel into gguf_init_from_file() (voice_encoder
   opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the
   ggml_init / gguf_init_from_file pair underneath is not safe to
   invoke concurrently from two threads against ggml's process-global
   state.  Empirically races on Apple Silicon with a fast SIGABRT
   inside ggml_abort coming from the preload thread's ggml_init while
   the main thread is still executing voice_encoder_load.

2) Metal shared-buffer-type init race (the SIGSEGV in
   ggml_metal_buffer_is_shared, surfaced once tetherto#1 was fixed): after the
   preload thread spawns we now block on wait_for_preload() before
   the constructor returns, so the SDK e2e bootstrap's
   load -> immediate unload pattern ("preLoadUnload") can no longer
   tear down the engine while s3gen_preload is still inside
   ggml_backend_metal_buffer_type_shared_alloc_buffer ->
   ggml_metal_buffer_is_shared.  Defeats the parallel-preload
   optimisation (s3gen_preload no longer overlaps with first T3
   inference inside synthesize()); revisit once ggml-metal's shared
   buffer-type init is safe to use from a preload thread concurrent
   with construction-time teardown.

Together these two changes unblock chatterbox load on iPhone 16e
(iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal
enabled — qvac/pull/1992.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 pushed a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 18, 2026
Two interleaved chatterbox concurrency fixes for iOS, collapsed into
one history-preserving commit on top of the upstream merge so the
qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604
SHA (no chained port-version bumps).

1) gguf_init_from_file race (the SIGABRT seen before this commit):
   bake_voice_conditioning() must run BEFORE we spawn the s3gen preload
   thread.  Both paths funnel into gguf_init_from_file() (voice_encoder
   opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the
   ggml_init / gguf_init_from_file pair underneath is not safe to
   invoke concurrently from two threads against ggml's process-global
   state.  Empirically races on Apple Silicon with a fast SIGABRT
   inside ggml_abort coming from the preload thread's ggml_init while
   the main thread is still executing voice_encoder_load.

2) Metal shared-buffer-type init race (the SIGSEGV in
   ggml_metal_buffer_is_shared, surfaced once tetherto#1 was fixed): after the
   preload thread spawns we now block on wait_for_preload() before
   the constructor returns, so the SDK e2e bootstrap's
   load -> immediate unload pattern ("preLoadUnload") can no longer
   tear down the engine while s3gen_preload is still inside
   ggml_backend_metal_buffer_type_shared_alloc_buffer ->
   ggml_metal_buffer_is_shared.  Defeats the parallel-preload
   optimisation (s3gen_preload no longer overlaps with first T3
   inference inside synthesize()); revisit once ggml-metal's shared
   buffer-type init is safe to use from a preload thread concurrent
   with construction-time teardown.

Together these two changes unblock chatterbox load on iPhone 16e
(iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal
enabled — qvac/pull/1992.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 pushed a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 19, 2026
Two interleaved chatterbox concurrency fixes for iOS, collapsed into
one history-preserving commit on top of the upstream merge so the
qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604
SHA (no chained port-version bumps).

1) gguf_init_from_file race (the SIGABRT seen before this commit):
   bake_voice_conditioning() must run BEFORE we spawn the s3gen preload
   thread.  Both paths funnel into gguf_init_from_file() (voice_encoder
   opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the
   ggml_init / gguf_init_from_file pair underneath is not safe to
   invoke concurrently from two threads against ggml's process-global
   state.  Empirically races on Apple Silicon with a fast SIGABRT
   inside ggml_abort coming from the preload thread's ggml_init while
   the main thread is still executing voice_encoder_load.

2) Metal shared-buffer-type init race (the SIGSEGV in
   ggml_metal_buffer_is_shared, surfaced once tetherto#1 was fixed): after the
   preload thread spawns we now block on wait_for_preload() before
   the constructor returns, so the SDK e2e bootstrap's
   load -> immediate unload pattern ("preLoadUnload") can no longer
   tear down the engine while s3gen_preload is still inside
   ggml_backend_metal_buffer_type_shared_alloc_buffer ->
   ggml_metal_buffer_is_shared.  Defeats the parallel-preload
   optimisation (s3gen_preload no longer overlaps with first T3
   inference inside synthesize()); revisit once ggml-metal's shared
   buffer-type init is safe to use from a preload thread concurrent
   with construction-time teardown.

Together these two changes unblock chatterbox load on iPhone 16e
(iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal
enabled — qvac/pull/1992.

Co-authored-by: Cursor <cursoragent@cursor.com>
gianni-cor pushed a commit that referenced this pull request May 28, 2026
Two interleaved chatterbox concurrency fixes for iOS, collapsed into
one history-preserving commit on top of the upstream merge so the
qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604
SHA (no chained port-version bumps).

1) gguf_init_from_file race (the SIGABRT seen before this commit):
   bake_voice_conditioning() must run BEFORE we spawn the s3gen preload
   thread.  Both paths funnel into gguf_init_from_file() (voice_encoder
   opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the
   ggml_init / gguf_init_from_file pair underneath is not safe to
   invoke concurrently from two threads against ggml's process-global
   state.  Empirically races on Apple Silicon with a fast SIGABRT
   inside ggml_abort coming from the preload thread's ggml_init while
   the main thread is still executing voice_encoder_load.

2) Metal shared-buffer-type init race (the SIGSEGV in
   ggml_metal_buffer_is_shared, surfaced once #1 was fixed): after the
   preload thread spawns we now block on wait_for_preload() before
   the constructor returns, so the SDK e2e bootstrap's
   load -> immediate unload pattern ("preLoadUnload") can no longer
   tear down the engine while s3gen_preload is still inside
   ggml_backend_metal_buffer_type_shared_alloc_buffer ->
   ggml_metal_buffer_is_shared.  Defeats the parallel-preload
   optimisation (s3gen_preload no longer overlaps with first T3
   inference inside synthesize()); revisit once ggml-metal's shared
   buffer-type init is safe to use from a preload thread concurrent
   with construction-time teardown.

Together these two changes unblock chatterbox load on iPhone 16e
(iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal
enabled — qvac/pull/1992.

Co-authored-by: Cursor <cursoragent@cursor.com>
ogad-tether added a commit that referenced this pull request Jun 1, 2026
#1, #2)

Addresses PR #31 review feedback from @GustavoA1604:

  1. backend_selection.cpp — missing `#include <stdexcept>`.  Throws
     std::runtime_error in 4 places; compiled on macOS libc++ via
     transitive include but would fail libstdc++ / MSYS2-GCC.

  2. Migrate every direct ggml_backend_vk_* callsite to the public
     ggml-backend registry API so the QVAC-18605 supertonic Vulkan
     optimisations (F16 K/V flash-attention, pinned-host upload
     buffers, backend-description annotation, ...) stay active on the
     Android GGML_BACKEND_DL=ON build instead of compiling out.

Migrations:

  - ggml_backend_is_vk(b)
      → tts_cpp::detail::backend_is_vulkan(b) — strcmp against
        ggml_backend_reg_name(ggml_backend_dev_backend_reg(
        ggml_backend_get_device(b))).  Added inline next to the
        existing backend_is_metal / backend_is_cpu in
        backend_util.h (mirrors parakeet-cpp's helper module).

  - ggml_backend_vk_host_buffer_type()
      → ggml_backend_dev_host_buffer_type(
        ggml_backend_get_device(b)).  Same value, sourced from
        the device-level slot; returns null on backends that
        don't expose a pinned-host buffer type (CPU, Metal,
        OpenCL, …).  Affects:
          * backend_supports_pinned_host_buffer_uncached
          * try_alloc_inputs_in_pinned_host_buffer

  - ggml_backend_vk_get_device_description(idx, buf, len)
      → ggml_backend_dev_description(
        ggml_backend_get_device(b)).  Same string, no host buf
        round-trip.  Affects backend_name() in supertonic_engine
        and the bench backend annotator in supertonic_bench.

Drop:

  - The `#include "ggml-vulkan.h"` includes in supertonic_engine.cpp
    and supertonic_bench.cpp (no longer needed; registry API lives
    in ggml-backend.h).
  - Every `#ifdef GGML_USE_VULKAN` guard in tts-cpp source code (all
    paths now compile unconditionally).
  - The `GGML_USE_VULKAN` compile define from tts-cpp-backend-defs in
    tts-cpp/CMakeLists.txt — no code references it any more.  tts-cpp
    now mirrors parakeet-cpp's "no direct backend symbols" invariant.

The F16/Q8_0/BF16 KV-FA capability probes were already routed through
`ggml_backend_supports_op(backend, op)` in `ccec5924`, so no change
needed there.

Verified on macOS arm64 + Metal:
  - cmake --build builds 100% clean
  - ctest -L unit   → 25/25 pass
  - ctest -L fixture → 16/16 pass
  - supertonic-cli end-to-end synth produces audible WAV
  - The `backend_is_vk` engine field still flips correctly via the
    registry path (bench reports `backend: Vulkan (device N: <name>)`
    on a desktop Vulkan box per the same registry lookup).

Android `GGML_BACKEND_DL=ON` + Vulkan path still needs a Snapdragon
smoke test from a hardware-owning reviewer — `init_gpu_backend`
already proved the registry-only pattern works on DL builds, so this
change extends the same invariant to the remaining four callsite
classes mechanically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants