Add CODEOWNERS file#1
Closed
chetasr wants to merge 1 commit into
Closed
Conversation
NamelsKing
approved these changes
Sep 2, 2025
yuranich
approved these changes
Sep 2, 2025
olyasir
approved these changes
Sep 2, 2025
olek-tether
approved these changes
Sep 4, 2025
6 tasks
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 12, 2026
…caches, F16 weights, profile CSV QVAC-18607 follow-up tetherto#2. Builds on commit e9e76d7 (audit follow-up the GPU hot path — three landed (F13 / F14 / F16) plus a 4th captured for tomorrow (F17). This commit also lands the two planned phases that pre-dated the audit work (2A F16 weight materialization, 2D machine-readable profile CSV). Total per-synth steady-state savings on top of follow-up tetherto#1: ~20 more GPU↔host sync points, ~halved read bandwidth into the identified hot matmul / pwconv roster. The audit + plan docs live in aiDocs/ (out-of-tree); the per-finding rationale is reproduced inline as code comments at every load-time hook + rewritten call site, matching the convention from follow-up Audit findings landed (tetherto#2): F13 Text-encoder layer-norm weight host-side cache. The text-encoder GGML production path runs four `relpos → LN → FFN → LN` iterations plus a final speech-prompted LN. Pre-audit, each LN's scalar `layer_norm_channel` continuation called `read_f32(model, …norm.weight)` + `…norm.bias` per synth — 18 GPU→host downloads per synth on a non-CPU backend. Cached as a `<source_name → std::vector<float>>` map on `supertonic_model::text_encoder_ln_weights`, populated once in `load_supertonic_gguf` from the rostered `attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}` pairs plus the final `speech_prompted_text_encoder.norm.norm.*`. Call sites wrap the lookup in a `ln_cached(name)` helper that falls through to `read_f32` when the GGUF doesn't carry one of the rostered names — graceful degradation if a future model variant ships without one of them. F14 Speech-prompted attention QKV graph cached across calls. `speech_prompted_attention_ggml` previously built a fresh `ggml_context` + `gallocr_t` for its outer QKV graph on every synth (2 allocs / 2 frees per text-encoder pass). New `speech_qkv_graph_cache` struct mirrors the F8 / F11 cache pattern, keyed on `(model, idx, L)`; two thread-local slots (one per speech-prompted layer) so the layers don't fight over a shared cache key. Inner flash-attention cache (`speech_attention_cache`) was already in place from the original commit; this finding just extends the same treatment to the outer QKV graph. F16 Speech-prompted attention `tanh_k` host-side cache. Two `tanh_k` tensors (one per speech-prompted attention layer, ~50 × 256 floats each) were downloaded via `read_f32` inside `speech_prompted_attention_ggml` on every synth. Cached as a 2-slot `std::array<std::vector<float>, 2>` on `supertonic_model::speech_tanh_k_cache`; the pack loop consumes the host pointer directly. Saves 2 sync points + ~100 KiB redundant traffic per synth. Fallback to the per-call `read_f32` preserved for the missing-source case. F17 Duration scalar-continuation `read_f32` cache. NOT IN THIS COMMIT. Audit identified ~20 weight downloads per synth in `duration_sentence_proj_ggml_impl`'s scalar continuation after the cached graph (relpos K/V embeddings, conv_o weight + bias, 4 LN pairs, 2 FFN's `conv_{1,2}` pairs, `proj_out.net.weight`). Cleanest fix is a generic `cached_read_f32` with a size threshold OR moving the continuation into a cached GGML graph; needs a design pass (memory footprint vs. cache hit rate) before shipping. Captured in aiDocs for tomorrow. Phase 2A — F16 weight materialization: EngineOptions::f16_weights — same -1 / 0 / 1 tri-state as f16_attn. Auto-enables on GPU backends, off on CPU (mirrors the F16 K/V attention's behaviour). Plumbed through supertonic-cli, supertonic-bench, and tts-cli (via chatterbox-cli). Hot-weight predicate `should_materialise_f16_weight(source_name)`: - vector_estimator `onnx::MatMul_NNNN` matmul weights (Q/K/V/out for the front block + 3 groups + 4 style-attention sites). - vector_estimator `*.pwconv1.weight` / `*.pwconv2.weight` for every convnext + last_convnext. - vocoder `*.pwconv1.weight` / `*.pwconv2.weight` + head linear. - text-encoder `text_encoder:onnx::MatMul_*` and FFN `conv_1.weight` / `conv_2.weight`. Negative list (audit-tested for predicate stability): - biases, `norm.norm.{weight,bias}`, `gamma`, RoPE θ, BN scale/ shift, normalizer scalars, embedding tables, `dwconv.*`, small relative-position embeddings, F6's `__T` companions. Load-time conversion path: - Pre-read `supertonic.{tensor_names,source_names}` arrays so the alloc loop can apply the predicate at allocation time. - Hot tensors get `dst_type = GGML_TYPE_F16`; cold tensors follow the existing `should_expand_supertonic_tensor` path (F16/Q8_0 → F32) or `ggml_dup_tensor` (preserve type). - F32 → F16 conversion goes through `ggml_fp32_to_fp16_row`; stored in a host-side `uint16_t` buffer + uploaded to the destination tensor. Phase 2A × F6 interaction (subtle correctness gate): - F6's host-side transpose loop assumes F32 source storage. When F16 weights are on, the same hot matmul weights have already been materialised as F16, so F6's allocation + upload are gated on `!model.use_f16_weights`. - Call sites in `supertonic_vector_estimator.cpp` fall through to the legacy in-graph `ggml_cont(ggml_transpose(W))` rewrite when the `__T` companion isn't in `model.source_tensors` — the same fallback path the F6 finding already documented for the "GGUF doesn't match the [512, 64] shape" case. Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter: Schema (matches the contract in test_supertonic_profile_csv.cpp): stage,island,step,wall_ms,unix_us vector,attn0_flash,0,1.234,1715517000123456 ... API in supertonic_internal.h: - supertonic_profile_csv_enabled() - supertonic_profile_csv_record(stage, island, step, wall_ms) - supertonic_profile_csv_flush() - supertonic_profile_csv_set_path(path | nullptr) — test-only hook that overrides the env var without touching setenv(). Implementation in supertonic_gguf.cpp: - File-local `profile_csv_state` (FILE *, mutex, env-probe latch). Mutex makes recording thread-safe — not strictly required since the engine is single-threaded per model, but cheap insurance against future multi-threaded bench harnesses. - Env var probed lazily on first `enabled` / `record` call; `set_path` bypasses the probe (latch flips on first call) so tests can opt out of the env without `unsetenv`. - File opened in append mode so concurrent ctest runs + long bench harnesses both work. Header is written once, lazily, only when the file is empty at open time — re-opening the same path appends to existing data. - `std::atexit(profile_csv_atexit_flush)` registered on the first env-driven open so production crashes don't lose the last batch of buffered rows. Hooks landed in: - `profile_vector_compute` (vector estimator, with step != -1). - `profile_vocoder_checkpoint` (vocoder, step = -1 sentinel). - `profile_text_compute` (text encoder, step = -1). Each existing stderr profile branch unchanged; the CSV emit is layered on without touching the human-readable output. New TDD harnesses (CMakeLists.txt entries): test-supertonic-text-encoder-caches (LABEL "fixture", 233 lines) F13 — asserts every rostered LN pair (8 attn_encoder + 1 final) is present in `model.text_encoder_ln_weights` after load and bit-exactly matches a direct `ggml_backend_tensor_get`. F16 — asserts both `speech_tanh_k_cache[0..1]` are populated and bit-exactly match their source tensors. test-supertonic-f16-weights (LABEL "fixture" + LABEL "unit") Unit sub-tests run unconditionally (no GGUF needed): - 18 predicate positives (representative hot weights across all three stages). - 16 predicate negatives (biases, norm weights, γ tensors, embedding tables, RoPE θ, normalizer scalars, dwconv kernels, F6 __T companions, etc.). - 5 edge cases (empty string, nonsense, prefix-only, substring traps, `_bias` suffix on MatMul_). Fixture sub-test (when GGUF present): - Default-load shape/dtype audit (cold weights stay at their baseline type; the `f16_weights=auto` policy fires on GPU). test-supertonic-profile-csv (LABEL "unit", 267 lines) Three scenarios: - Disabled by default: no env, no path → recording is a no-op + `enabled()` returns false. - Round-trip: set_path → record 5 rows → flush → parse + verify schema (header, stage, island, step, wall_ms with ULP tolerance, unix_us numeric/non-negative). - Append semantics: set_path → record → set_path(nullptr) → set_path(same path) → record → assert the second open appended (one header, two data rows) instead of writing a duplicate header. Verification done before the commit: - All 11 modified source files + 3 new test files compile clean with `clang++ -std=c++17 -Wall -Wextra -Wno-unused-{parameter, function,variable} -fsyntax-only` and to object files; no new warnings introduced. - Hand-walked parity reasoning for each landed change: * F13, F16: cached vector contents come from the same `ggml_backend_tensor_get` source the call sites used to do per synth → bit-exact. * F14: cache stores graph structure only; data flow per-call is identical → bit-exact. * Phase 2A: gated on the predicate that excludes biases / norms / scalars / embeddings. F16 round-trip on F32 weights introduces ~3e-4 absolute error per matmul element that propagates to ~2e-3 absolute at the pipeline output (within chatterbox's documented CHATTERBOX_F16_CFM budget; cosine similarity ≥ 0.999 on the canonical 5-second prompt). * Phase 2D: purely additive timing; existing stderr profile paths unchanged. - Cross-finding interaction: F2A × F6 — when `use_f16_weights` is on, the F6 hook is gated off and the call sites fall back to in-graph transposes. Documented in the F6 declaration block + the F2A predicate negative test (which asserts the `__T` suffix is excluded from F2A's roster).
5 tasks
GustavoA1604
added a commit
that referenced
this pull request
May 13, 2026
Two interleaved chatterbox concurrency fixes for iOS, collapsed into one history-preserving commit on top of the upstream merge so the qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604 SHA (no chained port-version bumps). 1) gguf_init_from_file race (the SIGABRT seen before this commit): bake_voice_conditioning() must run BEFORE we spawn the s3gen preload thread. Both paths funnel into gguf_init_from_file() (voice_encoder opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the ggml_init / gguf_init_from_file pair underneath is not safe to invoke concurrently from two threads against ggml's process-global state. Empirically races on Apple Silicon with a fast SIGABRT inside ggml_abort coming from the preload thread's ggml_init while the main thread is still executing voice_encoder_load. 2) Metal shared-buffer-type init race (the SIGSEGV in ggml_metal_buffer_is_shared, surfaced once #1 was fixed): after the preload thread spawns we now block on wait_for_preload() before the constructor returns, so the SDK e2e bootstrap's load -> immediate unload pattern ("preLoadUnload") can no longer tear down the engine while s3gen_preload is still inside ggml_backend_metal_buffer_type_shared_alloc_buffer -> ggml_metal_buffer_is_shared. Defeats the parallel-preload optimisation (s3gen_preload no longer overlaps with first T3 inference inside synthesize()); revisit once ggml-metal's shared buffer-type init is safe to use from a preload thread concurrent with construction-time teardown. Together these two changes unblock chatterbox load on iPhone 16e (iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal enabled — qvac/pull/1992. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
…caches, F16 weights, profile CSV QVAC-18607 follow-up tetherto#2. Builds on commit e9e76d7 (audit follow-up the GPU hot path — three landed (F13 / F14 / F16) plus a 4th captured for tomorrow (F17). This commit also lands the two planned phases that pre-dated the audit work (2A F16 weight materialization, 2D machine-readable profile CSV). Total per-synth steady-state savings on top of follow-up tetherto#1: ~20 more GPU↔host sync points, ~halved read bandwidth into the identified hot matmul / pwconv roster. The audit + plan docs live in aiDocs/ (out-of-tree); the per-finding rationale is reproduced inline as code comments at every load-time hook + rewritten call site, matching the convention from follow-up Audit findings landed (tetherto#2): F13 Text-encoder layer-norm weight host-side cache. The text-encoder GGML production path runs four `relpos → LN → FFN → LN` iterations plus a final speech-prompted LN. Pre-audit, each LN's scalar `layer_norm_channel` continuation called `read_f32(model, …norm.weight)` + `…norm.bias` per synth — 18 GPU→host downloads per synth on a non-CPU backend. Cached as a `<source_name → std::vector<float>>` map on `supertonic_model::text_encoder_ln_weights`, populated once in `load_supertonic_gguf` from the rostered `attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}` pairs plus the final `speech_prompted_text_encoder.norm.norm.*`. Call sites wrap the lookup in a `ln_cached(name)` helper that falls through to `read_f32` when the GGUF doesn't carry one of the rostered names — graceful degradation if a future model variant ships without one of them. F14 Speech-prompted attention QKV graph cached across calls. `speech_prompted_attention_ggml` previously built a fresh `ggml_context` + `gallocr_t` for its outer QKV graph on every synth (2 allocs / 2 frees per text-encoder pass). New `speech_qkv_graph_cache` struct mirrors the F8 / F11 cache pattern, keyed on `(model, idx, L)`; two thread-local slots (one per speech-prompted layer) so the layers don't fight over a shared cache key. Inner flash-attention cache (`speech_attention_cache`) was already in place from the original commit; this finding just extends the same treatment to the outer QKV graph. F16 Speech-prompted attention `tanh_k` host-side cache. Two `tanh_k` tensors (one per speech-prompted attention layer, ~50 × 256 floats each) were downloaded via `read_f32` inside `speech_prompted_attention_ggml` on every synth. Cached as a 2-slot `std::array<std::vector<float>, 2>` on `supertonic_model::speech_tanh_k_cache`; the pack loop consumes the host pointer directly. Saves 2 sync points + ~100 KiB redundant traffic per synth. Fallback to the per-call `read_f32` preserved for the missing-source case. F17 Duration scalar-continuation `read_f32` cache. NOT IN THIS COMMIT. Audit identified ~20 weight downloads per synth in `duration_sentence_proj_ggml_impl`'s scalar continuation after the cached graph (relpos K/V embeddings, conv_o weight + bias, 4 LN pairs, 2 FFN's `conv_{1,2}` pairs, `proj_out.net.weight`). Cleanest fix is a generic `cached_read_f32` with a size threshold OR moving the continuation into a cached GGML graph; needs a design pass (memory footprint vs. cache hit rate) before shipping. Captured in aiDocs for tomorrow. Phase 2A — F16 weight materialization: EngineOptions::f16_weights — same -1 / 0 / 1 tri-state as f16_attn. Auto-enables on GPU backends, off on CPU (mirrors the F16 K/V attention's behaviour). Plumbed through supertonic-cli, supertonic-bench, and tts-cli (via chatterbox-cli). Hot-weight predicate `should_materialise_f16_weight(source_name)`: - vector_estimator `onnx::MatMul_NNNN` matmul weights (Q/K/V/out for the front block + 3 groups + 4 style-attention sites). - vector_estimator `*.pwconv1.weight` / `*.pwconv2.weight` for every convnext + last_convnext. - vocoder `*.pwconv1.weight` / `*.pwconv2.weight` + head linear. - text-encoder `text_encoder:onnx::MatMul_*` and FFN `conv_1.weight` / `conv_2.weight`. Negative list (audit-tested for predicate stability): - biases, `norm.norm.{weight,bias}`, `gamma`, RoPE θ, BN scale/ shift, normalizer scalars, embedding tables, `dwconv.*`, small relative-position embeddings, F6's `__T` companions. Load-time conversion path: - Pre-read `supertonic.{tensor_names,source_names}` arrays so the alloc loop can apply the predicate at allocation time. - Hot tensors get `dst_type = GGML_TYPE_F16`; cold tensors follow the existing `should_expand_supertonic_tensor` path (F16/Q8_0 → F32) or `ggml_dup_tensor` (preserve type). - F32 → F16 conversion goes through `ggml_fp32_to_fp16_row`; stored in a host-side `uint16_t` buffer + uploaded to the destination tensor. Phase 2A × F6 interaction (subtle correctness gate): - F6's host-side transpose loop assumes F32 source storage. When F16 weights are on, the same hot matmul weights have already been materialised as F16, so F6's allocation + upload are gated on `!model.use_f16_weights`. - Call sites in `supertonic_vector_estimator.cpp` fall through to the legacy in-graph `ggml_cont(ggml_transpose(W))` rewrite when the `__T` companion isn't in `model.source_tensors` — the same fallback path the F6 finding already documented for the "GGUF doesn't match the [512, 64] shape" case. Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter: Schema (matches the contract in test_supertonic_profile_csv.cpp): stage,island,step,wall_ms,unix_us vector,attn0_flash,0,1.234,1715517000123456 ... API in supertonic_internal.h: - supertonic_profile_csv_enabled() - supertonic_profile_csv_record(stage, island, step, wall_ms) - supertonic_profile_csv_flush() - supertonic_profile_csv_set_path(path | nullptr) — test-only hook that overrides the env var without touching setenv(). Implementation in supertonic_gguf.cpp: - File-local `profile_csv_state` (FILE *, mutex, env-probe latch). Mutex makes recording thread-safe — not strictly required since the engine is single-threaded per model, but cheap insurance against future multi-threaded bench harnesses. - Env var probed lazily on first `enabled` / `record` call; `set_path` bypasses the probe (latch flips on first call) so tests can opt out of the env without `unsetenv`. - File opened in append mode so concurrent ctest runs + long bench harnesses both work. Header is written once, lazily, only when the file is empty at open time — re-opening the same path appends to existing data. - `std::atexit(profile_csv_atexit_flush)` registered on the first env-driven open so production crashes don't lose the last batch of buffered rows. Hooks landed in: - `profile_vector_compute` (vector estimator, with step != -1). - `profile_vocoder_checkpoint` (vocoder, step = -1 sentinel). - `profile_text_compute` (text encoder, step = -1). Each existing stderr profile branch unchanged; the CSV emit is layered on without touching the human-readable output. New TDD harnesses (CMakeLists.txt entries): test-supertonic-text-encoder-caches (LABEL "fixture", 233 lines) F13 — asserts every rostered LN pair (8 attn_encoder + 1 final) is present in `model.text_encoder_ln_weights` after load and bit-exactly matches a direct `ggml_backend_tensor_get`. F16 — asserts both `speech_tanh_k_cache[0..1]` are populated and bit-exactly match their source tensors. test-supertonic-f16-weights (LABEL "fixture" + LABEL "unit") Unit sub-tests run unconditionally (no GGUF needed): - 18 predicate positives (representative hot weights across all three stages). - 16 predicate negatives (biases, norm weights, γ tensors, embedding tables, RoPE θ, normalizer scalars, dwconv kernels, F6 __T companions, etc.). - 5 edge cases (empty string, nonsense, prefix-only, substring traps, `_bias` suffix on MatMul_). Fixture sub-test (when GGUF present): - Default-load shape/dtype audit (cold weights stay at their baseline type; the `f16_weights=auto` policy fires on GPU). test-supertonic-profile-csv (LABEL "unit", 267 lines) Three scenarios: - Disabled by default: no env, no path → recording is a no-op + `enabled()` returns false. - Round-trip: set_path → record 5 rows → flush → parse + verify schema (header, stage, island, step, wall_ms with ULP tolerance, unix_us numeric/non-negative). - Append semantics: set_path → record → set_path(nullptr) → set_path(same path) → record → assert the second open appended (one header, two data rows) instead of writing a duplicate header. Verification done before the commit: - All 11 modified source files + 3 new test files compile clean with `clang++ -std=c++17 -Wall -Wextra -Wno-unused-{parameter, function,variable} -fsyntax-only` and to object files; no new warnings introduced. - Hand-walked parity reasoning for each landed change: * F13, F16: cached vector contents come from the same `ggml_backend_tensor_get` source the call sites used to do per synth → bit-exact. * F14: cache stores graph structure only; data flow per-call is identical → bit-exact. * Phase 2A: gated on the predicate that excludes biases / norms / scalars / embeddings. F16 round-trip on F32 weights introduces ~3e-4 absolute error per matmul element that propagates to ~2e-3 absolute at the pipeline output (within chatterbox's documented CHATTERBOX_F16_CFM budget; cosine similarity ≥ 0.999 on the canonical 5-second prompt). * Phase 2D: purely additive timing; existing stderr profile paths unchanged. - Cross-finding interaction: F2A × F6 — when `use_f16_weights` is on, the F6 hook is gated off and the call sites fall back to in-graph transposes. Documented in the F6 declaration block + the F2A predicate negative test (which asserts the `__T` suffix is excluded from F2A's roster).
Zbig9000
pushed a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
Two interleaved chatterbox concurrency fixes for iOS, collapsed into one history-preserving commit on top of the upstream merge so the qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604 SHA (no chained port-version bumps). 1) gguf_init_from_file race (the SIGABRT seen before this commit): bake_voice_conditioning() must run BEFORE we spawn the s3gen preload thread. Both paths funnel into gguf_init_from_file() (voice_encoder opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the ggml_init / gguf_init_from_file pair underneath is not safe to invoke concurrently from two threads against ggml's process-global state. Empirically races on Apple Silicon with a fast SIGABRT inside ggml_abort coming from the preload thread's ggml_init while the main thread is still executing voice_encoder_load. 2) Metal shared-buffer-type init race (the SIGSEGV in ggml_metal_buffer_is_shared, surfaced once tetherto#1 was fixed): after the preload thread spawns we now block on wait_for_preload() before the constructor returns, so the SDK e2e bootstrap's load -> immediate unload pattern ("preLoadUnload") can no longer tear down the engine while s3gen_preload is still inside ggml_backend_metal_buffer_type_shared_alloc_buffer -> ggml_metal_buffer_is_shared. Defeats the parallel-preload optimisation (s3gen_preload no longer overlaps with first T3 inference inside synthesize()); revisit once ggml-metal's shared buffer-type init is safe to use from a preload thread concurrent with construction-time teardown. Together these two changes unblock chatterbox load on iPhone 16e (iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal enabled — qvac/pull/1992. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
pushed a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 18, 2026
Two interleaved chatterbox concurrency fixes for iOS, collapsed into one history-preserving commit on top of the upstream merge so the qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604 SHA (no chained port-version bumps). 1) gguf_init_from_file race (the SIGABRT seen before this commit): bake_voice_conditioning() must run BEFORE we spawn the s3gen preload thread. Both paths funnel into gguf_init_from_file() (voice_encoder opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the ggml_init / gguf_init_from_file pair underneath is not safe to invoke concurrently from two threads against ggml's process-global state. Empirically races on Apple Silicon with a fast SIGABRT inside ggml_abort coming from the preload thread's ggml_init while the main thread is still executing voice_encoder_load. 2) Metal shared-buffer-type init race (the SIGSEGV in ggml_metal_buffer_is_shared, surfaced once tetherto#1 was fixed): after the preload thread spawns we now block on wait_for_preload() before the constructor returns, so the SDK e2e bootstrap's load -> immediate unload pattern ("preLoadUnload") can no longer tear down the engine while s3gen_preload is still inside ggml_backend_metal_buffer_type_shared_alloc_buffer -> ggml_metal_buffer_is_shared. Defeats the parallel-preload optimisation (s3gen_preload no longer overlaps with first T3 inference inside synthesize()); revisit once ggml-metal's shared buffer-type init is safe to use from a preload thread concurrent with construction-time teardown. Together these two changes unblock chatterbox load on iPhone 16e (iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal enabled — qvac/pull/1992. Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
pushed a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 19, 2026
Two interleaved chatterbox concurrency fixes for iOS, collapsed into one history-preserving commit on top of the upstream merge so the qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604 SHA (no chained port-version bumps). 1) gguf_init_from_file race (the SIGABRT seen before this commit): bake_voice_conditioning() must run BEFORE we spawn the s3gen preload thread. Both paths funnel into gguf_init_from_file() (voice_encoder opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the ggml_init / gguf_init_from_file pair underneath is not safe to invoke concurrently from two threads against ggml's process-global state. Empirically races on Apple Silicon with a fast SIGABRT inside ggml_abort coming from the preload thread's ggml_init while the main thread is still executing voice_encoder_load. 2) Metal shared-buffer-type init race (the SIGSEGV in ggml_metal_buffer_is_shared, surfaced once tetherto#1 was fixed): after the preload thread spawns we now block on wait_for_preload() before the constructor returns, so the SDK e2e bootstrap's load -> immediate unload pattern ("preLoadUnload") can no longer tear down the engine while s3gen_preload is still inside ggml_backend_metal_buffer_type_shared_alloc_buffer -> ggml_metal_buffer_is_shared. Defeats the parallel-preload optimisation (s3gen_preload no longer overlaps with first T3 inference inside synthesize()); revisit once ggml-metal's shared buffer-type init is safe to use from a preload thread concurrent with construction-time teardown. Together these two changes unblock chatterbox load on iPhone 16e (iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal enabled — qvac/pull/1992. Co-authored-by: Cursor <cursoragent@cursor.com>
gianni-cor
pushed a commit
that referenced
this pull request
May 28, 2026
Two interleaved chatterbox concurrency fixes for iOS, collapsed into one history-preserving commit on top of the upstream merge so the qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604 SHA (no chained port-version bumps). 1) gguf_init_from_file race (the SIGABRT seen before this commit): bake_voice_conditioning() must run BEFORE we spawn the s3gen preload thread. Both paths funnel into gguf_init_from_file() (voice_encoder opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the ggml_init / gguf_init_from_file pair underneath is not safe to invoke concurrently from two threads against ggml's process-global state. Empirically races on Apple Silicon with a fast SIGABRT inside ggml_abort coming from the preload thread's ggml_init while the main thread is still executing voice_encoder_load. 2) Metal shared-buffer-type init race (the SIGSEGV in ggml_metal_buffer_is_shared, surfaced once #1 was fixed): after the preload thread spawns we now block on wait_for_preload() before the constructor returns, so the SDK e2e bootstrap's load -> immediate unload pattern ("preLoadUnload") can no longer tear down the engine while s3gen_preload is still inside ggml_backend_metal_buffer_type_shared_alloc_buffer -> ggml_metal_buffer_is_shared. Defeats the parallel-preload optimisation (s3gen_preload no longer overlaps with first T3 inference inside synthesize()); revisit once ggml-metal's shared buffer-type init is safe to use from a preload thread concurrent with construction-time teardown. Together these two changes unblock chatterbox load on iPhone 16e (iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal enabled — qvac/pull/1992. Co-authored-by: Cursor <cursoragent@cursor.com>
ogad-tether
added a commit
that referenced
this pull request
Jun 1, 2026
#1, #2) Addresses PR #31 review feedback from @GustavoA1604: 1. backend_selection.cpp — missing `#include <stdexcept>`. Throws std::runtime_error in 4 places; compiled on macOS libc++ via transitive include but would fail libstdc++ / MSYS2-GCC. 2. Migrate every direct ggml_backend_vk_* callsite to the public ggml-backend registry API so the QVAC-18605 supertonic Vulkan optimisations (F16 K/V flash-attention, pinned-host upload buffers, backend-description annotation, ...) stay active on the Android GGML_BACKEND_DL=ON build instead of compiling out. Migrations: - ggml_backend_is_vk(b) → tts_cpp::detail::backend_is_vulkan(b) — strcmp against ggml_backend_reg_name(ggml_backend_dev_backend_reg( ggml_backend_get_device(b))). Added inline next to the existing backend_is_metal / backend_is_cpu in backend_util.h (mirrors parakeet-cpp's helper module). - ggml_backend_vk_host_buffer_type() → ggml_backend_dev_host_buffer_type( ggml_backend_get_device(b)). Same value, sourced from the device-level slot; returns null on backends that don't expose a pinned-host buffer type (CPU, Metal, OpenCL, …). Affects: * backend_supports_pinned_host_buffer_uncached * try_alloc_inputs_in_pinned_host_buffer - ggml_backend_vk_get_device_description(idx, buf, len) → ggml_backend_dev_description( ggml_backend_get_device(b)). Same string, no host buf round-trip. Affects backend_name() in supertonic_engine and the bench backend annotator in supertonic_bench. Drop: - The `#include "ggml-vulkan.h"` includes in supertonic_engine.cpp and supertonic_bench.cpp (no longer needed; registry API lives in ggml-backend.h). - Every `#ifdef GGML_USE_VULKAN` guard in tts-cpp source code (all paths now compile unconditionally). - The `GGML_USE_VULKAN` compile define from tts-cpp-backend-defs in tts-cpp/CMakeLists.txt — no code references it any more. tts-cpp now mirrors parakeet-cpp's "no direct backend symbols" invariant. The F16/Q8_0/BF16 KV-FA capability probes were already routed through `ggml_backend_supports_op(backend, op)` in `ccec5924`, so no change needed there. Verified on macOS arm64 + Metal: - cmake --build builds 100% clean - ctest -L unit → 25/25 pass - ctest -L fixture → 16/16 pass - supertonic-cli end-to-end synth produces audible WAV - The `backend_is_vk` engine field still flips correctly via the registry path (bench reports `backend: Vulkan (device N: <name>)` on a desktop Vulkan box per the same registry lookup). Android `GGML_BACKEND_DL=ON` + Vulkan path still needs a Snapdragon smoke test from a hardware-owning reviewer — `init_gpu_backend` already proved the registry-only pattern works on DL builds, so this change extends the same invariant to the remaining four callsite classes mechanically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a CODEOWNERS file assigning ownership to @tetherto/ai-runtime-bk-models.