Skip to content

Fixing PR Workflow#3

Closed
kapildev421 wants to merge 1 commit into
tetherto:masterfrom
kapildev421:master
Closed

Fixing PR Workflow#3
kapildev421 wants to merge 1 commit into
tetherto:masterfrom
kapildev421:master

Conversation

@kapildev421

Copy link
Copy Markdown

This PR fixes the workflow for Tier-based Approvals

@kapildev421

Copy link
Copy Markdown
Author

/review

Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 12, 2026
… / vector graph caches

QVAC-18607 follow-up tetherto#3.  Three more audit findings landed on top of
follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync
points + ~6 allocator churn cycles per synth.

  F17  Duration scalar-continuation `read_f32` cache.
       Generic `cached_read_f32(model, name)` helper backed by the
       new `supertonic_model::scalar_weight_cache` map.  Replaces
       ~30 backend tensor reads per synth across
       `self_attention`, `ffn_block`, and the
       `duration_sentence_proj_ggml_impl` scalar continuation
       (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out,
       predictor layers + activation).  Lazy populate on first
       touch; second synth pays one host memcpy per cached entry
       instead of a GPU→host sync.

  F18  Text-encoder convnext-front graph cached across synths.
       `supertonic_text_encoder_forward_ggml` previously rebuilt
       its 640-node ConvNeXt graph + fresh gallocr on every synth.
       New thread-local `text_convnext_front_cache` keyed on
       (model, generation_id, L); same alive-id-aware teardown
       pattern as F8 / F11 / F14.

  F19  Vector-estimator front-block graph cached across denoise
       steps.  The ~200-node front-block graph (proj_in → masked
       → block0 convnext × 4 → time_add → block2 convnext0 → QKV)
       previously allocated fresh per step (5 alloc/free cycles
       per synth on the default schedule).  Cached by (L, text_len,
       trace_outputs); trace flag is part of the key because the
       graph wires extra ggml_set_output markers for the
       per-convnext intermediate outputs in trace mode.

New TDD harness (fixture-bound):

  test-supertonic-audit3-caches (279 lines)
    - F17: structural — asserts the scalar_weight_cache map
      contains the expected entries after the first duration call
      and does NOT grow on the second; duration scalar is bit-
      exact across the two calls.
    - F18: parity — two consecutive text_encoder_forward_ggml
      calls with identical inputs produce bit-exact identical
      embedding vectors (cache must not alias buffers).
    - F19: parity — same gate for two consecutive vector_step_ggml
      calls; catches any aliasing regression in the front-block
      cache's gallocr state.

Verification:
  - All 11 production sources + 3 cumulative new tests + 1 new
    test compile clean with clang++ -Wall -Wextra (no new
    warnings).
  - Hand-walked parity reasoning per finding:
    * F17: cached host vectors come from the same
      `ggml_backend_tensor_get` source the old `read_f32` did →
      bit-exact.
    * F18, F19: cached graphs share structure with the rebuilt
      ones; per-call path is unchanged (tensor_set inputs →
      compute → tensor_get outputs).  Bit-exact across calls.
  - Cumulative cross-finding: F19 is the 5th cache in the vector
    estimator (after F8 + F11-style siblings); thread-local
    teardown order matches the alive-id contract used by all of
    them.

Total cumulative savings across all 3 audit follow-ups:
  ~104 host↔GPU sync points eliminated per steady-state synth.

Diff:
  6 sources changed, 1 new test, 1 CMakeLists update.
  +327 / -172 in src/ + CMakeLists + internal header.
  +279 new test.

What's next (tomorrow):
  - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync
    points / synth).  Needs device parity gate.
  - Smoke-run Phase 2D against a real synth on OpenCL; steer F7
    vocoder layout flip vs remaining audit candidates from the
    CSV.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 added a commit to Zbig9000/qvac-ext-lib-whisper.cpp that referenced this pull request May 15, 2026
… / vector graph caches

QVAC-18607 follow-up tetherto#3.  Three more audit findings landed on top of
follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync
points + ~6 allocator churn cycles per synth.

  F17  Duration scalar-continuation `read_f32` cache.
       Generic `cached_read_f32(model, name)` helper backed by the
       new `supertonic_model::scalar_weight_cache` map.  Replaces
       ~30 backend tensor reads per synth across
       `self_attention`, `ffn_block`, and the
       `duration_sentence_proj_ggml_impl` scalar continuation
       (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out,
       predictor layers + activation).  Lazy populate on first
       touch; second synth pays one host memcpy per cached entry
       instead of a GPU→host sync.

  F18  Text-encoder convnext-front graph cached across synths.
       `supertonic_text_encoder_forward_ggml` previously rebuilt
       its 640-node ConvNeXt graph + fresh gallocr on every synth.
       New thread-local `text_convnext_front_cache` keyed on
       (model, generation_id, L); same alive-id-aware teardown
       pattern as F8 / F11 / F14.

  F19  Vector-estimator front-block graph cached across denoise
       steps.  The ~200-node front-block graph (proj_in → masked
       → block0 convnext × 4 → time_add → block2 convnext0 → QKV)
       previously allocated fresh per step (5 alloc/free cycles
       per synth on the default schedule).  Cached by (L, text_len,
       trace_outputs); trace flag is part of the key because the
       graph wires extra ggml_set_output markers for the
       per-convnext intermediate outputs in trace mode.

New TDD harness (fixture-bound):

  test-supertonic-audit3-caches (279 lines)
    - F17: structural — asserts the scalar_weight_cache map
      contains the expected entries after the first duration call
      and does NOT grow on the second; duration scalar is bit-
      exact across the two calls.
    - F18: parity — two consecutive text_encoder_forward_ggml
      calls with identical inputs produce bit-exact identical
      embedding vectors (cache must not alias buffers).
    - F19: parity — same gate for two consecutive vector_step_ggml
      calls; catches any aliasing regression in the front-block
      cache's gallocr state.

Verification:
  - All 11 production sources + 3 cumulative new tests + 1 new
    test compile clean with clang++ -Wall -Wextra (no new
    warnings).
  - Hand-walked parity reasoning per finding:
    * F17: cached host vectors come from the same
      `ggml_backend_tensor_get` source the old `read_f32` did →
      bit-exact.
    * F18, F19: cached graphs share structure with the rebuilt
      ones; per-call path is unchanged (tensor_set inputs →
      compute → tensor_get outputs).  Bit-exact across calls.
  - Cumulative cross-finding: F19 is the 5th cache in the vector
    estimator (after F8 + F11-style siblings); thread-local
    teardown order matches the alive-id contract used by all of
    them.

Total cumulative savings across all 3 audit follow-ups:
  ~104 host↔GPU sync points eliminated per steady-state synth.

Diff:
  6 sources changed, 1 new test, 1 CMakeLists update.
  +327 / -172 in src/ + CMakeLists + internal header.
  +279 new test.

What's next (tomorrow):
  - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync
    points / synth).  Needs device parity gate.
  - Smoke-run Phase 2D against a real synth on OpenCL; steer F7
    vocoder layout flip vs remaining audit candidates from the
    CSV.

Co-authored-by: Cursor <cursoragent@cursor.com>
freddy311082 added a commit that referenced this pull request May 29, 2026
Resolves 37 add/add conflicts that accumulated since the last
master merge (May 7). Master moved 326 commits forward, mainly
landing parakeet-cpp (TDT/EOU/Sortformer/AOSC), the ggml-backend
registry refactor (`backend_selection.{h,cpp}`, registry-only
device walk replacing the per-backend `#ifdef GGML_USE_<X>`
cascades), Android `GGML_BACKEND_DL=ON` plumbing, and the
`backends_dir` / `opencl_cache_dir` Engine knobs.

Resolution strategy:

- parakeet-cpp/ (19 files): taken from master verbatim. The PR
  branch only carried the original port (commits d7ab516 /
  c6c3fd7 / 761eca0, all <= May 7); master has 13 newer
  commits including TDT/EOU/Sortformer v2.1 + AOSC and the
  word-start signal already integrated. Nothing of the PR was
  lost on this side.

- .github/CODEOWNERS: taken from master (team reorg to
  `qvac-internal-dev` / `qvac-internal-merge`).

- tts-cpp/ stale-from-initial-drop (7 files: voice_encoder,
  t3_mtl, s3tokenizer, mel_extract_stft, main, campplus,
  campplus_forward.inc): taken from master. Their only PR
  commit is the original `ef840d5c Add tts-cpp files` drop;
  master has since rewritten them for the registry refactor.

- tts-cpp/ mirror-only (4 files: supertonic/engine.h,
  supertonic_engine, supertonic_gguf, chatterbox_tts): taken
  from master. The PR's only authored commits on these mirror
  pre-existing fixes from chatterbox.cpp that are already on
  master.

- tts-cpp/CMakeLists.txt: hybrid merge. Master's Android
  dynamic-backend stack, registry-only backend-defs interface
  (with `src/backend_selection.cpp` in the source list), and
  `target_compile_definitions(test-metal-ops PRIVATE
  GGML_USE_METAL)` retained. PR's `src/text_preprocess.cpp`
  source entry, MeCab/Cangjie find_library block (PRIVATE
  include per gianni-cor review), and 23-language multilingual
  test matrix retained.

- tts-cpp/include/tts-cpp/chatterbox/engine.h: master's
  updated `n_gpu_layers` doc (Adreno-tier policy) and new
  `backends_dir` / `opencl_cache_dir` fields retained. PR's
  `mecab_dict_path` / `cangjie_tsv_path` fields retained.

- tts-cpp/src/mtl_tokenizer.{cpp,h}: PR's `<mutex>` +
  `text_preprocess.h` includes, 23 supported_languages,
  preprocess_japanese / preprocess_chinese helpers with
  call_once-cached MeCab tagger + Cangjie table,
  apply_language_preprocessing dispatch, and
  `set_mecab_dict_path` / `set_cangjie_tsv_path` setters
  (with already-initialised warn) retained. Master's
  `// ---- Encode ----` divider kept.

- tts-cpp/src/chatterbox_engine.cpp: master's `#include
  "backend_selection.h"` and `backends_dir` /
  `opencl_cache_dir` wiring retained. PR's per-Engine
  `mtl_tokenizer::set_mecab_dict_path` /
  `set_cangjie_tsv_path` calls retained.

- tts-cpp/src/chatterbox_cli.cpp: master's removal of the
  per-backend `#include "ggml-{cuda,metal,vulkan}.h"`
  cascade (registry-only refactor) and the new
  voice-cloning backend comment retained. PR's
  `--mecab-dict` / `--cangjie-tsv` flags (declaration, help,
  parsing, and per-Engine setter call) retained. PR's RAII
  `thread_join_guard` on the s3gen preload thread retained
  (addresses GustavoA1604 review #3: std::terminate hazard
  during stack unwind). PR's 2-token MTL early-stop with
  `kMtlMinTokensBeforeCadence = 60` guard and
  `generated.resize(n - 1)` retained (addresses
  GustavoA1604 review #2: previous over-aggressive
  `resize(n - 2)` trimmed a legitimate token); the log line
  was updated to surface the repeated token id.

PR-only files (no conflict): tts-cpp/src/text_preprocess.{h,cpp},
tts-cpp/scripts/build_mecab_dict.py,
tts-cpp/scripts/build_cangjie_tsv.py,
tts-cpp/test/test_multilingual_{synth,asr}.cpp are all
preserved as-is by the merge.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants