Fixing PR Workflow#3
Closed
kapildev421 wants to merge 1 commit into
Closed
Conversation
olyasir
approved these changes
Sep 16, 2025
kartiksain
approved these changes
Sep 17, 2025
Author
|
/review |
9 tasks
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 12, 2026
… / vector graph caches QVAC-18607 follow-up tetherto#3. Three more audit findings landed on top of follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync points + ~6 allocator churn cycles per synth. F17 Duration scalar-continuation `read_f32` cache. Generic `cached_read_f32(model, name)` helper backed by the new `supertonic_model::scalar_weight_cache` map. Replaces ~30 backend tensor reads per synth across `self_attention`, `ffn_block`, and the `duration_sentence_proj_ggml_impl` scalar continuation (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out, predictor layers + activation). Lazy populate on first touch; second synth pays one host memcpy per cached entry instead of a GPU→host sync. F18 Text-encoder convnext-front graph cached across synths. `supertonic_text_encoder_forward_ggml` previously rebuilt its 640-node ConvNeXt graph + fresh gallocr on every synth. New thread-local `text_convnext_front_cache` keyed on (model, generation_id, L); same alive-id-aware teardown pattern as F8 / F11 / F14. F19 Vector-estimator front-block graph cached across denoise steps. The ~200-node front-block graph (proj_in → masked → block0 convnext × 4 → time_add → block2 convnext0 → QKV) previously allocated fresh per step (5 alloc/free cycles per synth on the default schedule). Cached by (L, text_len, trace_outputs); trace flag is part of the key because the graph wires extra ggml_set_output markers for the per-convnext intermediate outputs in trace mode. New TDD harness (fixture-bound): test-supertonic-audit3-caches (279 lines) - F17: structural — asserts the scalar_weight_cache map contains the expected entries after the first duration call and does NOT grow on the second; duration scalar is bit- exact across the two calls. - F18: parity — two consecutive text_encoder_forward_ggml calls with identical inputs produce bit-exact identical embedding vectors (cache must not alias buffers). - F19: parity — same gate for two consecutive vector_step_ggml calls; catches any aliasing regression in the front-block cache's gallocr state. Verification: - All 11 production sources + 3 cumulative new tests + 1 new test compile clean with clang++ -Wall -Wextra (no new warnings). - Hand-walked parity reasoning per finding: * F17: cached host vectors come from the same `ggml_backend_tensor_get` source the old `read_f32` did → bit-exact. * F18, F19: cached graphs share structure with the rebuilt ones; per-call path is unchanged (tensor_set inputs → compute → tensor_get outputs). Bit-exact across calls. - Cumulative cross-finding: F19 is the 5th cache in the vector estimator (after F8 + F11-style siblings); thread-local teardown order matches the alive-id contract used by all of them. Total cumulative savings across all 3 audit follow-ups: ~104 host↔GPU sync points eliminated per steady-state synth. Diff: 6 sources changed, 1 new test, 1 CMakeLists update. +327 / -172 in src/ + CMakeLists + internal header. +279 new test. What's next (tomorrow): - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync points / synth). Needs device parity gate. - Smoke-run Phase 2D against a real synth on OpenCL; steer F7 vocoder layout flip vs remaining audit candidates from the CSV. Co-authored-by: Cursor <cursoragent@cursor.com>
5 tasks
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
… / vector graph caches QVAC-18607 follow-up tetherto#3. Three more audit findings landed on top of follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync points + ~6 allocator churn cycles per synth. F17 Duration scalar-continuation `read_f32` cache. Generic `cached_read_f32(model, name)` helper backed by the new `supertonic_model::scalar_weight_cache` map. Replaces ~30 backend tensor reads per synth across `self_attention`, `ffn_block`, and the `duration_sentence_proj_ggml_impl` scalar continuation (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out, predictor layers + activation). Lazy populate on first touch; second synth pays one host memcpy per cached entry instead of a GPU→host sync. F18 Text-encoder convnext-front graph cached across synths. `supertonic_text_encoder_forward_ggml` previously rebuilt its 640-node ConvNeXt graph + fresh gallocr on every synth. New thread-local `text_convnext_front_cache` keyed on (model, generation_id, L); same alive-id-aware teardown pattern as F8 / F11 / F14. F19 Vector-estimator front-block graph cached across denoise steps. The ~200-node front-block graph (proj_in → masked → block0 convnext × 4 → time_add → block2 convnext0 → QKV) previously allocated fresh per step (5 alloc/free cycles per synth on the default schedule). Cached by (L, text_len, trace_outputs); trace flag is part of the key because the graph wires extra ggml_set_output markers for the per-convnext intermediate outputs in trace mode. New TDD harness (fixture-bound): test-supertonic-audit3-caches (279 lines) - F17: structural — asserts the scalar_weight_cache map contains the expected entries after the first duration call and does NOT grow on the second; duration scalar is bit- exact across the two calls. - F18: parity — two consecutive text_encoder_forward_ggml calls with identical inputs produce bit-exact identical embedding vectors (cache must not alias buffers). - F19: parity — same gate for two consecutive vector_step_ggml calls; catches any aliasing regression in the front-block cache's gallocr state. Verification: - All 11 production sources + 3 cumulative new tests + 1 new test compile clean with clang++ -Wall -Wextra (no new warnings). - Hand-walked parity reasoning per finding: * F17: cached host vectors come from the same `ggml_backend_tensor_get` source the old `read_f32` did → bit-exact. * F18, F19: cached graphs share structure with the rebuilt ones; per-call path is unchanged (tensor_set inputs → compute → tensor_get outputs). Bit-exact across calls. - Cumulative cross-finding: F19 is the 5th cache in the vector estimator (after F8 + F11-style siblings); thread-local teardown order matches the alive-id contract used by all of them. Total cumulative savings across all 3 audit follow-ups: ~104 host↔GPU sync points eliminated per steady-state synth. Diff: 6 sources changed, 1 new test, 1 CMakeLists update. +327 / -172 in src/ + CMakeLists + internal header. +279 new test. What's next (tomorrow): - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync points / synth). Needs device parity gate. - Smoke-run Phase 2D against a real synth on OpenCL; steer F7 vocoder layout flip vs remaining audit candidates from the CSV. Co-authored-by: Cursor <cursoragent@cursor.com>
freddy311082
added a commit
that referenced
this pull request
May 29, 2026
Resolves 37 add/add conflicts that accumulated since the last
master merge (May 7). Master moved 326 commits forward, mainly
landing parakeet-cpp (TDT/EOU/Sortformer/AOSC), the ggml-backend
registry refactor (`backend_selection.{h,cpp}`, registry-only
device walk replacing the per-backend `#ifdef GGML_USE_<X>`
cascades), Android `GGML_BACKEND_DL=ON` plumbing, and the
`backends_dir` / `opencl_cache_dir` Engine knobs.
Resolution strategy:
- parakeet-cpp/ (19 files): taken from master verbatim. The PR
branch only carried the original port (commits d7ab516 /
c6c3fd7 / 761eca0, all <= May 7); master has 13 newer
commits including TDT/EOU/Sortformer v2.1 + AOSC and the
word-start signal already integrated. Nothing of the PR was
lost on this side.
- .github/CODEOWNERS: taken from master (team reorg to
`qvac-internal-dev` / `qvac-internal-merge`).
- tts-cpp/ stale-from-initial-drop (7 files: voice_encoder,
t3_mtl, s3tokenizer, mel_extract_stft, main, campplus,
campplus_forward.inc): taken from master. Their only PR
commit is the original `ef840d5c Add tts-cpp files` drop;
master has since rewritten them for the registry refactor.
- tts-cpp/ mirror-only (4 files: supertonic/engine.h,
supertonic_engine, supertonic_gguf, chatterbox_tts): taken
from master. The PR's only authored commits on these mirror
pre-existing fixes from chatterbox.cpp that are already on
master.
- tts-cpp/CMakeLists.txt: hybrid merge. Master's Android
dynamic-backend stack, registry-only backend-defs interface
(with `src/backend_selection.cpp` in the source list), and
`target_compile_definitions(test-metal-ops PRIVATE
GGML_USE_METAL)` retained. PR's `src/text_preprocess.cpp`
source entry, MeCab/Cangjie find_library block (PRIVATE
include per gianni-cor review), and 23-language multilingual
test matrix retained.
- tts-cpp/include/tts-cpp/chatterbox/engine.h: master's
updated `n_gpu_layers` doc (Adreno-tier policy) and new
`backends_dir` / `opencl_cache_dir` fields retained. PR's
`mecab_dict_path` / `cangjie_tsv_path` fields retained.
- tts-cpp/src/mtl_tokenizer.{cpp,h}: PR's `<mutex>` +
`text_preprocess.h` includes, 23 supported_languages,
preprocess_japanese / preprocess_chinese helpers with
call_once-cached MeCab tagger + Cangjie table,
apply_language_preprocessing dispatch, and
`set_mecab_dict_path` / `set_cangjie_tsv_path` setters
(with already-initialised warn) retained. Master's
`// ---- Encode ----` divider kept.
- tts-cpp/src/chatterbox_engine.cpp: master's `#include
"backend_selection.h"` and `backends_dir` /
`opencl_cache_dir` wiring retained. PR's per-Engine
`mtl_tokenizer::set_mecab_dict_path` /
`set_cangjie_tsv_path` calls retained.
- tts-cpp/src/chatterbox_cli.cpp: master's removal of the
per-backend `#include "ggml-{cuda,metal,vulkan}.h"`
cascade (registry-only refactor) and the new
voice-cloning backend comment retained. PR's
`--mecab-dict` / `--cangjie-tsv` flags (declaration, help,
parsing, and per-Engine setter call) retained. PR's RAII
`thread_join_guard` on the s3gen preload thread retained
(addresses GustavoA1604 review #3: std::terminate hazard
during stack unwind). PR's 2-token MTL early-stop with
`kMtlMinTokensBeforeCadence = 60` guard and
`generated.resize(n - 1)` retained (addresses
GustavoA1604 review #2: previous over-aggressive
`resize(n - 2)` trimmed a legitimate token); the log line
was updated to surface the repeated token id.
PR-only files (no conflict): tts-cpp/src/text_preprocess.{h,cpp},
tts-cpp/scripts/build_mecab_dict.py,
tts-cpp/scripts/build_cangjie_tsv.py,
tts-cpp/test/test_multilingual_{synth,asr}.cpp are all
preserved as-is by the merge.
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes the workflow for Tier-based Approvals