added pullreques:write permission to qvac-cli workflow#7
Merged
Conversation
3 tasks
GustavoA1604
added a commit
that referenced
this pull request
May 7, 2026
Bundle of correctness, hygiene, and CI-doc fixes from the recent code review. Each item below has its own paragraph in the diff comments. - #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js to package.json so consumers running the integration tests from the npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`. - #2 deps: move @qvac/langdetect-text from runtime dependencies to devDependencies (it's only referenced from examples/, which aren't in the published files list). - #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming detection used to read engine_->options() outside engineMu_, racing with reload(). synthesize() now returns SynthesizeResult { pcm, wasStreaming } where wasStreaming is captured under the engine lock against the local shared_ptr so process() doesn't have to touch engine_ again. - #4 deferred-load: ChatterboxModel + SupertonicModel constructors used to call load() eagerly, so JsInterface::createInstance() (sync on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop. Both models now implement IModelAsyncLoad: constructors validate + return; the actual load is deferred to waitForLoadInitialization(), which the new addon_js::activate wraps inside JsAsyncTask::run so the parse runs on a worker thread. binding.cpp registers addon_js::activate in place of JsInterface::activate; tts.js now awaits the resulting promise. - #5 dead code: drop _resolvePath (unused), drop the (void)inputObj read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE / FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but- not-thrown so future maintainers don't delete them blindly (the unit suite asserts the values). - #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_ reset pattern: cancel() sets it, synthesize() fast-fails on it, process() resets it per call so a stale cancel doesn't poison the next run. - #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that the JS layer is the source of truth for useGPU and nGpuLayers wins downstream; left a pointer to std::optional<bool> if a future caller ever needs to distinguish "absent" from "explicit false". - #10 fork pointers: README.md and test/utils/downloadModel.js no longer point at GustavoA1604/chatterbox.cpp; both reference the upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now. - #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment on the build-and-test job documenting that continue-on-error is the early-days landing posture (merge-guard treats success || skipped as pass), with a pointer to tighten once Device Farm provisioning is stable. Nits: - 'use strict' added to addonLogging.js (matches every other .js). - node-vs-bare runtime banners on scripts/{generate,validate}-mobile-integration-tests.js. - ttsOutputDebugString no longer JSON.stringify's the full PCM Int16Array on every chunk-streaming event; emits a tiny summary ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen}) instead. Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load contract); 4 skipped real-GGUF tests behind the existing QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF / QVAC_TEST_SUPERTONIC_GGUF env-var gates. Lint clean. Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604
added a commit
that referenced
this pull request
May 11, 2026
…#1983) * feat: add @qvac/tts-ggml package (Chatterbox English on qvac-tts.cpp) New Bare addon wrapping the `qvac-tts::qvac-tts` static library (backed by the `tts-cpp` port added in tetherto/qvac-registry-vcpkg). API-compatible with the Chatterbox engine exposed by `@qvac/tts-onnx` so downstream consumers can swap backends without touching orchestration code. ## Scope * First iteration. Supports Chatterbox **English** only. Chatterbox multilingual, LavaSR enhancer, Supertonic engine, and streaming are out of scope and remain in `@qvac/tts-onnx`. They'll land alongside the evolution of qvac-tts.cpp. * Native backend is the static `qvac-tts` library from the QVAC vcpkg registry (`ports/tts-cpp`, baseline `2026-04-21`). No ONNX Runtime dependency. ## JS surface * `@qvac/tts-ggml` exports `TTSGgml` with the same method shape as `ONNXTTS`: `run` / `runStream` / `runStreaming` / `reload` / `unload` / `destroy`. * `files: { modelDir }` looks for `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` side-by-side; `files.t3Model` / `files.s3genModel` override the defaults. * Options: `referenceAudio`, `voiceDir` (baked profile), `seed`, `nGpuLayers`, `threads`, `outputSampleRate`, plus placeholders for the upcoming streaming flags (`streamChunkTokens`, `streamFirstChunkTokens`, `cfmSteps`). * Shared reusable lib code (`lib/textChunker.js`, `lib/textStreamAccumulator.js`, `addonLogging.*`) is copied verbatim from `@qvac/tts-onnx`. * New error class `QvacErrorAddonTTSGgml` uses codes **13001–14000** to avoid collisions with `@qvac/tts-onnx` (7001–7011) when both packages are loaded in the same Bare process. ## Native addon * `addon/src/model-interface/chatterbox/ChatterboxModel.{hpp,cpp}` — `IModel` + `IModelCancel` implementation. First-iteration strategy: assemble argv for `qvac_tts_cli_main` with a scratch `.wav` output path, call it synchronously, then parse the resulting 16-bit mono PCM wav back into `std::vector<int16_t>` for the JS handler. Consequences: every job re-loads the model (~700 ms + inference time), no mid-synthesis cancellation, no streaming. The follow-up milestone replaces this with a persistent, struct-based API once qvac-tts.cpp exposes one. * `addon/src/js-interface/{JSAdapter.{hpp,cpp}, binding.cpp}` — JS-to-C++ config bridging (same string-map pattern as `@qvac/tts-onnx`) and the `BARE_MODULE(qvac_tts_ggml, ...)` registration exposing `createInstance` / `runJob` / `reload` / `activate` / `cancel` / `destroyInstance` / `loadWeights` / `setLogger` / `releaseLogger`. * `addon/src/addon/AddonJs.hpp` — JS-facing `createInstance` / `runJob` / `reload` wrappers that register a `JsAudioOutputHandler` emitting `{ outputArray: Int16Array, sampleRate: number }` to JS. ## Build / registry * `CMakeLists.txt` uses `find_package(qvac-tts-cpp CONFIG REQUIRED)` and the standard `cmake-bare` + `cmake-vcpkg` scaffolding (shape matches `@qvac/transcription-whispercpp`). * `vcpkg.json` depends on `tts-cpp` (with a `vulkan` feature passthrough) plus `qvac-lib-inference-addon-cpp`, `qvac-lint-cpp`, and `gtest`. * `vcpkg-configuration.json` points at tetherto/qvac-registry-vcpkg. NOTE: the baseline pin here is inherited from `@qvac/transcription-whispercpp` and **must be bumped** to a commit that contains the `tts-cpp` port once that registry PR lands. A follow-up commit will update it. ## Tests & examples * Integration + unit test files for Chatterbox English are copied verbatim from `@qvac/tts-onnx` with only mechanical renames (`ONNXTTS` -> `TTSGgml`, `QvacErrorAddonTTS` -> `QvacErrorAddonTTSGgml`, `@qvac/tts-onnx/text-chunker` -> `../../lib/textChunker.js`). Some paths in `test/integration/addon.test.js` still import Supertonic / LavaSR helpers that don't exist in this package — those test blocks will fail fast when the file loads, which is expected until those backends get their own ggml packages. * Examples: `chatterbox-tts.js`, `chatterbox-streaming-tts.js`, plus shared `wav-helper.js` + `pcm-chunk-player.js`. ## What's not in this PR (known gaps) * No docs: README, NOTICE, CHANGELOG, PULL_REQUEST_TEMPLATE changes will land in a single documentation pass once the registry + fork commits have merged upstream. * `vcpkg-configuration.json` baseline needs to point at a qvac-registry-vcpkg commit that ships `tts-cpp` (pending the registry PR). * Actual `npm run build` requires the registry and fork commits to be on `main` of their respective upstream repos. * chore: point tts-ggml vcpkg baseline at the tts-cpp-bearing registry commit Bumps `vcpkg-configuration.json` to GustavoA1604/qvac-registry-vcpkg at commit 1e2839680b6be8d8ffff889a9c29b966c176098c — the commit that adds the `tts-cpp` port. Paired with the `qvac-tts` library already pinned in the port's `portfile.cmake` (GustavoA1604/chatterbox.cpp @ 0fe4a521618cc30358040b29d75d4261b31cbb60). Will be re-pointed at tetherto/qvac-registry-vcpkg once the registry PR lands upstream. * chore: tts-ggml: trim tests + examples to Chatterbox English, restore mobile wrapper Second pass over @qvac/tts-ggml after the build started passing: prune everything that only made sense for the ONNX-era multi-engine scope and adapt the remaining Chatterbox-English bits to the GGUF + file-path reference-audio contract. Restores `test/mobile/` so the Android build has something to point at. ## C++ * `ChatterboxModel.cpp`: the `ArgvBuilder::buildArgv` doc comment contained `**/` which closed the block comment early and broke the build. Rewrote as a `//` comment. ## Examples * `examples/chatterbox-tts.js` — rewrite for v0 contract: single `<text>` argv, `files: { modelDir }` pointing at the two GGUFs, `referenceAudio` is now a wav **path** (addon passes it to `--reference-audio`) instead of a Float32Array. Drops english/multilingual arg and the CHATTERBOX_VARIANT switch that picked which `.onnx` files to load. * Removed `examples/chatterbox-streaming-tts.js` + `examples/pcm-chunk-player.js`. The v0 addon re-loads the model per `run()` call — exposing streaming would mislead. Both come back alongside the persistent-engine milestone. * `package.json`: `npm run example` now passes a default text so it runs without extra args. ## Tests ### Kept as-is (engine-agnostic) * `test/unit/textChunker.test.js` * `test/mock/{MockedBinding,utils}.js` * `test/utils/{wav-helper,pcmConcatenator,loader.fake,runWhisper,runTTS}.js` * `test/reference-audio/jfk.wav`, `test/data/sentences-*.js` ### Mechanical fixes * `test/unit/tts.error.test.js` — fix error-code assertions to the tts-ggml range (`13001–14000`); was still checking the `@qvac/tts-onnx` range (`7001–7011`). * `test/unit/tts-ggml.lifecycle.test.js` — fix stale `QvacErrorAddonTTS` import to `QvacErrorAddonTTSGgml`; switch the stubbed model to `{ t3Model, s3genModel }` GGUFs and drop the non-existent `engine: 'chatterbox'` option. * `test/unit/tts-ggml.sentence-stream.test.js` — same GGUF/engine cleanup. ### Rewritten * `test/unit/chatterbox.inference.test.js` — drop tests that asserted the old ONNX file shape (`tokenizer / speechEncoder / embedTokens / conditionalDecoder / languageModel`), the removed `engine` detection and the wrong `getModelKey` return value (`'onnx-tts'` -> `'tts-ggml'`). New tests cover: `modelDir` derives the two GGUF paths; explicit `t3Model` / `s3genModel` override the defaults. The mocked-binding run/reload/cancel flow stays. * `test/integration/addon.test.js` — fresh, ~180 LoC, Chatterbox-English only. Ensures the GGUFs are present, runs the short sentence set through `loadChatterboxTTS` + `runChatterboxTTS[WithSplit]`, and (on darwin only) runs a whisper-based WER check via the existing `runWhisper` util. Drops the Chatterbox-multilingual block + every Supertonic + LavaSR block that doesn't apply to this package. * `test/utils/runChatterboxTTS.js` — rewrite for the GGUF contract: `files: { modelDir, t3Model, s3genModel }`, `referenceAudio` as a file path that falls back to `test/reference-audio/jfk.wav` (or the mobile test-asset when `global.assetPaths` is present). No more WAV decode / resample on the JS side. * `test/utils/downloadModel.js` — trim from 1007 LoC to 280. Drops the Supertonic + LavaSR + Chatterbox-multilingual + Cangjie downloaders. Keeps the shared HTTP/curl infrastructure and `ensureWhisperModel` (still used by the integration WER check). `ensureChatterboxModels` is now **check-only**: it verifies `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` exist locally and, if missing, prints the exact commands for generating them from the qvac-tts.cpp (née chatterbox.cpp) conversion scripts. Once the GGUFs land on a canonical HuggingFace repo we'll wire up download URLs here. ## Scripts * `scripts/ensure-chatterbox.js` — simplify to a single invocation against `./models/`. Drops the variant / language matrix that the ONNX downloader needed. * `scripts/ensure-models.js` — now a thin alias to `ensure-chatterbox.js`. Drops the Supertonic + LavaSR orchestration. ## Mobile * Restored `test/mobile/{integration.auto.cjs, integration-runtime.cjs, testAssets/jfk.wav}` so the Android build has a wrapper to point at. * `package.json`: re-added `test/mobile` to the `files` list. ## Gitignore * Ignore generated `.clang-format` / `.clang-tidy` / `.valgrind.supp` (produced by the top-level `configure_file(...)` calls) and `build_*/` dirs (bare-make convention). ## Verified locally * `npx standard "test/**/*.js" "*.js" "lib/*.js"` — clean. * `npm run test:unit` — 38/38 pass (105/105 asserts). * `npm run build && bare examples/chatterbox-tts.js "Hello from qvac tts ggml."` produces a 24 kHz wav as expected. * Add streaming support * Update ggml backend to use separate ggml repo * tts-ggml: consume renamed tts-cpp library (2026-04-24#1) Upstream chatterbox.cpp renamed the package + namespace + target from qvac-tts to tts-cpp and tightened the library boundary; pick up the new artefacts here: - find_package(qvac-tts-cpp CONFIG REQUIRED) -> find_package(tts-cpp CONFIG REQUIRED) - qvac-tts::qvac-tts -> tts-cpp::tts-cpp - qvac_tts::chatterbox -> tts_cpp::chatterbox (engine ptrs, EngineOptions, SynthesisResult, forward-decls in ChatterboxModel.hpp) - #include <qvac-tts/chatterbox/engine.h> -> #include <tts-cpp/chatterbox/engine.h> - Doxygen / inline doc references to the old names refreshed alongside the code changes. vcpkg wiring: - vcpkg-configuration.json baseline bumped to qvac-registry-vcpkg commit bc30b0b (ports/tts-cpp renamed and repointed at chatterbox.cpp@f8f9145). - vcpkg.json tts-cpp constraint bumped to 2026-04-24#1 (the port that carries the rename + namespace + install(EXPORT) changes). Verified with a cold bare-make generate + bare-make build against the new port, and the addon's existing unit + integration test suites. Made-with: Cursor * tts-ggml: bump tts-cpp port to 2026-05-07 + registry baseline Picks up the round-3 review-fix wave landed on the tts-cpp port: e673182 scrub stale patches/ refs from README (N10) 8ba10a6 drop unreachable TTS_CPP_GGML_LIB_PREFIX block (N8) 4b5d2d7 mirror N1-N7 fixes from chatterbox.cpp source-of-truth - N1 supertonic alive-registry guard against freed-backend gallocr_free assert on hot-swap (Vulkan/Metal/CUDA) - N2 drop dead g_sink_* state, soften log_set docstring - N3 Turbo BPE try/catch (exception-safe Engine ctor) - N4 STFT cancel checkpoint + tighter Engine::cancel() doc - N5 document s3gen_preload/unload refcount semantics - N6 drop dead cached_text_lc Supertonic shim - N7 fix misleading "no copy" view-vs-copy log wording Plus the integrated-port-only round-2 fixes that landed earlier: fa0d490 close patches/-deleted regression: TTS_CPP_USE_SYSTEM_GGML now defaults ON; bundled-without-patches hard-errors at configure time with a pointer at the ggml-speech vcpkg port. ae34c58 README rewritten for integrated/vcpkg context. a2f2dd6 top-level qvac-ext-lib-whisper.cpp README points at the tts-cpp/ subtree (alongside parakeet-cpp/). Public API used by ChatterboxModel (tts_cpp::chatterbox::Engine / EngineOptions / SynthesisResult / s3gen_preload / s3gen_unload) is backward-compatible: the new port adds Engine::backend_name(), MTL-variant fields on EngineOptions (language / cfg_weight / min_p / exaggeration), and a separate tts_cpp::supertonic::Engine class, but nothing this consumer was already calling has changed. Edits: packages/tts-ggml/vcpkg.json - tts-cpp dep: version>=2026-04-24#1 -> version>=2026-05-07. packages/tts-ggml/vcpkg-configuration.json - default-registry baseline: bc30b0b (April 2026 fork-only state) -> 16b91afdcfd59baea60e81f3da94f49311ef2a97. The new baseline pulls in the post-tetherto-merge state (parakeet-cpp port at 932d5d9, ggml-speech port-version 1 at f07bdd0) plus the new tts-cpp port (16b91af) on the developer's GustavoA1604 registry fork. Smoke-test plan: after running `vcpkg install` against the new baseline, the tts-cpp port's vcpkg_from_github resolves at GustavoA1604/qvac-ext-lib-whisper.cpp@e673182 (tts-cpp branch) until the upstream PR merges. ChatterboxModel should build and synthesize identically; expanding to Multilingual + Supertonic flows is the follow-up commit on the package side. Co-authored-by: Cursor <cursoragent@cursor.com> * Add chatterbox multilingual and supertonic * Add mobile integration tests * tts-ggml: drop clang-19 pin in linux-clang toolchain The toolchain hardcoded `clang-19` / `clang++-19` (versioned binary names) since the package's first commit (0a2c978). Linux CI hadn't exercised this path before — the new on-pr-tts-ggml.yml -> integration matrix is the first time it does, and it fails on every linux runner (ai-run-ubuntu-22.04, ai-run-linux-gpu, ubuntu-24.04-arm) at vcpkg's "detect_compiler" step because none of the GH-hosted images ship a `clang-19` symlink: Detecting compiler hash for triplet x64-linux... error: while detecting compiler information: ... CMake Error at scripts/cmake/vcpkg_execute_required_process.cmake:127 (message): Command failed: ... -DVCPKG_CHAINLOAD_TOOLCHAIN_FILE= .../tts-ggml/vcpkg/triplets/../toolchains/linux-clang.cmake ... Match parakeet's working pattern (qvac-lib-infer-parakeet/vcpkg/ toolchains/linux-clang.cmake): use unversioned `clang` / `clang++` so each runner picks up its image's default clang (clang-15 on ubuntu-22.04, clang-18 on ubuntu-24.04, whatever the AI runners ship). The `-stdlib=libc++` flag added by x64-linux.cmake / arm64-linux.cmake is honoured by every reasonable clang version. Co-authored-by: Cursor <cursoragent@cursor.com> * Add C++ tests and coverage; fix linux build * tts-ggml: address PR review feedback Bundle of correctness, hygiene, and CI-doc fixes from the recent code review. Each item below has its own paragraph in the diff comments. - #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js to package.json so consumers running the integration tests from the npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`. - #2 deps: move @qvac/langdetect-text from runtime dependencies to devDependencies (it's only referenced from examples/, which aren't in the published files list). - #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming detection used to read engine_->options() outside engineMu_, racing with reload(). synthesize() now returns SynthesizeResult { pcm, wasStreaming } where wasStreaming is captured under the engine lock against the local shared_ptr so process() doesn't have to touch engine_ again. - #4 deferred-load: ChatterboxModel + SupertonicModel constructors used to call load() eagerly, so JsInterface::createInstance() (sync on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop. Both models now implement IModelAsyncLoad: constructors validate + return; the actual load is deferred to waitForLoadInitialization(), which the new addon_js::activate wraps inside JsAsyncTask::run so the parse runs on a worker thread. binding.cpp registers addon_js::activate in place of JsInterface::activate; tts.js now awaits the resulting promise. - #5 dead code: drop _resolvePath (unused), drop the (void)inputObj read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE / FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but- not-thrown so future maintainers don't delete them blindly (the unit suite asserts the values). - #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_ reset pattern: cancel() sets it, synthesize() fast-fails on it, process() resets it per call so a stale cancel doesn't poison the next run. - #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that the JS layer is the source of truth for useGPU and nGpuLayers wins downstream; left a pointer to std::optional<bool> if a future caller ever needs to distinguish "absent" from "explicit false". - #10 fork pointers: README.md and test/utils/downloadModel.js no longer point at GustavoA1604/chatterbox.cpp; both reference the upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now. - #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment on the build-and-test job documenting that continue-on-error is the early-days landing posture (merge-guard treats success || skipped as pass), with a pointer to tighten once Device Farm provisioning is stable. Nits: - 'use strict' added to addonLogging.js (matches every other .js). - node-vs-bare runtime banners on scripts/{generate,validate}-mobile-integration-tests.js. - ttsOutputDebugString no longer JSON.stringify's the full PCM Int16Array on every chunk-streaming event; emits a tiny summary ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen}) instead. Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load contract); 4 skipped real-GGUF tests behind the existing QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF / QVAC_TEST_SUPERTONIC_GGUF env-var gates. Lint clean. Co-authored-by: Cursor <cursoragent@cursor.com> * tts-ggml: unblock CI integration tests on every desktop runner Four independent failures, one per platform: 1. linux-x64 / linux-arm64: addon load crashed at `libomp.so.5: cannot open shared object file`. tts-cpp's binary is built with clang under the linux-clang toolchain and links against libomp (LLVM OpenMP runtime); only `libgomp1` (GNU OpenMP) was being apt-installed. Add `libomp5` so libomp.so.5 is on the loader path. 2. darwin-arm64: convert-models.sh aborted at line 200 with `hf_args[@]: unbound variable`. macOS's system bash is 3.2 which treats `"${arr[@]}"` as nounset access when the array is empty under `set -u`; with HF_TOKEN unset we hit it on every fresh runner. Use the `${arr[@]+"${arr[@]}"}` idiom (defined-or-nothing) at all six call sites and add a header comment so the next maintainer doesn't accidentally regress. 3. darwin-x64: pip install bombed building `llvmlite` from source because the macos-15-large runner has no LLVM 15 development install. Root cause: librosa pulls in numba 0.65+, which stopped shipping darwin-x86_64 wheels for Python 3.12. Pin Python to 3.11 in the Setup Python step; 3.11 has prebuilt wheels for the entire numba/llvmlite/librosa stack on darwin-x64 and is fine for every other converter dependency. 4. windows-2022: ChatterboxModel::load threw `vk::createInstance: ErrorIncompatibleDriver`. Root cause: the addon's index.js::_validateConfig defaults `useGPU = true` when neither useGPU nor nGpuLayers is specified, so the test ran with n_gpu_layers=99 -> ggml_backend_vk_init -> vk::createInstance -> ErrorIncompatibleDriver on the runner's no-Vulkan-driver image. runChatterboxTTS.js now honours `process.env.NO_GPU === 'true'` (set on the no-GPU matrix entries) and forces useGPU=false on exactly those runners; the other test runners (chatterbox-mtl, gpu-smoke, multiple-runs) already had this guard. Also documents the `mesa-vulkan-drivers` apt package (already pulled in) as the software ICD that lets the Vulkan-built prebuild's runtime backend probe enumerate at least one device on linux runners. Co-authored-by: Cursor <cursoragent@cursor.com> * tts-ggml: drop Chatterbox from mobile bundle (Metro V8 string limit) Mobile build failed at `:app:createBundleReleaseJsAndAssets` with: SyntaxError: assets/testAssets/chatterbox-s3gen.gguf: Cannot create a string longer than 0x1fffffe8 characters Root cause: Metro's bundler reads every asset under `test/mobile/testAssets/` via `Buffer.toString()`. V8's max string length is 0x1fffffe8 (~512 MiB). chatterbox-s3gen.gguf is ~1 GiB even with --quant q4_0 because the s3gen converter only quantizes attention weights and leaves the bulk of the s3gen graph in fp16 ("0/291 weight tensors quantized" in the converter log). Fix: bundle ONLY supertonic.gguf (~125 MiB, comfortably under the limit) on mobile. Mobile Chatterbox tests degrade cleanly to `t.pass('Skipped: Chatterbox GGUFs not available')` via the existing `ensureChatterboxModels` helper -- it already returns { success: false } when the GGUFs aren't on disk. Cache key bumped to v2 so existing v1 cache entries (which include the chatterbox files) are evicted on the next run. Bundling Chatterbox on mobile requires either: - adding `gguf` to qvac-test-addon-mobile's metro `assetExts` so the JS-string read is skipped (then the s3gen file can flow through the bundle as a raw asset), or - pushing the chatterbox GGUFs to the device via `adb push` outside the bundle and surfacing the path through downloadModel.js's existing ANDROID_CANDIDATE_DIRS fallback. Both are outside the scope of this PR; documented inline above the cache step for the next maintainer. Co-authored-by: Cursor <cursoragent@cursor.com> * Bump hash of vcpkg * Consume vcpkg from tetherto repository * Fix integration tests failures in all platforms * Further fix tests * fix: Make useGPU flag more meaningful (#1953) * fix[api]: make useGPU flag actually force CPU/GPU and reject useGPU/nGpuLayers conflicts * add gpu smoke test * resolve comments --------- Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local> * Update dependencies after monorepo directory changes * Further drop qvac-lib- prefix * Add CHANGELOG.md --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Ishan Vohra <ishanvohra2@gmail.com> Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>
lauripiisang
added a commit
that referenced
this pull request
May 14, 2026
…ax_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience.
lauripiisang
added a commit
that referenced
this pull request
May 14, 2026
…ax_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience.
simon-iribarren
added a commit
to simon-iribarren/qvac
that referenced
this pull request
May 14, 2026
- transcribe.ts: route the two `Transcription Update` debug emits
through `requestLogger.debug` so they carry the per-request prefix,
matching the rule's `grep "requestId=<id>"` invariant. Drop the now-
unused module-level `logger`. Collapse two `scope.defer(async () =>
{ await restorePrompt(...) })` wrappers to bare arrow callbacks
(review tetherto#5, tetherto#10).
- inference-handler-migrations.test.ts: add bareTest op-level cancel-
by-requestId cases for `transcribe (whisper)` (asserts loop exit +
addon.cancel called + reload-count == 2 to pin the
`applyPrompt + restorePrompt runs exactly once` invariant) and
`finetune` (asserts model.cancel called + scope unwind clears the
runtime-state flag back to IDLE). Pin the NMT soft-cancel contract
by instrumenting the addon and asserting addon.cancel was NOT called
during a translate cancel (review tetherto#3, tetherto#7).
- request-lifecycle-primitives.mdc: reconcile the "polling
signal.aborted mid-handler" anti-pattern with the new "Per-iteration
cancel check (M3b)" canonical pattern. The anti-pattern is *adding*
the check when the addon already honours signal directly; the M3b
pattern is *introducing* the check where the addon doesn't and the
loop is the only soft-cancel exit (review tetherto#4).
lauripiisang
added a commit
that referenced
this pull request
May 14, 2026
…ax_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience.
lauripiisang
added a commit
that referenced
this pull request
May 14, 2026
#2030) * QVAC-18733 feat[api]: add OpenAI Responses routes with in-memory store Implement POST /v1/responses (blocking + SSE), GET/DELETE /v1/responses/{id}, GET /v1/responses/{id}/input_items, previous_response_id chaining, LRU+TTL store, X-QVAC-Stub: responses-volatile header, and startup banner. * fix: align Responses streaming with finalized response and add usage stats - Approach (b): always include the assistant `message` item in `response.output[0]`, even when tool calls are present, so the streamed item tree matches `response.completed`. - Pre-allocate `msgItemId` and `fcItemIds` once and reuse them across SSE events and the finalized `output[]`, fixing client-side accumulation by `item_id`. - Use distinct `output_index` per tool call (1..n) and set `item_id` on `response.function_call_arguments.delta`/`.done` to the function-call item id (was the OpenAI `call_id`, causing collisions and wrong wiring). - Populate `required_action.submit_tool_outputs.tool_calls` so OpenAI clients can satisfy tool calls instead of hanging in `requires_action` with no payload. - Drop the duplicate `previous_response_id` lookup in `handlePostResponses`. - Drop `parallel_tool_calls` from the unsupported-params log: it is honored. - Recognise `function_call_output` (-> `tool` role) and `function_call` (-> synthesized assistant `<tool_call>` content) in `openaiResponsesInputToHistory` and `historyPrefixFromStoredResponse` so chained tool round-trips actually carry through `previous_response_id`. - Use `crypto.randomUUID()` for `resp_`/`msg_`/`fc_`/input-item ids. - Surface real `usage.output_tokens` from `result.stats.generatedTokens` (Responses + chat.completions, blocking + streaming); fall back to word count when stats are missing. `input_tokens` stays 0 with an inline note that the SDK does not expose a prompt-token count today. - Tighten `CompletionResult.stats` to a structured `CompletionRunStats` shape. Tests: extend `responses.test.ts` and `translate.test.ts`; add `responses-streaming.test.ts` driving the new exported `writeStreamingResponse` / `writeBlockingResponse` helpers with a fake `CompletionResult` and `ServerResponse`. * test[skiplog]: stabilize Responses chain e2e for tiny reasoning model Pin temperature=0 + seed and bump max_output_tokens to 512 so Qwen3-600M has room for both its <think> block and the actual answer. The test exercises previous_response_id chain wiring; it should not depend on sampling luck or the model's reasoning length. * fix: walk previous_response_id chain so multi-turn keeps grandparent history Each StoredResponse.inputItems only carries that turn's NEW input (`normalizeResponsesInputItemsForStorage(body['input'])`), so a chain of depth >= 3 silently lost the grandparent turn: resp_1 input "A" -> output "X" (stored: ["A"]) resp_2 prev=resp_1 input "B" history sent: [A, X, B] (stored: ["B"]) resp_3 prev=resp_2 input "C" history sent: [B, Y, C] -- A and X gone historyPrefixFromStoredResponse now walks the chain via responseObject.previous_response_id when given a resolver, prepending earlier turns oldest-first. Cap depth at 32 to bound work and protect against pathological cycles. Routes pass `(id) => store.get(id)` as the resolver. Legacy single-step callers still work unchanged when the resolver is omitted. Tests: - unit: depth-3 chain produces all six prefix entries in order; maxDepth cap honored. - e2e: resp_1 sets "code word is XYZZY", resp_2 acks, resp_3 asks for the word and recovers it -- would silently fail before this fix. * fix: address Responses review nits (SSE sentinel, dup event, types, max_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience. * fix: address Simon review nits (stream error sentinel, input_items after cursor) Two surfaced post-rebase: 1. sendError gained an opt-in { sseSentinel: false } so callers inside an active stream can suppress the trailing `data: [DONE]\n\n` after the `response.error` SSE event. Responses streaming error path now passes it, closing the gap that the happy path already handled (response.completed already used endSSE({ sentinel: false })). 2. GET /v1/responses/:id/input_items now reads the `after` cursor from the query string in addition to `limit`. Spec-compliant pagination would have re-fetched page 1 forever; the store already implemented the cursor. Added a store-level pagination test that walks all pages by `last_id`.
simon-iribarren
added a commit
that referenced
this pull request
May 15, 2026
* QVAC-18183 feat[api]: inference-handler migrations
Migrate the four remaining inference handler kinds onto the
RequestRegistry primitives shipped in M3a (cancel-capability
declaration, per-kind concurrency policy, structured
`[request-lifecycle]` logging). Each handler now opens a
request-scoped `ManagedRequestContext`, threads the optional
`requestId` from the wire request (falling back to a server-minted
UUID), routes hard cancels to `addon.cancel()` at a single signal-
listener leaf, and replaces ad-hoc `try/finally` cleanup with
`scope.defer(...)` registrations so cleanup runs in LIFO order on
every exit path.
- `embed` (kind "embeddings", `{ scope: "model", hard: true }`):
`packages/sdk/server/bare/ops/embed.ts` opens the context, threads
`requestId` from `embedRequestSchema`, post-await `signal.aborted`
checks raise `InferenceCancelledError`.
- `transcribe` / `transcribeStream` (kind "transcribe",
`{ scope: "model", hard: true }`): collapsed
`try { ... } finally { restorePrompt(...) }` into
`scope.defer(restorePrompt)`, added per-iteration
`if (ctx.signal.aborted) break;` in the `response.iterate()` loop
(Option A from §4 of the M3b brief — explicit, visible at the call
site, no `takeWhileNotAborted` wrapper).
- `translate` (kind "translate"): two engine branches.
llamacpp-completion declares `{ scope: "model", hard: true }` and
wires `signal → addon.cancel()`; nmtcpp-translation keeps
`{ scope: "none" }` and soft-cancels inside both the streaming
iterate loop and the `runBatch` early-return path.
- `finetune` (kind "finetune"): flipped the llamacpp-completion
manifest declaration from `{ scope: "none" }` to
`{ scope: "model", hard: true }` (the addon already exposes
`model.cancel()`). `startFinetune` opens a registry context and
wires `signal → model.cancel()`; the two-level `try/finally`
collapses into `scope.defer` for `clearFinetuneRuntimeState` and
`handle.removeListener`. `cancelFinetune(modelId)` is now a thin
wrapper over `getRequestRegistry().cancel({ modelId, kind:
"finetune" })` — never invokes `model.cancel()` directly.
Per §4 of the brief: per-iteration cancel granularity uses
Option A (explicit `if (ctx.signal.aborted) break;` at the top of
each streaming loop body). No `takeWhileNotAborted` wrapper was
introduced.
Per §7 anti-patterns: M3b adds zero `oneAtATimePerModel` policies
(the four migrated kinds tolerate concurrent requests against the
same model), leaves the M1 compat-fallback in
`server/bare/ops/cancel.ts` untouched (M3d retires it), and does
not modify `cancelHandler.ts`.
Other changes:
- `embed`, `transcribe`, `transcribeStream`, `translate`,
`finetune` request schemas grow an optional `requestId` field
(`.string().min(1).optional()`); server-side ops fall back to
`generateServerRequestId()` when absent.
- Whisper / Parakeet / LLM / NMT plugin handlers thread
`request.requestId` into their bare ops.
- `plugin-cancel-capability.test.ts` truth-table flipped for the
`finetune` row.
- New `inference-handler-migrations.test.ts` covers schema-level
optional-`requestId` acceptance for all four kinds and pins the
`[request-lifecycle] begin/cancel/end` line shape for each kind.
The op-level cancel-by-requestId / cancel-by-modelId integration
tests are bare-runtime-gated (the migrated ops pull `bare-crypto`
/ `bare-fs` transitively and can't load under Bun, same reason as
`finetune-ops.test.disabled.ts`).
- `.cursor/rules/sdk/request-lifecycle-primitives.mdc` and
`.cursor/rules/sdk/docs/request-lifecycle-system.mdc` updated:
M3b row marked shipped, finetune truth-table row flipped,
canonical-handler-shape section refreshed to use `embed.ts` as the
cleanest reference and to document the Option A per-iteration
check.
Verification:
- `bun lint` (eslint + tsc --noEmit): green.
- `bun run typecheck`: green.
- `bun run test:unit`: every test file green except the
pre-existing `client/rpc/rpc-client.ts` `#rpc` package-resolution
failure on upstream/main (also reproducible without these
changes; unrelated to M3b).
* QVAC-18183 fix: address PR #2058 review feedback
- transcribe.ts: route the two `Transcription Update` debug emits
through `requestLogger.debug` so they carry the per-request prefix,
matching the rule's `grep "requestId=<id>"` invariant. Drop the now-
unused module-level `logger`. Collapse two `scope.defer(async () =>
{ await restorePrompt(...) })` wrappers to bare arrow callbacks
(review #5, #10).
- inference-handler-migrations.test.ts: add bareTest op-level cancel-
by-requestId cases for `transcribe (whisper)` (asserts loop exit +
addon.cancel called + reload-count == 2 to pin the
`applyPrompt + restorePrompt runs exactly once` invariant) and
`finetune` (asserts model.cancel called + scope unwind clears the
runtime-state flag back to IDLE). Pin the NMT soft-cancel contract
by instrumenting the addon and asserting addon.cancel was NOT called
during a translate cancel (review #3, #7).
- request-lifecycle-primitives.mdc: reconcile the "polling
signal.aborted mid-handler" anti-pattern with the new "Per-iteration
cancel check (M3b)" canonical pattern. The anti-pattern is *adding*
the check when the addon already honours signal directly; the M3b
pattern is *introducing* the check where the addon doesn't and the
loop is the only soft-cancel exit (review #4).
* QVAC-18183 fix: drop unsafe `addon` re-narrowing in translate.ts onAbort
Addresses opaninakuffo's review comment on #2058:
`AnyModel.addon` is already typed as `AddonInterface | undefined`
(see `server/bare/registry/model-registry.ts:17-20`), so the
`as unknown as { addon?: { cancel?(jobId?: string): Promise<void> } }`
cast was unnecessary. Matches the simpler pattern used by `embed.ts`
and `transcribe.ts` for the same `onAbort` shape — keeps the four
M3b-migrated ops uniform.
* QVAC-18183 doc: trim internal milestone references from cursor rules + code comments
Removed the "Migration Roadmap" table, "M1/M2/M3a-d" milestone labels, planning-brief
decision references (Decision A/B.2, D1/D2), workspace-local paths
(`tasks/release-0.11.0-planning/...`, `pitch-3-decisions.md`), and "in review"
forward-references from the request-lifecycle cursor rules and the matching code
comments in the bare ops, finetune wrapper, and the inference-migration tests. The
canonical handler shape, anti-patterns, primitives reference, plugin cancel-capability
truth table, and concurrency-policy / structured-logging sections all stay — only the
internal milestone framing comes out.
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
…#1983) * feat: add @qvac/tts-ggml package (Chatterbox English on qvac-tts.cpp) New Bare addon wrapping the `qvac-tts::qvac-tts` static library (backed by the `tts-cpp` port added in tetherto/qvac-registry-vcpkg). API-compatible with the Chatterbox engine exposed by `@qvac/tts-onnx` so downstream consumers can swap backends without touching orchestration code. ## Scope * First iteration. Supports Chatterbox **English** only. Chatterbox multilingual, LavaSR enhancer, Supertonic engine, and streaming are out of scope and remain in `@qvac/tts-onnx`. They'll land alongside the evolution of qvac-tts.cpp. * Native backend is the static `qvac-tts` library from the QVAC vcpkg registry (`ports/tts-cpp`, baseline `2026-04-21`). No ONNX Runtime dependency. ## JS surface * `@qvac/tts-ggml` exports `TTSGgml` with the same method shape as `ONNXTTS`: `run` / `runStream` / `runStreaming` / `reload` / `unload` / `destroy`. * `files: { modelDir }` looks for `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` side-by-side; `files.t3Model` / `files.s3genModel` override the defaults. * Options: `referenceAudio`, `voiceDir` (baked profile), `seed`, `nGpuLayers`, `threads`, `outputSampleRate`, plus placeholders for the upcoming streaming flags (`streamChunkTokens`, `streamFirstChunkTokens`, `cfmSteps`). * Shared reusable lib code (`lib/textChunker.js`, `lib/textStreamAccumulator.js`, `addonLogging.*`) is copied verbatim from `@qvac/tts-onnx`. * New error class `QvacErrorAddonTTSGgml` uses codes **13001–14000** to avoid collisions with `@qvac/tts-onnx` (7001–7011) when both packages are loaded in the same Bare process. ## Native addon * `addon/src/model-interface/chatterbox/ChatterboxModel.{hpp,cpp}` — `IModel` + `IModelCancel` implementation. First-iteration strategy: assemble argv for `qvac_tts_cli_main` with a scratch `.wav` output path, call it synchronously, then parse the resulting 16-bit mono PCM wav back into `std::vector<int16_t>` for the JS handler. Consequences: every job re-loads the model (~700 ms + inference time), no mid-synthesis cancellation, no streaming. The follow-up milestone replaces this with a persistent, struct-based API once qvac-tts.cpp exposes one. * `addon/src/js-interface/{JSAdapter.{hpp,cpp}, binding.cpp}` — JS-to-C++ config bridging (same string-map pattern as `@qvac/tts-onnx`) and the `BARE_MODULE(qvac_tts_ggml, ...)` registration exposing `createInstance` / `runJob` / `reload` / `activate` / `cancel` / `destroyInstance` / `loadWeights` / `setLogger` / `releaseLogger`. * `addon/src/addon/AddonJs.hpp` — JS-facing `createInstance` / `runJob` / `reload` wrappers that register a `JsAudioOutputHandler` emitting `{ outputArray: Int16Array, sampleRate: number }` to JS. ## Build / registry * `CMakeLists.txt` uses `find_package(qvac-tts-cpp CONFIG REQUIRED)` and the standard `cmake-bare` + `cmake-vcpkg` scaffolding (shape matches `@qvac/transcription-whispercpp`). * `vcpkg.json` depends on `tts-cpp` (with a `vulkan` feature passthrough) plus `qvac-lib-inference-addon-cpp`, `qvac-lint-cpp`, and `gtest`. * `vcpkg-configuration.json` points at tetherto/qvac-registry-vcpkg. NOTE: the baseline pin here is inherited from `@qvac/transcription-whispercpp` and **must be bumped** to a commit that contains the `tts-cpp` port once that registry PR lands. A follow-up commit will update it. ## Tests & examples * Integration + unit test files for Chatterbox English are copied verbatim from `@qvac/tts-onnx` with only mechanical renames (`ONNXTTS` -> `TTSGgml`, `QvacErrorAddonTTS` -> `QvacErrorAddonTTSGgml`, `@qvac/tts-onnx/text-chunker` -> `../../lib/textChunker.js`). Some paths in `test/integration/addon.test.js` still import Supertonic / LavaSR helpers that don't exist in this package — those test blocks will fail fast when the file loads, which is expected until those backends get their own ggml packages. * Examples: `chatterbox-tts.js`, `chatterbox-streaming-tts.js`, plus shared `wav-helper.js` + `pcm-chunk-player.js`. ## What's not in this PR (known gaps) * No docs: README, NOTICE, CHANGELOG, PULL_REQUEST_TEMPLATE changes will land in a single documentation pass once the registry + fork commits have merged upstream. * `vcpkg-configuration.json` baseline needs to point at a qvac-registry-vcpkg commit that ships `tts-cpp` (pending the registry PR). * Actual `npm run build` requires the registry and fork commits to be on `main` of their respective upstream repos. * chore: point tts-ggml vcpkg baseline at the tts-cpp-bearing registry commit Bumps `vcpkg-configuration.json` to GustavoA1604/qvac-registry-vcpkg at commit 1e2839680b6be8d8ffff889a9c29b966c176098c — the commit that adds the `tts-cpp` port. Paired with the `qvac-tts` library already pinned in the port's `portfile.cmake` (GustavoA1604/chatterbox.cpp @ 0fe4a521618cc30358040b29d75d4261b31cbb60). Will be re-pointed at tetherto/qvac-registry-vcpkg once the registry PR lands upstream. * chore: tts-ggml: trim tests + examples to Chatterbox English, restore mobile wrapper Second pass over @qvac/tts-ggml after the build started passing: prune everything that only made sense for the ONNX-era multi-engine scope and adapt the remaining Chatterbox-English bits to the GGUF + file-path reference-audio contract. Restores `test/mobile/` so the Android build has something to point at. ## C++ * `ChatterboxModel.cpp`: the `ArgvBuilder::buildArgv` doc comment contained `**/` which closed the block comment early and broke the build. Rewrote as a `//` comment. ## Examples * `examples/chatterbox-tts.js` — rewrite for v0 contract: single `<text>` argv, `files: { modelDir }` pointing at the two GGUFs, `referenceAudio` is now a wav **path** (addon passes it to `--reference-audio`) instead of a Float32Array. Drops english/multilingual arg and the CHATTERBOX_VARIANT switch that picked which `.onnx` files to load. * Removed `examples/chatterbox-streaming-tts.js` + `examples/pcm-chunk-player.js`. The v0 addon re-loads the model per `run()` call — exposing streaming would mislead. Both come back alongside the persistent-engine milestone. * `package.json`: `npm run example` now passes a default text so it runs without extra args. ## Tests ### Kept as-is (engine-agnostic) * `test/unit/textChunker.test.js` * `test/mock/{MockedBinding,utils}.js` * `test/utils/{wav-helper,pcmConcatenator,loader.fake,runWhisper,runTTS}.js` * `test/reference-audio/jfk.wav`, `test/data/sentences-*.js` ### Mechanical fixes * `test/unit/tts.error.test.js` — fix error-code assertions to the tts-ggml range (`13001–14000`); was still checking the `@qvac/tts-onnx` range (`7001–7011`). * `test/unit/tts-ggml.lifecycle.test.js` — fix stale `QvacErrorAddonTTS` import to `QvacErrorAddonTTSGgml`; switch the stubbed model to `{ t3Model, s3genModel }` GGUFs and drop the non-existent `engine: 'chatterbox'` option. * `test/unit/tts-ggml.sentence-stream.test.js` — same GGUF/engine cleanup. ### Rewritten * `test/unit/chatterbox.inference.test.js` — drop tests that asserted the old ONNX file shape (`tokenizer / speechEncoder / embedTokens / conditionalDecoder / languageModel`), the removed `engine` detection and the wrong `getModelKey` return value (`'onnx-tts'` -> `'tts-ggml'`). New tests cover: `modelDir` derives the two GGUF paths; explicit `t3Model` / `s3genModel` override the defaults. The mocked-binding run/reload/cancel flow stays. * `test/integration/addon.test.js` — fresh, ~180 LoC, Chatterbox-English only. Ensures the GGUFs are present, runs the short sentence set through `loadChatterboxTTS` + `runChatterboxTTS[WithSplit]`, and (on darwin only) runs a whisper-based WER check via the existing `runWhisper` util. Drops the Chatterbox-multilingual block + every Supertonic + LavaSR block that doesn't apply to this package. * `test/utils/runChatterboxTTS.js` — rewrite for the GGUF contract: `files: { modelDir, t3Model, s3genModel }`, `referenceAudio` as a file path that falls back to `test/reference-audio/jfk.wav` (or the mobile test-asset when `global.assetPaths` is present). No more WAV decode / resample on the JS side. * `test/utils/downloadModel.js` — trim from 1007 LoC to 280. Drops the Supertonic + LavaSR + Chatterbox-multilingual + Cangjie downloaders. Keeps the shared HTTP/curl infrastructure and `ensureWhisperModel` (still used by the integration WER check). `ensureChatterboxModels` is now **check-only**: it verifies `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` exist locally and, if missing, prints the exact commands for generating them from the qvac-tts.cpp (née chatterbox.cpp) conversion scripts. Once the GGUFs land on a canonical HuggingFace repo we'll wire up download URLs here. ## Scripts * `scripts/ensure-chatterbox.js` — simplify to a single invocation against `./models/`. Drops the variant / language matrix that the ONNX downloader needed. * `scripts/ensure-models.js` — now a thin alias to `ensure-chatterbox.js`. Drops the Supertonic + LavaSR orchestration. ## Mobile * Restored `test/mobile/{integration.auto.cjs, integration-runtime.cjs, testAssets/jfk.wav}` so the Android build has a wrapper to point at. * `package.json`: re-added `test/mobile` to the `files` list. ## Gitignore * Ignore generated `.clang-format` / `.clang-tidy` / `.valgrind.supp` (produced by the top-level `configure_file(...)` calls) and `build_*/` dirs (bare-make convention). ## Verified locally * `npx standard "test/**/*.js" "*.js" "lib/*.js"` — clean. * `npm run test:unit` — 38/38 pass (105/105 asserts). * `npm run build && bare examples/chatterbox-tts.js "Hello from qvac tts ggml."` produces a 24 kHz wav as expected. * Add streaming support * Update ggml backend to use separate ggml repo * tts-ggml: consume renamed tts-cpp library (2026-04-24#1) Upstream chatterbox.cpp renamed the package + namespace + target from qvac-tts to tts-cpp and tightened the library boundary; pick up the new artefacts here: - find_package(qvac-tts-cpp CONFIG REQUIRED) -> find_package(tts-cpp CONFIG REQUIRED) - qvac-tts::qvac-tts -> tts-cpp::tts-cpp - qvac_tts::chatterbox -> tts_cpp::chatterbox (engine ptrs, EngineOptions, SynthesisResult, forward-decls in ChatterboxModel.hpp) - #include <qvac-tts/chatterbox/engine.h> -> #include <tts-cpp/chatterbox/engine.h> - Doxygen / inline doc references to the old names refreshed alongside the code changes. vcpkg wiring: - vcpkg-configuration.json baseline bumped to qvac-registry-vcpkg commit bc30b0b (ports/tts-cpp renamed and repointed at chatterbox.cpp@f8f9145). - vcpkg.json tts-cpp constraint bumped to 2026-04-24#1 (the port that carries the rename + namespace + install(EXPORT) changes). Verified with a cold bare-make generate + bare-make build against the new port, and the addon's existing unit + integration test suites. Made-with: Cursor * tts-ggml: bump tts-cpp port to 2026-05-07 + registry baseline Picks up the round-3 review-fix wave landed on the tts-cpp port: e673182 scrub stale patches/ refs from README (N10) 8ba10a6 drop unreachable TTS_CPP_GGML_LIB_PREFIX block (N8) 4b5d2d7 mirror N1-N7 fixes from chatterbox.cpp source-of-truth - N1 supertonic alive-registry guard against freed-backend gallocr_free assert on hot-swap (Vulkan/Metal/CUDA) - N2 drop dead g_sink_* state, soften log_set docstring - N3 Turbo BPE try/catch (exception-safe Engine ctor) - N4 STFT cancel checkpoint + tighter Engine::cancel() doc - N5 document s3gen_preload/unload refcount semantics - N6 drop dead cached_text_lc Supertonic shim - N7 fix misleading "no copy" view-vs-copy log wording Plus the integrated-port-only round-2 fixes that landed earlier: fa0d490 close patches/-deleted regression: TTS_CPP_USE_SYSTEM_GGML now defaults ON; bundled-without-patches hard-errors at configure time with a pointer at the ggml-speech vcpkg port. ae34c58 README rewritten for integrated/vcpkg context. a2f2dd6 top-level qvac-ext-lib-whisper.cpp README points at the tts-cpp/ subtree (alongside parakeet-cpp/). Public API used by ChatterboxModel (tts_cpp::chatterbox::Engine / EngineOptions / SynthesisResult / s3gen_preload / s3gen_unload) is backward-compatible: the new port adds Engine::backend_name(), MTL-variant fields on EngineOptions (language / cfg_weight / min_p / exaggeration), and a separate tts_cpp::supertonic::Engine class, but nothing this consumer was already calling has changed. Edits: packages/tts-ggml/vcpkg.json - tts-cpp dep: version>=2026-04-24#1 -> version>=2026-05-07. packages/tts-ggml/vcpkg-configuration.json - default-registry baseline: bc30b0b (April 2026 fork-only state) -> 16b91afdcfd59baea60e81f3da94f49311ef2a97. The new baseline pulls in the post-tetherto-merge state (parakeet-cpp port at 932d5d9, ggml-speech port-version 1 at f07bdd0) plus the new tts-cpp port (16b91af) on the developer's GustavoA1604 registry fork. Smoke-test plan: after running `vcpkg install` against the new baseline, the tts-cpp port's vcpkg_from_github resolves at GustavoA1604/qvac-ext-lib-whisper.cpp@e673182 (tts-cpp branch) until the upstream PR merges. ChatterboxModel should build and synthesize identically; expanding to Multilingual + Supertonic flows is the follow-up commit on the package side. Co-authored-by: Cursor <cursoragent@cursor.com> * Add chatterbox multilingual and supertonic * Add mobile integration tests * tts-ggml: drop clang-19 pin in linux-clang toolchain The toolchain hardcoded `clang-19` / `clang++-19` (versioned binary names) since the package's first commit (0a2c978). Linux CI hadn't exercised this path before — the new on-pr-tts-ggml.yml -> integration matrix is the first time it does, and it fails on every linux runner (ai-run-ubuntu-22.04, ai-run-linux-gpu, ubuntu-24.04-arm) at vcpkg's "detect_compiler" step because none of the GH-hosted images ship a `clang-19` symlink: Detecting compiler hash for triplet x64-linux... error: while detecting compiler information: ... CMake Error at scripts/cmake/vcpkg_execute_required_process.cmake:127 (message): Command failed: ... -DVCPKG_CHAINLOAD_TOOLCHAIN_FILE= .../tts-ggml/vcpkg/triplets/../toolchains/linux-clang.cmake ... Match parakeet's working pattern (qvac-lib-infer-parakeet/vcpkg/ toolchains/linux-clang.cmake): use unversioned `clang` / `clang++` so each runner picks up its image's default clang (clang-15 on ubuntu-22.04, clang-18 on ubuntu-24.04, whatever the AI runners ship). The `-stdlib=libc++` flag added by x64-linux.cmake / arm64-linux.cmake is honoured by every reasonable clang version. Co-authored-by: Cursor <cursoragent@cursor.com> * Add C++ tests and coverage; fix linux build * tts-ggml: address PR review feedback Bundle of correctness, hygiene, and CI-doc fixes from the recent code review. Each item below has its own paragraph in the diff comments. - #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js to package.json so consumers running the integration tests from the npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`. - #2 deps: move @qvac/langdetect-text from runtime dependencies to devDependencies (it's only referenced from examples/, which aren't in the published files list). - #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming detection used to read engine_->options() outside engineMu_, racing with reload(). synthesize() now returns SynthesizeResult { pcm, wasStreaming } where wasStreaming is captured under the engine lock against the local shared_ptr so process() doesn't have to touch engine_ again. - #4 deferred-load: ChatterboxModel + SupertonicModel constructors used to call load() eagerly, so JsInterface::createInstance() (sync on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop. Both models now implement IModelAsyncLoad: constructors validate + return; the actual load is deferred to waitForLoadInitialization(), which the new addon_js::activate wraps inside JsAsyncTask::run so the parse runs on a worker thread. binding.cpp registers addon_js::activate in place of JsInterface::activate; tts.js now awaits the resulting promise. - #5 dead code: drop _resolvePath (unused), drop the (void)inputObj read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE / FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but- not-thrown so future maintainers don't delete them blindly (the unit suite asserts the values). - #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_ reset pattern: cancel() sets it, synthesize() fast-fails on it, process() resets it per call so a stale cancel doesn't poison the next run. - #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that the JS layer is the source of truth for useGPU and nGpuLayers wins downstream; left a pointer to std::optional<bool> if a future caller ever needs to distinguish "absent" from "explicit false". - #10 fork pointers: README.md and test/utils/downloadModel.js no longer point at GustavoA1604/chatterbox.cpp; both reference the upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now. - #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment on the build-and-test job documenting that continue-on-error is the early-days landing posture (merge-guard treats success || skipped as pass), with a pointer to tighten once Device Farm provisioning is stable. Nits: - 'use strict' added to addonLogging.js (matches every other .js). - node-vs-bare runtime banners on scripts/{generate,validate}-mobile-integration-tests.js. - ttsOutputDebugString no longer JSON.stringify's the full PCM Int16Array on every chunk-streaming event; emits a tiny summary ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen}) instead. Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load contract); 4 skipped real-GGUF tests behind the existing QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF / QVAC_TEST_SUPERTONIC_GGUF env-var gates. Lint clean. Co-authored-by: Cursor <cursoragent@cursor.com> * tts-ggml: unblock CI integration tests on every desktop runner Four independent failures, one per platform: 1. linux-x64 / linux-arm64: addon load crashed at `libomp.so.5: cannot open shared object file`. tts-cpp's binary is built with clang under the linux-clang toolchain and links against libomp (LLVM OpenMP runtime); only `libgomp1` (GNU OpenMP) was being apt-installed. Add `libomp5` so libomp.so.5 is on the loader path. 2. darwin-arm64: convert-models.sh aborted at line 200 with `hf_args[@]: unbound variable`. macOS's system bash is 3.2 which treats `"${arr[@]}"` as nounset access when the array is empty under `set -u`; with HF_TOKEN unset we hit it on every fresh runner. Use the `${arr[@]+"${arr[@]}"}` idiom (defined-or-nothing) at all six call sites and add a header comment so the next maintainer doesn't accidentally regress. 3. darwin-x64: pip install bombed building `llvmlite` from source because the macos-15-large runner has no LLVM 15 development install. Root cause: librosa pulls in numba 0.65+, which stopped shipping darwin-x86_64 wheels for Python 3.12. Pin Python to 3.11 in the Setup Python step; 3.11 has prebuilt wheels for the entire numba/llvmlite/librosa stack on darwin-x64 and is fine for every other converter dependency. 4. windows-2022: ChatterboxModel::load threw `vk::createInstance: ErrorIncompatibleDriver`. Root cause: the addon's index.js::_validateConfig defaults `useGPU = true` when neither useGPU nor nGpuLayers is specified, so the test ran with n_gpu_layers=99 -> ggml_backend_vk_init -> vk::createInstance -> ErrorIncompatibleDriver on the runner's no-Vulkan-driver image. runChatterboxTTS.js now honours `process.env.NO_GPU === 'true'` (set on the no-GPU matrix entries) and forces useGPU=false on exactly those runners; the other test runners (chatterbox-mtl, gpu-smoke, multiple-runs) already had this guard. Also documents the `mesa-vulkan-drivers` apt package (already pulled in) as the software ICD that lets the Vulkan-built prebuild's runtime backend probe enumerate at least one device on linux runners. Co-authored-by: Cursor <cursoragent@cursor.com> * tts-ggml: drop Chatterbox from mobile bundle (Metro V8 string limit) Mobile build failed at `:app:createBundleReleaseJsAndAssets` with: SyntaxError: assets/testAssets/chatterbox-s3gen.gguf: Cannot create a string longer than 0x1fffffe8 characters Root cause: Metro's bundler reads every asset under `test/mobile/testAssets/` via `Buffer.toString()`. V8's max string length is 0x1fffffe8 (~512 MiB). chatterbox-s3gen.gguf is ~1 GiB even with --quant q4_0 because the s3gen converter only quantizes attention weights and leaves the bulk of the s3gen graph in fp16 ("0/291 weight tensors quantized" in the converter log). Fix: bundle ONLY supertonic.gguf (~125 MiB, comfortably under the limit) on mobile. Mobile Chatterbox tests degrade cleanly to `t.pass('Skipped: Chatterbox GGUFs not available')` via the existing `ensureChatterboxModels` helper -- it already returns { success: false } when the GGUFs aren't on disk. Cache key bumped to v2 so existing v1 cache entries (which include the chatterbox files) are evicted on the next run. Bundling Chatterbox on mobile requires either: - adding `gguf` to qvac-test-addon-mobile's metro `assetExts` so the JS-string read is skipped (then the s3gen file can flow through the bundle as a raw asset), or - pushing the chatterbox GGUFs to the device via `adb push` outside the bundle and surfacing the path through downloadModel.js's existing ANDROID_CANDIDATE_DIRS fallback. Both are outside the scope of this PR; documented inline above the cache step for the next maintainer. Co-authored-by: Cursor <cursoragent@cursor.com> * Bump hash of vcpkg * Consume vcpkg from tetherto repository * Fix integration tests failures in all platforms * Further fix tests * fix: Make useGPU flag more meaningful (#1953) * fix[api]: make useGPU flag actually force CPU/GPU and reject useGPU/nGpuLayers conflicts * add gpu smoke test * resolve comments --------- Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local> * Update dependencies after monorepo directory changes * Further drop qvac-lib- prefix * Add CHANGELOG.md --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Ishan Vohra <ishanvohra2@gmail.com> Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
#2030) * QVAC-18733 feat[api]: add OpenAI Responses routes with in-memory store Implement POST /v1/responses (blocking + SSE), GET/DELETE /v1/responses/{id}, GET /v1/responses/{id}/input_items, previous_response_id chaining, LRU+TTL store, X-QVAC-Stub: responses-volatile header, and startup banner. * fix: align Responses streaming with finalized response and add usage stats - Approach (b): always include the assistant `message` item in `response.output[0]`, even when tool calls are present, so the streamed item tree matches `response.completed`. - Pre-allocate `msgItemId` and `fcItemIds` once and reuse them across SSE events and the finalized `output[]`, fixing client-side accumulation by `item_id`. - Use distinct `output_index` per tool call (1..n) and set `item_id` on `response.function_call_arguments.delta`/`.done` to the function-call item id (was the OpenAI `call_id`, causing collisions and wrong wiring). - Populate `required_action.submit_tool_outputs.tool_calls` so OpenAI clients can satisfy tool calls instead of hanging in `requires_action` with no payload. - Drop the duplicate `previous_response_id` lookup in `handlePostResponses`. - Drop `parallel_tool_calls` from the unsupported-params log: it is honored. - Recognise `function_call_output` (-> `tool` role) and `function_call` (-> synthesized assistant `<tool_call>` content) in `openaiResponsesInputToHistory` and `historyPrefixFromStoredResponse` so chained tool round-trips actually carry through `previous_response_id`. - Use `crypto.randomUUID()` for `resp_`/`msg_`/`fc_`/input-item ids. - Surface real `usage.output_tokens` from `result.stats.generatedTokens` (Responses + chat.completions, blocking + streaming); fall back to word count when stats are missing. `input_tokens` stays 0 with an inline note that the SDK does not expose a prompt-token count today. - Tighten `CompletionResult.stats` to a structured `CompletionRunStats` shape. Tests: extend `responses.test.ts` and `translate.test.ts`; add `responses-streaming.test.ts` driving the new exported `writeStreamingResponse` / `writeBlockingResponse` helpers with a fake `CompletionResult` and `ServerResponse`. * test[skiplog]: stabilize Responses chain e2e for tiny reasoning model Pin temperature=0 + seed and bump max_output_tokens to 512 so Qwen3-600M has room for both its <think> block and the actual answer. The test exercises previous_response_id chain wiring; it should not depend on sampling luck or the model's reasoning length. * fix: walk previous_response_id chain so multi-turn keeps grandparent history Each StoredResponse.inputItems only carries that turn's NEW input (`normalizeResponsesInputItemsForStorage(body['input'])`), so a chain of depth >= 3 silently lost the grandparent turn: resp_1 input "A" -> output "X" (stored: ["A"]) resp_2 prev=resp_1 input "B" history sent: [A, X, B] (stored: ["B"]) resp_3 prev=resp_2 input "C" history sent: [B, Y, C] -- A and X gone historyPrefixFromStoredResponse now walks the chain via responseObject.previous_response_id when given a resolver, prepending earlier turns oldest-first. Cap depth at 32 to bound work and protect against pathological cycles. Routes pass `(id) => store.get(id)` as the resolver. Legacy single-step callers still work unchanged when the resolver is omitted. Tests: - unit: depth-3 chain produces all six prefix entries in order; maxDepth cap honored. - e2e: resp_1 sets "code word is XYZZY", resp_2 acks, resp_3 asks for the word and recovers it -- would silently fail before this fix. * fix: address Responses review nits (SSE sentinel, dup event, types, max_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience. * fix: address Simon review nits (stream error sentinel, input_items after cursor) Two surfaced post-rebase: 1. sendError gained an opt-in { sseSentinel: false } so callers inside an active stream can suppress the trailing `data: [DONE]\n\n` after the `response.error` SSE event. Responses streaming error path now passes it, closing the gap that the happy path already handled (response.completed already used endSSE({ sentinel: false })). 2. GET /v1/responses/:id/input_items now reads the `after` cursor from the query string in addition to `limit`. Spec-compliant pagination would have re-fetched page 1 forever; the store already implemented the cursor. Added a store-level pagination test that walks all pages by `last_id`.
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
* QVAC-18183 feat[api]: inference-handler migrations
Migrate the four remaining inference handler kinds onto the
RequestRegistry primitives shipped in M3a (cancel-capability
declaration, per-kind concurrency policy, structured
`[request-lifecycle]` logging). Each handler now opens a
request-scoped `ManagedRequestContext`, threads the optional
`requestId` from the wire request (falling back to a server-minted
UUID), routes hard cancels to `addon.cancel()` at a single signal-
listener leaf, and replaces ad-hoc `try/finally` cleanup with
`scope.defer(...)` registrations so cleanup runs in LIFO order on
every exit path.
- `embed` (kind "embeddings", `{ scope: "model", hard: true }`):
`packages/sdk/server/bare/ops/embed.ts` opens the context, threads
`requestId` from `embedRequestSchema`, post-await `signal.aborted`
checks raise `InferenceCancelledError`.
- `transcribe` / `transcribeStream` (kind "transcribe",
`{ scope: "model", hard: true }`): collapsed
`try { ... } finally { restorePrompt(...) }` into
`scope.defer(restorePrompt)`, added per-iteration
`if (ctx.signal.aborted) break;` in the `response.iterate()` loop
(Option A from §4 of the M3b brief — explicit, visible at the call
site, no `takeWhileNotAborted` wrapper).
- `translate` (kind "translate"): two engine branches.
llamacpp-completion declares `{ scope: "model", hard: true }` and
wires `signal → addon.cancel()`; nmtcpp-translation keeps
`{ scope: "none" }` and soft-cancels inside both the streaming
iterate loop and the `runBatch` early-return path.
- `finetune` (kind "finetune"): flipped the llamacpp-completion
manifest declaration from `{ scope: "none" }` to
`{ scope: "model", hard: true }` (the addon already exposes
`model.cancel()`). `startFinetune` opens a registry context and
wires `signal → model.cancel()`; the two-level `try/finally`
collapses into `scope.defer` for `clearFinetuneRuntimeState` and
`handle.removeListener`. `cancelFinetune(modelId)` is now a thin
wrapper over `getRequestRegistry().cancel({ modelId, kind:
"finetune" })` — never invokes `model.cancel()` directly.
Per §4 of the brief: per-iteration cancel granularity uses
Option A (explicit `if (ctx.signal.aborted) break;` at the top of
each streaming loop body). No `takeWhileNotAborted` wrapper was
introduced.
Per §7 anti-patterns: M3b adds zero `oneAtATimePerModel` policies
(the four migrated kinds tolerate concurrent requests against the
same model), leaves the M1 compat-fallback in
`server/bare/ops/cancel.ts` untouched (M3d retires it), and does
not modify `cancelHandler.ts`.
Other changes:
- `embed`, `transcribe`, `transcribeStream`, `translate`,
`finetune` request schemas grow an optional `requestId` field
(`.string().min(1).optional()`); server-side ops fall back to
`generateServerRequestId()` when absent.
- Whisper / Parakeet / LLM / NMT plugin handlers thread
`request.requestId` into their bare ops.
- `plugin-cancel-capability.test.ts` truth-table flipped for the
`finetune` row.
- New `inference-handler-migrations.test.ts` covers schema-level
optional-`requestId` acceptance for all four kinds and pins the
`[request-lifecycle] begin/cancel/end` line shape for each kind.
The op-level cancel-by-requestId / cancel-by-modelId integration
tests are bare-runtime-gated (the migrated ops pull `bare-crypto`
/ `bare-fs` transitively and can't load under Bun, same reason as
`finetune-ops.test.disabled.ts`).
- `.cursor/rules/sdk/request-lifecycle-primitives.mdc` and
`.cursor/rules/sdk/docs/request-lifecycle-system.mdc` updated:
M3b row marked shipped, finetune truth-table row flipped,
canonical-handler-shape section refreshed to use `embed.ts` as the
cleanest reference and to document the Option A per-iteration
check.
Verification:
- `bun lint` (eslint + tsc --noEmit): green.
- `bun run typecheck`: green.
- `bun run test:unit`: every test file green except the
pre-existing `client/rpc/rpc-client.ts` `#rpc` package-resolution
failure on upstream/main (also reproducible without these
changes; unrelated to M3b).
* QVAC-18183 fix: address PR #2058 review feedback
- transcribe.ts: route the two `Transcription Update` debug emits
through `requestLogger.debug` so they carry the per-request prefix,
matching the rule's `grep "requestId=<id>"` invariant. Drop the now-
unused module-level `logger`. Collapse two `scope.defer(async () =>
{ await restorePrompt(...) })` wrappers to bare arrow callbacks
(review #5, #10).
- inference-handler-migrations.test.ts: add bareTest op-level cancel-
by-requestId cases for `transcribe (whisper)` (asserts loop exit +
addon.cancel called + reload-count == 2 to pin the
`applyPrompt + restorePrompt runs exactly once` invariant) and
`finetune` (asserts model.cancel called + scope unwind clears the
runtime-state flag back to IDLE). Pin the NMT soft-cancel contract
by instrumenting the addon and asserting addon.cancel was NOT called
during a translate cancel (review #3, #7).
- request-lifecycle-primitives.mdc: reconcile the "polling
signal.aborted mid-handler" anti-pattern with the new "Per-iteration
cancel check (M3b)" canonical pattern. The anti-pattern is *adding*
the check when the addon already honours signal directly; the M3b
pattern is *introducing* the check where the addon doesn't and the
loop is the only soft-cancel exit (review #4).
* QVAC-18183 fix: drop unsafe `addon` re-narrowing in translate.ts onAbort
Addresses opaninakuffo's review comment on #2058:
`AnyModel.addon` is already typed as `AddonInterface | undefined`
(see `server/bare/registry/model-registry.ts:17-20`), so the
`as unknown as { addon?: { cancel?(jobId?: string): Promise<void> } }`
cast was unnecessary. Matches the simpler pattern used by `embed.ts`
and `transcribe.ts` for the same `onAbort` shape — keeps the four
M3b-migrated ops uniform.
* QVAC-18183 doc: trim internal milestone references from cursor rules + code comments
Removed the "Migration Roadmap" table, "M1/M2/M3a-d" milestone labels, planning-brief
decision references (Decision A/B.2, D1/D2), workspace-local paths
(`tasks/release-0.11.0-planning/...`, `pitch-3-decisions.md`), and "in review"
forward-references from the request-lifecycle cursor rules and the matching code
comments in the bare ops, finetune wrapper, and the inference-migration tests. The
canonical handler shape, anti-patterns, primitives reference, plugin cancel-capability
truth table, and concurrency-policy / structured-logging sections all stay — only the
internal milestone framing comes out.
simon-iribarren
added a commit
to simon-iribarren/qvac
that referenced
this pull request
Jun 8, 2026
Lifecycle correctness: - Spawn lock: steal only when the owner pid is dead (with an mtime fallback for an unreadable lock), so a legitimate multi-minute cold start no longer loses its lock after 30s and spawns a duplicate runner/serve (tetherto#1). - close(): the fetch path now bails out instead of re-resolving once closed, so a request racing close() can't silently re-add a consumer / spawn a runner (tetherto#3). - sweepServes: when an orphaned serve's pid is alive but its health check fails, keep the record instead of dropping it — dropping stranded a live serve with no registry trace. We only reap once it answers as ours, or drop once its pid dies (tetherto#4). - servePort: fold a pinned port into the fleet key so pinned-port callers don't reuse an auto-allocated serve on a different port, and distinct pins don't collide (tetherto#5). - Respawn: expose baseURL/port/pid as getters over live state, updated on every reconnect, so diagnostics/external clients see the real serve after recovery (tetherto#6). - retargetUrl now handles Request inputs (not just string/URL) so a respawn stays transparent if the SDK ever switches input shapes (tetherto#8). Docs: - README + docs-site: direct-baseURL tools (OpenCode/Cline/Aider) don't extend liveness; document the long-lived-sentinel/wrapper pattern and fix the misleading "the script doesn't have to stay running" note (tetherto#2). - Reconcile version wording: README/changelog now describe managed mode as unreleased (package is 0.1.0); docs-site integration page documents managed mode + the async overload (tetherto#7). Tests: spawn-lock steal/keep matrix, fleet-key pinned-port sensitivity, and the runner-dead + serve-alive + health-failing sweep case. Build + suite green (60 pass / 1 integration skip).
simon-iribarren
added a commit
that referenced
this pull request
Jun 10, 2026
* feat[api]: add managed mode to @qvac/ai-sdk-provider (QVAC-19900)
Add `mode: 'managed'` so the provider can synthesize an ephemeral
qvac.config.json from a model-constant list, spawn and supervise
`qvac serve` on a free port, and tear it down on host exit. External
mode is unchanged and stays synchronous; the managed supervisor is
lazily dynamic-imported so external-mode users pay no startup cost.
@qvac/cli becomes an optional peer dependency.
* fix: resolve @qvac/cli via main entry when its exports block package.json (QVAC-19900)
The published @qvac/cli ships a string `exports` field ("./dist/index.js"),
which makes the `./package.json` subpath non-resolvable
(ERR_PACKAGE_PATH_NOT_EXPORTED). Managed mode relied on resolving
`@qvac/cli/package.json` to locate the bin, so it would fail to find the CLI
on a clean install. Fall back to resolving the package main entry, which for
@qvac/cli is the same file as the `qvac` bin.
* doc: update ai-sdk provider agent setup after queue (QVAC-19900)
* QVAC-19900 feat[api]: per-model config for managed mode
Managed mode `models` now accepts spec objects ({ name, config, preload,
default }) alongside bare constant names, so callers can set per-model serve
options — notably `ctx_size` and `reasoning_budget` — that coding agents like
OpenCode require. The synthesized qvac.config.json carries the config block,
honors explicit `preload`/`default`, and validates names inside spec objects.
Exports the new `QvacManagedModel` type and documents per-model config plus a
managed-mode OpenCode example in the README.
* QVAC-19900 feat[api]: shared idle-reaped managed serve daemon
Rework managed mode from a per-provider supervisor into a shared,
self-cleaning serve daemon so it is robust standalone and usable by any
tool, not just a single session.
- Reuse via a fleet key (model set + per-model config + host) keyed in a
cross-process registry under ~/.qvac/managed-serves/; createQvac attaches
to a matching healthy serve instead of cold-starting a duplicate.
- A detached runner owns the qvac serve child and reaps it once no consumer
process has been alive for serveIdleTimeout (default 5m). Liveness, not
request traffic, is the signal, so it works for tools that hit baseURL
directly (OpenCode/Cline/Aider).
- close() now detaches (deregisters the consumer) instead of killing; a
shared serve survives until its last user is gone.
- Sweep only reaps dead/orphaned serves, never a healthy serve a live
process owns (fixes a second session SIGKILLing a downloading serve).
- Respawn-on-failure: fetch re-resolves and retries once on ECONNREFUSED.
- reuse:false (or a pinned servePort) yields a private serve reaped as soon
as its owner exits.
Refactor into serve-process.ts (spawn/health/stop), registry.ts,
fleet-key.ts, runner.ts; remove supervisor.ts and pid-tracker.ts. Add
reuse and serveIdleTimeout options. Rewrite tests and add reuse/idle-reap
end-to-end coverage; document the shared lifecycle in the README.
* QVAC-19900 fix: reject duplicate model names in managed mode
Each managed model maps to a single serve alias keyed by its name, so a
repeated name silently overwrote the earlier entry — and could drop its
`default: true`. Reject duplicates up front with DuplicateManagedModelError
instead of resolving them ambiguously. Addresses PR review feedback.
* QVAC-19900 fix[api]: address managed-mode self-review findings
- Per-instance consumer markers (<pid>.<rand>) so two providers in one
process sharing a fleet key don't deregister each other on close (A).
- Restrict respawn retry to ECONNREFUSED so an in-flight completion is
never blindly replayed on ECONNRESET/EPIPE (C).
- Health-check the recorded baseURL before SIGTERM-ing an orphaned serve,
guarding against killing a recycled pid (D).
- Use dirname() instead of a posix-only regex for ephemeral config cleanup (E).
- Fold serveBinPath into the fleet key so distinct local builds don't share
a serve (G).
- Export managed error classes + QvacManagedErrorCode for instanceof checks (H).
- Reject more than one explicit default: true (I).
- Deregister the consumer if resolveServe throws (F); drop dead
firstConsumerPid runner param (J).
Tests: per-instance markers, health-gated orphan sweep (kills serving
orphan, spares non-serving stranger pid), fleet-key serveBinPath sensitivity,
multiple-default rejection. README updated.
* QVAC-19900 fix[api]: address managed-mode lifecycle review (round 2)
Lifecycle correctness:
- Spawn lock: steal only when the owner pid is dead (with an mtime fallback for
an unreadable lock), so a legitimate multi-minute cold start no longer loses
its lock after 30s and spawns a duplicate runner/serve (#1).
- close(): the fetch path now bails out instead of re-resolving once closed, so
a request racing close() can't silently re-add a consumer / spawn a runner (#3).
- sweepServes: when an orphaned serve's pid is alive but its health check fails,
keep the record instead of dropping it — dropping stranded a live serve with
no registry trace. We only reap once it answers as ours, or drop once its pid
dies (#4).
- servePort: fold a pinned port into the fleet key so pinned-port callers don't
reuse an auto-allocated serve on a different port, and distinct pins don't
collide (#5).
- Respawn: expose baseURL/port/pid as getters over live state, updated on every
reconnect, so diagnostics/external clients see the real serve after recovery (#6).
- retargetUrl now handles Request inputs (not just string/URL) so a respawn stays
transparent if the SDK ever switches input shapes (#8).
Docs:
- README + docs-site: direct-baseURL tools (OpenCode/Cline/Aider) don't extend
liveness; document the long-lived-sentinel/wrapper pattern and fix the
misleading "the script doesn't have to stay running" note (#2).
- Reconcile version wording: README/changelog now describe managed mode as
unreleased (package is 0.1.0); docs-site integration page documents managed
mode + the async overload (#7).
Tests: spawn-lock steal/keep matrix, fleet-key pinned-port sensitivity, and the
runner-dead + serve-alive + health-failing sweep case. Build + suite green
(60 pass / 1 integration skip).
* docs: use canonical qvac.tether.io URL in ai-sdk-provider README
* QVAC-19900 feat[api]: public model catalog + catalog-id aliases in managed mode
Add `models.qvacCatalog`, a public models.dev-style catalog that maps
friendly ids (`qwen3.5-9b`) to the SDK constant the serve loads
(`QWEN3_5_9B_MULTIMODAL_Q4_K_M`), so the id a user picks from models.dev
resolves end-to-end with no translation layer in front of the serve.
Managed mode now accepts catalog ids as model names: the synthesized
serve config keys the alias by the friendly id while `model` resolves to
the underlying SDK constant, so the serve answers `qwen3.5-9b` directly.
Bare SDK constants keep working unchanged. A drift unit test fails CI if
any catalog constant disappears from the generated SDK catalog.
* QVAC-19900 feat[api]: process-group serve teardown + closeOnParentExit
Harden managed-mode lifecycle so a managed serve never leaks its `bare`
inference worker or outlives the process that owns it.
- Process-group teardown: spawn `qvac serve` detached (its own group) and,
when stopServe must escalate past the grace window, SIGKILL the whole
group. A plain SIGKILL of the serve pid never cascades to the grandchild
bare worker, so previously a wedged serve orphaned the worker. The
graceful SIGTERM is still sent to the serve process only, so a healthy
serve orchestrates its own shutdown and releases the global worker lock
(no stale lock left behind); the group SIGKILL is the wedged-path fallback.
- `closeOnParentExit` option: for a daemon-style host whose sole job is to
keep a managed serve alive for a parent process (e.g. an editor/agent
plugin). The provider watches its parent pid and, the moment the parent
exits (on POSIX we are reparented to init, ppid → 1), closes itself —
deregistering the consumer so the runner reaps the serve — and exits.
Without it a hard-killed parent would leave a reparented host alive,
keeping its consumer marker forever so the serve was never reaped.
Tests: a stubborn-grandchild fake serve proves group teardown reaps the
worker; `parentIsGone` unit-tests the parent-watch decision.
* QVAC-19900 fix: keep managed serve lifecycle correct under close() race and crash-respawn
- Undo the consumer re-registration when close() wins the race against an
in-flight fetch retry: resolveServe re-adds the marker after close() removed
it, which would keep the shared serve warm until the process exits.
- Preserve live consumer markers when sweepServes reaps a crashed/orphaned
serve, so a respawned runner inherits the still-alive sessions instead of
idle-reaping the fresh serve out from under them.
- docs: bump managed-mode ctx_size examples to 32768 for agent-sized prompts.
* QVAC-19900 fix: rename reresolve result to resolved for clarity in managed fetch
* QVAC-19900 mod: collapse redundant sync/async registry teardown helpers
removeConsumer/removeConsumerSync and removeRecord/removeRecordSync were a
confusing sync/async mirror: the async removeConsumer was only ever called right
after the sync one (a guaranteed no-op), and the removeRecord pair was really two
teardown semantics under near-identical names. Marker/record teardown is a single
unlink/rm, cheap enough to be synchronous everywhere — including process 'exit'
handlers where async can't run — so collapse each pair into one sync function.
No behaviour change; addresses review feedback on #2408.
* QVAC-19900 mod: trim verbose comments in managed registry
Tighten the sync-rationale comments on removeRecord/removeConsumer and drop a
stale, broken leftover comment above ensureDirSync. Keeps the non-obvious intent
(why sync, preserveConsumers semantics) without the narration.
* QVAC-19900 mod: drop unused DEFAULT_SERVE_BIN and ephemeralConfigName
Both were dead: DEFAULT_SERVE_BIN was never imported (serve-process spawns the
resolved CLI path verbatim) and ephemeralConfigName was an unused helper
(writeEphemeralConfig uses a fixed name inside an mkdtemp dir). Removing the
latter also drops the now-unused randomBytes import.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.