Qvac cli integration by Proletter · Pull Request #3 · tetherto/qvac

Proletter · 2026-01-08T13:38:43Z

No description provided.

Try #3. Limit device to only cpu or gpu.

* Try #1. Adding tokenizer proxy to provide vocab size. * Try #2. More fixes and logs. * Try #3. Limit device to only cpu or gpu. * Revert "Try #2. More fixes and logs." This reverts commit a461e69. * Revert "Try #1. Adding tokenizer proxy to provide vocab size." This reverts commit 9951195. * Fixing pipeline logging * Add more logs * Fixing bench logging * Add more error handling and logging * Improve error handling on the server. Added retry in case of context overflow. * Make retries self-adjustable * Adding some more checks and limiting the datasets temporarily * Test: trying to narrow down the error * Exclude failing datasets from embed benchmark * Clean up the code * Changing bench model for LLM * Try #1. Adding tokenizer proxy to provide vocab size. * Try #2. More fixes and logs. * Try #3. Limit device to only cpu or gpu. * Revert "Try #2. More fixes and logs." This reverts commit a461e69. * Revert "Try #1. Adding tokenizer proxy to provide vocab size." This reverts commit 9951195. * Fixing pipeline logging * Add more logs * Fixing bench logging * Add more error handling and logging * Improve error handling on the server. Added retry in case of context overflow. * Make retries self-adjustable * Adding some more checks and limiting the datasets temporarily * Test: trying to narrow down the error * Exclude failing datasets from embed benchmark * Clean up the code * Changing bench model for LLM * Minor fixes for clarity * Removing unused vars * Removing unused imports * Removing unused python deps --------- Co-authored-by: gianni <gianfranco.cordella@tether.io>

This reverts commit 0231828.

Polish the remaining review nits on the TTS client streaming surface. - #3 TtsMulticast.pump now rejects the `done` promise with the fatal error instead of resolving `false`. An internal `.catch(() => {})` silences unhandled-rejection warnings when the caller only iterates the buffer/chunk streams and never awaits `done`; re-awaits still see the rejection. - #6 TextToSpeechStreamSession[Symbol.asyncIterator] no longer throws synchronously on a second iteration; it returns an iterator whose first `.next()` rejects, so `for await` surfaces the error in the normal async control flow rather than the iterator protocol. - #9 plainTtsBufferStream / collectTtsBuffer wrap the RPC loop in try/catch/finally so `done` always settles: resolve(true) on the terminal frame, reject with the real error on exceptions, and resolve(false) on early consumer break. Previously `await done` could hang forever when the consumer bailed out early. - #11 Skip per-frame ttsResponseSchema.parse() in all three paths; rely on the discriminated-union narrowing at the RPC boundary. Drops the per-PCM-frame Zod validation cost for large sentences. Made-with: Cursor

…peech (#1590) * feat: Add runStream() which takes input as a stream * add integration tests * uncomment cb tests * chore: Add cb streaming example * feat: Add TTS streaming funcitonality and example * Update tts addon version * Remove chatterbox example * add new error code for tts streaming fail * Move common code to util * fix: Use z.infer to define TextToSpeechStreamClientParams * Move TextToSpeechStreamSession to schemas * Track subscriber current index and trim queue when all subscribers consumed past items * add missing unit tests * fix: drive done promise from multicast pump lifecycle * fix: Forward chunkIndex and sentenceChunk in sentence-stream mode to client * fix: Use correct error code for tts stream failure * chore: Add supertonic stream test in tts-tests.ts * fix: Make tts client more readable * Remove closures and inline async generators * fix: Subscribe eagerly in sentenceStreamTts to avoid late-subscriber data loss TtsMulticast.pump() starts in a microtask on construction, while the returned async generators only call subscribe() when first iterated. If the consumer iterated one generator before the other, the first subscriber could trim the queue before the second ever registered, silently dropping earlier frames. Subscribe synchronously for both bufferStream and chunkUpdates before returning, so both subscriber indexes are in place before pump pushes its first item. Made-with: Cursor * fix: Close TTS stream on server-sent done frame Remove the dead `null` sentinel from `processTextToSpeechStreamLine` and instead close `parseTextToSpeechStreamLines` after yielding the terminal `done: true` frame, so consumers don't rely on the server closing the socket to stop iteration. Made-with: Cursor * fix: Reject sentenceStream without stream in textToSpeech Previously `sentenceStream: true` combined with `stream: false` fell through to the collect path, silently dropping the sentence-stream parameters and returning no `chunkUpdates`. Fail fast at the dispatcher with a clear error so the contract mismatch surfaces to the caller instead of being swallowed. Made-with: Cursor * fix: Release TtsMulticast subscriber slot on early break Wire a try/finally into drain() so that when a consumer breaks out of the for-await (or the generator is .return()'d / throws), the slot is parked at +Infinity via unsubscribe(). This prevents a stale low min-index from permanently pinning trimConsumed, which otherwise leaked the queue for the entire RPC stream. Made-with: Cursor * fix: Guard TTS stream write after close and preserve UTF-8 boundaries Client: - Track a `closed` flag in `textToSpeechStream` duplex session, set by `end()` / `destroy()`. Subsequent `write()` calls now throw a typed `TextToSpeechStreamFailedError` instead of propagating a raw Bare/Node "write after end" stream error. - `end()` is idempotent so accidental double-close no longer errors. Server: - `buffersToUtf8Fragments` previously decoded each incoming Buffer via `toString("utf8")`, which corrupts any multi-byte codepoint whose bytes straddle a chunk boundary (common with CJK / emoji / accented scripts emitted as LLM token deltas). Added a small tail-buffer that finds the last complete UTF-8 codepoint end in the combined buffer and defers trailing incomplete bytes to the next chunk. Any dangling partial sequence is flushed on stream end. Made-with: Cursor * fix: Order TEXT_TO_SPEECH_STREAM_FAILED code and document it - Move TEXT_TO_SPEECH_STREAM_FAILED (52415) to the end of the 52400 Model Operations block so the ordering in SDK_SERVER_ERROR_CODES matches the numeric sequence (…52413, 52414, 52415). - Add the missing row for 52415 to the (latest) errors.mdx table, per the sdk/docs-freshness rule that the error table stay in sync whenever a new code is introduced. Made-with: Cursor * fix: Register operation metrics for textToSpeechStream Only `textToSpeech` was registered in `operation-metrics.ts`, so the duplex `textToSpeechStream` path silently skipped `modelExecutionTime`, `audioDuration`, and `totalSamples` gauges even though the server already collects the same `TtsStats` via `collectTtsStats()` on the final chunk. Mirror the non-streaming registration so the streaming path has parity observability. Made-with: Cursor * fix: Harden TTS client done-promise, iterator, and parse cost Polish the remaining review nits on the TTS client streaming surface. - #3 TtsMulticast.pump now rejects the `done` promise with the fatal error instead of resolving `false`. An internal `.catch(() => {})` silences unhandled-rejection warnings when the caller only iterates the buffer/chunk streams and never awaits `done`; re-awaits still see the rejection. - #6 TextToSpeechStreamSession[Symbol.asyncIterator] no longer throws synchronously on a second iteration; it returns an iterator whose first `.next()` rejects, so `for await` surfaces the error in the normal async control flow rather than the iterator protocol. - #9 plainTtsBufferStream / collectTtsBuffer wrap the RPC loop in try/catch/finally so `done` always settles: resolve(true) on the terminal frame, reject with the real error on exceptions, and resolve(false) on early consumer break. Previously `await done` could hang forever when the consumer bailed out early. - #11 Skip per-frame ttsResponseSchema.parse() in all three paths; rely on the discriminated-union narrowing at the RPC boundary. Drops the per-PCM-frame Zod validation cost for large sentences. Made-with: Cursor * fix: Tighten textToSpeechStream schema surface - Add .positive() to maxBufferScalars and flushAfterMs to match the existing constraint on sentenceStreamMaxChunkScalars. Previously a caller could pass negative values straight through to the addon. - Un-export textToSpeechStreamRequestBaseSchema — consumers only need the finalized textToSpeechStreamRequestSchema, and the base is an implementation detail of the shared object shape. The exported type alias TextToSpeechStreamClientParams continues to derive from the base via `typeof`, so nothing on the public type surface changes. Made-with: Cursor * fix: Cross-platform tmp path and safer PCM append in TTS examples - playPcmInt16Chunk now writes the intermediate WAV chunk under os.tmpdir() / path.join instead of a hard-coded /tmp/qvac-tts-chunk-… path. The previous code's Windows branch was unreachable in practice because the POSIX /tmp directory doesn't exist there; this uses %TEMP% on Windows automatically. - appendPcmSamples switches from `target.push(...chunk.slice(i, end))` to `Array.prototype.push.apply(target, chunk.slice(i, end))`. Same semantics, but avoids allocating the spread rest array per batch and is closer to a memcpy-style concat in V8. Made-with: Cursor * fix: Catch zero-chunk regressions in TTS sentence-stream test - TtsExecutor.makeSentenceStream now returns `{ passed: false, ... }` when the chunkUpdates iterator yields no chunks / no samples. The previous executor always returned a formatted string regardless of counts, so a regression that silently emitted zero chunks would still have looked like a pass. - ttsSupertonicSentenceStream's expectation upgraded from `{ validation: "type", expectedType: "string" }` to `{ validation: "contains-all", contains: ["sentence-streamed", "chunks", "samples"] }`. The executor's zero-case failure string lacks "sentence-streamed", so the contains-all match fails on regression. Made-with: Cursor * fix: Apply stream default locally and throw typed error on tts mismatch Previous guard only rejected the explicit `stream: false + sentenceStream: true` combination. A caller passing `{ modelId, text, sentenceStream: true }` with `stream` omitted silently fell through to `collectTts` while the server's Zod `.default(true)` still ran the sentence-stream branch and emitted chunk frames — which the client then discarded, dropping all chunk metadata. - Resolve the `stream` default locally (`params.stream ?? true`) so the client's dispatch routing matches the server's Zod-applied routing, and an omitted `stream` now correctly lands in `sentenceStreamTts` or `plainStreamTts`. - Only the explicit `sentenceStream: true + stream: false` combination is rejected, and it now throws `TextToSpeechStreamFailedError` (code 52415) instead of a bare `new Error(...)` so callers can discriminate by error code like everywhere else in the SDK. Made-with: Cursor * remove inline defaults for sentenceStream and stream * Use TtsMulticast in unit test instead of mock --------- Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>

Bundle of correctness, hygiene, and CI-doc fixes from the recent code review. Each item below has its own paragraph in the diff comments. - #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js to package.json so consumers running the integration tests from the npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`. - #2 deps: move @qvac/langdetect-text from runtime dependencies to devDependencies (it's only referenced from examples/, which aren't in the published files list). - #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming detection used to read engine_->options() outside engineMu_, racing with reload(). synthesize() now returns SynthesizeResult { pcm, wasStreaming } where wasStreaming is captured under the engine lock against the local shared_ptr so process() doesn't have to touch engine_ again. - #4 deferred-load: ChatterboxModel + SupertonicModel constructors used to call load() eagerly, so JsInterface::createInstance() (sync on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop. Both models now implement IModelAsyncLoad: constructors validate + return; the actual load is deferred to waitForLoadInitialization(), which the new addon_js::activate wraps inside JsAsyncTask::run so the parse runs on a worker thread. binding.cpp registers addon_js::activate in place of JsInterface::activate; tts.js now awaits the resulting promise. - #5 dead code: drop _resolvePath (unused), drop the (void)inputObj read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE / FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but- not-thrown so future maintainers don't delete them blindly (the unit suite asserts the values). - #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_ reset pattern: cancel() sets it, synthesize() fast-fails on it, process() resets it per call so a stale cancel doesn't poison the next run. - #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that the JS layer is the source of truth for useGPU and nGpuLayers wins downstream; left a pointer to std::optional<bool> if a future caller ever needs to distinguish "absent" from "explicit false". - #10 fork pointers: README.md and test/utils/downloadModel.js no longer point at GustavoA1604/chatterbox.cpp; both reference the upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now. - #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment on the build-and-test job documenting that continue-on-error is the early-days landing posture (merge-guard treats success || skipped as pass), with a pointer to tighten once Device Farm provisioning is stable. Nits: - 'use strict' added to addonLogging.js (matches every other .js). - node-vs-bare runtime banners on scripts/{generate,validate}-mobile-integration-tests.js. - ttsOutputDebugString no longer JSON.stringify's the full PCM Int16Array on every chunk-streaming event; emits a tiny summary ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen}) instead. Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load contract); 4 skipped real-GGUF tests behind the existing QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF / QVAC_TEST_SUPERTONIC_GGUF env-var gates. Lint clean. Co-authored-by: Cursor <cursoragent@cursor.com>

…ence-addon-cpp to 1.1.7#1 Per request from PR #1784 reviewers — pulls in the latest published versions of both core C++ dependencies. The new ports are breaking changes in two ways that require parallel source updates: 1. The `qvac-lib-inference-addon-cpp` port renamed its top-level include namespace from `qvac-lib-inference-addon-cpp/*.hpp` to `inference-addon-cpp/*.hpp`. No back-compat shim ships in 1.1.7#1, so all 12 `#include` lines across 4 files migrate: - addon/src/utils/LoggingMacros.hpp - addon/src/js-interface/binding.cpp - addon/src/addon/AddonJs.hpp (8 includes) - addon/src/addon/AddonCpp.hpp (2 includes) CMakeLists.txt's find_path probe also migrated to the new namespace. 2. The `qvac-lint-cpp` port (auto-bumped 1.4.4#2 -> #3 by the new baseline) relocated its config files from `share/qvac-lint-cpp/` to `share/lint-cpp/`. CMakeLists.txt's three `find_path` / `configure_file` paths migrated to the new location. The vcpkg port names themselves are unchanged in vcpkg.json (the registry entries are still `qvac-lib-inference-addon-cpp` and `qvac-lint-cpp`); only the installed file layout moved. Local validation (Win32 MSVC): bare-make generate OK (re-resolved deps; fabric compiled from source ~32 min — local binary cache had no entry for 8189.0.2 yet) bare-make build OK (1 pre-existing fopen deprecation warning, unrelated to this change) bare-make install OK npm run test:cpp 7/7 GoogleTest pass npm run test:integration 5/5 tests, 79/79 asserts pass Quality vs PyTorch (unchanged): fixed/auto: max|Δ|=0.0008 cos=1.0000 fixed/cpu: max|Δ|=0.0021 cos=1.0000 real/auto: max|Δ|=0.1096 cos=0.9998 real/cpu: max|Δ|=0.1083 cos=0.9998 Note: VLA is intentionally not in the `verify-qvac-fabric-lockstep` action's allowlist (which checks llamacpp-llm, llamacpp-embed, and nmtcpp — all still pinned to 7248.2.3). So bumping VLA alone does not break that gate; the LLM addons can be migrated separately when their owners are ready. Co-authored-by: Cursor <cursoragent@cursor.com>

…#1983) * feat: add @qvac/tts-ggml package (Chatterbox English on qvac-tts.cpp) New Bare addon wrapping the `qvac-tts::qvac-tts` static library (backed by the `tts-cpp` port added in tetherto/qvac-registry-vcpkg). API-compatible with the Chatterbox engine exposed by `@qvac/tts-onnx` so downstream consumers can swap backends without touching orchestration code. ## Scope * First iteration. Supports Chatterbox **English** only. Chatterbox multilingual, LavaSR enhancer, Supertonic engine, and streaming are out of scope and remain in `@qvac/tts-onnx`. They'll land alongside the evolution of qvac-tts.cpp. * Native backend is the static `qvac-tts` library from the QVAC vcpkg registry (`ports/tts-cpp`, baseline `2026-04-21`). No ONNX Runtime dependency. ## JS surface * `@qvac/tts-ggml` exports `TTSGgml` with the same method shape as `ONNXTTS`: `run` / `runStream` / `runStreaming` / `reload` / `unload` / `destroy`. * `files: { modelDir }` looks for `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` side-by-side; `files.t3Model` / `files.s3genModel` override the defaults. * Options: `referenceAudio`, `voiceDir` (baked profile), `seed`, `nGpuLayers`, `threads`, `outputSampleRate`, plus placeholders for the upcoming streaming flags (`streamChunkTokens`, `streamFirstChunkTokens`, `cfmSteps`). * Shared reusable lib code (`lib/textChunker.js`, `lib/textStreamAccumulator.js`, `addonLogging.*`) is copied verbatim from `@qvac/tts-onnx`. * New error class `QvacErrorAddonTTSGgml` uses codes **13001–14000** to avoid collisions with `@qvac/tts-onnx` (7001–7011) when both packages are loaded in the same Bare process. ## Native addon * `addon/src/model-interface/chatterbox/ChatterboxModel.{hpp,cpp}` — `IModel` + `IModelCancel` implementation. First-iteration strategy: assemble argv for `qvac_tts_cli_main` with a scratch `.wav` output path, call it synchronously, then parse the resulting 16-bit mono PCM wav back into `std::vector<int16_t>` for the JS handler. Consequences: every job re-loads the model (~700 ms + inference time), no mid-synthesis cancellation, no streaming. The follow-up milestone replaces this with a persistent, struct-based API once qvac-tts.cpp exposes one. * `addon/src/js-interface/{JSAdapter.{hpp,cpp}, binding.cpp}` — JS-to-C++ config bridging (same string-map pattern as `@qvac/tts-onnx`) and the `BARE_MODULE(qvac_tts_ggml, ...)` registration exposing `createInstance` / `runJob` / `reload` / `activate` / `cancel` / `destroyInstance` / `loadWeights` / `setLogger` / `releaseLogger`. * `addon/src/addon/AddonJs.hpp` — JS-facing `createInstance` / `runJob` / `reload` wrappers that register a `JsAudioOutputHandler` emitting `{ outputArray: Int16Array, sampleRate: number }` to JS. ## Build / registry * `CMakeLists.txt` uses `find_package(qvac-tts-cpp CONFIG REQUIRED)` and the standard `cmake-bare` + `cmake-vcpkg` scaffolding (shape matches `@qvac/transcription-whispercpp`). * `vcpkg.json` depends on `tts-cpp` (with a `vulkan` feature passthrough) plus `qvac-lib-inference-addon-cpp`, `qvac-lint-cpp`, and `gtest`. * `vcpkg-configuration.json` points at tetherto/qvac-registry-vcpkg. NOTE: the baseline pin here is inherited from `@qvac/transcription-whispercpp` and **must be bumped** to a commit that contains the `tts-cpp` port once that registry PR lands. A follow-up commit will update it. ## Tests & examples * Integration + unit test files for Chatterbox English are copied verbatim from `@qvac/tts-onnx` with only mechanical renames (`ONNXTTS` -> `TTSGgml`, `QvacErrorAddonTTS` -> `QvacErrorAddonTTSGgml`, `@qvac/tts-onnx/text-chunker` -> `../../lib/textChunker.js`). Some paths in `test/integration/addon.test.js` still import Supertonic / LavaSR helpers that don't exist in this package — those test blocks will fail fast when the file loads, which is expected until those backends get their own ggml packages. * Examples: `chatterbox-tts.js`, `chatterbox-streaming-tts.js`, plus shared `wav-helper.js` + `pcm-chunk-player.js`. ## What's not in this PR (known gaps) * No docs: README, NOTICE, CHANGELOG, PULL_REQUEST_TEMPLATE changes will land in a single documentation pass once the registry + fork commits have merged upstream. * `vcpkg-configuration.json` baseline needs to point at a qvac-registry-vcpkg commit that ships `tts-cpp` (pending the registry PR). * Actual `npm run build` requires the registry and fork commits to be on `main` of their respective upstream repos. * chore: point tts-ggml vcpkg baseline at the tts-cpp-bearing registry commit Bumps `vcpkg-configuration.json` to GustavoA1604/qvac-registry-vcpkg at commit 1e2839680b6be8d8ffff889a9c29b966c176098c — the commit that adds the `tts-cpp` port. Paired with the `qvac-tts` library already pinned in the port's `portfile.cmake` (GustavoA1604/chatterbox.cpp @ 0fe4a521618cc30358040b29d75d4261b31cbb60). Will be re-pointed at tetherto/qvac-registry-vcpkg once the registry PR lands upstream. * chore: tts-ggml: trim tests + examples to Chatterbox English, restore mobile wrapper Second pass over @qvac/tts-ggml after the build started passing: prune everything that only made sense for the ONNX-era multi-engine scope and adapt the remaining Chatterbox-English bits to the GGUF + file-path reference-audio contract. Restores `test/mobile/` so the Android build has something to point at. ## C++ * `ChatterboxModel.cpp`: the `ArgvBuilder::buildArgv` doc comment contained `**/` which closed the block comment early and broke the build. Rewrote as a `//` comment. ## Examples * `examples/chatterbox-tts.js` — rewrite for v0 contract: single `<text>` argv, `files: { modelDir }` pointing at the two GGUFs, `referenceAudio` is now a wav **path** (addon passes it to `--reference-audio`) instead of a Float32Array. Drops english/multilingual arg and the CHATTERBOX_VARIANT switch that picked which `.onnx` files to load. * Removed `examples/chatterbox-streaming-tts.js` + `examples/pcm-chunk-player.js`. The v0 addon re-loads the model per `run()` call — exposing streaming would mislead. Both come back alongside the persistent-engine milestone. * `package.json`: `npm run example` now passes a default text so it runs without extra args. ## Tests ### Kept as-is (engine-agnostic) * `test/unit/textChunker.test.js` * `test/mock/{MockedBinding,utils}.js` * `test/utils/{wav-helper,pcmConcatenator,loader.fake,runWhisper,runTTS}.js` * `test/reference-audio/jfk.wav`, `test/data/sentences-*.js` ### Mechanical fixes * `test/unit/tts.error.test.js` — fix error-code assertions to the tts-ggml range (`13001–14000`); was still checking the `@qvac/tts-onnx` range (`7001–7011`). * `test/unit/tts-ggml.lifecycle.test.js` — fix stale `QvacErrorAddonTTS` import to `QvacErrorAddonTTSGgml`; switch the stubbed model to `{ t3Model, s3genModel }` GGUFs and drop the non-existent `engine: 'chatterbox'` option. * `test/unit/tts-ggml.sentence-stream.test.js` — same GGUF/engine cleanup. ### Rewritten * `test/unit/chatterbox.inference.test.js` — drop tests that asserted the old ONNX file shape (`tokenizer / speechEncoder / embedTokens / conditionalDecoder / languageModel`), the removed `engine` detection and the wrong `getModelKey` return value (`'onnx-tts'` -> `'tts-ggml'`). New tests cover: `modelDir` derives the two GGUF paths; explicit `t3Model` / `s3genModel` override the defaults. The mocked-binding run/reload/cancel flow stays. * `test/integration/addon.test.js` — fresh, ~180 LoC, Chatterbox-English only. Ensures the GGUFs are present, runs the short sentence set through `loadChatterboxTTS` + `runChatterboxTTS[WithSplit]`, and (on darwin only) runs a whisper-based WER check via the existing `runWhisper` util. Drops the Chatterbox-multilingual block + every Supertonic + LavaSR block that doesn't apply to this package. * `test/utils/runChatterboxTTS.js` — rewrite for the GGUF contract: `files: { modelDir, t3Model, s3genModel }`, `referenceAudio` as a file path that falls back to `test/reference-audio/jfk.wav` (or the mobile test-asset when `global.assetPaths` is present). No more WAV decode / resample on the JS side. * `test/utils/downloadModel.js` — trim from 1007 LoC to 280. Drops the Supertonic + LavaSR + Chatterbox-multilingual + Cangjie downloaders. Keeps the shared HTTP/curl infrastructure and `ensureWhisperModel` (still used by the integration WER check). `ensureChatterboxModels` is now **check-only**: it verifies `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` exist locally and, if missing, prints the exact commands for generating them from the qvac-tts.cpp (née chatterbox.cpp) conversion scripts. Once the GGUFs land on a canonical HuggingFace repo we'll wire up download URLs here. ## Scripts * `scripts/ensure-chatterbox.js` — simplify to a single invocation against `./models/`. Drops the variant / language matrix that the ONNX downloader needed. * `scripts/ensure-models.js` — now a thin alias to `ensure-chatterbox.js`. Drops the Supertonic + LavaSR orchestration. ## Mobile * Restored `test/mobile/{integration.auto.cjs, integration-runtime.cjs, testAssets/jfk.wav}` so the Android build has a wrapper to point at. * `package.json`: re-added `test/mobile` to the `files` list. ## Gitignore * Ignore generated `.clang-format` / `.clang-tidy` / `.valgrind.supp` (produced by the top-level `configure_file(...)` calls) and `build_*/` dirs (bare-make convention). ## Verified locally * `npx standard "test/**/*.js" "*.js" "lib/*.js"` — clean. * `npm run test:unit` — 38/38 pass (105/105 asserts). * `npm run build && bare examples/chatterbox-tts.js "Hello from qvac tts ggml."` produces a 24 kHz wav as expected. * Add streaming support * Update ggml backend to use separate ggml repo * tts-ggml: consume renamed tts-cpp library (2026-04-24#1) Upstream chatterbox.cpp renamed the package + namespace + target from qvac-tts to tts-cpp and tightened the library boundary; pick up the new artefacts here: - find_package(qvac-tts-cpp CONFIG REQUIRED) -> find_package(tts-cpp CONFIG REQUIRED) - qvac-tts::qvac-tts -> tts-cpp::tts-cpp - qvac_tts::chatterbox -> tts_cpp::chatterbox (engine ptrs, EngineOptions, SynthesisResult, forward-decls in ChatterboxModel.hpp) - #include <qvac-tts/chatterbox/engine.h> -> #include <tts-cpp/chatterbox/engine.h> - Doxygen / inline doc references to the old names refreshed alongside the code changes. vcpkg wiring: - vcpkg-configuration.json baseline bumped to qvac-registry-vcpkg commit bc30b0b (ports/tts-cpp renamed and repointed at chatterbox.cpp@f8f9145). - vcpkg.json tts-cpp constraint bumped to 2026-04-24#1 (the port that carries the rename + namespace + install(EXPORT) changes). Verified with a cold bare-make generate + bare-make build against the new port, and the addon's existing unit + integration test suites. Made-with: Cursor * tts-ggml: bump tts-cpp port to 2026-05-07 + registry baseline Picks up the round-3 review-fix wave landed on the tts-cpp port: e673182 scrub stale patches/ refs from README (N10) 8ba10a6 drop unreachable TTS_CPP_GGML_LIB_PREFIX block (N8) 4b5d2d7 mirror N1-N7 fixes from chatterbox.cpp source-of-truth - N1 supertonic alive-registry guard against freed-backend gallocr_free assert on hot-swap (Vulkan/Metal/CUDA) - N2 drop dead g_sink_* state, soften log_set docstring - N3 Turbo BPE try/catch (exception-safe Engine ctor) - N4 STFT cancel checkpoint + tighter Engine::cancel() doc - N5 document s3gen_preload/unload refcount semantics - N6 drop dead cached_text_lc Supertonic shim - N7 fix misleading "no copy" view-vs-copy log wording Plus the integrated-port-only round-2 fixes that landed earlier: fa0d490 close patches/-deleted regression: TTS_CPP_USE_SYSTEM_GGML now defaults ON; bundled-without-patches hard-errors at configure time with a pointer at the ggml-speech vcpkg port. ae34c58 README rewritten for integrated/vcpkg context. a2f2dd6 top-level qvac-ext-lib-whisper.cpp README points at the tts-cpp/ subtree (alongside parakeet-cpp/). Public API used by ChatterboxModel (tts_cpp::chatterbox::Engine / EngineOptions / SynthesisResult / s3gen_preload / s3gen_unload) is backward-compatible: the new port adds Engine::backend_name(), MTL-variant fields on EngineOptions (language / cfg_weight / min_p / exaggeration), and a separate tts_cpp::supertonic::Engine class, but nothing this consumer was already calling has changed. Edits: packages/tts-ggml/vcpkg.json - tts-cpp dep: version>=2026-04-24#1 -> version>=2026-05-07. packages/tts-ggml/vcpkg-configuration.json - default-registry baseline: bc30b0b (April 2026 fork-only state) -> 16b91afdcfd59baea60e81f3da94f49311ef2a97. The new baseline pulls in the post-tetherto-merge state (parakeet-cpp port at 932d5d9, ggml-speech port-version 1 at f07bdd0) plus the new tts-cpp port (16b91af) on the developer's GustavoA1604 registry fork. Smoke-test plan: after running `vcpkg install` against the new baseline, the tts-cpp port's vcpkg_from_github resolves at GustavoA1604/qvac-ext-lib-whisper.cpp@e673182 (tts-cpp branch) until the upstream PR merges. ChatterboxModel should build and synthesize identically; expanding to Multilingual + Supertonic flows is the follow-up commit on the package side. Co-authored-by: Cursor <cursoragent@cursor.com> * Add chatterbox multilingual and supertonic * Add mobile integration tests * tts-ggml: drop clang-19 pin in linux-clang toolchain The toolchain hardcoded `clang-19` / `clang++-19` (versioned binary names) since the package's first commit (0a2c978). Linux CI hadn't exercised this path before — the new on-pr-tts-ggml.yml -> integration matrix is the first time it does, and it fails on every linux runner (ai-run-ubuntu-22.04, ai-run-linux-gpu, ubuntu-24.04-arm) at vcpkg's "detect_compiler" step because none of the GH-hosted images ship a `clang-19` symlink: Detecting compiler hash for triplet x64-linux... error: while detecting compiler information: ... CMake Error at scripts/cmake/vcpkg_execute_required_process.cmake:127 (message): Command failed: ... -DVCPKG_CHAINLOAD_TOOLCHAIN_FILE= .../tts-ggml/vcpkg/triplets/../toolchains/linux-clang.cmake ... Match parakeet's working pattern (qvac-lib-infer-parakeet/vcpkg/ toolchains/linux-clang.cmake): use unversioned `clang` / `clang++` so each runner picks up its image's default clang (clang-15 on ubuntu-22.04, clang-18 on ubuntu-24.04, whatever the AI runners ship). The `-stdlib=libc++` flag added by x64-linux.cmake / arm64-linux.cmake is honoured by every reasonable clang version. Co-authored-by: Cursor <cursoragent@cursor.com> * Add C++ tests and coverage; fix linux build * tts-ggml: address PR review feedback Bundle of correctness, hygiene, and CI-doc fixes from the recent code review. Each item below has its own paragraph in the diff comments. - #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js to package.json so consumers running the integration tests from the npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`. - #2 deps: move @qvac/langdetect-text from runtime dependencies to devDependencies (it's only referenced from examples/, which aren't in the published files list). - #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming detection used to read engine_->options() outside engineMu_, racing with reload(). synthesize() now returns SynthesizeResult { pcm, wasStreaming } where wasStreaming is captured under the engine lock against the local shared_ptr so process() doesn't have to touch engine_ again. - #4 deferred-load: ChatterboxModel + SupertonicModel constructors used to call load() eagerly, so JsInterface::createInstance() (sync on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop. Both models now implement IModelAsyncLoad: constructors validate + return; the actual load is deferred to waitForLoadInitialization(), which the new addon_js::activate wraps inside JsAsyncTask::run so the parse runs on a worker thread. binding.cpp registers addon_js::activate in place of JsInterface::activate; tts.js now awaits the resulting promise. - #5 dead code: drop _resolvePath (unused), drop the (void)inputObj read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE / FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but- not-thrown so future maintainers don't delete them blindly (the unit suite asserts the values). - #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_ reset pattern: cancel() sets it, synthesize() fast-fails on it, process() resets it per call so a stale cancel doesn't poison the next run. - #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that the JS layer is the source of truth for useGPU and nGpuLayers wins downstream; left a pointer to std::optional<bool> if a future caller ever needs to distinguish "absent" from "explicit false". - #10 fork pointers: README.md and test/utils/downloadModel.js no longer point at GustavoA1604/chatterbox.cpp; both reference the upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now. - #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment on the build-and-test job documenting that continue-on-error is the early-days landing posture (merge-guard treats success || skipped as pass), with a pointer to tighten once Device Farm provisioning is stable. Nits: - 'use strict' added to addonLogging.js (matches every other .js). - node-vs-bare runtime banners on scripts/{generate,validate}-mobile-integration-tests.js. - ttsOutputDebugString no longer JSON.stringify's the full PCM Int16Array on every chunk-streaming event; emits a tiny summary ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen}) instead. Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load contract); 4 skipped real-GGUF tests behind the existing QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF / QVAC_TEST_SUPERTONIC_GGUF env-var gates. Lint clean. Co-authored-by: Cursor <cursoragent@cursor.com> * tts-ggml: unblock CI integration tests on every desktop runner Four independent failures, one per platform: 1. linux-x64 / linux-arm64: addon load crashed at `libomp.so.5: cannot open shared object file`. tts-cpp's binary is built with clang under the linux-clang toolchain and links against libomp (LLVM OpenMP runtime); only `libgomp1` (GNU OpenMP) was being apt-installed. Add `libomp5` so libomp.so.5 is on the loader path. 2. darwin-arm64: convert-models.sh aborted at line 200 with `hf_args[@]: unbound variable`. macOS's system bash is 3.2 which treats `"${arr[@]}"` as nounset access when the array is empty under `set -u`; with HF_TOKEN unset we hit it on every fresh runner. Use the `${arr[@]+"${arr[@]}"}` idiom (defined-or-nothing) at all six call sites and add a header comment so the next maintainer doesn't accidentally regress. 3. darwin-x64: pip install bombed building `llvmlite` from source because the macos-15-large runner has no LLVM 15 development install. Root cause: librosa pulls in numba 0.65+, which stopped shipping darwin-x86_64 wheels for Python 3.12. Pin Python to 3.11 in the Setup Python step; 3.11 has prebuilt wheels for the entire numba/llvmlite/librosa stack on darwin-x64 and is fine for every other converter dependency. 4. windows-2022: ChatterboxModel::load threw `vk::createInstance: ErrorIncompatibleDriver`. Root cause: the addon's index.js::_validateConfig defaults `useGPU = true` when neither useGPU nor nGpuLayers is specified, so the test ran with n_gpu_layers=99 -> ggml_backend_vk_init -> vk::createInstance -> ErrorIncompatibleDriver on the runner's no-Vulkan-driver image. runChatterboxTTS.js now honours `process.env.NO_GPU === 'true'` (set on the no-GPU matrix entries) and forces useGPU=false on exactly those runners; the other test runners (chatterbox-mtl, gpu-smoke, multiple-runs) already had this guard. Also documents the `mesa-vulkan-drivers` apt package (already pulled in) as the software ICD that lets the Vulkan-built prebuild's runtime backend probe enumerate at least one device on linux runners. Co-authored-by: Cursor <cursoragent@cursor.com> * tts-ggml: drop Chatterbox from mobile bundle (Metro V8 string limit) Mobile build failed at `:app:createBundleReleaseJsAndAssets` with: SyntaxError: assets/testAssets/chatterbox-s3gen.gguf: Cannot create a string longer than 0x1fffffe8 characters Root cause: Metro's bundler reads every asset under `test/mobile/testAssets/` via `Buffer.toString()`. V8's max string length is 0x1fffffe8 (~512 MiB). chatterbox-s3gen.gguf is ~1 GiB even with --quant q4_0 because the s3gen converter only quantizes attention weights and leaves the bulk of the s3gen graph in fp16 ("0/291 weight tensors quantized" in the converter log). Fix: bundle ONLY supertonic.gguf (~125 MiB, comfortably under the limit) on mobile. Mobile Chatterbox tests degrade cleanly to `t.pass('Skipped: Chatterbox GGUFs not available')` via the existing `ensureChatterboxModels` helper -- it already returns { success: false } when the GGUFs aren't on disk. Cache key bumped to v2 so existing v1 cache entries (which include the chatterbox files) are evicted on the next run. Bundling Chatterbox on mobile requires either: - adding `gguf` to qvac-test-addon-mobile's metro `assetExts` so the JS-string read is skipped (then the s3gen file can flow through the bundle as a raw asset), or - pushing the chatterbox GGUFs to the device via `adb push` outside the bundle and surfacing the path through downloadModel.js's existing ANDROID_CANDIDATE_DIRS fallback. Both are outside the scope of this PR; documented inline above the cache step for the next maintainer. Co-authored-by: Cursor <cursoragent@cursor.com> * Bump hash of vcpkg * Consume vcpkg from tetherto repository * Fix integration tests failures in all platforms * Further fix tests * fix: Make useGPU flag more meaningful (#1953) * fix[api]: make useGPU flag actually force CPU/GPU and reject useGPU/nGpuLayers conflicts * add gpu smoke test * resolve comments --------- Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local> * Update dependencies after monorepo directory changes * Further drop qvac-lib- prefix * Add CHANGELOG.md --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Ishan Vohra <ishanvohra2@gmail.com> Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>

…te concurrency Address non-blocking review nits on PR tetherto#2007: - aggregate-events: explain why a wire event carrying both error and cancelled signals resolves to error (closes brief open question tetherto#3). - kv-cache-session: doc-comment on deleteKvCacheState explaining the ordering guarantee under concurrent in-flight turns -- delete is wire-async, in-flight turns roll back idempotently when their commit probe finds the file gone (closes brief open question tetherto#4). Comments only; no behavior changes.

@opaninakuffo

…ache via KvCacheSession (#2007) * QVAC-18182 feat[api]: typed cancel outcomes on the wire + atomic KV-cache via KvCacheSession Builds on QVAC-18181's request lifecycle primitives (DisposableScope, RequestContext, RequestRegistry) to deliver the M2 milestone: - Typed cancel outcomes: `stopReason: "cancelled"` on `completionDone` events, and `InferenceCancelledError(requestId, partial)` thrown from CompletionRun promise-aggregates (`final` / `text` / `toolCalls` / `stats`). The wire stream still ends normally so iterating `run.events` is unaffected — the typed error lives on the aggregate promises that callers `await` for the final result. - KvCacheSession (`server/bare/plugins/llamacpp-completion/ops/ kv-cache-session.ts`) — single atomic owner of the three KV-cache layers (`cachedMessageCounts`, `initializedCaches`, on-disk `.bin` files). `beginTurn` / `commitTurn` / `rollback` collapse the three duplicated cleanup blocks in `completion-stream.ts` into one scope.defer hook. Cross-model administrative deletion lives at the module level as `deleteKvCacheState(...)`, called by the RPC `handleDeleteCache` handler. - Stop-button race close — `RequestRegistry` now keeps a bounded cancelled-before-begin map (128 entries, 30s TTL). A `cancel({ requestId })` that lands before the server's `begin(...)` ran is applied retroactively when begin lands, so same-tick stop clicks no longer disappear into the void. Internal-only — the wire surface for `cancel` is unchanged (Option A in the brief). Cursor rules updated in the same PR so the request-lifecycle and KV-cache topic docs stay in sync with the implementation. Tests: - unit: KvCacheSession (bareTest-gated, runs in the Bare consumer), RequestRegistry race + bounded-set eviction, completion-event schema cancelled cases. - e2e: cancellation-tests.ts adds three definitions — mid-stream cancel (events.stopReason === "cancelled", final rejects with InferenceCancelledError, partial.text matches concatenated contentDelta), cancel-before-begin (retroactive abort), and cancel-then-resume-kv-cache (rollback wiped the three layers, the next turn re-primes cleanly). * chore: drop planning labels (Mx/Dx) from QVAC-18182 comments Strips milestone (`M1`/`M2`/`M3a`...) and deliverable (`D2`/`D5`/`D7`) labels from comments and test titles introduced with the typed-cancel outcomes + KvCacheSession work. The substantive descriptions of the contracts (Stop-button race, cancelled-before-begin map, three-layer session ownership, etc.) are preserved; only the planning-doc references are removed so the code reads cleanly without the pitch context. Durable `QVAC-XXXXX` ticket references are kept. No behavior or API surface changes. * chore: drop Asana ticket references from QVAC-18182 code comments Strips QVAC-XXXXX inline ticket references from code/test comments introduced by the typed-cancel-outcomes work. Concept names (Stop-button race, cancelled-before-begin, etc.) and prose descriptions of the contracts are preserved; only the ticket-tag suffixes go. Also renames a test cache key from `qvac-18182-cancel-resume-kvcache` to `cancel-then-resume-kvcache` so the cache key reads as a stable identifier rather than a ticket reference. No behavior or API surface changes. * QVAC-18182 doc: clarify error>cancelled precedence + deleteKvCacheState concurrency Address non-blocking review nits on PR #2007: - aggregate-events: explain why a wire event carrying both error and cancelled signals resolves to error (closes brief open question #3). - kv-cache-session: doc-comment on deleteKvCacheState explaining the ordering guarantee under concurrent in-flight turns -- delete is wire-async, in-flight turns roll back idempotently when their commit probe finds the file gone (closes brief open question #4). Comments only; no behavior changes. * QVAC-18182 doc: demonstrate typed cancel outcomes in cancel example Enhance the existing cancel-by-request-id example to demonstrate the two M2 cancel-outcome channels: - run.events ends normally with completionDone carrying stopReason: "cancelled" -- show reading it inside the iteration loop. - run.text rejects with InferenceCancelledError(requestId, partial) on cancel -- show the instanceof check and consuming partial.text, partial.toolCalls, partial.stats. Also update the header to remove the now-stale "logged as a no-match" sentence (same-tick cancels are no longer dropped after M2's race close). Pure documentation enhancement; no API or behavior changes. * QVAC-18182 fix: address PR review — partial-prime cleanup + parent-aborted state Two follow-ups from Opanin's review on PR #2007: 1. KvCacheSession.beginTurn: if `primeIfMissing` throws after the addon has partially written a `.bin` to disk, the next `beginCustom` would `fsPromises.access(cachePath)` → true and trust the half-primed file as a valid cache (no rollback hook is registered yet — the handler hasn't seen the `TurnHandle`). Wrap both `beginCustom` and `beginAuto` prime calls in a shared `primeOrCleanup` helper that best-effort unlinks the partial file before re-throwing the original prime error. Adds a bare-only unit test asserting the on-disk file is removed and the init flag stays unset on the failed-prime path. 2. RequestRegistry.begin: when `parentSignal` was already aborted at begin time, line 271 aborts the controller but the `state` ternary still landed `"running"`, exactly the "momentarily-running with already-aborted signal" the preCancel branch was guarding against. Extend the ternary to cover both inputs and the existing `parentSignal already aborted` test now also asserts `ctx.state === "cancelling"`. No behavior change on the happy path. Lint + typecheck + 351-test unit suite green locally on the changed files. * QVAC-18182 fix: prime is atomic — addon writes to .prime.tmp + atomic rename Upgrade the previous reactive cleanup workaround (PR #2007 review by @opaninakuffo) into a proactive atomic-by-construction design: - The session steers `model.run({ saveSessionPath })` to a sibling `cachePath + ".prime.tmp"` path. - Only after the prime closure resolves successfully do we promote the temp file to the canonical `cachePath` via `fsPromises.rename` (atomic same-volume on every host we target). - The canonical cache path is therefore *never* observable in a partial state — a thrown prime is indistinguishable on disk from a never-attempted prime, so the next existence probe (in-process or cross-process worker restart) cannot trust corrupt bytes. Defensive details: - We unlink any leftover `.prime.tmp` *before* invoking the closure, so a deferred-write addon path can't accidentally promote stale-from-crash bytes left by a prior worker. - On prime success we probe the temp path before renaming. If the addon deferred its disk write (some llama.cpp paths flush lazily), the temp doesn't exist and we leave the canonical path absent — `verifySaveAndRecord` in `commitTurn` is the authoritative check. - On rename failure we unlink the temp and surface the rename error; rename atomicity guarantees the canonical path was untouched. Why this is better than the prior `primeOrCleanup`: - Best-effort `unlink` was load-bearing for correctness in the old design — a failed unlink left a half-primed canonical file the next `beginCustom` would trust. The new design moves the only possible "partial" file to a non-trusted name, so failed cleanup cannot corrupt the canonical name by construction. - The unit test no longer mocks the workaround surface; it asserts the actual invariant ("canonical path was never written") plus the positive rename and the leftover-sweep guarantees. Tests: 3 bare-only kv-cache-session unit tests (throw-leaves-canonical- untouched, success-promotes-via-rename, leftover-from-crash-is-swept). Lint + typecheck + 351-test unit suite green locally on the changed files. Long-term, the right fix is one layer down — the llama.cpp addon should write transactionally itself and surface save errors instead of swallowing them. When that lands, this helper collapses to a direct `prime(cachePath)` call and the `verifySaveAndRecord` access-probe fallback (TODO already documented) can be retired together. Filed as a separate follow-up; out of scope for this PR. * QVAC-18182 fix: replace prime-atomic helper with verifyPrimedFile post-prime probe Audit of the llama.cpp addon (`CacheManager::writeCacheFile` → `llama_state_save_file`, return value swallowed; `LlamaModel:: processPromptImpl` lines 575-599) shows the bug shape Opanin flagged on PR #2007 — "primeIfMissing throws after a partial save" — does not actually fire. The save call is the very last operation on the prefill path, the addon ignores its return value, and any earlier throw means no save was attempted. So: - `primeOrCleanup` (`ac8d2d74e`) and the upgrade to `primeAtomically` (`a7420f3e6`) defended against a code path that the addon does not produce. - The real corruption shape is silent partial writes (addon's `llama_state_save_file` returns false, addon ignores it, file is half-written or empty). Atomic temp+rename did NOT close this gap — on a "silent partial" the closure resolves successfully and the helper would happily promote the partial `.prime.tmp` to the canonical path. Replace both helpers with a small `verifyPrimedFile` that mirrors the existing `verifySaveAndRecord` access-probe pattern used at commit time, applied at prime time: - After a successful prime closure, `fsPromises.stat` the canonical path. If it doesn't exist (addon was interrupted before save) or has size 0 (addon save call produced an empty file), throw and best-effort unlink the empty leftover so the next existence probe doesn't trust it. - This catches the two failure modes Opanin's concern was a proxy for (cancelled-mid-prime; addon save quietly produced nothing) without claiming defense against partial-but-nonzero writes, which can only be closed at the addon layer. The `RequestRegistry` parent-aborted-state fix (`ctx.state` ternary covers `opts.parentSignal?.aborted`) from `ac8d2d74e` is preserved unchanged — it stands on its own as a correct response to Opanin's second comment. Long-term root cause stays the addon: have `CacheManager::writeCacheFile` check `llama_state_save_file`'s return value and throw on failure. When that lands, both `verifyPrimedFile` and `verifySaveAndRecord`'s access-probes can be retired together. Filed as a separate follow-up — out of scope for this PR. Tests: 3 prior bare-only prime-atomic tests removed; 2 new bare-only tests added (no-file and empty-file rejection paths). Lint + typecheck + 330-test unit suite green locally on the changed files (pre-existing sdcpp-generation lint errors unchanged). * QVAC-18182 doc: kv-cache rule documents addon non-transactional save + matched access-probes Extend the "Cache Initialization (primeIfMissing)" section in .cursor/rules/sdk/docs/kv-cache-system.mdc with the corrected addon-contract analysis: - The llama.cpp addon's CacheManager::writeCacheFile discards llama_state_save_file's bool return; maybeSaveCacheToDisk is the last call on the prefill path. So no closure-rejection path can coexist with a partial file on disk. - Document the four real outcomes as a table (interrupted / success / silent partial write / pre-eval throw) so future readers can see why the SDK takes the shape it does. - Pin both SDK-side defenses as a matched pair: verifyPrimedFile at prime time (added in this PR) and verifySaveAndRecord at commit time (existing). Both are honest about what they catch (missing / empty file) and what they don't (partial-but-nonzero, only addon fix can close that). - Reference the addon-layer follow-up (1214778658064488 / "throw on llama_state_save_file failure") so the next contributor knows both probes will be retired together when the addon throws on save failure. No code change — rule-only update.

- transcribe.ts: route the two `Transcription Update` debug emits through `requestLogger.debug` so they carry the per-request prefix, matching the rule's `grep "requestId=<id>"` invariant. Drop the now- unused module-level `logger`. Collapse two `scope.defer(async () => { await restorePrompt(...) })` wrappers to bare arrow callbacks (review tetherto#5, tetherto#10). - inference-handler-migrations.test.ts: add bareTest op-level cancel- by-requestId cases for `transcribe (whisper)` (asserts loop exit + addon.cancel called + reload-count == 2 to pin the `applyPrompt + restorePrompt runs exactly once` invariant) and `finetune` (asserts model.cancel called + scope unwind clears the runtime-state flag back to IDLE). Pin the NMT soft-cancel contract by instrumenting the addon and asserting addon.cancel was NOT called during a translate cancel (review tetherto#3, tetherto#7). - request-lifecycle-primitives.mdc: reconcile the "polling signal.aborted mid-handler" anti-pattern with the new "Per-iteration cancel check (M3b)" canonical pattern. The anti-pattern is *adding* the check when the addon already honours signal directly; the M3b pattern is *introducing* the check where the addon doesn't and the loop is the only soft-cancel exit (review tetherto#4).

* QVAC-18183 feat[api]: inference-handler migrations Migrate the four remaining inference handler kinds onto the RequestRegistry primitives shipped in M3a (cancel-capability declaration, per-kind concurrency policy, structured `[request-lifecycle]` logging). Each handler now opens a request-scoped `ManagedRequestContext`, threads the optional `requestId` from the wire request (falling back to a server-minted UUID), routes hard cancels to `addon.cancel()` at a single signal- listener leaf, and replaces ad-hoc `try/finally` cleanup with `scope.defer(...)` registrations so cleanup runs in LIFO order on every exit path. - `embed` (kind "embeddings", `{ scope: "model", hard: true }`): `packages/sdk/server/bare/ops/embed.ts` opens the context, threads `requestId` from `embedRequestSchema`, post-await `signal.aborted` checks raise `InferenceCancelledError`. - `transcribe` / `transcribeStream` (kind "transcribe", `{ scope: "model", hard: true }`): collapsed `try { ... } finally { restorePrompt(...) }` into `scope.defer(restorePrompt)`, added per-iteration `if (ctx.signal.aborted) break;` in the `response.iterate()` loop (Option A from §4 of the M3b brief — explicit, visible at the call site, no `takeWhileNotAborted` wrapper). - `translate` (kind "translate"): two engine branches. llamacpp-completion declares `{ scope: "model", hard: true }` and wires `signal → addon.cancel()`; nmtcpp-translation keeps `{ scope: "none" }` and soft-cancels inside both the streaming iterate loop and the `runBatch` early-return path. - `finetune` (kind "finetune"): flipped the llamacpp-completion manifest declaration from `{ scope: "none" }` to `{ scope: "model", hard: true }` (the addon already exposes `model.cancel()`). `startFinetune` opens a registry context and wires `signal → model.cancel()`; the two-level `try/finally` collapses into `scope.defer` for `clearFinetuneRuntimeState` and `handle.removeListener`. `cancelFinetune(modelId)` is now a thin wrapper over `getRequestRegistry().cancel({ modelId, kind: "finetune" })` — never invokes `model.cancel()` directly. Per §4 of the brief: per-iteration cancel granularity uses Option A (explicit `if (ctx.signal.aborted) break;` at the top of each streaming loop body). No `takeWhileNotAborted` wrapper was introduced. Per §7 anti-patterns: M3b adds zero `oneAtATimePerModel` policies (the four migrated kinds tolerate concurrent requests against the same model), leaves the M1 compat-fallback in `server/bare/ops/cancel.ts` untouched (M3d retires it), and does not modify `cancelHandler.ts`. Other changes: - `embed`, `transcribe`, `transcribeStream`, `translate`, `finetune` request schemas grow an optional `requestId` field (`.string().min(1).optional()`); server-side ops fall back to `generateServerRequestId()` when absent. - Whisper / Parakeet / LLM / NMT plugin handlers thread `request.requestId` into their bare ops. - `plugin-cancel-capability.test.ts` truth-table flipped for the `finetune` row. - New `inference-handler-migrations.test.ts` covers schema-level optional-`requestId` acceptance for all four kinds and pins the `[request-lifecycle] begin/cancel/end` line shape for each kind. The op-level cancel-by-requestId / cancel-by-modelId integration tests are bare-runtime-gated (the migrated ops pull `bare-crypto` / `bare-fs` transitively and can't load under Bun, same reason as `finetune-ops.test.disabled.ts`). - `.cursor/rules/sdk/request-lifecycle-primitives.mdc` and `.cursor/rules/sdk/docs/request-lifecycle-system.mdc` updated: M3b row marked shipped, finetune truth-table row flipped, canonical-handler-shape section refreshed to use `embed.ts` as the cleanest reference and to document the Option A per-iteration check. Verification: - `bun lint` (eslint + tsc --noEmit): green. - `bun run typecheck`: green. - `bun run test:unit`: every test file green except the pre-existing `client/rpc/rpc-client.ts` `#rpc` package-resolution failure on upstream/main (also reproducible without these changes; unrelated to M3b). * QVAC-18183 fix: address PR #2058 review feedback - transcribe.ts: route the two `Transcription Update` debug emits through `requestLogger.debug` so they carry the per-request prefix, matching the rule's `grep "requestId=<id>"` invariant. Drop the now- unused module-level `logger`. Collapse two `scope.defer(async () => { await restorePrompt(...) })` wrappers to bare arrow callbacks (review #5, #10). - inference-handler-migrations.test.ts: add bareTest op-level cancel- by-requestId cases for `transcribe (whisper)` (asserts loop exit + addon.cancel called + reload-count == 2 to pin the `applyPrompt + restorePrompt runs exactly once` invariant) and `finetune` (asserts model.cancel called + scope unwind clears the runtime-state flag back to IDLE). Pin the NMT soft-cancel contract by instrumenting the addon and asserting addon.cancel was NOT called during a translate cancel (review #3, #7). - request-lifecycle-primitives.mdc: reconcile the "polling signal.aborted mid-handler" anti-pattern with the new "Per-iteration cancel check (M3b)" canonical pattern. The anti-pattern is *adding* the check when the addon already honours signal directly; the M3b pattern is *introducing* the check where the addon doesn't and the loop is the only soft-cancel exit (review #4). * QVAC-18183 fix: drop unsafe `addon` re-narrowing in translate.ts onAbort Addresses opaninakuffo's review comment on #2058: `AnyModel.addon` is already typed as `AddonInterface | undefined` (see `server/bare/registry/model-registry.ts:17-20`), so the `as unknown as { addon?: { cancel?(jobId?: string): Promise<void> } }` cast was unnecessary. Matches the simpler pattern used by `embed.ts` and `transcribe.ts` for the same `onAbort` shape — keeps the four M3b-migrated ops uniform. * QVAC-18183 doc: trim internal milestone references from cursor rules + code comments Removed the "Migration Roadmap" table, "M1/M2/M3a-d" milestone labels, planning-brief decision references (Decision A/B.2, D1/D2), workspace-local paths (`tasks/release-0.11.0-planning/...`, `pitch-3-decisions.md`), and "in review" forward-references from the request-lifecycle cursor rules and the matching code comments in the bare ops, finetune wrapper, and the inference-migration tests. The canonical handler shape, anti-patterns, primitives reference, plugin cancel-capability truth table, and concurrency-policy / structured-logging sections all stay — only the internal milestone framing comes out.

* Try #1. Adding tokenizer proxy to provide vocab size. * Try #2. More fixes and logs. * Try #3. Limit device to only cpu or gpu. * Revert "Try #2. More fixes and logs." This reverts commit a461e69. * Revert "Try #1. Adding tokenizer proxy to provide vocab size." This reverts commit 9951195. * Fixing pipeline logging * Add more logs * Fixing bench logging * Add more error handling and logging * Improve error handling on the server. Added retry in case of context overflow. * Make retries self-adjustable * Adding some more checks and limiting the datasets temporarily * Test: trying to narrow down the error * Exclude failing datasets from embed benchmark * Clean up the code * Changing bench model for LLM * Try #1. Adding tokenizer proxy to provide vocab size. * Try #2. More fixes and logs. * Try #3. Limit device to only cpu or gpu. * Revert "Try #2. More fixes and logs." This reverts commit a461e69. * Revert "Try #1. Adding tokenizer proxy to provide vocab size." This reverts commit 9951195. * Fixing pipeline logging * Add more logs * Fixing bench logging * Add more error handling and logging * Improve error handling on the server. Added retry in case of context overflow. * Make retries self-adjustable * Adding some more checks and limiting the datasets temporarily * Test: trying to narrow down the error * Exclude failing datasets from embed benchmark * Clean up the code * Changing bench model for LLM * Minor fixes for clarity * Removing unused vars * Removing unused imports * Removing unused python deps --------- Co-authored-by: gianni <gianfranco.cordella@tether.io>

…peech (#1590) * feat: Add runStream() which takes input as a stream * add integration tests * uncomment cb tests * chore: Add cb streaming example * feat: Add TTS streaming funcitonality and example * Update tts addon version * Remove chatterbox example * add new error code for tts streaming fail * Move common code to util * fix: Use z.infer to define TextToSpeechStreamClientParams * Move TextToSpeechStreamSession to schemas * Track subscriber current index and trim queue when all subscribers consumed past items * add missing unit tests * fix: drive done promise from multicast pump lifecycle * fix: Forward chunkIndex and sentenceChunk in sentence-stream mode to client * fix: Use correct error code for tts stream failure * chore: Add supertonic stream test in tts-tests.ts * fix: Make tts client more readable * Remove closures and inline async generators * fix: Subscribe eagerly in sentenceStreamTts to avoid late-subscriber data loss TtsMulticast.pump() starts in a microtask on construction, while the returned async generators only call subscribe() when first iterated. If the consumer iterated one generator before the other, the first subscriber could trim the queue before the second ever registered, silently dropping earlier frames. Subscribe synchronously for both bufferStream and chunkUpdates before returning, so both subscriber indexes are in place before pump pushes its first item. Made-with: Cursor * fix: Close TTS stream on server-sent done frame Remove the dead `null` sentinel from `processTextToSpeechStreamLine` and instead close `parseTextToSpeechStreamLines` after yielding the terminal `done: true` frame, so consumers don't rely on the server closing the socket to stop iteration. Made-with: Cursor * fix: Reject sentenceStream without stream in textToSpeech Previously `sentenceStream: true` combined with `stream: false` fell through to the collect path, silently dropping the sentence-stream parameters and returning no `chunkUpdates`. Fail fast at the dispatcher with a clear error so the contract mismatch surfaces to the caller instead of being swallowed. Made-with: Cursor * fix: Release TtsMulticast subscriber slot on early break Wire a try/finally into drain() so that when a consumer breaks out of the for-await (or the generator is .return()'d / throws), the slot is parked at +Infinity via unsubscribe(). This prevents a stale low min-index from permanently pinning trimConsumed, which otherwise leaked the queue for the entire RPC stream. Made-with: Cursor * fix: Guard TTS stream write after close and preserve UTF-8 boundaries Client: - Track a `closed` flag in `textToSpeechStream` duplex session, set by `end()` / `destroy()`. Subsequent `write()` calls now throw a typed `TextToSpeechStreamFailedError` instead of propagating a raw Bare/Node "write after end" stream error. - `end()` is idempotent so accidental double-close no longer errors. Server: - `buffersToUtf8Fragments` previously decoded each incoming Buffer via `toString("utf8")`, which corrupts any multi-byte codepoint whose bytes straddle a chunk boundary (common with CJK / emoji / accented scripts emitted as LLM token deltas). Added a small tail-buffer that finds the last complete UTF-8 codepoint end in the combined buffer and defers trailing incomplete bytes to the next chunk. Any dangling partial sequence is flushed on stream end. Made-with: Cursor * fix: Order TEXT_TO_SPEECH_STREAM_FAILED code and document it - Move TEXT_TO_SPEECH_STREAM_FAILED (52415) to the end of the 52400 Model Operations block so the ordering in SDK_SERVER_ERROR_CODES matches the numeric sequence (…52413, 52414, 52415). - Add the missing row for 52415 to the (latest) errors.mdx table, per the sdk/docs-freshness rule that the error table stay in sync whenever a new code is introduced. Made-with: Cursor * fix: Register operation metrics for textToSpeechStream Only `textToSpeech` was registered in `operation-metrics.ts`, so the duplex `textToSpeechStream` path silently skipped `modelExecutionTime`, `audioDuration`, and `totalSamples` gauges even though the server already collects the same `TtsStats` via `collectTtsStats()` on the final chunk. Mirror the non-streaming registration so the streaming path has parity observability. Made-with: Cursor * fix: Harden TTS client done-promise, iterator, and parse cost Polish the remaining review nits on the TTS client streaming surface. - #3 TtsMulticast.pump now rejects the `done` promise with the fatal error instead of resolving `false`. An internal `.catch(() => {})` silences unhandled-rejection warnings when the caller only iterates the buffer/chunk streams and never awaits `done`; re-awaits still see the rejection. - #6 TextToSpeechStreamSession[Symbol.asyncIterator] no longer throws synchronously on a second iteration; it returns an iterator whose first `.next()` rejects, so `for await` surfaces the error in the normal async control flow rather than the iterator protocol. - #9 plainTtsBufferStream / collectTtsBuffer wrap the RPC loop in try/catch/finally so `done` always settles: resolve(true) on the terminal frame, reject with the real error on exceptions, and resolve(false) on early consumer break. Previously `await done` could hang forever when the consumer bailed out early. - #11 Skip per-frame ttsResponseSchema.parse() in all three paths; rely on the discriminated-union narrowing at the RPC boundary. Drops the per-PCM-frame Zod validation cost for large sentences. Made-with: Cursor * fix: Tighten textToSpeechStream schema surface - Add .positive() to maxBufferScalars and flushAfterMs to match the existing constraint on sentenceStreamMaxChunkScalars. Previously a caller could pass negative values straight through to the addon. - Un-export textToSpeechStreamRequestBaseSchema — consumers only need the finalized textToSpeechStreamRequestSchema, and the base is an implementation detail of the shared object shape. The exported type alias TextToSpeechStreamClientParams continues to derive from the base via `typeof`, so nothing on the public type surface changes. Made-with: Cursor * fix: Cross-platform tmp path and safer PCM append in TTS examples - playPcmInt16Chunk now writes the intermediate WAV chunk under os.tmpdir() / path.join instead of a hard-coded /tmp/qvac-tts-chunk-… path. The previous code's Windows branch was unreachable in practice because the POSIX /tmp directory doesn't exist there; this uses %TEMP% on Windows automatically. - appendPcmSamples switches from `target.push(...chunk.slice(i, end))` to `Array.prototype.push.apply(target, chunk.slice(i, end))`. Same semantics, but avoids allocating the spread rest array per batch and is closer to a memcpy-style concat in V8. Made-with: Cursor * fix: Catch zero-chunk regressions in TTS sentence-stream test - TtsExecutor.makeSentenceStream now returns `{ passed: false, ... }` when the chunkUpdates iterator yields no chunks / no samples. The previous executor always returned a formatted string regardless of counts, so a regression that silently emitted zero chunks would still have looked like a pass. - ttsSupertonicSentenceStream's expectation upgraded from `{ validation: "type", expectedType: "string" }` to `{ validation: "contains-all", contains: ["sentence-streamed", "chunks", "samples"] }`. The executor's zero-case failure string lacks "sentence-streamed", so the contains-all match fails on regression. Made-with: Cursor * fix: Apply stream default locally and throw typed error on tts mismatch Previous guard only rejected the explicit `stream: false + sentenceStream: true` combination. A caller passing `{ modelId, text, sentenceStream: true }` with `stream` omitted silently fell through to `collectTts` while the server's Zod `.default(true)` still ran the sentence-stream branch and emitted chunk frames — which the client then discarded, dropping all chunk metadata. - Resolve the `stream` default locally (`params.stream ?? true`) so the client's dispatch routing matches the server's Zod-applied routing, and an omitted `stream` now correctly lands in `sentenceStreamTts` or `plainStreamTts`. - Only the explicit `sentenceStream: true + stream: false` combination is rejected, and it now throws `TextToSpeechStreamFailedError` (code 52415) instead of a bare `new Error(...)` so callers can discriminate by error code like everywhere else in the SDK. Made-with: Cursor * remove inline defaults for sentenceStream and stream * Use TtsMulticast in unit test instead of mock --------- Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>

…#1983) * feat: add @qvac/tts-ggml package (Chatterbox English on qvac-tts.cpp) New Bare addon wrapping the `qvac-tts::qvac-tts` static library (backed by the `tts-cpp` port added in tetherto/qvac-registry-vcpkg). API-compatible with the Chatterbox engine exposed by `@qvac/tts-onnx` so downstream consumers can swap backends without touching orchestration code. ## Scope * First iteration. Supports Chatterbox **English** only. Chatterbox multilingual, LavaSR enhancer, Supertonic engine, and streaming are out of scope and remain in `@qvac/tts-onnx`. They'll land alongside the evolution of qvac-tts.cpp. * Native backend is the static `qvac-tts` library from the QVAC vcpkg registry (`ports/tts-cpp`, baseline `2026-04-21`). No ONNX Runtime dependency. ## JS surface * `@qvac/tts-ggml` exports `TTSGgml` with the same method shape as `ONNXTTS`: `run` / `runStream` / `runStreaming` / `reload` / `unload` / `destroy`. * `files: { modelDir }` looks for `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` side-by-side; `files.t3Model` / `files.s3genModel` override the defaults. * Options: `referenceAudio`, `voiceDir` (baked profile), `seed`, `nGpuLayers`, `threads`, `outputSampleRate`, plus placeholders for the upcoming streaming flags (`streamChunkTokens`, `streamFirstChunkTokens`, `cfmSteps`). * Shared reusable lib code (`lib/textChunker.js`, `lib/textStreamAccumulator.js`, `addonLogging.*`) is copied verbatim from `@qvac/tts-onnx`. * New error class `QvacErrorAddonTTSGgml` uses codes **13001–14000** to avoid collisions with `@qvac/tts-onnx` (7001–7011) when both packages are loaded in the same Bare process. ## Native addon * `addon/src/model-interface/chatterbox/ChatterboxModel.{hpp,cpp}` — `IModel` + `IModelCancel` implementation. First-iteration strategy: assemble argv for `qvac_tts_cli_main` with a scratch `.wav` output path, call it synchronously, then parse the resulting 16-bit mono PCM wav back into `std::vector<int16_t>` for the JS handler. Consequences: every job re-loads the model (~700 ms + inference time), no mid-synthesis cancellation, no streaming. The follow-up milestone replaces this with a persistent, struct-based API once qvac-tts.cpp exposes one. * `addon/src/js-interface/{JSAdapter.{hpp,cpp}, binding.cpp}` — JS-to-C++ config bridging (same string-map pattern as `@qvac/tts-onnx`) and the `BARE_MODULE(qvac_tts_ggml, ...)` registration exposing `createInstance` / `runJob` / `reload` / `activate` / `cancel` / `destroyInstance` / `loadWeights` / `setLogger` / `releaseLogger`. * `addon/src/addon/AddonJs.hpp` — JS-facing `createInstance` / `runJob` / `reload` wrappers that register a `JsAudioOutputHandler` emitting `{ outputArray: Int16Array, sampleRate: number }` to JS. ## Build / registry * `CMakeLists.txt` uses `find_package(qvac-tts-cpp CONFIG REQUIRED)` and the standard `cmake-bare` + `cmake-vcpkg` scaffolding (shape matches `@qvac/transcription-whispercpp`). * `vcpkg.json` depends on `tts-cpp` (with a `vulkan` feature passthrough) plus `qvac-lib-inference-addon-cpp`, `qvac-lint-cpp`, and `gtest`. * `vcpkg-configuration.json` points at tetherto/qvac-registry-vcpkg. NOTE: the baseline pin here is inherited from `@qvac/transcription-whispercpp` and **must be bumped** to a commit that contains the `tts-cpp` port once that registry PR lands. A follow-up commit will update it. ## Tests & examples * Integration + unit test files for Chatterbox English are copied verbatim from `@qvac/tts-onnx` with only mechanical renames (`ONNXTTS` -> `TTSGgml`, `QvacErrorAddonTTS` -> `QvacErrorAddonTTSGgml`, `@qvac/tts-onnx/text-chunker` -> `../../lib/textChunker.js`). Some paths in `test/integration/addon.test.js` still import Supertonic / LavaSR helpers that don't exist in this package — those test blocks will fail fast when the file loads, which is expected until those backends get their own ggml packages. * Examples: `chatterbox-tts.js`, `chatterbox-streaming-tts.js`, plus shared `wav-helper.js` + `pcm-chunk-player.js`. ## What's not in this PR (known gaps) * No docs: README, NOTICE, CHANGELOG, PULL_REQUEST_TEMPLATE changes will land in a single documentation pass once the registry + fork commits have merged upstream. * `vcpkg-configuration.json` baseline needs to point at a qvac-registry-vcpkg commit that ships `tts-cpp` (pending the registry PR). * Actual `npm run build` requires the registry and fork commits to be on `main` of their respective upstream repos. * chore: point tts-ggml vcpkg baseline at the tts-cpp-bearing registry commit Bumps `vcpkg-configuration.json` to GustavoA1604/qvac-registry-vcpkg at commit 1e2839680b6be8d8ffff889a9c29b966c176098c — the commit that adds the `tts-cpp` port. Paired with the `qvac-tts` library already pinned in the port's `portfile.cmake` (GustavoA1604/chatterbox.cpp @ 0fe4a521618cc30358040b29d75d4261b31cbb60). Will be re-pointed at tetherto/qvac-registry-vcpkg once the registry PR lands upstream. * chore: tts-ggml: trim tests + examples to Chatterbox English, restore mobile wrapper Second pass over @qvac/tts-ggml after the build started passing: prune everything that only made sense for the ONNX-era multi-engine scope and adapt the remaining Chatterbox-English bits to the GGUF + file-path reference-audio contract. Restores `test/mobile/` so the Android build has something to point at. ## C++ * `ChatterboxModel.cpp`: the `ArgvBuilder::buildArgv` doc comment contained `**/` which closed the block comment early and broke the build. Rewrote as a `//` comment. ## Examples * `examples/chatterbox-tts.js` — rewrite for v0 contract: single `<text>` argv, `files: { modelDir }` pointing at the two GGUFs, `referenceAudio` is now a wav **path** (addon passes it to `--reference-audio`) instead of a Float32Array. Drops english/multilingual arg and the CHATTERBOX_VARIANT switch that picked which `.onnx` files to load. * Removed `examples/chatterbox-streaming-tts.js` + `examples/pcm-chunk-player.js`. The v0 addon re-loads the model per `run()` call — exposing streaming would mislead. Both come back alongside the persistent-engine milestone. * `package.json`: `npm run example` now passes a default text so it runs without extra args. ## Tests ### Kept as-is (engine-agnostic) * `test/unit/textChunker.test.js` * `test/mock/{MockedBinding,utils}.js` * `test/utils/{wav-helper,pcmConcatenator,loader.fake,runWhisper,runTTS}.js` * `test/reference-audio/jfk.wav`, `test/data/sentences-*.js` ### Mechanical fixes * `test/unit/tts.error.test.js` — fix error-code assertions to the tts-ggml range (`13001–14000`); was still checking the `@qvac/tts-onnx` range (`7001–7011`). * `test/unit/tts-ggml.lifecycle.test.js` — fix stale `QvacErrorAddonTTS` import to `QvacErrorAddonTTSGgml`; switch the stubbed model to `{ t3Model, s3genModel }` GGUFs and drop the non-existent `engine: 'chatterbox'` option. * `test/unit/tts-ggml.sentence-stream.test.js` — same GGUF/engine cleanup. ### Rewritten * `test/unit/chatterbox.inference.test.js` — drop tests that asserted the old ONNX file shape (`tokenizer / speechEncoder / embedTokens / conditionalDecoder / languageModel`), the removed `engine` detection and the wrong `getModelKey` return value (`'onnx-tts'` -> `'tts-ggml'`). New tests cover: `modelDir` derives the two GGUF paths; explicit `t3Model` / `s3genModel` override the defaults. The mocked-binding run/reload/cancel flow stays. * `test/integration/addon.test.js` — fresh, ~180 LoC, Chatterbox-English only. Ensures the GGUFs are present, runs the short sentence set through `loadChatterboxTTS` + `runChatterboxTTS[WithSplit]`, and (on darwin only) runs a whisper-based WER check via the existing `runWhisper` util. Drops the Chatterbox-multilingual block + every Supertonic + LavaSR block that doesn't apply to this package. * `test/utils/runChatterboxTTS.js` — rewrite for the GGUF contract: `files: { modelDir, t3Model, s3genModel }`, `referenceAudio` as a file path that falls back to `test/reference-audio/jfk.wav` (or the mobile test-asset when `global.assetPaths` is present). No more WAV decode / resample on the JS side. * `test/utils/downloadModel.js` — trim from 1007 LoC to 280. Drops the Supertonic + LavaSR + Chatterbox-multilingual + Cangjie downloaders. Keeps the shared HTTP/curl infrastructure and `ensureWhisperModel` (still used by the integration WER check). `ensureChatterboxModels` is now **check-only**: it verifies `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` exist locally and, if missing, prints the exact commands for generating them from the qvac-tts.cpp (née chatterbox.cpp) conversion scripts. Once the GGUFs land on a canonical HuggingFace repo we'll wire up download URLs here. ## Scripts * `scripts/ensure-chatterbox.js` — simplify to a single invocation against `./models/`. Drops the variant / language matrix that the ONNX downloader needed. * `scripts/ensure-models.js` — now a thin alias to `ensure-chatterbox.js`. Drops the Supertonic + LavaSR orchestration. ## Mobile * Restored `test/mobile/{integration.auto.cjs, integration-runtime.cjs, testAssets/jfk.wav}` so the Android build has a wrapper to point at. * `package.json`: re-added `test/mobile` to the `files` list. ## Gitignore * Ignore generated `.clang-format` / `.clang-tidy` / `.valgrind.supp` (produced by the top-level `configure_file(...)` calls) and `build_*/` dirs (bare-make convention). ## Verified locally * `npx standard "test/**/*.js" "*.js" "lib/*.js"` — clean. * `npm run test:unit` — 38/38 pass (105/105 asserts). * `npm run build && bare examples/chatterbox-tts.js "Hello from qvac tts ggml."` produces a 24 kHz wav as expected. * Add streaming support * Update ggml backend to use separate ggml repo * tts-ggml: consume renamed tts-cpp library (2026-04-24#1) Upstream chatterbox.cpp renamed the package + namespace + target from qvac-tts to tts-cpp and tightened the library boundary; pick up the new artefacts here: - find_package(qvac-tts-cpp CONFIG REQUIRED) -> find_package(tts-cpp CONFIG REQUIRED) - qvac-tts::qvac-tts -> tts-cpp::tts-cpp - qvac_tts::chatterbox -> tts_cpp::chatterbox (engine ptrs, EngineOptions, SynthesisResult, forward-decls in ChatterboxModel.hpp) - #include <qvac-tts/chatterbox/engine.h> -> #include <tts-cpp/chatterbox/engine.h> - Doxygen / inline doc references to the old names refreshed alongside the code changes. vcpkg wiring: - vcpkg-configuration.json baseline bumped to qvac-registry-vcpkg commit bc30b0b (ports/tts-cpp renamed and repointed at chatterbox.cpp@f8f9145). - vcpkg.json tts-cpp constraint bumped to 2026-04-24#1 (the port that carries the rename + namespace + install(EXPORT) changes). Verified with a cold bare-make generate + bare-make build against the new port, and the addon's existing unit + integration test suites. Made-with: Cursor * tts-ggml: bump tts-cpp port to 2026-05-07 + registry baseline Picks up the round-3 review-fix wave landed on the tts-cpp port: e673182 scrub stale patches/ refs from README (N10) 8ba10a6 drop unreachable TTS_CPP_GGML_LIB_PREFIX block (N8) 4b5d2d7 mirror N1-N7 fixes from chatterbox.cpp source-of-truth - N1 supertonic alive-registry guard against freed-backend gallocr_free assert on hot-swap (Vulkan/Metal/CUDA) - N2 drop dead g_sink_* state, soften log_set docstring - N3 Turbo BPE try/catch (exception-safe Engine ctor) - N4 STFT cancel checkpoint + tighter Engine::cancel() doc - N5 document s3gen_preload/unload refcount semantics - N6 drop dead cached_text_lc Supertonic shim - N7 fix misleading "no copy" view-vs-copy log wording Plus the integrated-port-only round-2 fixes that landed earlier: fa0d490 close patches/-deleted regression: TTS_CPP_USE_SYSTEM_GGML now defaults ON; bundled-without-patches hard-errors at configure time with a pointer at the ggml-speech vcpkg port. ae34c58 README rewritten for integrated/vcpkg context. a2f2dd6 top-level qvac-ext-lib-whisper.cpp README points at the tts-cpp/ subtree (alongside parakeet-cpp/). Public API used by ChatterboxModel (tts_cpp::chatterbox::Engine / EngineOptions / SynthesisResult / s3gen_preload / s3gen_unload) is backward-compatible: the new port adds Engine::backend_name(), MTL-variant fields on EngineOptions (language / cfg_weight / min_p / exaggeration), and a separate tts_cpp::supertonic::Engine class, but nothing this consumer was already calling has changed. Edits: packages/tts-ggml/vcpkg.json - tts-cpp dep: version>=2026-04-24#1 -> version>=2026-05-07. packages/tts-ggml/vcpkg-configuration.json - default-registry baseline: bc30b0b (April 2026 fork-only state) -> 16b91afdcfd59baea60e81f3da94f49311ef2a97. The new baseline pulls in the post-tetherto-merge state (parakeet-cpp port at 932d5d9, ggml-speech port-version 1 at f07bdd0) plus the new tts-cpp port (16b91af) on the developer's GustavoA1604 registry fork. Smoke-test plan: after running `vcpkg install` against the new baseline, the tts-cpp port's vcpkg_from_github resolves at GustavoA1604/qvac-ext-lib-whisper.cpp@e673182 (tts-cpp branch) until the upstream PR merges. ChatterboxModel should build and synthesize identically; expanding to Multilingual + Supertonic flows is the follow-up commit on the package side. Co-authored-by: Cursor <cursoragent@cursor.com> * Add chatterbox multilingual and supertonic * Add mobile integration tests * tts-ggml: drop clang-19 pin in linux-clang toolchain The toolchain hardcoded `clang-19` / `clang++-19` (versioned binary names) since the package's first commit (0a2c978). Linux CI hadn't exercised this path before — the new on-pr-tts-ggml.yml -> integration matrix is the first time it does, and it fails on every linux runner (ai-run-ubuntu-22.04, ai-run-linux-gpu, ubuntu-24.04-arm) at vcpkg's "detect_compiler" step because none of the GH-hosted images ship a `clang-19` symlink: Detecting compiler hash for triplet x64-linux... error: while detecting compiler information: ... CMake Error at scripts/cmake/vcpkg_execute_required_process.cmake:127 (message): Command failed: ... -DVCPKG_CHAINLOAD_TOOLCHAIN_FILE= .../tts-ggml/vcpkg/triplets/../toolchains/linux-clang.cmake ... Match parakeet's working pattern (qvac-lib-infer-parakeet/vcpkg/ toolchains/linux-clang.cmake): use unversioned `clang` / `clang++` so each runner picks up its image's default clang (clang-15 on ubuntu-22.04, clang-18 on ubuntu-24.04, whatever the AI runners ship). The `-stdlib=libc++` flag added by x64-linux.cmake / arm64-linux.cmake is honoured by every reasonable clang version. Co-authored-by: Cursor <cursoragent@cursor.com> * Add C++ tests and coverage; fix linux build * tts-ggml: address PR review feedback Bundle of correctness, hygiene, and CI-doc fixes from the recent code review. Each item below has its own paragraph in the diff comments. - #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js to package.json so consumers running the integration tests from the npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`. - #2 deps: move @qvac/langdetect-text from runtime dependencies to devDependencies (it's only referenced from examples/, which aren't in the published files list). - #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming detection used to read engine_->options() outside engineMu_, racing with reload(). synthesize() now returns SynthesizeResult { pcm, wasStreaming } where wasStreaming is captured under the engine lock against the local shared_ptr so process() doesn't have to touch engine_ again. - #4 deferred-load: ChatterboxModel + SupertonicModel constructors used to call load() eagerly, so JsInterface::createInstance() (sync on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop. Both models now implement IModelAsyncLoad: constructors validate + return; the actual load is deferred to waitForLoadInitialization(), which the new addon_js::activate wraps inside JsAsyncTask::run so the parse runs on a worker thread. binding.cpp registers addon_js::activate in place of JsInterface::activate; tts.js now awaits the resulting promise. - #5 dead code: drop _resolvePath (unused), drop the (void)inputObj read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE / FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but- not-thrown so future maintainers don't delete them blindly (the unit suite asserts the values). - #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_ reset pattern: cancel() sets it, synthesize() fast-fails on it, process() resets it per call so a stale cancel doesn't poison the next run. - #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that the JS layer is the source of truth for useGPU and nGpuLayers wins downstream; left a pointer to std::optional<bool> if a future caller ever needs to distinguish "absent" from "explicit false". - #10 fork pointers: README.md and test/utils/downloadModel.js no longer point at GustavoA1604/chatterbox.cpp; both reference the upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now. - #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment on the build-and-test job documenting that continue-on-error is the early-days landing posture (merge-guard treats success || skipped as pass), with a pointer to tighten once Device Farm provisioning is stable. Nits: - 'use strict' added to addonLogging.js (matches every other .js). - node-vs-bare runtime banners on scripts/{generate,validate}-mobile-integration-tests.js. - ttsOutputDebugString no longer JSON.stringify's the full PCM Int16Array on every chunk-streaming event; emits a tiny summary ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen}) instead. Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load contract); 4 skipped real-GGUF tests behind the existing QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF / QVAC_TEST_SUPERTONIC_GGUF env-var gates. Lint clean. Co-authored-by: Cursor <cursoragent@cursor.com> * tts-ggml: unblock CI integration tests on every desktop runner Four independent failures, one per platform: 1. linux-x64 / linux-arm64: addon load crashed at `libomp.so.5: cannot open shared object file`. tts-cpp's binary is built with clang under the linux-clang toolchain and links against libomp (LLVM OpenMP runtime); only `libgomp1` (GNU OpenMP) was being apt-installed. Add `libomp5` so libomp.so.5 is on the loader path. 2. darwin-arm64: convert-models.sh aborted at line 200 with `hf_args[@]: unbound variable`. macOS's system bash is 3.2 which treats `"${arr[@]}"` as nounset access when the array is empty under `set -u`; with HF_TOKEN unset we hit it on every fresh runner. Use the `${arr[@]+"${arr[@]}"}` idiom (defined-or-nothing) at all six call sites and add a header comment so the next maintainer doesn't accidentally regress. 3. darwin-x64: pip install bombed building `llvmlite` from source because the macos-15-large runner has no LLVM 15 development install. Root cause: librosa pulls in numba 0.65+, which stopped shipping darwin-x86_64 wheels for Python 3.12. Pin Python to 3.11 in the Setup Python step; 3.11 has prebuilt wheels for the entire numba/llvmlite/librosa stack on darwin-x64 and is fine for every other converter dependency. 4. windows-2022: ChatterboxModel::load threw `vk::createInstance: ErrorIncompatibleDriver`. Root cause: the addon's index.js::_validateConfig defaults `useGPU = true` when neither useGPU nor nGpuLayers is specified, so the test ran with n_gpu_layers=99 -> ggml_backend_vk_init -> vk::createInstance -> ErrorIncompatibleDriver on the runner's no-Vulkan-driver image. runChatterboxTTS.js now honours `process.env.NO_GPU === 'true'` (set on the no-GPU matrix entries) and forces useGPU=false on exactly those runners; the other test runners (chatterbox-mtl, gpu-smoke, multiple-runs) already had this guard. Also documents the `mesa-vulkan-drivers` apt package (already pulled in) as the software ICD that lets the Vulkan-built prebuild's runtime backend probe enumerate at least one device on linux runners. Co-authored-by: Cursor <cursoragent@cursor.com> * tts-ggml: drop Chatterbox from mobile bundle (Metro V8 string limit) Mobile build failed at `:app:createBundleReleaseJsAndAssets` with: SyntaxError: assets/testAssets/chatterbox-s3gen.gguf: Cannot create a string longer than 0x1fffffe8 characters Root cause: Metro's bundler reads every asset under `test/mobile/testAssets/` via `Buffer.toString()`. V8's max string length is 0x1fffffe8 (~512 MiB). chatterbox-s3gen.gguf is ~1 GiB even with --quant q4_0 because the s3gen converter only quantizes attention weights and leaves the bulk of the s3gen graph in fp16 ("0/291 weight tensors quantized" in the converter log). Fix: bundle ONLY supertonic.gguf (~125 MiB, comfortably under the limit) on mobile. Mobile Chatterbox tests degrade cleanly to `t.pass('Skipped: Chatterbox GGUFs not available')` via the existing `ensureChatterboxModels` helper -- it already returns { success: false } when the GGUFs aren't on disk. Cache key bumped to v2 so existing v1 cache entries (which include the chatterbox files) are evicted on the next run. Bundling Chatterbox on mobile requires either: - adding `gguf` to qvac-test-addon-mobile's metro `assetExts` so the JS-string read is skipped (then the s3gen file can flow through the bundle as a raw asset), or - pushing the chatterbox GGUFs to the device via `adb push` outside the bundle and surfacing the path through downloadModel.js's existing ANDROID_CANDIDATE_DIRS fallback. Both are outside the scope of this PR; documented inline above the cache step for the next maintainer. Co-authored-by: Cursor <cursoragent@cursor.com> * Bump hash of vcpkg * Consume vcpkg from tetherto repository * Fix integration tests failures in all platforms * Further fix tests * fix: Make useGPU flag more meaningful (#1953) * fix[api]: make useGPU flag actually force CPU/GPU and reject useGPU/nGpuLayers conflicts * add gpu smoke test * resolve comments --------- Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local> * Update dependencies after monorepo directory changes * Further drop qvac-lib- prefix * Add CHANGELOG.md --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Ishan Vohra <ishanvohra2@gmail.com> Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>

@opaninakuffo

…ache via KvCacheSession (#2007) * QVAC-18182 feat[api]: typed cancel outcomes on the wire + atomic KV-cache via KvCacheSession Builds on QVAC-18181's request lifecycle primitives (DisposableScope, RequestContext, RequestRegistry) to deliver the M2 milestone: - Typed cancel outcomes: `stopReason: "cancelled"` on `completionDone` events, and `InferenceCancelledError(requestId, partial)` thrown from CompletionRun promise-aggregates (`final` / `text` / `toolCalls` / `stats`). The wire stream still ends normally so iterating `run.events` is unaffected — the typed error lives on the aggregate promises that callers `await` for the final result. - KvCacheSession (`server/bare/plugins/llamacpp-completion/ops/ kv-cache-session.ts`) — single atomic owner of the three KV-cache layers (`cachedMessageCounts`, `initializedCaches`, on-disk `.bin` files). `beginTurn` / `commitTurn` / `rollback` collapse the three duplicated cleanup blocks in `completion-stream.ts` into one scope.defer hook. Cross-model administrative deletion lives at the module level as `deleteKvCacheState(...)`, called by the RPC `handleDeleteCache` handler. - Stop-button race close — `RequestRegistry` now keeps a bounded cancelled-before-begin map (128 entries, 30s TTL). A `cancel({ requestId })` that lands before the server's `begin(...)` ran is applied retroactively when begin lands, so same-tick stop clicks no longer disappear into the void. Internal-only — the wire surface for `cancel` is unchanged (Option A in the brief). Cursor rules updated in the same PR so the request-lifecycle and KV-cache topic docs stay in sync with the implementation. Tests: - unit: KvCacheSession (bareTest-gated, runs in the Bare consumer), RequestRegistry race + bounded-set eviction, completion-event schema cancelled cases. - e2e: cancellation-tests.ts adds three definitions — mid-stream cancel (events.stopReason === "cancelled", final rejects with InferenceCancelledError, partial.text matches concatenated contentDelta), cancel-before-begin (retroactive abort), and cancel-then-resume-kv-cache (rollback wiped the three layers, the next turn re-primes cleanly). * chore: drop planning labels (Mx/Dx) from QVAC-18182 comments Strips milestone (`M1`/`M2`/`M3a`...) and deliverable (`D2`/`D5`/`D7`) labels from comments and test titles introduced with the typed-cancel outcomes + KvCacheSession work. The substantive descriptions of the contracts (Stop-button race, cancelled-before-begin map, three-layer session ownership, etc.) are preserved; only the planning-doc references are removed so the code reads cleanly without the pitch context. Durable `QVAC-XXXXX` ticket references are kept. No behavior or API surface changes. * chore: drop Asana ticket references from QVAC-18182 code comments Strips QVAC-XXXXX inline ticket references from code/test comments introduced by the typed-cancel-outcomes work. Concept names (Stop-button race, cancelled-before-begin, etc.) and prose descriptions of the contracts are preserved; only the ticket-tag suffixes go. Also renames a test cache key from `qvac-18182-cancel-resume-kvcache` to `cancel-then-resume-kvcache` so the cache key reads as a stable identifier rather than a ticket reference. No behavior or API surface changes. * QVAC-18182 doc: clarify error>cancelled precedence + deleteKvCacheState concurrency Address non-blocking review nits on PR #2007: - aggregate-events: explain why a wire event carrying both error and cancelled signals resolves to error (closes brief open question #3). - kv-cache-session: doc-comment on deleteKvCacheState explaining the ordering guarantee under concurrent in-flight turns -- delete is wire-async, in-flight turns roll back idempotently when their commit probe finds the file gone (closes brief open question #4). Comments only; no behavior changes. * QVAC-18182 doc: demonstrate typed cancel outcomes in cancel example Enhance the existing cancel-by-request-id example to demonstrate the two M2 cancel-outcome channels: - run.events ends normally with completionDone carrying stopReason: "cancelled" -- show reading it inside the iteration loop. - run.text rejects with InferenceCancelledError(requestId, partial) on cancel -- show the instanceof check and consuming partial.text, partial.toolCalls, partial.stats. Also update the header to remove the now-stale "logged as a no-match" sentence (same-tick cancels are no longer dropped after M2's race close). Pure documentation enhancement; no API or behavior changes. * QVAC-18182 fix: address PR review — partial-prime cleanup + parent-aborted state Two follow-ups from Opanin's review on PR #2007: 1. KvCacheSession.beginTurn: if `primeIfMissing` throws after the addon has partially written a `.bin` to disk, the next `beginCustom` would `fsPromises.access(cachePath)` → true and trust the half-primed file as a valid cache (no rollback hook is registered yet — the handler hasn't seen the `TurnHandle`). Wrap both `beginCustom` and `beginAuto` prime calls in a shared `primeOrCleanup` helper that best-effort unlinks the partial file before re-throwing the original prime error. Adds a bare-only unit test asserting the on-disk file is removed and the init flag stays unset on the failed-prime path. 2. RequestRegistry.begin: when `parentSignal` was already aborted at begin time, line 271 aborts the controller but the `state` ternary still landed `"running"`, exactly the "momentarily-running with already-aborted signal" the preCancel branch was guarding against. Extend the ternary to cover both inputs and the existing `parentSignal already aborted` test now also asserts `ctx.state === "cancelling"`. No behavior change on the happy path. Lint + typecheck + 351-test unit suite green locally on the changed files. * QVAC-18182 fix: prime is atomic — addon writes to .prime.tmp + atomic rename Upgrade the previous reactive cleanup workaround (PR #2007 review by @opaninakuffo) into a proactive atomic-by-construction design: - The session steers `model.run({ saveSessionPath })` to a sibling `cachePath + ".prime.tmp"` path. - Only after the prime closure resolves successfully do we promote the temp file to the canonical `cachePath` via `fsPromises.rename` (atomic same-volume on every host we target). - The canonical cache path is therefore *never* observable in a partial state — a thrown prime is indistinguishable on disk from a never-attempted prime, so the next existence probe (in-process or cross-process worker restart) cannot trust corrupt bytes. Defensive details: - We unlink any leftover `.prime.tmp` *before* invoking the closure, so a deferred-write addon path can't accidentally promote stale-from-crash bytes left by a prior worker. - On prime success we probe the temp path before renaming. If the addon deferred its disk write (some llama.cpp paths flush lazily), the temp doesn't exist and we leave the canonical path absent — `verifySaveAndRecord` in `commitTurn` is the authoritative check. - On rename failure we unlink the temp and surface the rename error; rename atomicity guarantees the canonical path was untouched. Why this is better than the prior `primeOrCleanup`: - Best-effort `unlink` was load-bearing for correctness in the old design — a failed unlink left a half-primed canonical file the next `beginCustom` would trust. The new design moves the only possible "partial" file to a non-trusted name, so failed cleanup cannot corrupt the canonical name by construction. - The unit test no longer mocks the workaround surface; it asserts the actual invariant ("canonical path was never written") plus the positive rename and the leftover-sweep guarantees. Tests: 3 bare-only kv-cache-session unit tests (throw-leaves-canonical- untouched, success-promotes-via-rename, leftover-from-crash-is-swept). Lint + typecheck + 351-test unit suite green locally on the changed files. Long-term, the right fix is one layer down — the llama.cpp addon should write transactionally itself and surface save errors instead of swallowing them. When that lands, this helper collapses to a direct `prime(cachePath)` call and the `verifySaveAndRecord` access-probe fallback (TODO already documented) can be retired together. Filed as a separate follow-up; out of scope for this PR. * QVAC-18182 fix: replace prime-atomic helper with verifyPrimedFile post-prime probe Audit of the llama.cpp addon (`CacheManager::writeCacheFile` → `llama_state_save_file`, return value swallowed; `LlamaModel:: processPromptImpl` lines 575-599) shows the bug shape Opanin flagged on PR #2007 — "primeIfMissing throws after a partial save" — does not actually fire. The save call is the very last operation on the prefill path, the addon ignores its return value, and any earlier throw means no save was attempted. So: - `primeOrCleanup` (`ac8d2d74e`) and the upgrade to `primeAtomically` (`a7420f3e6`) defended against a code path that the addon does not produce. - The real corruption shape is silent partial writes (addon's `llama_state_save_file` returns false, addon ignores it, file is half-written or empty). Atomic temp+rename did NOT close this gap — on a "silent partial" the closure resolves successfully and the helper would happily promote the partial `.prime.tmp` to the canonical path. Replace both helpers with a small `verifyPrimedFile` that mirrors the existing `verifySaveAndRecord` access-probe pattern used at commit time, applied at prime time: - After a successful prime closure, `fsPromises.stat` the canonical path. If it doesn't exist (addon was interrupted before save) or has size 0 (addon save call produced an empty file), throw and best-effort unlink the empty leftover so the next existence probe doesn't trust it. - This catches the two failure modes Opanin's concern was a proxy for (cancelled-mid-prime; addon save quietly produced nothing) without claiming defense against partial-but-nonzero writes, which can only be closed at the addon layer. The `RequestRegistry` parent-aborted-state fix (`ctx.state` ternary covers `opts.parentSignal?.aborted`) from `ac8d2d74e` is preserved unchanged — it stands on its own as a correct response to Opanin's second comment. Long-term root cause stays the addon: have `CacheManager::writeCacheFile` check `llama_state_save_file`'s return value and throw on failure. When that lands, both `verifyPrimedFile` and `verifySaveAndRecord`'s access-probes can be retired together. Filed as a separate follow-up — out of scope for this PR. Tests: 3 prior bare-only prime-atomic tests removed; 2 new bare-only tests added (no-file and empty-file rejection paths). Lint + typecheck + 330-test unit suite green locally on the changed files (pre-existing sdcpp-generation lint errors unchanged). * QVAC-18182 doc: kv-cache rule documents addon non-transactional save + matched access-probes Extend the "Cache Initialization (primeIfMissing)" section in .cursor/rules/sdk/docs/kv-cache-system.mdc with the corrected addon-contract analysis: - The llama.cpp addon's CacheManager::writeCacheFile discards llama_state_save_file's bool return; maybeSaveCacheToDisk is the last call on the prefill path. So no closure-rejection path can coexist with a partial file on disk. - Document the four real outcomes as a table (interrupted / success / silent partial write / pre-eval throw) so future readers can see why the SDK takes the shape it does. - Pin both SDK-side defenses as a matched pair: verifyPrimedFile at prime time (added in this PR) and verifySaveAndRecord at commit time (existing). Both are honest about what they catch (missing / empty file) and what they don't (partial-but-nonzero, only addon fix can close that). - Reference the addon-layer follow-up (1214778658064488 / "throw on llama_state_save_file failure") so the next contributor knows both probes will be retired together when the addon throws on save failure. No code change — rule-only update.

* QVAC-18183 feat[api]: inference-handler migrations Migrate the four remaining inference handler kinds onto the RequestRegistry primitives shipped in M3a (cancel-capability declaration, per-kind concurrency policy, structured `[request-lifecycle]` logging). Each handler now opens a request-scoped `ManagedRequestContext`, threads the optional `requestId` from the wire request (falling back to a server-minted UUID), routes hard cancels to `addon.cancel()` at a single signal- listener leaf, and replaces ad-hoc `try/finally` cleanup with `scope.defer(...)` registrations so cleanup runs in LIFO order on every exit path. - `embed` (kind "embeddings", `{ scope: "model", hard: true }`): `packages/sdk/server/bare/ops/embed.ts` opens the context, threads `requestId` from `embedRequestSchema`, post-await `signal.aborted` checks raise `InferenceCancelledError`. - `transcribe` / `transcribeStream` (kind "transcribe", `{ scope: "model", hard: true }`): collapsed `try { ... } finally { restorePrompt(...) }` into `scope.defer(restorePrompt)`, added per-iteration `if (ctx.signal.aborted) break;` in the `response.iterate()` loop (Option A from §4 of the M3b brief — explicit, visible at the call site, no `takeWhileNotAborted` wrapper). - `translate` (kind "translate"): two engine branches. llamacpp-completion declares `{ scope: "model", hard: true }` and wires `signal → addon.cancel()`; nmtcpp-translation keeps `{ scope: "none" }` and soft-cancels inside both the streaming iterate loop and the `runBatch` early-return path. - `finetune` (kind "finetune"): flipped the llamacpp-completion manifest declaration from `{ scope: "none" }` to `{ scope: "model", hard: true }` (the addon already exposes `model.cancel()`). `startFinetune` opens a registry context and wires `signal → model.cancel()`; the two-level `try/finally` collapses into `scope.defer` for `clearFinetuneRuntimeState` and `handle.removeListener`. `cancelFinetune(modelId)` is now a thin wrapper over `getRequestRegistry().cancel({ modelId, kind: "finetune" })` — never invokes `model.cancel()` directly. Per §4 of the brief: per-iteration cancel granularity uses Option A (explicit `if (ctx.signal.aborted) break;` at the top of each streaming loop body). No `takeWhileNotAborted` wrapper was introduced. Per §7 anti-patterns: M3b adds zero `oneAtATimePerModel` policies (the four migrated kinds tolerate concurrent requests against the same model), leaves the M1 compat-fallback in `server/bare/ops/cancel.ts` untouched (M3d retires it), and does not modify `cancelHandler.ts`. Other changes: - `embed`, `transcribe`, `transcribeStream`, `translate`, `finetune` request schemas grow an optional `requestId` field (`.string().min(1).optional()`); server-side ops fall back to `generateServerRequestId()` when absent. - Whisper / Parakeet / LLM / NMT plugin handlers thread `request.requestId` into their bare ops. - `plugin-cancel-capability.test.ts` truth-table flipped for the `finetune` row. - New `inference-handler-migrations.test.ts` covers schema-level optional-`requestId` acceptance for all four kinds and pins the `[request-lifecycle] begin/cancel/end` line shape for each kind. The op-level cancel-by-requestId / cancel-by-modelId integration tests are bare-runtime-gated (the migrated ops pull `bare-crypto` / `bare-fs` transitively and can't load under Bun, same reason as `finetune-ops.test.disabled.ts`). - `.cursor/rules/sdk/request-lifecycle-primitives.mdc` and `.cursor/rules/sdk/docs/request-lifecycle-system.mdc` updated: M3b row marked shipped, finetune truth-table row flipped, canonical-handler-shape section refreshed to use `embed.ts` as the cleanest reference and to document the Option A per-iteration check. Verification: - `bun lint` (eslint + tsc --noEmit): green. - `bun run typecheck`: green. - `bun run test:unit`: every test file green except the pre-existing `client/rpc/rpc-client.ts` `#rpc` package-resolution failure on upstream/main (also reproducible without these changes; unrelated to M3b). * QVAC-18183 fix: address PR #2058 review feedback - transcribe.ts: route the two `Transcription Update` debug emits through `requestLogger.debug` so they carry the per-request prefix, matching the rule's `grep "requestId=<id>"` invariant. Drop the now- unused module-level `logger`. Collapse two `scope.defer(async () => { await restorePrompt(...) })` wrappers to bare arrow callbacks (review #5, #10). - inference-handler-migrations.test.ts: add bareTest op-level cancel- by-requestId cases for `transcribe (whisper)` (asserts loop exit + addon.cancel called + reload-count == 2 to pin the `applyPrompt + restorePrompt runs exactly once` invariant) and `finetune` (asserts model.cancel called + scope unwind clears the runtime-state flag back to IDLE). Pin the NMT soft-cancel contract by instrumenting the addon and asserting addon.cancel was NOT called during a translate cancel (review #3, #7). - request-lifecycle-primitives.mdc: reconcile the "polling signal.aborted mid-handler" anti-pattern with the new "Per-iteration cancel check (M3b)" canonical pattern. The anti-pattern is *adding* the check when the addon already honours signal directly; the M3b pattern is *introducing* the check where the addon doesn't and the loop is the only soft-cancel exit (review #4). * QVAC-18183 fix: drop unsafe `addon` re-narrowing in translate.ts onAbort Addresses opaninakuffo's review comment on #2058: `AnyModel.addon` is already typed as `AddonInterface | undefined` (see `server/bare/registry/model-registry.ts:17-20`), so the `as unknown as { addon?: { cancel?(jobId?: string): Promise<void> } }` cast was unnecessary. Matches the simpler pattern used by `embed.ts` and `transcribe.ts` for the same `onAbort` shape — keeps the four M3b-migrated ops uniform. * QVAC-18183 doc: trim internal milestone references from cursor rules + code comments Removed the "Migration Roadmap" table, "M1/M2/M3a-d" milestone labels, planning-brief decision references (Decision A/B.2, D1/D2), workspace-local paths (`tasks/release-0.11.0-planning/...`, `pitch-3-decisions.md`), and "in review" forward-references from the request-lifecycle cursor rules and the matching code comments in the bare ops, finetune wrapper, and the inference-migration tests. The canonical handler shape, anti-patterns, primitives reference, plugin cancel-capability truth table, and concurrency-policy / structured-logging sections all stay — only the internal milestone framing comes out.

Lifecycle correctness: - Spawn lock: steal only when the owner pid is dead (with an mtime fallback for an unreadable lock), so a legitimate multi-minute cold start no longer loses its lock after 30s and spawns a duplicate runner/serve (tetherto#1). - close(): the fetch path now bails out instead of re-resolving once closed, so a request racing close() can't silently re-add a consumer / spawn a runner (tetherto#3). - sweepServes: when an orphaned serve's pid is alive but its health check fails, keep the record instead of dropping it — dropping stranded a live serve with no registry trace. We only reap once it answers as ours, or drop once its pid dies (tetherto#4). - servePort: fold a pinned port into the fleet key so pinned-port callers don't reuse an auto-allocated serve on a different port, and distinct pins don't collide (tetherto#5). - Respawn: expose baseURL/port/pid as getters over live state, updated on every reconnect, so diagnostics/external clients see the real serve after recovery (tetherto#6). - retargetUrl now handles Request inputs (not just string/URL) so a respawn stays transparent if the SDK ever switches input shapes (tetherto#8). Docs: - README + docs-site: direct-baseURL tools (OpenCode/Cline/Aider) don't extend liveness; document the long-lived-sentinel/wrapper pattern and fix the misleading "the script doesn't have to stay running" note (tetherto#2). - Reconcile version wording: README/changelog now describe managed mode as unreleased (package is 0.1.0); docs-site integration page documents managed mode + the async overload (tetherto#7). Tests: spawn-lock steal/keep matrix, fleet-key pinned-port sensitivity, and the runner-dead + serve-alive + health-failing sweep case. Build + suite green (60 pass / 1 integration skip).

* feat[api]: add managed mode to @qvac/ai-sdk-provider (QVAC-19900) Add `mode: 'managed'` so the provider can synthesize an ephemeral qvac.config.json from a model-constant list, spawn and supervise `qvac serve` on a free port, and tear it down on host exit. External mode is unchanged and stays synchronous; the managed supervisor is lazily dynamic-imported so external-mode users pay no startup cost. @qvac/cli becomes an optional peer dependency. * fix: resolve @qvac/cli via main entry when its exports block package.json (QVAC-19900) The published @qvac/cli ships a string `exports` field ("./dist/index.js"), which makes the `./package.json` subpath non-resolvable (ERR_PACKAGE_PATH_NOT_EXPORTED). Managed mode relied on resolving `@qvac/cli/package.json` to locate the bin, so it would fail to find the CLI on a clean install. Fall back to resolving the package main entry, which for @qvac/cli is the same file as the `qvac` bin. * doc: update ai-sdk provider agent setup after queue (QVAC-19900) * QVAC-19900 feat[api]: per-model config for managed mode Managed mode `models` now accepts spec objects ({ name, config, preload, default }) alongside bare constant names, so callers can set per-model serve options — notably `ctx_size` and `reasoning_budget` — that coding agents like OpenCode require. The synthesized qvac.config.json carries the config block, honors explicit `preload`/`default`, and validates names inside spec objects. Exports the new `QvacManagedModel` type and documents per-model config plus a managed-mode OpenCode example in the README. * QVAC-19900 feat[api]: shared idle-reaped managed serve daemon Rework managed mode from a per-provider supervisor into a shared, self-cleaning serve daemon so it is robust standalone and usable by any tool, not just a single session. - Reuse via a fleet key (model set + per-model config + host) keyed in a cross-process registry under ~/.qvac/managed-serves/; createQvac attaches to a matching healthy serve instead of cold-starting a duplicate. - A detached runner owns the qvac serve child and reaps it once no consumer process has been alive for serveIdleTimeout (default 5m). Liveness, not request traffic, is the signal, so it works for tools that hit baseURL directly (OpenCode/Cline/Aider). - close() now detaches (deregisters the consumer) instead of killing; a shared serve survives until its last user is gone. - Sweep only reaps dead/orphaned serves, never a healthy serve a live process owns (fixes a second session SIGKILLing a downloading serve). - Respawn-on-failure: fetch re-resolves and retries once on ECONNREFUSED. - reuse:false (or a pinned servePort) yields a private serve reaped as soon as its owner exits. Refactor into serve-process.ts (spawn/health/stop), registry.ts, fleet-key.ts, runner.ts; remove supervisor.ts and pid-tracker.ts. Add reuse and serveIdleTimeout options. Rewrite tests and add reuse/idle-reap end-to-end coverage; document the shared lifecycle in the README. * QVAC-19900 fix: reject duplicate model names in managed mode Each managed model maps to a single serve alias keyed by its name, so a repeated name silently overwrote the earlier entry — and could drop its `default: true`. Reject duplicates up front with DuplicateManagedModelError instead of resolving them ambiguously. Addresses PR review feedback. * QVAC-19900 fix[api]: address managed-mode self-review findings - Per-instance consumer markers (<pid>.<rand>) so two providers in one process sharing a fleet key don't deregister each other on close (A). - Restrict respawn retry to ECONNREFUSED so an in-flight completion is never blindly replayed on ECONNRESET/EPIPE (C). - Health-check the recorded baseURL before SIGTERM-ing an orphaned serve, guarding against killing a recycled pid (D). - Use dirname() instead of a posix-only regex for ephemeral config cleanup (E). - Fold serveBinPath into the fleet key so distinct local builds don't share a serve (G). - Export managed error classes + QvacManagedErrorCode for instanceof checks (H). - Reject more than one explicit default: true (I). - Deregister the consumer if resolveServe throws (F); drop dead firstConsumerPid runner param (J). Tests: per-instance markers, health-gated orphan sweep (kills serving orphan, spares non-serving stranger pid), fleet-key serveBinPath sensitivity, multiple-default rejection. README updated. * QVAC-19900 fix[api]: address managed-mode lifecycle review (round 2) Lifecycle correctness: - Spawn lock: steal only when the owner pid is dead (with an mtime fallback for an unreadable lock), so a legitimate multi-minute cold start no longer loses its lock after 30s and spawns a duplicate runner/serve (#1). - close(): the fetch path now bails out instead of re-resolving once closed, so a request racing close() can't silently re-add a consumer / spawn a runner (#3). - sweepServes: when an orphaned serve's pid is alive but its health check fails, keep the record instead of dropping it — dropping stranded a live serve with no registry trace. We only reap once it answers as ours, or drop once its pid dies (#4). - servePort: fold a pinned port into the fleet key so pinned-port callers don't reuse an auto-allocated serve on a different port, and distinct pins don't collide (#5). - Respawn: expose baseURL/port/pid as getters over live state, updated on every reconnect, so diagnostics/external clients see the real serve after recovery (#6). - retargetUrl now handles Request inputs (not just string/URL) so a respawn stays transparent if the SDK ever switches input shapes (#8). Docs: - README + docs-site: direct-baseURL tools (OpenCode/Cline/Aider) don't extend liveness; document the long-lived-sentinel/wrapper pattern and fix the misleading "the script doesn't have to stay running" note (#2). - Reconcile version wording: README/changelog now describe managed mode as unreleased (package is 0.1.0); docs-site integration page documents managed mode + the async overload (#7). Tests: spawn-lock steal/keep matrix, fleet-key pinned-port sensitivity, and the runner-dead + serve-alive + health-failing sweep case. Build + suite green (60 pass / 1 integration skip). * docs: use canonical qvac.tether.io URL in ai-sdk-provider README * QVAC-19900 feat[api]: public model catalog + catalog-id aliases in managed mode Add `models.qvacCatalog`, a public models.dev-style catalog that maps friendly ids (`qwen3.5-9b`) to the SDK constant the serve loads (`QWEN3_5_9B_MULTIMODAL_Q4_K_M`), so the id a user picks from models.dev resolves end-to-end with no translation layer in front of the serve. Managed mode now accepts catalog ids as model names: the synthesized serve config keys the alias by the friendly id while `model` resolves to the underlying SDK constant, so the serve answers `qwen3.5-9b` directly. Bare SDK constants keep working unchanged. A drift unit test fails CI if any catalog constant disappears from the generated SDK catalog. * QVAC-19900 feat[api]: process-group serve teardown + closeOnParentExit Harden managed-mode lifecycle so a managed serve never leaks its `bare` inference worker or outlives the process that owns it. - Process-group teardown: spawn `qvac serve` detached (its own group) and, when stopServe must escalate past the grace window, SIGKILL the whole group. A plain SIGKILL of the serve pid never cascades to the grandchild bare worker, so previously a wedged serve orphaned the worker. The graceful SIGTERM is still sent to the serve process only, so a healthy serve orchestrates its own shutdown and releases the global worker lock (no stale lock left behind); the group SIGKILL is the wedged-path fallback. - `closeOnParentExit` option: for a daemon-style host whose sole job is to keep a managed serve alive for a parent process (e.g. an editor/agent plugin). The provider watches its parent pid and, the moment the parent exits (on POSIX we are reparented to init, ppid → 1), closes itself — deregistering the consumer so the runner reaps the serve — and exits. Without it a hard-killed parent would leave a reparented host alive, keeping its consumer marker forever so the serve was never reaped. Tests: a stubborn-grandchild fake serve proves group teardown reaps the worker; `parentIsGone` unit-tests the parent-watch decision. * QVAC-19900 fix: keep managed serve lifecycle correct under close() race and crash-respawn - Undo the consumer re-registration when close() wins the race against an in-flight fetch retry: resolveServe re-adds the marker after close() removed it, which would keep the shared serve warm until the process exits. - Preserve live consumer markers when sweepServes reaps a crashed/orphaned serve, so a respawned runner inherits the still-alive sessions instead of idle-reaping the fresh serve out from under them. - docs: bump managed-mode ctx_size examples to 32768 for agent-sized prompts. * QVAC-19900 fix: rename reresolve result to resolved for clarity in managed fetch * QVAC-19900 mod: collapse redundant sync/async registry teardown helpers removeConsumer/removeConsumerSync and removeRecord/removeRecordSync were a confusing sync/async mirror: the async removeConsumer was only ever called right after the sync one (a guaranteed no-op), and the removeRecord pair was really two teardown semantics under near-identical names. Marker/record teardown is a single unlink/rm, cheap enough to be synchronous everywhere — including process 'exit' handlers where async can't run — so collapse each pair into one sync function. No behaviour change; addresses review feedback on #2408. * QVAC-19900 mod: trim verbose comments in managed registry Tighten the sync-rationale comments on removeRecord/removeConsumer and drop a stale, broken leftover comment above ensureDirSync. Keeps the non-obvious intent (why sync, preserveConsumers semantics) without the narration. * QVAC-19900 mod: drop unused DEFAULT_SERVE_BIN and ephemeralConfigName Both were dead: DEFAULT_SERVE_BIN was never imported (serve-process spawns the resolved CLI path verbatim) and ephemeralConfigName was an unused helper (writeEphemeralConfig uses a fixed name inside an mkdtemp dir). Removing the latter also drops the now-unused randomBytes import.

Proletter added 2 commits January 8, 2026 07:33

add public reusable for qvac-cli

d5ac02c

commented out npm publish composite action

d7a7e44

Proletter merged commit d00672a into main Jan 8, 2026

Proletter deleted the qvac-cli-integration branch January 8, 2026 15:44

maxim-smotrov added a commit that referenced this pull request Feb 28, 2026

Merge pull request #614 from maxim-smotrov/fix/benchmark-pipeline-errors

3807b27

Try #3. Limit device to only cpu or gpu.

sharmaraju352 mentioned this pull request Mar 23, 2026

fix[notask]: harden benchmark servers against RCE, path traversal, injection, DoS, and CSRF #1086

Merged

maxim-smotrov added a commit that referenced this pull request Apr 10, 2026

Changing generated images dir on iOS #3

0231828

maxim-smotrov added a commit that referenced this pull request Apr 10, 2026

Revert "Changing generated images dir on iOS #3"

fd80e02

This reverts commit 0231828.

gianni-cor mentioned this pull request May 5, 2026

QVAC-17990 Add standalone ESRGAN upscaler API #1901

Merged

simon-iribarren mentioned this pull request May 13, 2026

QVAC-18182 feat[api]: typed cancel outcomes on the wire + atomic KV-cache via KvCacheSession #2007

Merged

simon-iribarren mentioned this pull request May 14, 2026

QVAC-18183 feat[api]: cancel capability + per-handler cancel scope + structured logging #2036

Merged

Victor-Rodzko mentioned this pull request May 14, 2026

QVAC-18935 test[skiplog]: retry on RequestRejectedByPolicyError in logging tests #2065

Merged

ogad-tether mentioned this pull request Jun 1, 2026

QVAC-18417: multilingual (MeCab/Cangjie) preprocessing wiring for tts-ggml Chatterbox #2347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qvac cli integration#3

Qvac cli integration#3
Proletter merged 2 commits into
mainfrom
qvac-cli-integration

Proletter commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Proletter commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant