Skip to content

Qvac cli integration#3

Merged
Proletter merged 2 commits into
mainfrom
qvac-cli-integration
Jan 8, 2026
Merged

Qvac cli integration#3
Proletter merged 2 commits into
mainfrom
qvac-cli-integration

Conversation

@Proletter

Copy link
Copy Markdown
Collaborator

No description provided.

@Proletter Proletter merged commit d00672a into main Jan 8, 2026
@Proletter Proletter deleted the qvac-cli-integration branch January 8, 2026 15:44
maxim-smotrov added a commit that referenced this pull request Feb 28, 2026
gianni-cor added a commit that referenced this pull request Mar 5, 2026
* Try #1. Adding tokenizer proxy to provide vocab size.

* Try #2. More fixes and logs.

* Try #3. Limit device to only cpu or gpu.

* Revert "Try #2. More fixes and logs."

This reverts commit a461e69.

* Revert "Try #1. Adding tokenizer proxy to provide vocab size."

This reverts commit 9951195.

* Fixing pipeline logging

* Add more logs

* Fixing bench logging

* Add more error handling and logging

* Improve error handling on the server. Added retry in case of context overflow.

* Make retries self-adjustable

* Adding some more checks and limiting the datasets temporarily

* Test: trying to narrow down the error

* Exclude failing datasets from embed benchmark

* Clean up the code

* Changing bench model for LLM

* Try #1. Adding tokenizer proxy to provide vocab size.

* Try #2. More fixes and logs.

* Try #3. Limit device to only cpu or gpu.

* Revert "Try #2. More fixes and logs."

This reverts commit a461e69.

* Revert "Try #1. Adding tokenizer proxy to provide vocab size."

This reverts commit 9951195.

* Fixing pipeline logging

* Add more logs

* Fixing bench logging

* Add more error handling and logging

* Improve error handling on the server. Added retry in case of context overflow.

* Make retries self-adjustable

* Adding some more checks and limiting the datasets temporarily

* Test: trying to narrow down the error

* Exclude failing datasets from embed benchmark

* Clean up the code

* Changing bench model for LLM

* Minor fixes for clarity

* Removing unused vars

* Removing unused imports

* Removing unused python deps

---------

Co-authored-by: gianni <gianfranco.cordella@tether.io>
maxim-smotrov added a commit that referenced this pull request Apr 10, 2026
maxim-smotrov added a commit that referenced this pull request Apr 10, 2026
ishanvohra2 pushed a commit that referenced this pull request Apr 24, 2026
Polish the remaining review nits on the TTS client streaming surface.

- #3 TtsMulticast.pump now rejects the `done` promise with the fatal
  error instead of resolving `false`. An internal `.catch(() => {})`
  silences unhandled-rejection warnings when the caller only iterates
  the buffer/chunk streams and never awaits `done`; re-awaits still
  see the rejection.
- #6 TextToSpeechStreamSession[Symbol.asyncIterator] no longer throws
  synchronously on a second iteration; it returns an iterator whose
  first `.next()` rejects, so `for await` surfaces the error in the
  normal async control flow rather than the iterator protocol.
- #9 plainTtsBufferStream / collectTtsBuffer wrap the RPC loop in
  try/catch/finally so `done` always settles: resolve(true) on the
  terminal frame, reject with the real error on exceptions, and
  resolve(false) on early consumer break. Previously `await done`
  could hang forever when the consumer bailed out early.
- #11 Skip per-frame ttsResponseSchema.parse() in all three paths;
  rely on the discriminated-union narrowing at the RPC boundary.
  Drops the per-PCM-frame Zod validation cost for large sentences.

Made-with: Cursor
ishanvohra2 added a commit that referenced this pull request Apr 27, 2026
…peech (#1590)

* feat: Add runStream() which takes input as a stream

* add integration tests

* uncomment cb tests

* chore: Add cb streaming example

* feat: Add TTS streaming funcitonality and example

* Update tts addon version

* Remove chatterbox example

* add new error code for tts streaming fail

* Move common code to util

* fix: Use z.infer to define TextToSpeechStreamClientParams

* Move TextToSpeechStreamSession to schemas

* Track subscriber current index and trim queue when all subscribers consumed past items

* add missing unit tests

* fix: drive done promise from multicast pump lifecycle

* fix: Forward chunkIndex and sentenceChunk in sentence-stream mode to client

* fix: Use correct error code for tts stream failure

* chore: Add supertonic stream test in tts-tests.ts

* fix: Make tts client more readable

* Remove closures and inline async generators

* fix: Subscribe eagerly in sentenceStreamTts to avoid late-subscriber data loss

TtsMulticast.pump() starts in a microtask on construction, while the
returned async generators only call subscribe() when first iterated. If
the consumer iterated one generator before the other, the first
subscriber could trim the queue before the second ever registered,
silently dropping earlier frames.

Subscribe synchronously for both bufferStream and chunkUpdates before
returning, so both subscriber indexes are in place before pump pushes
its first item.

Made-with: Cursor

* fix: Close TTS stream on server-sent done frame

Remove the dead `null` sentinel from `processTextToSpeechStreamLine`
and instead close `parseTextToSpeechStreamLines` after yielding the
terminal `done: true` frame, so consumers don't rely on the server
closing the socket to stop iteration.

Made-with: Cursor

* fix: Reject sentenceStream without stream in textToSpeech

Previously `sentenceStream: true` combined with `stream: false` fell
through to the collect path, silently dropping the sentence-stream
parameters and returning no `chunkUpdates`. Fail fast at the
dispatcher with a clear error so the contract mismatch surfaces to
the caller instead of being swallowed.

Made-with: Cursor

* fix: Release TtsMulticast subscriber slot on early break

Wire a try/finally into drain() so that when a consumer breaks out of
the for-await (or the generator is .return()'d / throws), the slot is
parked at +Infinity via unsubscribe(). This prevents a stale low
min-index from permanently pinning trimConsumed, which otherwise leaked
the queue for the entire RPC stream.

Made-with: Cursor

* fix: Guard TTS stream write after close and preserve UTF-8 boundaries

Client:
- Track a `closed` flag in `textToSpeechStream` duplex session, set by
  `end()` / `destroy()`. Subsequent `write()` calls now throw a typed
  `TextToSpeechStreamFailedError` instead of propagating a raw Bare/Node
  "write after end" stream error.
- `end()` is idempotent so accidental double-close no longer errors.

Server:
- `buffersToUtf8Fragments` previously decoded each incoming Buffer via
  `toString("utf8")`, which corrupts any multi-byte codepoint whose bytes
  straddle a chunk boundary (common with CJK / emoji / accented scripts
  emitted as LLM token deltas). Added a small tail-buffer that finds the
  last complete UTF-8 codepoint end in the combined buffer and defers
  trailing incomplete bytes to the next chunk. Any dangling partial
  sequence is flushed on stream end.

Made-with: Cursor

* fix: Order TEXT_TO_SPEECH_STREAM_FAILED code and document it

- Move TEXT_TO_SPEECH_STREAM_FAILED (52415) to the end of the 52400
  Model Operations block so the ordering in SDK_SERVER_ERROR_CODES
  matches the numeric sequence (…52413, 52414, 52415).
- Add the missing row for 52415 to the (latest) errors.mdx table, per
  the sdk/docs-freshness rule that the error table stay in sync
  whenever a new code is introduced.

Made-with: Cursor

* fix: Register operation metrics for textToSpeechStream

Only `textToSpeech` was registered in `operation-metrics.ts`, so the
duplex `textToSpeechStream` path silently skipped `modelExecutionTime`,
`audioDuration`, and `totalSamples` gauges even though the server
already collects the same `TtsStats` via `collectTtsStats()` on the
final chunk. Mirror the non-streaming registration so the streaming
path has parity observability.

Made-with: Cursor

* fix: Harden TTS client done-promise, iterator, and parse cost

Polish the remaining review nits on the TTS client streaming surface.

- #3 TtsMulticast.pump now rejects the `done` promise with the fatal
  error instead of resolving `false`. An internal `.catch(() => {})`
  silences unhandled-rejection warnings when the caller only iterates
  the buffer/chunk streams and never awaits `done`; re-awaits still
  see the rejection.
- #6 TextToSpeechStreamSession[Symbol.asyncIterator] no longer throws
  synchronously on a second iteration; it returns an iterator whose
  first `.next()` rejects, so `for await` surfaces the error in the
  normal async control flow rather than the iterator protocol.
- #9 plainTtsBufferStream / collectTtsBuffer wrap the RPC loop in
  try/catch/finally so `done` always settles: resolve(true) on the
  terminal frame, reject with the real error on exceptions, and
  resolve(false) on early consumer break. Previously `await done`
  could hang forever when the consumer bailed out early.
- #11 Skip per-frame ttsResponseSchema.parse() in all three paths;
  rely on the discriminated-union narrowing at the RPC boundary.
  Drops the per-PCM-frame Zod validation cost for large sentences.

Made-with: Cursor

* fix: Tighten textToSpeechStream schema surface

- Add .positive() to maxBufferScalars and flushAfterMs to match the
  existing constraint on sentenceStreamMaxChunkScalars. Previously a
  caller could pass negative values straight through to the addon.
- Un-export textToSpeechStreamRequestBaseSchema — consumers only need
  the finalized textToSpeechStreamRequestSchema, and the base is an
  implementation detail of the shared object shape. The exported type
  alias TextToSpeechStreamClientParams continues to derive from the
  base via `typeof`, so nothing on the public type surface changes.

Made-with: Cursor

* fix: Cross-platform tmp path and safer PCM append in TTS examples

- playPcmInt16Chunk now writes the intermediate WAV chunk under
  os.tmpdir() / path.join instead of a hard-coded /tmp/qvac-tts-chunk-…
  path. The previous code's Windows branch was unreachable in practice
  because the POSIX /tmp directory doesn't exist there; this uses
  %TEMP% on Windows automatically.
- appendPcmSamples switches from `target.push(...chunk.slice(i, end))`
  to `Array.prototype.push.apply(target, chunk.slice(i, end))`. Same
  semantics, but avoids allocating the spread rest array per batch
  and is closer to a memcpy-style concat in V8.

Made-with: Cursor

* fix: Catch zero-chunk regressions in TTS sentence-stream test

- TtsExecutor.makeSentenceStream now returns `{ passed: false, ... }`
  when the chunkUpdates iterator yields no chunks / no samples. The
  previous executor always returned a formatted string regardless of
  counts, so a regression that silently emitted zero chunks would
  still have looked like a pass.
- ttsSupertonicSentenceStream's expectation upgraded from
  `{ validation: "type", expectedType: "string" }` to
  `{ validation: "contains-all", contains: ["sentence-streamed",
  "chunks", "samples"] }`. The executor's zero-case failure string
  lacks "sentence-streamed", so the contains-all match fails on
  regression.

Made-with: Cursor

* fix: Apply stream default locally and throw typed error on tts mismatch

Previous guard only rejected the explicit `stream: false + sentenceStream:
true` combination. A caller passing `{ modelId, text, sentenceStream: true }`
with `stream` omitted silently fell through to `collectTts` while the
server's Zod `.default(true)` still ran the sentence-stream branch and
emitted chunk frames — which the client then discarded, dropping all
chunk metadata.

- Resolve the `stream` default locally (`params.stream ?? true`) so the
  client's dispatch routing matches the server's Zod-applied routing,
  and an omitted `stream` now correctly lands in `sentenceStreamTts` or
  `plainStreamTts`.
- Only the explicit `sentenceStream: true + stream: false` combination
  is rejected, and it now throws `TextToSpeechStreamFailedError` (code
  52415) instead of a bare `new Error(...)` so callers can discriminate
  by error code like everywhere else in the SDK.

Made-with: Cursor

* remove inline defaults for sentenceStream and stream

* Use TtsMulticast in unit test instead of mock

---------

Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>
GustavoA1604 added a commit that referenced this pull request May 7, 2026
Bundle of correctness, hygiene, and CI-doc fixes from the recent code
review.  Each item below has its own paragraph in the diff comments.

- #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js
  to package.json so consumers running the integration tests from the
  npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`.
- #2 deps: move @qvac/langdetect-text from runtime dependencies to
  devDependencies (it's only referenced from examples/, which aren't in
  the published files list).
- #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming
  detection used to read engine_->options() outside engineMu_, racing
  with reload().  synthesize() now returns SynthesizeResult { pcm,
  wasStreaming } where wasStreaming is captured under the engine lock
  against the local shared_ptr so process() doesn't have to touch
  engine_ again.
- #4 deferred-load: ChatterboxModel + SupertonicModel constructors
  used to call load() eagerly, so JsInterface::createInstance() (sync
  on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop.
  Both models now implement IModelAsyncLoad: constructors validate +
  return; the actual load is deferred to waitForLoadInitialization(),
  which the new addon_js::activate wraps inside JsAsyncTask::run so the
  parse runs on a worker thread.  binding.cpp registers
  addon_js::activate in place of JsInterface::activate; tts.js now
  awaits the resulting promise.
- #5 dead code: drop _resolvePath (unused), drop the (void)inputObj
  read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE /
  FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but-
  not-thrown so future maintainers don't delete them blindly (the unit
  suite asserts the values).
- #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_
  reset pattern: cancel() sets it, synthesize() fast-fails on it,
  process() resets it per call so a stale cancel doesn't poison the
  next run.
- #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that
  the JS layer is the source of truth for useGPU and nGpuLayers wins
  downstream; left a pointer to std::optional<bool> if a future caller
  ever needs to distinguish "absent" from "explicit false".
- #10 fork pointers: README.md and test/utils/downloadModel.js no
  longer point at GustavoA1604/chatterbox.cpp; both reference the
  upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now.
- #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment
  on the build-and-test job documenting that continue-on-error is the
  early-days landing posture (merge-guard treats success || skipped as
  pass), with a pointer to tighten once Device Farm provisioning is
  stable.

Nits:
- 'use strict' added to addonLogging.js (matches every other .js).
- node-vs-bare runtime banners on
  scripts/{generate,validate}-mobile-integration-tests.js.
- ttsOutputDebugString no longer JSON.stringify's the full PCM
  Int16Array on every chunk-streaming event; emits a tiny summary
  ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen})
  instead.

Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load
contract); 4 skipped real-GGUF tests behind the existing
QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF /
QVAC_TEST_SUPERTONIC_GGUF env-var gates.  Lint clean.

Co-authored-by: Cursor <cursoragent@cursor.com>
DmitryMalishev added a commit that referenced this pull request May 11, 2026
…ence-addon-cpp to 1.1.7#1

Per request from PR #1784 reviewers — pulls in the latest published
versions of both core C++ dependencies. The new ports are breaking
changes in two ways that require parallel source updates:

  1. The `qvac-lib-inference-addon-cpp` port renamed its top-level
     include namespace from `qvac-lib-inference-addon-cpp/*.hpp` to
     `inference-addon-cpp/*.hpp`. No back-compat shim ships in 1.1.7#1,
     so all 12 `#include` lines across 4 files migrate:
       - addon/src/utils/LoggingMacros.hpp
       - addon/src/js-interface/binding.cpp
       - addon/src/addon/AddonJs.hpp     (8 includes)
       - addon/src/addon/AddonCpp.hpp    (2 includes)
     CMakeLists.txt's find_path probe also migrated to the new namespace.

  2. The `qvac-lint-cpp` port (auto-bumped 1.4.4#2 -> #3 by the new
     baseline) relocated its config files from `share/qvac-lint-cpp/`
     to `share/lint-cpp/`. CMakeLists.txt's three `find_path` /
     `configure_file` paths migrated to the new location.

The vcpkg port names themselves are unchanged in vcpkg.json (the
registry entries are still `qvac-lib-inference-addon-cpp` and
`qvac-lint-cpp`); only the installed file layout moved.

Local validation (Win32 MSVC):
  bare-make generate               OK (re-resolved deps; fabric compiled
                                   from source ~32 min — local binary
                                   cache had no entry for 8189.0.2 yet)
  bare-make build                  OK (1 pre-existing fopen deprecation
                                   warning, unrelated to this change)
  bare-make install                OK
  npm run test:cpp                 7/7 GoogleTest pass
  npm run test:integration         5/5 tests, 79/79 asserts pass
  Quality vs PyTorch (unchanged):
    fixed/auto: max|Δ|=0.0008 cos=1.0000
    fixed/cpu:  max|Δ|=0.0021 cos=1.0000
    real/auto:  max|Δ|=0.1096 cos=0.9998
    real/cpu:   max|Δ|=0.1083 cos=0.9998

Note: VLA is intentionally not in the `verify-qvac-fabric-lockstep`
action's allowlist (which checks llamacpp-llm, llamacpp-embed, and
nmtcpp — all still pinned to 7248.2.3). So bumping VLA alone does not
break that gate; the LLM addons can be migrated separately when
their owners are ready.

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 added a commit that referenced this pull request May 11, 2026
…#1983)

* feat: add @qvac/tts-ggml package (Chatterbox English on qvac-tts.cpp)

New Bare addon wrapping the `qvac-tts::qvac-tts` static library (backed
by the `tts-cpp` port added in tetherto/qvac-registry-vcpkg).  API-compatible
with the Chatterbox engine exposed by `@qvac/tts-onnx` so downstream
consumers can swap backends without touching orchestration code.

## Scope

* First iteration.  Supports Chatterbox **English** only.  Chatterbox
  multilingual, LavaSR enhancer, Supertonic engine, and streaming are
  out of scope and remain in `@qvac/tts-onnx`.  They'll land alongside
  the evolution of qvac-tts.cpp.
* Native backend is the static `qvac-tts` library from the QVAC vcpkg
  registry (`ports/tts-cpp`, baseline `2026-04-21`).  No ONNX Runtime
  dependency.

## JS surface

* `@qvac/tts-ggml` exports `TTSGgml` with the same method shape as
  `ONNXTTS`:  `run` / `runStream` / `runStreaming` / `reload` /
  `unload` / `destroy`.
* `files: { modelDir }` looks for `chatterbox-t3-turbo.gguf` +
  `chatterbox-s3gen.gguf` side-by-side; `files.t3Model` /
  `files.s3genModel` override the defaults.
* Options: `referenceAudio`, `voiceDir` (baked profile), `seed`,
  `nGpuLayers`, `threads`, `outputSampleRate`, plus placeholders for
  the upcoming streaming flags (`streamChunkTokens`,
  `streamFirstChunkTokens`, `cfmSteps`).
* Shared reusable lib code (`lib/textChunker.js`,
  `lib/textStreamAccumulator.js`, `addonLogging.*`) is copied verbatim
  from `@qvac/tts-onnx`.
* New error class `QvacErrorAddonTTSGgml` uses codes **13001–14000**
  to avoid collisions with `@qvac/tts-onnx` (7001–7011) when both
  packages are loaded in the same Bare process.

## Native addon

* `addon/src/model-interface/chatterbox/ChatterboxModel.{hpp,cpp}` —
  `IModel` + `IModelCancel` implementation.  First-iteration strategy:
  assemble argv for `qvac_tts_cli_main` with a scratch `.wav` output
  path, call it synchronously, then parse the resulting 16-bit mono
  PCM wav back into `std::vector<int16_t>` for the JS handler.
  Consequences: every job re-loads the model (~700 ms + inference
  time), no mid-synthesis cancellation, no streaming.  The follow-up
  milestone replaces this with a persistent, struct-based API once
  qvac-tts.cpp exposes one.
* `addon/src/js-interface/{JSAdapter.{hpp,cpp}, binding.cpp}` — JS-to-C++
  config bridging (same string-map pattern as `@qvac/tts-onnx`) and the
  `BARE_MODULE(qvac_tts_ggml, ...)` registration exposing
  `createInstance` / `runJob` / `reload` / `activate` / `cancel` /
  `destroyInstance` / `loadWeights` / `setLogger` / `releaseLogger`.
* `addon/src/addon/AddonJs.hpp` — JS-facing `createInstance` / `runJob`
  / `reload` wrappers that register a `JsAudioOutputHandler` emitting
  `{ outputArray: Int16Array, sampleRate: number }` to JS.

## Build / registry

* `CMakeLists.txt` uses `find_package(qvac-tts-cpp CONFIG REQUIRED)`
  and the standard `cmake-bare` + `cmake-vcpkg` scaffolding (shape
  matches `@qvac/transcription-whispercpp`).
* `vcpkg.json` depends on `tts-cpp` (with a `vulkan` feature passthrough)
  plus `qvac-lib-inference-addon-cpp`, `qvac-lint-cpp`, and `gtest`.
* `vcpkg-configuration.json` points at tetherto/qvac-registry-vcpkg.
  NOTE: the baseline pin here is inherited from
  `@qvac/transcription-whispercpp` and **must be bumped** to a commit
  that contains the `tts-cpp` port once that registry PR lands.  A
  follow-up commit will update it.

## Tests & examples

* Integration + unit test files for Chatterbox English are copied
  verbatim from `@qvac/tts-onnx` with only mechanical renames
  (`ONNXTTS` -> `TTSGgml`, `QvacErrorAddonTTS` -> `QvacErrorAddonTTSGgml`,
  `@qvac/tts-onnx/text-chunker` -> `../../lib/textChunker.js`).  Some
  paths in `test/integration/addon.test.js` still import Supertonic /
  LavaSR helpers that don't exist in this package — those test blocks
  will fail fast when the file loads, which is expected until those
  backends get their own ggml packages.
* Examples: `chatterbox-tts.js`, `chatterbox-streaming-tts.js`, plus
  shared `wav-helper.js` + `pcm-chunk-player.js`.

## What's not in this PR (known gaps)

* No docs: README, NOTICE, CHANGELOG, PULL_REQUEST_TEMPLATE changes
  will land in a single documentation pass once the registry + fork
  commits have merged upstream.
* `vcpkg-configuration.json` baseline needs to point at a
  qvac-registry-vcpkg commit that ships `tts-cpp` (pending the
  registry PR).
* Actual `npm run build` requires the registry and fork commits to be
  on `main` of their respective upstream repos.

* chore: point tts-ggml vcpkg baseline at the tts-cpp-bearing registry commit

Bumps `vcpkg-configuration.json` to GustavoA1604/qvac-registry-vcpkg
at commit 1e2839680b6be8d8ffff889a9c29b966c176098c — the commit that
adds the `tts-cpp` port.  Paired with the `qvac-tts` library already
pinned in the port's `portfile.cmake` (GustavoA1604/chatterbox.cpp
@ 0fe4a521618cc30358040b29d75d4261b31cbb60).

Will be re-pointed at tetherto/qvac-registry-vcpkg once the registry
PR lands upstream.

* chore: tts-ggml: trim tests + examples to Chatterbox English, restore mobile wrapper

Second pass over @qvac/tts-ggml after the build started passing: prune
everything that only made sense for the ONNX-era multi-engine scope and
adapt the remaining Chatterbox-English bits to the GGUF + file-path
reference-audio contract.  Restores `test/mobile/` so the Android build
has something to point at.

## C++

* `ChatterboxModel.cpp`: the `ArgvBuilder::buildArgv` doc comment
  contained `**/` which closed the block comment early and broke the
  build.  Rewrote as a `//` comment.

## Examples

* `examples/chatterbox-tts.js` — rewrite for v0 contract: single
  `<text>` argv, `files: { modelDir }` pointing at the two GGUFs,
  `referenceAudio` is now a wav **path** (addon passes it to
  `--reference-audio`) instead of a Float32Array.  Drops
  english/multilingual arg and the CHATTERBOX_VARIANT switch that
  picked which `.onnx` files to load.
* Removed `examples/chatterbox-streaming-tts.js` +
  `examples/pcm-chunk-player.js`.  The v0 addon re-loads the model
  per `run()` call — exposing streaming would mislead.  Both come
  back alongside the persistent-engine milestone.
* `package.json`: `npm run example` now passes a default text so it
  runs without extra args.

## Tests

### Kept as-is (engine-agnostic)

* `test/unit/textChunker.test.js`
* `test/mock/{MockedBinding,utils}.js`
* `test/utils/{wav-helper,pcmConcatenator,loader.fake,runWhisper,runTTS}.js`
* `test/reference-audio/jfk.wav`, `test/data/sentences-*.js`

### Mechanical fixes

* `test/unit/tts.error.test.js` — fix error-code assertions to the
  tts-ggml range (`13001–14000`); was still checking the
  `@qvac/tts-onnx` range (`7001–7011`).
* `test/unit/tts-ggml.lifecycle.test.js` — fix stale
  `QvacErrorAddonTTS` import to `QvacErrorAddonTTSGgml`; switch the
  stubbed model to `{ t3Model, s3genModel }` GGUFs and drop the
  non-existent `engine: 'chatterbox'` option.
* `test/unit/tts-ggml.sentence-stream.test.js` — same GGUF/engine
  cleanup.

### Rewritten

* `test/unit/chatterbox.inference.test.js` — drop tests that asserted
  the old ONNX file shape (`tokenizer / speechEncoder / embedTokens /
  conditionalDecoder / languageModel`), the removed `engine` detection
  and the wrong `getModelKey` return value (`'onnx-tts'` -> `'tts-ggml'`).
  New tests cover: `modelDir` derives the two GGUF paths; explicit
  `t3Model` / `s3genModel` override the defaults.  The mocked-binding
  run/reload/cancel flow stays.
* `test/integration/addon.test.js` — fresh, ~180 LoC, Chatterbox-English
  only.  Ensures the GGUFs are present, runs the short sentence set
  through `loadChatterboxTTS` + `runChatterboxTTS[WithSplit]`, and
  (on darwin only) runs a whisper-based WER check via the existing
  `runWhisper` util.  Drops the Chatterbox-multilingual block + every
  Supertonic + LavaSR block that doesn't apply to this package.
* `test/utils/runChatterboxTTS.js` — rewrite for the GGUF contract:
  `files: { modelDir, t3Model, s3genModel }`, `referenceAudio` as a
  file path that falls back to `test/reference-audio/jfk.wav` (or the
  mobile test-asset when `global.assetPaths` is present).  No more
  WAV decode / resample on the JS side.
* `test/utils/downloadModel.js` — trim from 1007 LoC to 280.  Drops
  the Supertonic + LavaSR + Chatterbox-multilingual + Cangjie
  downloaders.  Keeps the shared HTTP/curl infrastructure and
  `ensureWhisperModel` (still used by the integration WER check).
  `ensureChatterboxModels` is now **check-only**: it verifies
  `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` exist locally
  and, if missing, prints the exact commands for generating them
  from the qvac-tts.cpp (née chatterbox.cpp) conversion scripts.
  Once the GGUFs land on a canonical HuggingFace repo we'll wire up
  download URLs here.

## Scripts

* `scripts/ensure-chatterbox.js` — simplify to a single invocation
  against `./models/`.  Drops the variant / language matrix that the
  ONNX downloader needed.
* `scripts/ensure-models.js` — now a thin alias to
  `ensure-chatterbox.js`.  Drops the Supertonic + LavaSR orchestration.

## Mobile

* Restored `test/mobile/{integration.auto.cjs, integration-runtime.cjs,
  testAssets/jfk.wav}` so the Android build has a wrapper to point at.
* `package.json`: re-added `test/mobile` to the `files` list.

## Gitignore

* Ignore generated `.clang-format` / `.clang-tidy` / `.valgrind.supp`
  (produced by the top-level `configure_file(...)` calls) and
  `build_*/` dirs (bare-make convention).

## Verified locally

* `npx standard "test/**/*.js" "*.js" "lib/*.js"` — clean.
* `npm run test:unit` — 38/38 pass (105/105 asserts).
* `npm run build && bare examples/chatterbox-tts.js "Hello from qvac tts ggml."`
  produces a 24 kHz wav as expected.

* Add streaming support

* Update ggml backend to use separate ggml repo

* tts-ggml: consume renamed tts-cpp library (2026-04-24#1)

Upstream chatterbox.cpp renamed the package + namespace + target from
qvac-tts to tts-cpp and tightened the library boundary; pick up the
new artefacts here:

- find_package(qvac-tts-cpp CONFIG REQUIRED)
    -> find_package(tts-cpp CONFIG REQUIRED)
- qvac-tts::qvac-tts  -> tts-cpp::tts-cpp
- qvac_tts::chatterbox -> tts_cpp::chatterbox (engine ptrs, EngineOptions,
  SynthesisResult, forward-decls in ChatterboxModel.hpp)
- #include <qvac-tts/chatterbox/engine.h>
    -> #include <tts-cpp/chatterbox/engine.h>
- Doxygen / inline doc references to the old names refreshed alongside
  the code changes.

vcpkg wiring:
- vcpkg-configuration.json baseline bumped to qvac-registry-vcpkg
  commit bc30b0b (ports/tts-cpp renamed and repointed at
  chatterbox.cpp@f8f9145).
- vcpkg.json tts-cpp constraint bumped to 2026-04-24#1 (the port that
  carries the rename + namespace + install(EXPORT) changes).

Verified with a cold bare-make generate + bare-make build against the
new port, and the addon's existing unit + integration test suites.

Made-with: Cursor

* tts-ggml: bump tts-cpp port to 2026-05-07 + registry baseline

Picks up the round-3 review-fix wave landed on the tts-cpp port:

  e673182  scrub stale patches/ refs from README                (N10)
  8ba10a6  drop unreachable TTS_CPP_GGML_LIB_PREFIX block        (N8)
  4b5d2d7  mirror N1-N7 fixes from chatterbox.cpp source-of-truth
            - N1 supertonic alive-registry guard against freed-backend
              gallocr_free assert on hot-swap (Vulkan/Metal/CUDA)
            - N2 drop dead g_sink_* state, soften log_set docstring
            - N3 Turbo BPE try/catch (exception-safe Engine ctor)
            - N4 STFT cancel checkpoint + tighter Engine::cancel() doc
            - N5 document s3gen_preload/unload refcount semantics
            - N6 drop dead cached_text_lc Supertonic shim
            - N7 fix misleading "no copy" view-vs-copy log wording

Plus the integrated-port-only round-2 fixes that landed earlier:

  fa0d490  close patches/-deleted regression: TTS_CPP_USE_SYSTEM_GGML
            now defaults ON; bundled-without-patches hard-errors at
            configure time with a pointer at the ggml-speech vcpkg
            port.
  ae34c58  README rewritten for integrated/vcpkg context.
  a2f2dd6  top-level qvac-ext-lib-whisper.cpp README points at the
            tts-cpp/ subtree (alongside parakeet-cpp/).

Public API used by ChatterboxModel (tts_cpp::chatterbox::Engine /
EngineOptions / SynthesisResult / s3gen_preload / s3gen_unload) is
backward-compatible: the new port adds Engine::backend_name(),
MTL-variant fields on EngineOptions (language / cfg_weight / min_p /
exaggeration), and a separate tts_cpp::supertonic::Engine class, but
nothing this consumer was already calling has changed.

Edits:

  packages/tts-ggml/vcpkg.json
    - tts-cpp dep: version>=2026-04-24#1 -> version>=2026-05-07.

  packages/tts-ggml/vcpkg-configuration.json
    - default-registry baseline: bc30b0b (April 2026 fork-only state)
      -> 16b91afdcfd59baea60e81f3da94f49311ef2a97.  The new baseline
      pulls in the post-tetherto-merge state (parakeet-cpp port at
      932d5d9, ggml-speech port-version 1 at f07bdd0) plus the new
      tts-cpp port (16b91af) on the developer's GustavoA1604
      registry fork.

Smoke-test plan: after running `vcpkg install` against the new
baseline, the tts-cpp port's vcpkg_from_github resolves at
GustavoA1604/qvac-ext-lib-whisper.cpp@e673182 (tts-cpp branch) until the
upstream PR merges.  ChatterboxModel should build and synthesize
identically; expanding to Multilingual + Supertonic flows is the
follow-up commit on the package side.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add chatterbox multilingual and supertonic

* Add mobile integration tests

* tts-ggml: drop clang-19 pin in linux-clang toolchain

The toolchain hardcoded `clang-19` / `clang++-19` (versioned binary
names) since the package's first commit (0a2c978).  Linux CI hadn't
exercised this path before — the new on-pr-tts-ggml.yml -> integration
matrix is the first time it does, and it fails on every linux runner
(ai-run-ubuntu-22.04, ai-run-linux-gpu, ubuntu-24.04-arm) at vcpkg's
"detect_compiler" step because none of the GH-hosted images ship a
`clang-19` symlink:

  Detecting compiler hash for triplet x64-linux...
  error: while detecting compiler information:
  ...
  CMake Error at scripts/cmake/vcpkg_execute_required_process.cmake:127
  (message): Command failed: ... -DVCPKG_CHAINLOAD_TOOLCHAIN_FILE=
  .../tts-ggml/vcpkg/triplets/../toolchains/linux-clang.cmake ...

Match parakeet's working pattern (qvac-lib-infer-parakeet/vcpkg/
toolchains/linux-clang.cmake): use unversioned `clang` / `clang++` so
each runner picks up its image's default clang (clang-15 on
ubuntu-22.04, clang-18 on ubuntu-24.04, whatever the AI runners ship).
The `-stdlib=libc++` flag added by x64-linux.cmake / arm64-linux.cmake
is honoured by every reasonable clang version.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add C++ tests and coverage; fix linux build

* tts-ggml: address PR review feedback

Bundle of correctness, hygiene, and CI-doc fixes from the recent code
review.  Each item below has its own paragraph in the diff comments.

- #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js
  to package.json so consumers running the integration tests from the
  npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`.
- #2 deps: move @qvac/langdetect-text from runtime dependencies to
  devDependencies (it's only referenced from examples/, which aren't in
  the published files list).
- #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming
  detection used to read engine_->options() outside engineMu_, racing
  with reload().  synthesize() now returns SynthesizeResult { pcm,
  wasStreaming } where wasStreaming is captured under the engine lock
  against the local shared_ptr so process() doesn't have to touch
  engine_ again.
- #4 deferred-load: ChatterboxModel + SupertonicModel constructors
  used to call load() eagerly, so JsInterface::createInstance() (sync
  on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop.
  Both models now implement IModelAsyncLoad: constructors validate +
  return; the actual load is deferred to waitForLoadInitialization(),
  which the new addon_js::activate wraps inside JsAsyncTask::run so the
  parse runs on a worker thread.  binding.cpp registers
  addon_js::activate in place of JsInterface::activate; tts.js now
  awaits the resulting promise.
- #5 dead code: drop _resolvePath (unused), drop the (void)inputObj
  read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE /
  FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but-
  not-thrown so future maintainers don't delete them blindly (the unit
  suite asserts the values).
- #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_
  reset pattern: cancel() sets it, synthesize() fast-fails on it,
  process() resets it per call so a stale cancel doesn't poison the
  next run.
- #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that
  the JS layer is the source of truth for useGPU and nGpuLayers wins
  downstream; left a pointer to std::optional<bool> if a future caller
  ever needs to distinguish "absent" from "explicit false".
- #10 fork pointers: README.md and test/utils/downloadModel.js no
  longer point at GustavoA1604/chatterbox.cpp; both reference the
  upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now.
- #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment
  on the build-and-test job documenting that continue-on-error is the
  early-days landing posture (merge-guard treats success || skipped as
  pass), with a pointer to tighten once Device Farm provisioning is
  stable.

Nits:
- 'use strict' added to addonLogging.js (matches every other .js).
- node-vs-bare runtime banners on
  scripts/{generate,validate}-mobile-integration-tests.js.
- ttsOutputDebugString no longer JSON.stringify's the full PCM
  Int16Array on every chunk-streaming event; emits a tiny summary
  ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen})
  instead.

Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load
contract); 4 skipped real-GGUF tests behind the existing
QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF /
QVAC_TEST_SUPERTONIC_GGUF env-var gates.  Lint clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

* tts-ggml: unblock CI integration tests on every desktop runner

Four independent failures, one per platform:

1. linux-x64 / linux-arm64: addon load crashed at
   `libomp.so.5: cannot open shared object file`.  tts-cpp's binary is
   built with clang under the linux-clang toolchain and links against
   libomp (LLVM OpenMP runtime); only `libgomp1` (GNU OpenMP) was being
   apt-installed.  Add `libomp5` so libomp.so.5 is on the loader path.

2. darwin-arm64: convert-models.sh aborted at line 200 with
   `hf_args[@]: unbound variable`.  macOS's system bash is 3.2 which
   treats `"${arr[@]}"` as nounset access when the array is empty under
   `set -u`; with HF_TOKEN unset we hit it on every fresh runner.  Use
   the `${arr[@]+"${arr[@]}"}` idiom (defined-or-nothing) at all six
   call sites and add a header comment so the next maintainer doesn't
   accidentally regress.

3. darwin-x64: pip install bombed building `llvmlite` from source
   because the macos-15-large runner has no LLVM 15 development
   install.  Root cause: librosa pulls in numba 0.65+, which stopped
   shipping darwin-x86_64 wheels for Python 3.12.  Pin Python to 3.11
   in the Setup Python step; 3.11 has prebuilt wheels for the entire
   numba/llvmlite/librosa stack on darwin-x64 and is fine for every
   other converter dependency.

4. windows-2022: ChatterboxModel::load threw
   `vk::createInstance: ErrorIncompatibleDriver`.  Root cause: the
   addon's index.js::_validateConfig defaults `useGPU = true` when
   neither useGPU nor nGpuLayers is specified, so the test ran with
   n_gpu_layers=99 -> ggml_backend_vk_init -> vk::createInstance ->
   ErrorIncompatibleDriver on the runner's no-Vulkan-driver image.
   runChatterboxTTS.js now honours `process.env.NO_GPU === 'true'`
   (set on the no-GPU matrix entries) and forces useGPU=false on
   exactly those runners; the other test runners (chatterbox-mtl,
   gpu-smoke, multiple-runs) already had this guard.

Also documents the `mesa-vulkan-drivers` apt package (already pulled
in) as the software ICD that lets the Vulkan-built prebuild's runtime
backend probe enumerate at least one device on linux runners.

Co-authored-by: Cursor <cursoragent@cursor.com>

* tts-ggml: drop Chatterbox from mobile bundle (Metro V8 string limit)

Mobile build failed at `:app:createBundleReleaseJsAndAssets` with:

  SyntaxError: assets/testAssets/chatterbox-s3gen.gguf:
    Cannot create a string longer than 0x1fffffe8 characters

Root cause: Metro's bundler reads every asset under
`test/mobile/testAssets/` via `Buffer.toString()`.  V8's max string
length is 0x1fffffe8 (~512 MiB).  chatterbox-s3gen.gguf is ~1 GiB even
with --quant q4_0 because the s3gen converter only quantizes attention
weights and leaves the bulk of the s3gen graph in fp16 ("0/291 weight
tensors quantized" in the converter log).

Fix: bundle ONLY supertonic.gguf (~125 MiB, comfortably under the
limit) on mobile.  Mobile Chatterbox tests degrade cleanly to
`t.pass('Skipped: Chatterbox GGUFs not available')` via the existing
`ensureChatterboxModels` helper -- it already returns
{ success: false } when the GGUFs aren't on disk.

Cache key bumped to v2 so existing v1 cache entries (which include
the chatterbox files) are evicted on the next run.

Bundling Chatterbox on mobile requires either:
  - adding `gguf` to qvac-test-addon-mobile's metro `assetExts` so the
    JS-string read is skipped (then the s3gen file can flow through the
    bundle as a raw asset), or
  - pushing the chatterbox GGUFs to the device via `adb push` outside
    the bundle and surfacing the path through downloadModel.js's
    existing ANDROID_CANDIDATE_DIRS fallback.

Both are outside the scope of this PR; documented inline above the
cache step for the next maintainer.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Bump hash of vcpkg

* Consume vcpkg from tetherto repository

* Fix integration tests failures in all platforms

* Further fix tests

* fix: Make useGPU flag more meaningful (#1953)

* fix[api]: make useGPU flag actually force CPU/GPU and reject useGPU/nGpuLayers conflicts

* add gpu smoke test

* resolve comments

---------

Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>

* Update dependencies after monorepo directory changes

* Further drop qvac-lib- prefix

* Add CHANGELOG.md

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Ishan Vohra <ishanvohra2@gmail.com>
Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>
simon-iribarren added a commit to simon-iribarren/qvac that referenced this pull request May 13, 2026
…te concurrency

Address non-blocking review nits on PR tetherto#2007:

- aggregate-events: explain why a wire event carrying both error and
  cancelled signals resolves to error (closes brief open question tetherto#3).
- kv-cache-session: doc-comment on deleteKvCacheState explaining the
  ordering guarantee under concurrent in-flight turns -- delete is
  wire-async, in-flight turns roll back idempotently when their commit
  probe finds the file gone (closes brief open question tetherto#4).

Comments only; no behavior changes.
simon-iribarren added a commit that referenced this pull request May 13, 2026
…ache via KvCacheSession (#2007)

* QVAC-18182 feat[api]: typed cancel outcomes on the wire + atomic KV-cache via KvCacheSession

Builds on QVAC-18181's request lifecycle primitives (DisposableScope,
RequestContext, RequestRegistry) to deliver the M2 milestone:

- Typed cancel outcomes: `stopReason: "cancelled"` on `completionDone`
  events, and `InferenceCancelledError(requestId, partial)` thrown from
  CompletionRun promise-aggregates (`final` / `text` / `toolCalls` /
  `stats`). The wire stream still ends normally so iterating
  `run.events` is unaffected — the typed error lives on the aggregate
  promises that callers `await` for the final result.

- KvCacheSession (`server/bare/plugins/llamacpp-completion/ops/
  kv-cache-session.ts`) — single atomic owner of the three KV-cache
  layers (`cachedMessageCounts`, `initializedCaches`, on-disk `.bin`
  files). `beginTurn` / `commitTurn` / `rollback` collapse the three
  duplicated cleanup blocks in `completion-stream.ts` into one
  scope.defer hook. Cross-model administrative deletion lives at the
  module level as `deleteKvCacheState(...)`, called by the RPC
  `handleDeleteCache` handler.

- Stop-button race close — `RequestRegistry` now keeps a bounded
  cancelled-before-begin map (128 entries, 30s TTL). A `cancel({
  requestId })` that lands before the server's `begin(...)` ran is
  applied retroactively when begin lands, so same-tick stop clicks no
  longer disappear into the void. Internal-only — the wire surface for
  `cancel` is unchanged (Option A in the brief).

Cursor rules updated in the same PR so the request-lifecycle and
KV-cache topic docs stay in sync with the implementation.

Tests:
- unit: KvCacheSession (bareTest-gated, runs in the Bare consumer),
  RequestRegistry race + bounded-set eviction, completion-event schema
  cancelled cases.
- e2e: cancellation-tests.ts adds three definitions — mid-stream cancel
  (events.stopReason === "cancelled", final rejects with
  InferenceCancelledError, partial.text matches concatenated
  contentDelta), cancel-before-begin (retroactive abort), and
  cancel-then-resume-kv-cache (rollback wiped the three layers, the
  next turn re-primes cleanly).

* chore: drop planning labels (Mx/Dx) from QVAC-18182 comments

Strips milestone (`M1`/`M2`/`M3a`...) and deliverable (`D2`/`D5`/`D7`)
labels from comments and test titles introduced with the typed-cancel
outcomes + KvCacheSession work. The substantive descriptions of the
contracts (Stop-button race, cancelled-before-begin map, three-layer
session ownership, etc.) are preserved; only the planning-doc
references are removed so the code reads cleanly without the pitch
context. Durable `QVAC-XXXXX` ticket references are kept.

No behavior or API surface changes.

* chore: drop Asana ticket references from QVAC-18182 code comments

Strips QVAC-XXXXX inline ticket references from code/test comments
introduced by the typed-cancel-outcomes work. Concept names
(Stop-button race, cancelled-before-begin, etc.) and prose
descriptions of the contracts are preserved; only the ticket-tag
suffixes go. Also renames a test cache key from
`qvac-18182-cancel-resume-kvcache` to `cancel-then-resume-kvcache` so
the cache key reads as a stable identifier rather than a ticket
reference.

No behavior or API surface changes.

* QVAC-18182 doc: clarify error>cancelled precedence + deleteKvCacheState concurrency

Address non-blocking review nits on PR #2007:

- aggregate-events: explain why a wire event carrying both error and
  cancelled signals resolves to error (closes brief open question #3).
- kv-cache-session: doc-comment on deleteKvCacheState explaining the
  ordering guarantee under concurrent in-flight turns -- delete is
  wire-async, in-flight turns roll back idempotently when their commit
  probe finds the file gone (closes brief open question #4).

Comments only; no behavior changes.

* QVAC-18182 doc: demonstrate typed cancel outcomes in cancel example

Enhance the existing cancel-by-request-id example to demonstrate the
two M2 cancel-outcome channels:

- run.events ends normally with completionDone carrying
  stopReason: "cancelled" -- show reading it inside the iteration loop.
- run.text rejects with InferenceCancelledError(requestId, partial) on
  cancel -- show the instanceof check and consuming partial.text,
  partial.toolCalls, partial.stats.

Also update the header to remove the now-stale "logged as a no-match"
sentence (same-tick cancels are no longer dropped after M2's race
close).

Pure documentation enhancement; no API or behavior changes.

* QVAC-18182 fix: address PR review — partial-prime cleanup + parent-aborted state

Two follow-ups from Opanin's review on PR #2007:

1. KvCacheSession.beginTurn: if `primeIfMissing` throws after the
   addon has partially written a `.bin` to disk, the next
   `beginCustom` would `fsPromises.access(cachePath)` → true and
   trust the half-primed file as a valid cache (no rollback hook is
   registered yet — the handler hasn't seen the `TurnHandle`). Wrap
   both `beginCustom` and `beginAuto` prime calls in a shared
   `primeOrCleanup` helper that best-effort unlinks the partial file
   before re-throwing the original prime error. Adds a bare-only unit
   test asserting the on-disk file is removed and the init flag stays
   unset on the failed-prime path.

2. RequestRegistry.begin: when `parentSignal` was already aborted at
   begin time, line 271 aborts the controller but the `state` ternary
   still landed `"running"`, exactly the "momentarily-running with
   already-aborted signal" the preCancel branch was guarding against.
   Extend the ternary to cover both inputs and the existing
   `parentSignal already aborted` test now also asserts
   `ctx.state === "cancelling"`.

No behavior change on the happy path. Lint + typecheck + 351-test
unit suite green locally on the changed files.

* QVAC-18182 fix: prime is atomic — addon writes to .prime.tmp + atomic rename

Upgrade the previous reactive cleanup workaround (PR #2007 review by
@opaninakuffo) into a proactive atomic-by-construction design:

  - The session steers `model.run({ saveSessionPath })` to a sibling
    `cachePath + ".prime.tmp"` path.
  - Only after the prime closure resolves successfully do we promote
    the temp file to the canonical `cachePath` via `fsPromises.rename`
    (atomic same-volume on every host we target).
  - The canonical cache path is therefore *never* observable in a
    partial state — a thrown prime is indistinguishable on disk from
    a never-attempted prime, so the next existence probe (in-process
    or cross-process worker restart) cannot trust corrupt bytes.

Defensive details:
  - We unlink any leftover `.prime.tmp` *before* invoking the closure,
    so a deferred-write addon path can't accidentally promote
    stale-from-crash bytes left by a prior worker.
  - On prime success we probe the temp path before renaming. If the
    addon deferred its disk write (some llama.cpp paths flush lazily),
    the temp doesn't exist and we leave the canonical path absent —
    `verifySaveAndRecord` in `commitTurn` is the authoritative check.
  - On rename failure we unlink the temp and surface the rename error;
    rename atomicity guarantees the canonical path was untouched.

Why this is better than the prior `primeOrCleanup`:
  - Best-effort `unlink` was load-bearing for correctness in the old
    design — a failed unlink left a half-primed canonical file the
    next `beginCustom` would trust. The new design moves the only
    possible "partial" file to a non-trusted name, so failed cleanup
    cannot corrupt the canonical name by construction.
  - The unit test no longer mocks the workaround surface; it asserts
    the actual invariant ("canonical path was never written") plus
    the positive rename and the leftover-sweep guarantees.

Tests: 3 bare-only kv-cache-session unit tests (throw-leaves-canonical-
untouched, success-promotes-via-rename, leftover-from-crash-is-swept).
Lint + typecheck + 351-test unit suite green locally on the changed files.

Long-term, the right fix is one layer down — the llama.cpp addon should
write transactionally itself and surface save errors instead of
swallowing them. When that lands, this helper collapses to a direct
`prime(cachePath)` call and the `verifySaveAndRecord` access-probe
fallback (TODO already documented) can be retired together. Filed as
a separate follow-up; out of scope for this PR.

* QVAC-18182 fix: replace prime-atomic helper with verifyPrimedFile post-prime probe

Audit of the llama.cpp addon (`CacheManager::writeCacheFile` →
`llama_state_save_file`, return value swallowed; `LlamaModel::
processPromptImpl` lines 575-599) shows the bug shape Opanin flagged
on PR #2007 — "primeIfMissing throws after a partial save" — does not
actually fire. The save call is the very last operation on the
prefill path, the addon ignores its return value, and any earlier
throw means no save was attempted. So:

  - `primeOrCleanup` (`ac8d2d74e`) and the upgrade to
    `primeAtomically` (`a7420f3e6`) defended against a code path that
    the addon does not produce.
  - The real corruption shape is silent partial writes (addon's
    `llama_state_save_file` returns false, addon ignores it, file is
    half-written or empty). Atomic temp+rename did NOT close this
    gap — on a "silent partial" the closure resolves successfully and
    the helper would happily promote the partial `.prime.tmp` to the
    canonical path.

Replace both helpers with a small `verifyPrimedFile` that mirrors the
existing `verifySaveAndRecord` access-probe pattern used at commit
time, applied at prime time:

  - After a successful prime closure, `fsPromises.stat` the canonical
    path. If it doesn't exist (addon was interrupted before save) or
    has size 0 (addon save call produced an empty file), throw and
    best-effort unlink the empty leftover so the next existence probe
    doesn't trust it.
  - This catches the two failure modes Opanin's concern was a proxy
    for (cancelled-mid-prime; addon save quietly produced nothing)
    without claiming defense against partial-but-nonzero writes,
    which can only be closed at the addon layer.

The `RequestRegistry` parent-aborted-state fix (`ctx.state` ternary
covers `opts.parentSignal?.aborted`) from `ac8d2d74e` is preserved
unchanged — it stands on its own as a correct response to Opanin's
second comment.

Long-term root cause stays the addon: have
`CacheManager::writeCacheFile` check `llama_state_save_file`'s return
value and throw on failure. When that lands, both `verifyPrimedFile`
and `verifySaveAndRecord`'s access-probes can be retired together.
Filed as a separate follow-up — out of scope for this PR.

Tests: 3 prior bare-only prime-atomic tests removed; 2 new bare-only
tests added (no-file and empty-file rejection paths). Lint +
typecheck + 330-test unit suite green locally on the changed files
(pre-existing sdcpp-generation lint errors unchanged).

* QVAC-18182 doc: kv-cache rule documents addon non-transactional save + matched access-probes

Extend the "Cache Initialization (primeIfMissing)" section in
.cursor/rules/sdk/docs/kv-cache-system.mdc with the corrected
addon-contract analysis:

  - The llama.cpp addon's CacheManager::writeCacheFile discards
    llama_state_save_file's bool return; maybeSaveCacheToDisk is the
    last call on the prefill path. So no closure-rejection path can
    coexist with a partial file on disk.
  - Document the four real outcomes as a table (interrupted /
    success / silent partial write / pre-eval throw) so future
    readers can see why the SDK takes the shape it does.
  - Pin both SDK-side defenses as a matched pair: verifyPrimedFile
    at prime time (added in this PR) and verifySaveAndRecord at
    commit time (existing). Both are honest about what they catch
    (missing / empty file) and what they don't (partial-but-nonzero,
    only addon fix can close that).
  - Reference the addon-layer follow-up
    (1214778658064488 / "throw on llama_state_save_file failure")
    so the next contributor knows both probes will be retired
    together when the addon throws on save failure.

No code change — rule-only update.
simon-iribarren added a commit to simon-iribarren/qvac that referenced this pull request May 14, 2026
- transcribe.ts: route the two `Transcription Update` debug emits
  through `requestLogger.debug` so they carry the per-request prefix,
  matching the rule's `grep "requestId=<id>"` invariant. Drop the now-
  unused module-level `logger`. Collapse two `scope.defer(async () =>
  { await restorePrompt(...) })` wrappers to bare arrow callbacks
  (review tetherto#5, tetherto#10).

- inference-handler-migrations.test.ts: add bareTest op-level cancel-
  by-requestId cases for `transcribe (whisper)` (asserts loop exit +
  addon.cancel called + reload-count == 2 to pin the
  `applyPrompt + restorePrompt runs exactly once` invariant) and
  `finetune` (asserts model.cancel called + scope unwind clears the
  runtime-state flag back to IDLE). Pin the NMT soft-cancel contract
  by instrumenting the addon and asserting addon.cancel was NOT called
  during a translate cancel (review tetherto#3, tetherto#7).

- request-lifecycle-primitives.mdc: reconcile the "polling
  signal.aborted mid-handler" anti-pattern with the new "Per-iteration
  cancel check (M3b)" canonical pattern. The anti-pattern is *adding*
  the check when the addon already honours signal directly; the M3b
  pattern is *introducing* the check where the addon doesn't and the
  loop is the only soft-cancel exit (review tetherto#4).
simon-iribarren added a commit that referenced this pull request May 15, 2026
* QVAC-18183 feat[api]: inference-handler migrations

Migrate the four remaining inference handler kinds onto the
RequestRegistry primitives shipped in M3a (cancel-capability
declaration, per-kind concurrency policy, structured
`[request-lifecycle]` logging). Each handler now opens a
request-scoped `ManagedRequestContext`, threads the optional
`requestId` from the wire request (falling back to a server-minted
UUID), routes hard cancels to `addon.cancel()` at a single signal-
listener leaf, and replaces ad-hoc `try/finally` cleanup with
`scope.defer(...)` registrations so cleanup runs in LIFO order on
every exit path.

- `embed` (kind "embeddings", `{ scope: "model", hard: true }`):
  `packages/sdk/server/bare/ops/embed.ts` opens the context, threads
  `requestId` from `embedRequestSchema`, post-await `signal.aborted`
  checks raise `InferenceCancelledError`.
- `transcribe` / `transcribeStream` (kind "transcribe",
  `{ scope: "model", hard: true }`): collapsed
  `try { ... } finally { restorePrompt(...) }` into
  `scope.defer(restorePrompt)`, added per-iteration
  `if (ctx.signal.aborted) break;` in the `response.iterate()` loop
  (Option A from §4 of the M3b brief — explicit, visible at the call
  site, no `takeWhileNotAborted` wrapper).
- `translate` (kind "translate"): two engine branches.
  llamacpp-completion declares `{ scope: "model", hard: true }` and
  wires `signal → addon.cancel()`; nmtcpp-translation keeps
  `{ scope: "none" }` and soft-cancels inside both the streaming
  iterate loop and the `runBatch` early-return path.
- `finetune` (kind "finetune"): flipped the llamacpp-completion
  manifest declaration from `{ scope: "none" }` to
  `{ scope: "model", hard: true }` (the addon already exposes
  `model.cancel()`). `startFinetune` opens a registry context and
  wires `signal → model.cancel()`; the two-level `try/finally`
  collapses into `scope.defer` for `clearFinetuneRuntimeState` and
  `handle.removeListener`. `cancelFinetune(modelId)` is now a thin
  wrapper over `getRequestRegistry().cancel({ modelId, kind:
  "finetune" })` — never invokes `model.cancel()` directly.

Per §4 of the brief: per-iteration cancel granularity uses
Option A (explicit `if (ctx.signal.aborted) break;` at the top of
each streaming loop body). No `takeWhileNotAborted` wrapper was
introduced.

Per §7 anti-patterns: M3b adds zero `oneAtATimePerModel` policies
(the four migrated kinds tolerate concurrent requests against the
same model), leaves the M1 compat-fallback in
`server/bare/ops/cancel.ts` untouched (M3d retires it), and does
not modify `cancelHandler.ts`.

Other changes:
- `embed`, `transcribe`, `transcribeStream`, `translate`,
  `finetune` request schemas grow an optional `requestId` field
  (`.string().min(1).optional()`); server-side ops fall back to
  `generateServerRequestId()` when absent.
- Whisper / Parakeet / LLM / NMT plugin handlers thread
  `request.requestId` into their bare ops.
- `plugin-cancel-capability.test.ts` truth-table flipped for the
  `finetune` row.
- New `inference-handler-migrations.test.ts` covers schema-level
  optional-`requestId` acceptance for all four kinds and pins the
  `[request-lifecycle] begin/cancel/end` line shape for each kind.
  The op-level cancel-by-requestId / cancel-by-modelId integration
  tests are bare-runtime-gated (the migrated ops pull `bare-crypto`
  / `bare-fs` transitively and can't load under Bun, same reason as
  `finetune-ops.test.disabled.ts`).
- `.cursor/rules/sdk/request-lifecycle-primitives.mdc` and
  `.cursor/rules/sdk/docs/request-lifecycle-system.mdc` updated:
  M3b row marked shipped, finetune truth-table row flipped,
  canonical-handler-shape section refreshed to use `embed.ts` as the
  cleanest reference and to document the Option A per-iteration
  check.

Verification:
- `bun lint` (eslint + tsc --noEmit): green.
- `bun run typecheck`: green.
- `bun run test:unit`: every test file green except the
  pre-existing `client/rpc/rpc-client.ts` `#rpc` package-resolution
  failure on upstream/main (also reproducible without these
  changes; unrelated to M3b).

* QVAC-18183 fix: address PR #2058 review feedback

- transcribe.ts: route the two `Transcription Update` debug emits
  through `requestLogger.debug` so they carry the per-request prefix,
  matching the rule's `grep "requestId=<id>"` invariant. Drop the now-
  unused module-level `logger`. Collapse two `scope.defer(async () =>
  { await restorePrompt(...) })` wrappers to bare arrow callbacks
  (review #5, #10).

- inference-handler-migrations.test.ts: add bareTest op-level cancel-
  by-requestId cases for `transcribe (whisper)` (asserts loop exit +
  addon.cancel called + reload-count == 2 to pin the
  `applyPrompt + restorePrompt runs exactly once` invariant) and
  `finetune` (asserts model.cancel called + scope unwind clears the
  runtime-state flag back to IDLE). Pin the NMT soft-cancel contract
  by instrumenting the addon and asserting addon.cancel was NOT called
  during a translate cancel (review #3, #7).

- request-lifecycle-primitives.mdc: reconcile the "polling
  signal.aborted mid-handler" anti-pattern with the new "Per-iteration
  cancel check (M3b)" canonical pattern. The anti-pattern is *adding*
  the check when the addon already honours signal directly; the M3b
  pattern is *introducing* the check where the addon doesn't and the
  loop is the only soft-cancel exit (review #4).

* QVAC-18183 fix: drop unsafe `addon` re-narrowing in translate.ts onAbort

Addresses opaninakuffo's review comment on #2058:
`AnyModel.addon` is already typed as `AddonInterface | undefined`
(see `server/bare/registry/model-registry.ts:17-20`), so the
`as unknown as { addon?: { cancel?(jobId?: string): Promise<void> } }`
cast was unnecessary. Matches the simpler pattern used by `embed.ts`
and `transcribe.ts` for the same `onAbort` shape — keeps the four
M3b-migrated ops uniform.

* QVAC-18183 doc: trim internal milestone references from cursor rules + code comments

Removed the "Migration Roadmap" table, "M1/M2/M3a-d" milestone labels, planning-brief
decision references (Decision A/B.2, D1/D2), workspace-local paths
(`tasks/release-0.11.0-planning/...`, `pitch-3-decisions.md`), and "in review"
forward-references from the request-lifecycle cursor rules and the matching code
comments in the bare ops, finetune wrapper, and the inference-migration tests. The
canonical handler shape, anti-patterns, primitives reference, plugin cancel-capability
truth table, and concurrency-policy / structured-logging sections all stay — only the
internal milestone framing comes out.
Proletter pushed a commit that referenced this pull request May 24, 2026
* Try #1. Adding tokenizer proxy to provide vocab size.

* Try #2. More fixes and logs.

* Try #3. Limit device to only cpu or gpu.

* Revert "Try #2. More fixes and logs."

This reverts commit a461e69.

* Revert "Try #1. Adding tokenizer proxy to provide vocab size."

This reverts commit 9951195.

* Fixing pipeline logging

* Add more logs

* Fixing bench logging

* Add more error handling and logging

* Improve error handling on the server. Added retry in case of context overflow.

* Make retries self-adjustable

* Adding some more checks and limiting the datasets temporarily

* Test: trying to narrow down the error

* Exclude failing datasets from embed benchmark

* Clean up the code

* Changing bench model for LLM

* Try #1. Adding tokenizer proxy to provide vocab size.

* Try #2. More fixes and logs.

* Try #3. Limit device to only cpu or gpu.

* Revert "Try #2. More fixes and logs."

This reverts commit a461e69.

* Revert "Try #1. Adding tokenizer proxy to provide vocab size."

This reverts commit 9951195.

* Fixing pipeline logging

* Add more logs

* Fixing bench logging

* Add more error handling and logging

* Improve error handling on the server. Added retry in case of context overflow.

* Make retries self-adjustable

* Adding some more checks and limiting the datasets temporarily

* Test: trying to narrow down the error

* Exclude failing datasets from embed benchmark

* Clean up the code

* Changing bench model for LLM

* Minor fixes for clarity

* Removing unused vars

* Removing unused imports

* Removing unused python deps

---------

Co-authored-by: gianni <gianfranco.cordella@tether.io>
Proletter pushed a commit that referenced this pull request May 24, 2026
…peech (#1590)

* feat: Add runStream() which takes input as a stream

* add integration tests

* uncomment cb tests

* chore: Add cb streaming example

* feat: Add TTS streaming funcitonality and example

* Update tts addon version

* Remove chatterbox example

* add new error code for tts streaming fail

* Move common code to util

* fix: Use z.infer to define TextToSpeechStreamClientParams

* Move TextToSpeechStreamSession to schemas

* Track subscriber current index and trim queue when all subscribers consumed past items

* add missing unit tests

* fix: drive done promise from multicast pump lifecycle

* fix: Forward chunkIndex and sentenceChunk in sentence-stream mode to client

* fix: Use correct error code for tts stream failure

* chore: Add supertonic stream test in tts-tests.ts

* fix: Make tts client more readable

* Remove closures and inline async generators

* fix: Subscribe eagerly in sentenceStreamTts to avoid late-subscriber data loss

TtsMulticast.pump() starts in a microtask on construction, while the
returned async generators only call subscribe() when first iterated. If
the consumer iterated one generator before the other, the first
subscriber could trim the queue before the second ever registered,
silently dropping earlier frames.

Subscribe synchronously for both bufferStream and chunkUpdates before
returning, so both subscriber indexes are in place before pump pushes
its first item.

Made-with: Cursor

* fix: Close TTS stream on server-sent done frame

Remove the dead `null` sentinel from `processTextToSpeechStreamLine`
and instead close `parseTextToSpeechStreamLines` after yielding the
terminal `done: true` frame, so consumers don't rely on the server
closing the socket to stop iteration.

Made-with: Cursor

* fix: Reject sentenceStream without stream in textToSpeech

Previously `sentenceStream: true` combined with `stream: false` fell
through to the collect path, silently dropping the sentence-stream
parameters and returning no `chunkUpdates`. Fail fast at the
dispatcher with a clear error so the contract mismatch surfaces to
the caller instead of being swallowed.

Made-with: Cursor

* fix: Release TtsMulticast subscriber slot on early break

Wire a try/finally into drain() so that when a consumer breaks out of
the for-await (or the generator is .return()'d / throws), the slot is
parked at +Infinity via unsubscribe(). This prevents a stale low
min-index from permanently pinning trimConsumed, which otherwise leaked
the queue for the entire RPC stream.

Made-with: Cursor

* fix: Guard TTS stream write after close and preserve UTF-8 boundaries

Client:
- Track a `closed` flag in `textToSpeechStream` duplex session, set by
  `end()` / `destroy()`. Subsequent `write()` calls now throw a typed
  `TextToSpeechStreamFailedError` instead of propagating a raw Bare/Node
  "write after end" stream error.
- `end()` is idempotent so accidental double-close no longer errors.

Server:
- `buffersToUtf8Fragments` previously decoded each incoming Buffer via
  `toString("utf8")`, which corrupts any multi-byte codepoint whose bytes
  straddle a chunk boundary (common with CJK / emoji / accented scripts
  emitted as LLM token deltas). Added a small tail-buffer that finds the
  last complete UTF-8 codepoint end in the combined buffer and defers
  trailing incomplete bytes to the next chunk. Any dangling partial
  sequence is flushed on stream end.

Made-with: Cursor

* fix: Order TEXT_TO_SPEECH_STREAM_FAILED code and document it

- Move TEXT_TO_SPEECH_STREAM_FAILED (52415) to the end of the 52400
  Model Operations block so the ordering in SDK_SERVER_ERROR_CODES
  matches the numeric sequence (…52413, 52414, 52415).
- Add the missing row for 52415 to the (latest) errors.mdx table, per
  the sdk/docs-freshness rule that the error table stay in sync
  whenever a new code is introduced.

Made-with: Cursor

* fix: Register operation metrics for textToSpeechStream

Only `textToSpeech` was registered in `operation-metrics.ts`, so the
duplex `textToSpeechStream` path silently skipped `modelExecutionTime`,
`audioDuration`, and `totalSamples` gauges even though the server
already collects the same `TtsStats` via `collectTtsStats()` on the
final chunk. Mirror the non-streaming registration so the streaming
path has parity observability.

Made-with: Cursor

* fix: Harden TTS client done-promise, iterator, and parse cost

Polish the remaining review nits on the TTS client streaming surface.

- #3 TtsMulticast.pump now rejects the `done` promise with the fatal
  error instead of resolving `false`. An internal `.catch(() => {})`
  silences unhandled-rejection warnings when the caller only iterates
  the buffer/chunk streams and never awaits `done`; re-awaits still
  see the rejection.
- #6 TextToSpeechStreamSession[Symbol.asyncIterator] no longer throws
  synchronously on a second iteration; it returns an iterator whose
  first `.next()` rejects, so `for await` surfaces the error in the
  normal async control flow rather than the iterator protocol.
- #9 plainTtsBufferStream / collectTtsBuffer wrap the RPC loop in
  try/catch/finally so `done` always settles: resolve(true) on the
  terminal frame, reject with the real error on exceptions, and
  resolve(false) on early consumer break. Previously `await done`
  could hang forever when the consumer bailed out early.
- #11 Skip per-frame ttsResponseSchema.parse() in all three paths;
  rely on the discriminated-union narrowing at the RPC boundary.
  Drops the per-PCM-frame Zod validation cost for large sentences.

Made-with: Cursor

* fix: Tighten textToSpeechStream schema surface

- Add .positive() to maxBufferScalars and flushAfterMs to match the
  existing constraint on sentenceStreamMaxChunkScalars. Previously a
  caller could pass negative values straight through to the addon.
- Un-export textToSpeechStreamRequestBaseSchema — consumers only need
  the finalized textToSpeechStreamRequestSchema, and the base is an
  implementation detail of the shared object shape. The exported type
  alias TextToSpeechStreamClientParams continues to derive from the
  base via `typeof`, so nothing on the public type surface changes.

Made-with: Cursor

* fix: Cross-platform tmp path and safer PCM append in TTS examples

- playPcmInt16Chunk now writes the intermediate WAV chunk under
  os.tmpdir() / path.join instead of a hard-coded /tmp/qvac-tts-chunk-…
  path. The previous code's Windows branch was unreachable in practice
  because the POSIX /tmp directory doesn't exist there; this uses
  %TEMP% on Windows automatically.
- appendPcmSamples switches from `target.push(...chunk.slice(i, end))`
  to `Array.prototype.push.apply(target, chunk.slice(i, end))`. Same
  semantics, but avoids allocating the spread rest array per batch
  and is closer to a memcpy-style concat in V8.

Made-with: Cursor

* fix: Catch zero-chunk regressions in TTS sentence-stream test

- TtsExecutor.makeSentenceStream now returns `{ passed: false, ... }`
  when the chunkUpdates iterator yields no chunks / no samples. The
  previous executor always returned a formatted string regardless of
  counts, so a regression that silently emitted zero chunks would
  still have looked like a pass.
- ttsSupertonicSentenceStream's expectation upgraded from
  `{ validation: "type", expectedType: "string" }` to
  `{ validation: "contains-all", contains: ["sentence-streamed",
  "chunks", "samples"] }`. The executor's zero-case failure string
  lacks "sentence-streamed", so the contains-all match fails on
  regression.

Made-with: Cursor

* fix: Apply stream default locally and throw typed error on tts mismatch

Previous guard only rejected the explicit `stream: false + sentenceStream:
true` combination. A caller passing `{ modelId, text, sentenceStream: true }`
with `stream` omitted silently fell through to `collectTts` while the
server's Zod `.default(true)` still ran the sentence-stream branch and
emitted chunk frames — which the client then discarded, dropping all
chunk metadata.

- Resolve the `stream` default locally (`params.stream ?? true`) so the
  client's dispatch routing matches the server's Zod-applied routing,
  and an omitted `stream` now correctly lands in `sentenceStreamTts` or
  `plainStreamTts`.
- Only the explicit `sentenceStream: true + stream: false` combination
  is rejected, and it now throws `TextToSpeechStreamFailedError` (code
  52415) instead of a bare `new Error(...)` so callers can discriminate
  by error code like everywhere else in the SDK.

Made-with: Cursor

* remove inline defaults for sentenceStream and stream

* Use TtsMulticast in unit test instead of mock

---------

Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>
Proletter pushed a commit that referenced this pull request May 24, 2026
…#1983)

* feat: add @qvac/tts-ggml package (Chatterbox English on qvac-tts.cpp)

New Bare addon wrapping the `qvac-tts::qvac-tts` static library (backed
by the `tts-cpp` port added in tetherto/qvac-registry-vcpkg).  API-compatible
with the Chatterbox engine exposed by `@qvac/tts-onnx` so downstream
consumers can swap backends without touching orchestration code.

## Scope

* First iteration.  Supports Chatterbox **English** only.  Chatterbox
  multilingual, LavaSR enhancer, Supertonic engine, and streaming are
  out of scope and remain in `@qvac/tts-onnx`.  They'll land alongside
  the evolution of qvac-tts.cpp.
* Native backend is the static `qvac-tts` library from the QVAC vcpkg
  registry (`ports/tts-cpp`, baseline `2026-04-21`).  No ONNX Runtime
  dependency.

## JS surface

* `@qvac/tts-ggml` exports `TTSGgml` with the same method shape as
  `ONNXTTS`:  `run` / `runStream` / `runStreaming` / `reload` /
  `unload` / `destroy`.
* `files: { modelDir }` looks for `chatterbox-t3-turbo.gguf` +
  `chatterbox-s3gen.gguf` side-by-side; `files.t3Model` /
  `files.s3genModel` override the defaults.
* Options: `referenceAudio`, `voiceDir` (baked profile), `seed`,
  `nGpuLayers`, `threads`, `outputSampleRate`, plus placeholders for
  the upcoming streaming flags (`streamChunkTokens`,
  `streamFirstChunkTokens`, `cfmSteps`).
* Shared reusable lib code (`lib/textChunker.js`,
  `lib/textStreamAccumulator.js`, `addonLogging.*`) is copied verbatim
  from `@qvac/tts-onnx`.
* New error class `QvacErrorAddonTTSGgml` uses codes **13001–14000**
  to avoid collisions with `@qvac/tts-onnx` (7001–7011) when both
  packages are loaded in the same Bare process.

## Native addon

* `addon/src/model-interface/chatterbox/ChatterboxModel.{hpp,cpp}` —
  `IModel` + `IModelCancel` implementation.  First-iteration strategy:
  assemble argv for `qvac_tts_cli_main` with a scratch `.wav` output
  path, call it synchronously, then parse the resulting 16-bit mono
  PCM wav back into `std::vector<int16_t>` for the JS handler.
  Consequences: every job re-loads the model (~700 ms + inference
  time), no mid-synthesis cancellation, no streaming.  The follow-up
  milestone replaces this with a persistent, struct-based API once
  qvac-tts.cpp exposes one.
* `addon/src/js-interface/{JSAdapter.{hpp,cpp}, binding.cpp}` — JS-to-C++
  config bridging (same string-map pattern as `@qvac/tts-onnx`) and the
  `BARE_MODULE(qvac_tts_ggml, ...)` registration exposing
  `createInstance` / `runJob` / `reload` / `activate` / `cancel` /
  `destroyInstance` / `loadWeights` / `setLogger` / `releaseLogger`.
* `addon/src/addon/AddonJs.hpp` — JS-facing `createInstance` / `runJob`
  / `reload` wrappers that register a `JsAudioOutputHandler` emitting
  `{ outputArray: Int16Array, sampleRate: number }` to JS.

## Build / registry

* `CMakeLists.txt` uses `find_package(qvac-tts-cpp CONFIG REQUIRED)`
  and the standard `cmake-bare` + `cmake-vcpkg` scaffolding (shape
  matches `@qvac/transcription-whispercpp`).
* `vcpkg.json` depends on `tts-cpp` (with a `vulkan` feature passthrough)
  plus `qvac-lib-inference-addon-cpp`, `qvac-lint-cpp`, and `gtest`.
* `vcpkg-configuration.json` points at tetherto/qvac-registry-vcpkg.
  NOTE: the baseline pin here is inherited from
  `@qvac/transcription-whispercpp` and **must be bumped** to a commit
  that contains the `tts-cpp` port once that registry PR lands.  A
  follow-up commit will update it.

## Tests & examples

* Integration + unit test files for Chatterbox English are copied
  verbatim from `@qvac/tts-onnx` with only mechanical renames
  (`ONNXTTS` -> `TTSGgml`, `QvacErrorAddonTTS` -> `QvacErrorAddonTTSGgml`,
  `@qvac/tts-onnx/text-chunker` -> `../../lib/textChunker.js`).  Some
  paths in `test/integration/addon.test.js` still import Supertonic /
  LavaSR helpers that don't exist in this package — those test blocks
  will fail fast when the file loads, which is expected until those
  backends get their own ggml packages.
* Examples: `chatterbox-tts.js`, `chatterbox-streaming-tts.js`, plus
  shared `wav-helper.js` + `pcm-chunk-player.js`.

## What's not in this PR (known gaps)

* No docs: README, NOTICE, CHANGELOG, PULL_REQUEST_TEMPLATE changes
  will land in a single documentation pass once the registry + fork
  commits have merged upstream.
* `vcpkg-configuration.json` baseline needs to point at a
  qvac-registry-vcpkg commit that ships `tts-cpp` (pending the
  registry PR).
* Actual `npm run build` requires the registry and fork commits to be
  on `main` of their respective upstream repos.

* chore: point tts-ggml vcpkg baseline at the tts-cpp-bearing registry commit

Bumps `vcpkg-configuration.json` to GustavoA1604/qvac-registry-vcpkg
at commit 1e2839680b6be8d8ffff889a9c29b966c176098c — the commit that
adds the `tts-cpp` port.  Paired with the `qvac-tts` library already
pinned in the port's `portfile.cmake` (GustavoA1604/chatterbox.cpp
@ 0fe4a521618cc30358040b29d75d4261b31cbb60).

Will be re-pointed at tetherto/qvac-registry-vcpkg once the registry
PR lands upstream.

* chore: tts-ggml: trim tests + examples to Chatterbox English, restore mobile wrapper

Second pass over @qvac/tts-ggml after the build started passing: prune
everything that only made sense for the ONNX-era multi-engine scope and
adapt the remaining Chatterbox-English bits to the GGUF + file-path
reference-audio contract.  Restores `test/mobile/` so the Android build
has something to point at.

## C++

* `ChatterboxModel.cpp`: the `ArgvBuilder::buildArgv` doc comment
  contained `**/` which closed the block comment early and broke the
  build.  Rewrote as a `//` comment.

## Examples

* `examples/chatterbox-tts.js` — rewrite for v0 contract: single
  `<text>` argv, `files: { modelDir }` pointing at the two GGUFs,
  `referenceAudio` is now a wav **path** (addon passes it to
  `--reference-audio`) instead of a Float32Array.  Drops
  english/multilingual arg and the CHATTERBOX_VARIANT switch that
  picked which `.onnx` files to load.
* Removed `examples/chatterbox-streaming-tts.js` +
  `examples/pcm-chunk-player.js`.  The v0 addon re-loads the model
  per `run()` call — exposing streaming would mislead.  Both come
  back alongside the persistent-engine milestone.
* `package.json`: `npm run example` now passes a default text so it
  runs without extra args.

## Tests

### Kept as-is (engine-agnostic)

* `test/unit/textChunker.test.js`
* `test/mock/{MockedBinding,utils}.js`
* `test/utils/{wav-helper,pcmConcatenator,loader.fake,runWhisper,runTTS}.js`
* `test/reference-audio/jfk.wav`, `test/data/sentences-*.js`

### Mechanical fixes

* `test/unit/tts.error.test.js` — fix error-code assertions to the
  tts-ggml range (`13001–14000`); was still checking the
  `@qvac/tts-onnx` range (`7001–7011`).
* `test/unit/tts-ggml.lifecycle.test.js` — fix stale
  `QvacErrorAddonTTS` import to `QvacErrorAddonTTSGgml`; switch the
  stubbed model to `{ t3Model, s3genModel }` GGUFs and drop the
  non-existent `engine: 'chatterbox'` option.
* `test/unit/tts-ggml.sentence-stream.test.js` — same GGUF/engine
  cleanup.

### Rewritten

* `test/unit/chatterbox.inference.test.js` — drop tests that asserted
  the old ONNX file shape (`tokenizer / speechEncoder / embedTokens /
  conditionalDecoder / languageModel`), the removed `engine` detection
  and the wrong `getModelKey` return value (`'onnx-tts'` -> `'tts-ggml'`).
  New tests cover: `modelDir` derives the two GGUF paths; explicit
  `t3Model` / `s3genModel` override the defaults.  The mocked-binding
  run/reload/cancel flow stays.
* `test/integration/addon.test.js` — fresh, ~180 LoC, Chatterbox-English
  only.  Ensures the GGUFs are present, runs the short sentence set
  through `loadChatterboxTTS` + `runChatterboxTTS[WithSplit]`, and
  (on darwin only) runs a whisper-based WER check via the existing
  `runWhisper` util.  Drops the Chatterbox-multilingual block + every
  Supertonic + LavaSR block that doesn't apply to this package.
* `test/utils/runChatterboxTTS.js` — rewrite for the GGUF contract:
  `files: { modelDir, t3Model, s3genModel }`, `referenceAudio` as a
  file path that falls back to `test/reference-audio/jfk.wav` (or the
  mobile test-asset when `global.assetPaths` is present).  No more
  WAV decode / resample on the JS side.
* `test/utils/downloadModel.js` — trim from 1007 LoC to 280.  Drops
  the Supertonic + LavaSR + Chatterbox-multilingual + Cangjie
  downloaders.  Keeps the shared HTTP/curl infrastructure and
  `ensureWhisperModel` (still used by the integration WER check).
  `ensureChatterboxModels` is now **check-only**: it verifies
  `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` exist locally
  and, if missing, prints the exact commands for generating them
  from the qvac-tts.cpp (née chatterbox.cpp) conversion scripts.
  Once the GGUFs land on a canonical HuggingFace repo we'll wire up
  download URLs here.

## Scripts

* `scripts/ensure-chatterbox.js` — simplify to a single invocation
  against `./models/`.  Drops the variant / language matrix that the
  ONNX downloader needed.
* `scripts/ensure-models.js` — now a thin alias to
  `ensure-chatterbox.js`.  Drops the Supertonic + LavaSR orchestration.

## Mobile

* Restored `test/mobile/{integration.auto.cjs, integration-runtime.cjs,
  testAssets/jfk.wav}` so the Android build has a wrapper to point at.
* `package.json`: re-added `test/mobile` to the `files` list.

## Gitignore

* Ignore generated `.clang-format` / `.clang-tidy` / `.valgrind.supp`
  (produced by the top-level `configure_file(...)` calls) and
  `build_*/` dirs (bare-make convention).

## Verified locally

* `npx standard "test/**/*.js" "*.js" "lib/*.js"` — clean.
* `npm run test:unit` — 38/38 pass (105/105 asserts).
* `npm run build && bare examples/chatterbox-tts.js "Hello from qvac tts ggml."`
  produces a 24 kHz wav as expected.

* Add streaming support

* Update ggml backend to use separate ggml repo

* tts-ggml: consume renamed tts-cpp library (2026-04-24#1)

Upstream chatterbox.cpp renamed the package + namespace + target from
qvac-tts to tts-cpp and tightened the library boundary; pick up the
new artefacts here:

- find_package(qvac-tts-cpp CONFIG REQUIRED)
    -> find_package(tts-cpp CONFIG REQUIRED)
- qvac-tts::qvac-tts  -> tts-cpp::tts-cpp
- qvac_tts::chatterbox -> tts_cpp::chatterbox (engine ptrs, EngineOptions,
  SynthesisResult, forward-decls in ChatterboxModel.hpp)
- #include <qvac-tts/chatterbox/engine.h>
    -> #include <tts-cpp/chatterbox/engine.h>
- Doxygen / inline doc references to the old names refreshed alongside
  the code changes.

vcpkg wiring:
- vcpkg-configuration.json baseline bumped to qvac-registry-vcpkg
  commit bc30b0b (ports/tts-cpp renamed and repointed at
  chatterbox.cpp@f8f9145).
- vcpkg.json tts-cpp constraint bumped to 2026-04-24#1 (the port that
  carries the rename + namespace + install(EXPORT) changes).

Verified with a cold bare-make generate + bare-make build against the
new port, and the addon's existing unit + integration test suites.

Made-with: Cursor

* tts-ggml: bump tts-cpp port to 2026-05-07 + registry baseline

Picks up the round-3 review-fix wave landed on the tts-cpp port:

  e673182  scrub stale patches/ refs from README                (N10)
  8ba10a6  drop unreachable TTS_CPP_GGML_LIB_PREFIX block        (N8)
  4b5d2d7  mirror N1-N7 fixes from chatterbox.cpp source-of-truth
            - N1 supertonic alive-registry guard against freed-backend
              gallocr_free assert on hot-swap (Vulkan/Metal/CUDA)
            - N2 drop dead g_sink_* state, soften log_set docstring
            - N3 Turbo BPE try/catch (exception-safe Engine ctor)
            - N4 STFT cancel checkpoint + tighter Engine::cancel() doc
            - N5 document s3gen_preload/unload refcount semantics
            - N6 drop dead cached_text_lc Supertonic shim
            - N7 fix misleading "no copy" view-vs-copy log wording

Plus the integrated-port-only round-2 fixes that landed earlier:

  fa0d490  close patches/-deleted regression: TTS_CPP_USE_SYSTEM_GGML
            now defaults ON; bundled-without-patches hard-errors at
            configure time with a pointer at the ggml-speech vcpkg
            port.
  ae34c58  README rewritten for integrated/vcpkg context.
  a2f2dd6  top-level qvac-ext-lib-whisper.cpp README points at the
            tts-cpp/ subtree (alongside parakeet-cpp/).

Public API used by ChatterboxModel (tts_cpp::chatterbox::Engine /
EngineOptions / SynthesisResult / s3gen_preload / s3gen_unload) is
backward-compatible: the new port adds Engine::backend_name(),
MTL-variant fields on EngineOptions (language / cfg_weight / min_p /
exaggeration), and a separate tts_cpp::supertonic::Engine class, but
nothing this consumer was already calling has changed.

Edits:

  packages/tts-ggml/vcpkg.json
    - tts-cpp dep: version>=2026-04-24#1 -> version>=2026-05-07.

  packages/tts-ggml/vcpkg-configuration.json
    - default-registry baseline: bc30b0b (April 2026 fork-only state)
      -> 16b91afdcfd59baea60e81f3da94f49311ef2a97.  The new baseline
      pulls in the post-tetherto-merge state (parakeet-cpp port at
      932d5d9, ggml-speech port-version 1 at f07bdd0) plus the new
      tts-cpp port (16b91af) on the developer's GustavoA1604
      registry fork.

Smoke-test plan: after running `vcpkg install` against the new
baseline, the tts-cpp port's vcpkg_from_github resolves at
GustavoA1604/qvac-ext-lib-whisper.cpp@e673182 (tts-cpp branch) until the
upstream PR merges.  ChatterboxModel should build and synthesize
identically; expanding to Multilingual + Supertonic flows is the
follow-up commit on the package side.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add chatterbox multilingual and supertonic

* Add mobile integration tests

* tts-ggml: drop clang-19 pin in linux-clang toolchain

The toolchain hardcoded `clang-19` / `clang++-19` (versioned binary
names) since the package's first commit (0a2c978).  Linux CI hadn't
exercised this path before — the new on-pr-tts-ggml.yml -> integration
matrix is the first time it does, and it fails on every linux runner
(ai-run-ubuntu-22.04, ai-run-linux-gpu, ubuntu-24.04-arm) at vcpkg's
"detect_compiler" step because none of the GH-hosted images ship a
`clang-19` symlink:

  Detecting compiler hash for triplet x64-linux...
  error: while detecting compiler information:
  ...
  CMake Error at scripts/cmake/vcpkg_execute_required_process.cmake:127
  (message): Command failed: ... -DVCPKG_CHAINLOAD_TOOLCHAIN_FILE=
  .../tts-ggml/vcpkg/triplets/../toolchains/linux-clang.cmake ...

Match parakeet's working pattern (qvac-lib-infer-parakeet/vcpkg/
toolchains/linux-clang.cmake): use unversioned `clang` / `clang++` so
each runner picks up its image's default clang (clang-15 on
ubuntu-22.04, clang-18 on ubuntu-24.04, whatever the AI runners ship).
The `-stdlib=libc++` flag added by x64-linux.cmake / arm64-linux.cmake
is honoured by every reasonable clang version.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add C++ tests and coverage; fix linux build

* tts-ggml: address PR review feedback

Bundle of correctness, hygiene, and CI-doc fixes from the recent code
review.  Each item below has its own paragraph in the diff comments.

- #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js
  to package.json so consumers running the integration tests from the
  npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`.
- #2 deps: move @qvac/langdetect-text from runtime dependencies to
  devDependencies (it's only referenced from examples/, which aren't in
  the published files list).
- #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming
  detection used to read engine_->options() outside engineMu_, racing
  with reload().  synthesize() now returns SynthesizeResult { pcm,
  wasStreaming } where wasStreaming is captured under the engine lock
  against the local shared_ptr so process() doesn't have to touch
  engine_ again.
- #4 deferred-load: ChatterboxModel + SupertonicModel constructors
  used to call load() eagerly, so JsInterface::createInstance() (sync
  on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop.
  Both models now implement IModelAsyncLoad: constructors validate +
  return; the actual load is deferred to waitForLoadInitialization(),
  which the new addon_js::activate wraps inside JsAsyncTask::run so the
  parse runs on a worker thread.  binding.cpp registers
  addon_js::activate in place of JsInterface::activate; tts.js now
  awaits the resulting promise.
- #5 dead code: drop _resolvePath (unused), drop the (void)inputObj
  read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE /
  FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but-
  not-thrown so future maintainers don't delete them blindly (the unit
  suite asserts the values).
- #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_
  reset pattern: cancel() sets it, synthesize() fast-fails on it,
  process() resets it per call so a stale cancel doesn't poison the
  next run.
- #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that
  the JS layer is the source of truth for useGPU and nGpuLayers wins
  downstream; left a pointer to std::optional<bool> if a future caller
  ever needs to distinguish "absent" from "explicit false".
- #10 fork pointers: README.md and test/utils/downloadModel.js no
  longer point at GustavoA1604/chatterbox.cpp; both reference the
  upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now.
- #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment
  on the build-and-test job documenting that continue-on-error is the
  early-days landing posture (merge-guard treats success || skipped as
  pass), with a pointer to tighten once Device Farm provisioning is
  stable.

Nits:
- 'use strict' added to addonLogging.js (matches every other .js).
- node-vs-bare runtime banners on
  scripts/{generate,validate}-mobile-integration-tests.js.
- ttsOutputDebugString no longer JSON.stringify's the full PCM
  Int16Array on every chunk-streaming event; emits a tiny summary
  ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen})
  instead.

Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load
contract); 4 skipped real-GGUF tests behind the existing
QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF /
QVAC_TEST_SUPERTONIC_GGUF env-var gates.  Lint clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

* tts-ggml: unblock CI integration tests on every desktop runner

Four independent failures, one per platform:

1. linux-x64 / linux-arm64: addon load crashed at
   `libomp.so.5: cannot open shared object file`.  tts-cpp's binary is
   built with clang under the linux-clang toolchain and links against
   libomp (LLVM OpenMP runtime); only `libgomp1` (GNU OpenMP) was being
   apt-installed.  Add `libomp5` so libomp.so.5 is on the loader path.

2. darwin-arm64: convert-models.sh aborted at line 200 with
   `hf_args[@]: unbound variable`.  macOS's system bash is 3.2 which
   treats `"${arr[@]}"` as nounset access when the array is empty under
   `set -u`; with HF_TOKEN unset we hit it on every fresh runner.  Use
   the `${arr[@]+"${arr[@]}"}` idiom (defined-or-nothing) at all six
   call sites and add a header comment so the next maintainer doesn't
   accidentally regress.

3. darwin-x64: pip install bombed building `llvmlite` from source
   because the macos-15-large runner has no LLVM 15 development
   install.  Root cause: librosa pulls in numba 0.65+, which stopped
   shipping darwin-x86_64 wheels for Python 3.12.  Pin Python to 3.11
   in the Setup Python step; 3.11 has prebuilt wheels for the entire
   numba/llvmlite/librosa stack on darwin-x64 and is fine for every
   other converter dependency.

4. windows-2022: ChatterboxModel::load threw
   `vk::createInstance: ErrorIncompatibleDriver`.  Root cause: the
   addon's index.js::_validateConfig defaults `useGPU = true` when
   neither useGPU nor nGpuLayers is specified, so the test ran with
   n_gpu_layers=99 -> ggml_backend_vk_init -> vk::createInstance ->
   ErrorIncompatibleDriver on the runner's no-Vulkan-driver image.
   runChatterboxTTS.js now honours `process.env.NO_GPU === 'true'`
   (set on the no-GPU matrix entries) and forces useGPU=false on
   exactly those runners; the other test runners (chatterbox-mtl,
   gpu-smoke, multiple-runs) already had this guard.

Also documents the `mesa-vulkan-drivers` apt package (already pulled
in) as the software ICD that lets the Vulkan-built prebuild's runtime
backend probe enumerate at least one device on linux runners.

Co-authored-by: Cursor <cursoragent@cursor.com>

* tts-ggml: drop Chatterbox from mobile bundle (Metro V8 string limit)

Mobile build failed at `:app:createBundleReleaseJsAndAssets` with:

  SyntaxError: assets/testAssets/chatterbox-s3gen.gguf:
    Cannot create a string longer than 0x1fffffe8 characters

Root cause: Metro's bundler reads every asset under
`test/mobile/testAssets/` via `Buffer.toString()`.  V8's max string
length is 0x1fffffe8 (~512 MiB).  chatterbox-s3gen.gguf is ~1 GiB even
with --quant q4_0 because the s3gen converter only quantizes attention
weights and leaves the bulk of the s3gen graph in fp16 ("0/291 weight
tensors quantized" in the converter log).

Fix: bundle ONLY supertonic.gguf (~125 MiB, comfortably under the
limit) on mobile.  Mobile Chatterbox tests degrade cleanly to
`t.pass('Skipped: Chatterbox GGUFs not available')` via the existing
`ensureChatterboxModels` helper -- it already returns
{ success: false } when the GGUFs aren't on disk.

Cache key bumped to v2 so existing v1 cache entries (which include
the chatterbox files) are evicted on the next run.

Bundling Chatterbox on mobile requires either:
  - adding `gguf` to qvac-test-addon-mobile's metro `assetExts` so the
    JS-string read is skipped (then the s3gen file can flow through the
    bundle as a raw asset), or
  - pushing the chatterbox GGUFs to the device via `adb push` outside
    the bundle and surfacing the path through downloadModel.js's
    existing ANDROID_CANDIDATE_DIRS fallback.

Both are outside the scope of this PR; documented inline above the
cache step for the next maintainer.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Bump hash of vcpkg

* Consume vcpkg from tetherto repository

* Fix integration tests failures in all platforms

* Further fix tests

* fix: Make useGPU flag more meaningful (#1953)

* fix[api]: make useGPU flag actually force CPU/GPU and reject useGPU/nGpuLayers conflicts

* add gpu smoke test

* resolve comments

---------

Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>

* Update dependencies after monorepo directory changes

* Further drop qvac-lib- prefix

* Add CHANGELOG.md

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Ishan Vohra <ishanvohra2@gmail.com>
Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>
Proletter pushed a commit that referenced this pull request May 24, 2026
…ache via KvCacheSession (#2007)

* QVAC-18182 feat[api]: typed cancel outcomes on the wire + atomic KV-cache via KvCacheSession

Builds on QVAC-18181's request lifecycle primitives (DisposableScope,
RequestContext, RequestRegistry) to deliver the M2 milestone:

- Typed cancel outcomes: `stopReason: "cancelled"` on `completionDone`
  events, and `InferenceCancelledError(requestId, partial)` thrown from
  CompletionRun promise-aggregates (`final` / `text` / `toolCalls` /
  `stats`). The wire stream still ends normally so iterating
  `run.events` is unaffected — the typed error lives on the aggregate
  promises that callers `await` for the final result.

- KvCacheSession (`server/bare/plugins/llamacpp-completion/ops/
  kv-cache-session.ts`) — single atomic owner of the three KV-cache
  layers (`cachedMessageCounts`, `initializedCaches`, on-disk `.bin`
  files). `beginTurn` / `commitTurn` / `rollback` collapse the three
  duplicated cleanup blocks in `completion-stream.ts` into one
  scope.defer hook. Cross-model administrative deletion lives at the
  module level as `deleteKvCacheState(...)`, called by the RPC
  `handleDeleteCache` handler.

- Stop-button race close — `RequestRegistry` now keeps a bounded
  cancelled-before-begin map (128 entries, 30s TTL). A `cancel({
  requestId })` that lands before the server's `begin(...)` ran is
  applied retroactively when begin lands, so same-tick stop clicks no
  longer disappear into the void. Internal-only — the wire surface for
  `cancel` is unchanged (Option A in the brief).

Cursor rules updated in the same PR so the request-lifecycle and
KV-cache topic docs stay in sync with the implementation.

Tests:
- unit: KvCacheSession (bareTest-gated, runs in the Bare consumer),
  RequestRegistry race + bounded-set eviction, completion-event schema
  cancelled cases.
- e2e: cancellation-tests.ts adds three definitions — mid-stream cancel
  (events.stopReason === "cancelled", final rejects with
  InferenceCancelledError, partial.text matches concatenated
  contentDelta), cancel-before-begin (retroactive abort), and
  cancel-then-resume-kv-cache (rollback wiped the three layers, the
  next turn re-primes cleanly).

* chore: drop planning labels (Mx/Dx) from QVAC-18182 comments

Strips milestone (`M1`/`M2`/`M3a`...) and deliverable (`D2`/`D5`/`D7`)
labels from comments and test titles introduced with the typed-cancel
outcomes + KvCacheSession work. The substantive descriptions of the
contracts (Stop-button race, cancelled-before-begin map, three-layer
session ownership, etc.) are preserved; only the planning-doc
references are removed so the code reads cleanly without the pitch
context. Durable `QVAC-XXXXX` ticket references are kept.

No behavior or API surface changes.

* chore: drop Asana ticket references from QVAC-18182 code comments

Strips QVAC-XXXXX inline ticket references from code/test comments
introduced by the typed-cancel-outcomes work. Concept names
(Stop-button race, cancelled-before-begin, etc.) and prose
descriptions of the contracts are preserved; only the ticket-tag
suffixes go. Also renames a test cache key from
`qvac-18182-cancel-resume-kvcache` to `cancel-then-resume-kvcache` so
the cache key reads as a stable identifier rather than a ticket
reference.

No behavior or API surface changes.

* QVAC-18182 doc: clarify error>cancelled precedence + deleteKvCacheState concurrency

Address non-blocking review nits on PR #2007:

- aggregate-events: explain why a wire event carrying both error and
  cancelled signals resolves to error (closes brief open question #3).
- kv-cache-session: doc-comment on deleteKvCacheState explaining the
  ordering guarantee under concurrent in-flight turns -- delete is
  wire-async, in-flight turns roll back idempotently when their commit
  probe finds the file gone (closes brief open question #4).

Comments only; no behavior changes.

* QVAC-18182 doc: demonstrate typed cancel outcomes in cancel example

Enhance the existing cancel-by-request-id example to demonstrate the
two M2 cancel-outcome channels:

- run.events ends normally with completionDone carrying
  stopReason: "cancelled" -- show reading it inside the iteration loop.
- run.text rejects with InferenceCancelledError(requestId, partial) on
  cancel -- show the instanceof check and consuming partial.text,
  partial.toolCalls, partial.stats.

Also update the header to remove the now-stale "logged as a no-match"
sentence (same-tick cancels are no longer dropped after M2's race
close).

Pure documentation enhancement; no API or behavior changes.

* QVAC-18182 fix: address PR review — partial-prime cleanup + parent-aborted state

Two follow-ups from Opanin's review on PR #2007:

1. KvCacheSession.beginTurn: if `primeIfMissing` throws after the
   addon has partially written a `.bin` to disk, the next
   `beginCustom` would `fsPromises.access(cachePath)` → true and
   trust the half-primed file as a valid cache (no rollback hook is
   registered yet — the handler hasn't seen the `TurnHandle`). Wrap
   both `beginCustom` and `beginAuto` prime calls in a shared
   `primeOrCleanup` helper that best-effort unlinks the partial file
   before re-throwing the original prime error. Adds a bare-only unit
   test asserting the on-disk file is removed and the init flag stays
   unset on the failed-prime path.

2. RequestRegistry.begin: when `parentSignal` was already aborted at
   begin time, line 271 aborts the controller but the `state` ternary
   still landed `"running"`, exactly the "momentarily-running with
   already-aborted signal" the preCancel branch was guarding against.
   Extend the ternary to cover both inputs and the existing
   `parentSignal already aborted` test now also asserts
   `ctx.state === "cancelling"`.

No behavior change on the happy path. Lint + typecheck + 351-test
unit suite green locally on the changed files.

* QVAC-18182 fix: prime is atomic — addon writes to .prime.tmp + atomic rename

Upgrade the previous reactive cleanup workaround (PR #2007 review by
@opaninakuffo) into a proactive atomic-by-construction design:

  - The session steers `model.run({ saveSessionPath })` to a sibling
    `cachePath + ".prime.tmp"` path.
  - Only after the prime closure resolves successfully do we promote
    the temp file to the canonical `cachePath` via `fsPromises.rename`
    (atomic same-volume on every host we target).
  - The canonical cache path is therefore *never* observable in a
    partial state — a thrown prime is indistinguishable on disk from
    a never-attempted prime, so the next existence probe (in-process
    or cross-process worker restart) cannot trust corrupt bytes.

Defensive details:
  - We unlink any leftover `.prime.tmp` *before* invoking the closure,
    so a deferred-write addon path can't accidentally promote
    stale-from-crash bytes left by a prior worker.
  - On prime success we probe the temp path before renaming. If the
    addon deferred its disk write (some llama.cpp paths flush lazily),
    the temp doesn't exist and we leave the canonical path absent —
    `verifySaveAndRecord` in `commitTurn` is the authoritative check.
  - On rename failure we unlink the temp and surface the rename error;
    rename atomicity guarantees the canonical path was untouched.

Why this is better than the prior `primeOrCleanup`:
  - Best-effort `unlink` was load-bearing for correctness in the old
    design — a failed unlink left a half-primed canonical file the
    next `beginCustom` would trust. The new design moves the only
    possible "partial" file to a non-trusted name, so failed cleanup
    cannot corrupt the canonical name by construction.
  - The unit test no longer mocks the workaround surface; it asserts
    the actual invariant ("canonical path was never written") plus
    the positive rename and the leftover-sweep guarantees.

Tests: 3 bare-only kv-cache-session unit tests (throw-leaves-canonical-
untouched, success-promotes-via-rename, leftover-from-crash-is-swept).
Lint + typecheck + 351-test unit suite green locally on the changed files.

Long-term, the right fix is one layer down — the llama.cpp addon should
write transactionally itself and surface save errors instead of
swallowing them. When that lands, this helper collapses to a direct
`prime(cachePath)` call and the `verifySaveAndRecord` access-probe
fallback (TODO already documented) can be retired together. Filed as
a separate follow-up; out of scope for this PR.

* QVAC-18182 fix: replace prime-atomic helper with verifyPrimedFile post-prime probe

Audit of the llama.cpp addon (`CacheManager::writeCacheFile` →
`llama_state_save_file`, return value swallowed; `LlamaModel::
processPromptImpl` lines 575-599) shows the bug shape Opanin flagged
on PR #2007 — "primeIfMissing throws after a partial save" — does not
actually fire. The save call is the very last operation on the
prefill path, the addon ignores its return value, and any earlier
throw means no save was attempted. So:

  - `primeOrCleanup` (`ac8d2d74e`) and the upgrade to
    `primeAtomically` (`a7420f3e6`) defended against a code path that
    the addon does not produce.
  - The real corruption shape is silent partial writes (addon's
    `llama_state_save_file` returns false, addon ignores it, file is
    half-written or empty). Atomic temp+rename did NOT close this
    gap — on a "silent partial" the closure resolves successfully and
    the helper would happily promote the partial `.prime.tmp` to the
    canonical path.

Replace both helpers with a small `verifyPrimedFile` that mirrors the
existing `verifySaveAndRecord` access-probe pattern used at commit
time, applied at prime time:

  - After a successful prime closure, `fsPromises.stat` the canonical
    path. If it doesn't exist (addon was interrupted before save) or
    has size 0 (addon save call produced an empty file), throw and
    best-effort unlink the empty leftover so the next existence probe
    doesn't trust it.
  - This catches the two failure modes Opanin's concern was a proxy
    for (cancelled-mid-prime; addon save quietly produced nothing)
    without claiming defense against partial-but-nonzero writes,
    which can only be closed at the addon layer.

The `RequestRegistry` parent-aborted-state fix (`ctx.state` ternary
covers `opts.parentSignal?.aborted`) from `ac8d2d74e` is preserved
unchanged — it stands on its own as a correct response to Opanin's
second comment.

Long-term root cause stays the addon: have
`CacheManager::writeCacheFile` check `llama_state_save_file`'s return
value and throw on failure. When that lands, both `verifyPrimedFile`
and `verifySaveAndRecord`'s access-probes can be retired together.
Filed as a separate follow-up — out of scope for this PR.

Tests: 3 prior bare-only prime-atomic tests removed; 2 new bare-only
tests added (no-file and empty-file rejection paths). Lint +
typecheck + 330-test unit suite green locally on the changed files
(pre-existing sdcpp-generation lint errors unchanged).

* QVAC-18182 doc: kv-cache rule documents addon non-transactional save + matched access-probes

Extend the "Cache Initialization (primeIfMissing)" section in
.cursor/rules/sdk/docs/kv-cache-system.mdc with the corrected
addon-contract analysis:

  - The llama.cpp addon's CacheManager::writeCacheFile discards
    llama_state_save_file's bool return; maybeSaveCacheToDisk is the
    last call on the prefill path. So no closure-rejection path can
    coexist with a partial file on disk.
  - Document the four real outcomes as a table (interrupted /
    success / silent partial write / pre-eval throw) so future
    readers can see why the SDK takes the shape it does.
  - Pin both SDK-side defenses as a matched pair: verifyPrimedFile
    at prime time (added in this PR) and verifySaveAndRecord at
    commit time (existing). Both are honest about what they catch
    (missing / empty file) and what they don't (partial-but-nonzero,
    only addon fix can close that).
  - Reference the addon-layer follow-up
    (1214778658064488 / "throw on llama_state_save_file failure")
    so the next contributor knows both probes will be retired
    together when the addon throws on save failure.

No code change — rule-only update.
Proletter pushed a commit that referenced this pull request May 24, 2026
* QVAC-18183 feat[api]: inference-handler migrations

Migrate the four remaining inference handler kinds onto the
RequestRegistry primitives shipped in M3a (cancel-capability
declaration, per-kind concurrency policy, structured
`[request-lifecycle]` logging). Each handler now opens a
request-scoped `ManagedRequestContext`, threads the optional
`requestId` from the wire request (falling back to a server-minted
UUID), routes hard cancels to `addon.cancel()` at a single signal-
listener leaf, and replaces ad-hoc `try/finally` cleanup with
`scope.defer(...)` registrations so cleanup runs in LIFO order on
every exit path.

- `embed` (kind "embeddings", `{ scope: "model", hard: true }`):
  `packages/sdk/server/bare/ops/embed.ts` opens the context, threads
  `requestId` from `embedRequestSchema`, post-await `signal.aborted`
  checks raise `InferenceCancelledError`.
- `transcribe` / `transcribeStream` (kind "transcribe",
  `{ scope: "model", hard: true }`): collapsed
  `try { ... } finally { restorePrompt(...) }` into
  `scope.defer(restorePrompt)`, added per-iteration
  `if (ctx.signal.aborted) break;` in the `response.iterate()` loop
  (Option A from §4 of the M3b brief — explicit, visible at the call
  site, no `takeWhileNotAborted` wrapper).
- `translate` (kind "translate"): two engine branches.
  llamacpp-completion declares `{ scope: "model", hard: true }` and
  wires `signal → addon.cancel()`; nmtcpp-translation keeps
  `{ scope: "none" }` and soft-cancels inside both the streaming
  iterate loop and the `runBatch` early-return path.
- `finetune` (kind "finetune"): flipped the llamacpp-completion
  manifest declaration from `{ scope: "none" }` to
  `{ scope: "model", hard: true }` (the addon already exposes
  `model.cancel()`). `startFinetune` opens a registry context and
  wires `signal → model.cancel()`; the two-level `try/finally`
  collapses into `scope.defer` for `clearFinetuneRuntimeState` and
  `handle.removeListener`. `cancelFinetune(modelId)` is now a thin
  wrapper over `getRequestRegistry().cancel({ modelId, kind:
  "finetune" })` — never invokes `model.cancel()` directly.

Per §4 of the brief: per-iteration cancel granularity uses
Option A (explicit `if (ctx.signal.aborted) break;` at the top of
each streaming loop body). No `takeWhileNotAborted` wrapper was
introduced.

Per §7 anti-patterns: M3b adds zero `oneAtATimePerModel` policies
(the four migrated kinds tolerate concurrent requests against the
same model), leaves the M1 compat-fallback in
`server/bare/ops/cancel.ts` untouched (M3d retires it), and does
not modify `cancelHandler.ts`.

Other changes:
- `embed`, `transcribe`, `transcribeStream`, `translate`,
  `finetune` request schemas grow an optional `requestId` field
  (`.string().min(1).optional()`); server-side ops fall back to
  `generateServerRequestId()` when absent.
- Whisper / Parakeet / LLM / NMT plugin handlers thread
  `request.requestId` into their bare ops.
- `plugin-cancel-capability.test.ts` truth-table flipped for the
  `finetune` row.
- New `inference-handler-migrations.test.ts` covers schema-level
  optional-`requestId` acceptance for all four kinds and pins the
  `[request-lifecycle] begin/cancel/end` line shape for each kind.
  The op-level cancel-by-requestId / cancel-by-modelId integration
  tests are bare-runtime-gated (the migrated ops pull `bare-crypto`
  / `bare-fs` transitively and can't load under Bun, same reason as
  `finetune-ops.test.disabled.ts`).
- `.cursor/rules/sdk/request-lifecycle-primitives.mdc` and
  `.cursor/rules/sdk/docs/request-lifecycle-system.mdc` updated:
  M3b row marked shipped, finetune truth-table row flipped,
  canonical-handler-shape section refreshed to use `embed.ts` as the
  cleanest reference and to document the Option A per-iteration
  check.

Verification:
- `bun lint` (eslint + tsc --noEmit): green.
- `bun run typecheck`: green.
- `bun run test:unit`: every test file green except the
  pre-existing `client/rpc/rpc-client.ts` `#rpc` package-resolution
  failure on upstream/main (also reproducible without these
  changes; unrelated to M3b).

* QVAC-18183 fix: address PR #2058 review feedback

- transcribe.ts: route the two `Transcription Update` debug emits
  through `requestLogger.debug` so they carry the per-request prefix,
  matching the rule's `grep "requestId=<id>"` invariant. Drop the now-
  unused module-level `logger`. Collapse two `scope.defer(async () =>
  { await restorePrompt(...) })` wrappers to bare arrow callbacks
  (review #5, #10).

- inference-handler-migrations.test.ts: add bareTest op-level cancel-
  by-requestId cases for `transcribe (whisper)` (asserts loop exit +
  addon.cancel called + reload-count == 2 to pin the
  `applyPrompt + restorePrompt runs exactly once` invariant) and
  `finetune` (asserts model.cancel called + scope unwind clears the
  runtime-state flag back to IDLE). Pin the NMT soft-cancel contract
  by instrumenting the addon and asserting addon.cancel was NOT called
  during a translate cancel (review #3, #7).

- request-lifecycle-primitives.mdc: reconcile the "polling
  signal.aborted mid-handler" anti-pattern with the new "Per-iteration
  cancel check (M3b)" canonical pattern. The anti-pattern is *adding*
  the check when the addon already honours signal directly; the M3b
  pattern is *introducing* the check where the addon doesn't and the
  loop is the only soft-cancel exit (review #4).

* QVAC-18183 fix: drop unsafe `addon` re-narrowing in translate.ts onAbort

Addresses opaninakuffo's review comment on #2058:
`AnyModel.addon` is already typed as `AddonInterface | undefined`
(see `server/bare/registry/model-registry.ts:17-20`), so the
`as unknown as { addon?: { cancel?(jobId?: string): Promise<void> } }`
cast was unnecessary. Matches the simpler pattern used by `embed.ts`
and `transcribe.ts` for the same `onAbort` shape — keeps the four
M3b-migrated ops uniform.

* QVAC-18183 doc: trim internal milestone references from cursor rules + code comments

Removed the "Migration Roadmap" table, "M1/M2/M3a-d" milestone labels, planning-brief
decision references (Decision A/B.2, D1/D2), workspace-local paths
(`tasks/release-0.11.0-planning/...`, `pitch-3-decisions.md`), and "in review"
forward-references from the request-lifecycle cursor rules and the matching code
comments in the bare ops, finetune wrapper, and the inference-migration tests. The
canonical handler shape, anti-patterns, primitives reference, plugin cancel-capability
truth table, and concurrency-policy / structured-logging sections all stay — only the
internal milestone framing comes out.
simon-iribarren added a commit to simon-iribarren/qvac that referenced this pull request Jun 8, 2026
Lifecycle correctness:
- Spawn lock: steal only when the owner pid is dead (with an mtime fallback for
  an unreadable lock), so a legitimate multi-minute cold start no longer loses
  its lock after 30s and spawns a duplicate runner/serve (tetherto#1).
- close(): the fetch path now bails out instead of re-resolving once closed, so
  a request racing close() can't silently re-add a consumer / spawn a runner (tetherto#3).
- sweepServes: when an orphaned serve's pid is alive but its health check fails,
  keep the record instead of dropping it — dropping stranded a live serve with
  no registry trace. We only reap once it answers as ours, or drop once its pid
  dies (tetherto#4).
- servePort: fold a pinned port into the fleet key so pinned-port callers don't
  reuse an auto-allocated serve on a different port, and distinct pins don't
  collide (tetherto#5).
- Respawn: expose baseURL/port/pid as getters over live state, updated on every
  reconnect, so diagnostics/external clients see the real serve after recovery (tetherto#6).
- retargetUrl now handles Request inputs (not just string/URL) so a respawn stays
  transparent if the SDK ever switches input shapes (tetherto#8).

Docs:
- README + docs-site: direct-baseURL tools (OpenCode/Cline/Aider) don't extend
  liveness; document the long-lived-sentinel/wrapper pattern and fix the
  misleading "the script doesn't have to stay running" note (tetherto#2).
- Reconcile version wording: README/changelog now describe managed mode as
  unreleased (package is 0.1.0); docs-site integration page documents managed
  mode + the async overload (tetherto#7).

Tests: spawn-lock steal/keep matrix, fleet-key pinned-port sensitivity, and the
runner-dead + serve-alive + health-failing sweep case. Build + suite green
(60 pass / 1 integration skip).
simon-iribarren added a commit that referenced this pull request Jun 10, 2026
* feat[api]: add managed mode to @qvac/ai-sdk-provider (QVAC-19900)

Add `mode: 'managed'` so the provider can synthesize an ephemeral
qvac.config.json from a model-constant list, spawn and supervise
`qvac serve` on a free port, and tear it down on host exit. External
mode is unchanged and stays synchronous; the managed supervisor is
lazily dynamic-imported so external-mode users pay no startup cost.

@qvac/cli becomes an optional peer dependency.

* fix: resolve @qvac/cli via main entry when its exports block package.json (QVAC-19900)

The published @qvac/cli ships a string `exports` field ("./dist/index.js"),
which makes the `./package.json` subpath non-resolvable
(ERR_PACKAGE_PATH_NOT_EXPORTED). Managed mode relied on resolving
`@qvac/cli/package.json` to locate the bin, so it would fail to find the CLI
on a clean install. Fall back to resolving the package main entry, which for
@qvac/cli is the same file as the `qvac` bin.

* doc: update ai-sdk provider agent setup after queue (QVAC-19900)

* QVAC-19900 feat[api]: per-model config for managed mode

Managed mode `models` now accepts spec objects ({ name, config, preload,
default }) alongside bare constant names, so callers can set per-model serve
options — notably `ctx_size` and `reasoning_budget` — that coding agents like
OpenCode require. The synthesized qvac.config.json carries the config block,
honors explicit `preload`/`default`, and validates names inside spec objects.

Exports the new `QvacManagedModel` type and documents per-model config plus a
managed-mode OpenCode example in the README.

* QVAC-19900 feat[api]: shared idle-reaped managed serve daemon

Rework managed mode from a per-provider supervisor into a shared,
self-cleaning serve daemon so it is robust standalone and usable by any
tool, not just a single session.

- Reuse via a fleet key (model set + per-model config + host) keyed in a
  cross-process registry under ~/.qvac/managed-serves/; createQvac attaches
  to a matching healthy serve instead of cold-starting a duplicate.
- A detached runner owns the qvac serve child and reaps it once no consumer
  process has been alive for serveIdleTimeout (default 5m). Liveness, not
  request traffic, is the signal, so it works for tools that hit baseURL
  directly (OpenCode/Cline/Aider).
- close() now detaches (deregisters the consumer) instead of killing; a
  shared serve survives until its last user is gone.
- Sweep only reaps dead/orphaned serves, never a healthy serve a live
  process owns (fixes a second session SIGKILLing a downloading serve).
- Respawn-on-failure: fetch re-resolves and retries once on ECONNREFUSED.
- reuse:false (or a pinned servePort) yields a private serve reaped as soon
  as its owner exits.

Refactor into serve-process.ts (spawn/health/stop), registry.ts,
fleet-key.ts, runner.ts; remove supervisor.ts and pid-tracker.ts. Add
reuse and serveIdleTimeout options. Rewrite tests and add reuse/idle-reap
end-to-end coverage; document the shared lifecycle in the README.

* QVAC-19900 fix: reject duplicate model names in managed mode

Each managed model maps to a single serve alias keyed by its name, so a
repeated name silently overwrote the earlier entry — and could drop its
`default: true`. Reject duplicates up front with DuplicateManagedModelError
instead of resolving them ambiguously. Addresses PR review feedback.

* QVAC-19900 fix[api]: address managed-mode self-review findings

- Per-instance consumer markers (<pid>.<rand>) so two providers in one
  process sharing a fleet key don't deregister each other on close (A).
- Restrict respawn retry to ECONNREFUSED so an in-flight completion is
  never blindly replayed on ECONNRESET/EPIPE (C).
- Health-check the recorded baseURL before SIGTERM-ing an orphaned serve,
  guarding against killing a recycled pid (D).
- Use dirname() instead of a posix-only regex for ephemeral config cleanup (E).
- Fold serveBinPath into the fleet key so distinct local builds don't share
  a serve (G).
- Export managed error classes + QvacManagedErrorCode for instanceof checks (H).
- Reject more than one explicit default: true (I).
- Deregister the consumer if resolveServe throws (F); drop dead
  firstConsumerPid runner param (J).

Tests: per-instance markers, health-gated orphan sweep (kills serving
orphan, spares non-serving stranger pid), fleet-key serveBinPath sensitivity,
multiple-default rejection. README updated.

* QVAC-19900 fix[api]: address managed-mode lifecycle review (round 2)

Lifecycle correctness:
- Spawn lock: steal only when the owner pid is dead (with an mtime fallback for
  an unreadable lock), so a legitimate multi-minute cold start no longer loses
  its lock after 30s and spawns a duplicate runner/serve (#1).
- close(): the fetch path now bails out instead of re-resolving once closed, so
  a request racing close() can't silently re-add a consumer / spawn a runner (#3).
- sweepServes: when an orphaned serve's pid is alive but its health check fails,
  keep the record instead of dropping it — dropping stranded a live serve with
  no registry trace. We only reap once it answers as ours, or drop once its pid
  dies (#4).
- servePort: fold a pinned port into the fleet key so pinned-port callers don't
  reuse an auto-allocated serve on a different port, and distinct pins don't
  collide (#5).
- Respawn: expose baseURL/port/pid as getters over live state, updated on every
  reconnect, so diagnostics/external clients see the real serve after recovery (#6).
- retargetUrl now handles Request inputs (not just string/URL) so a respawn stays
  transparent if the SDK ever switches input shapes (#8).

Docs:
- README + docs-site: direct-baseURL tools (OpenCode/Cline/Aider) don't extend
  liveness; document the long-lived-sentinel/wrapper pattern and fix the
  misleading "the script doesn't have to stay running" note (#2).
- Reconcile version wording: README/changelog now describe managed mode as
  unreleased (package is 0.1.0); docs-site integration page documents managed
  mode + the async overload (#7).

Tests: spawn-lock steal/keep matrix, fleet-key pinned-port sensitivity, and the
runner-dead + serve-alive + health-failing sweep case. Build + suite green
(60 pass / 1 integration skip).

* docs: use canonical qvac.tether.io URL in ai-sdk-provider README

* QVAC-19900 feat[api]: public model catalog + catalog-id aliases in managed mode

Add `models.qvacCatalog`, a public models.dev-style catalog that maps
friendly ids (`qwen3.5-9b`) to the SDK constant the serve loads
(`QWEN3_5_9B_MULTIMODAL_Q4_K_M`), so the id a user picks from models.dev
resolves end-to-end with no translation layer in front of the serve.

Managed mode now accepts catalog ids as model names: the synthesized
serve config keys the alias by the friendly id while `model` resolves to
the underlying SDK constant, so the serve answers `qwen3.5-9b` directly.
Bare SDK constants keep working unchanged. A drift unit test fails CI if
any catalog constant disappears from the generated SDK catalog.

* QVAC-19900 feat[api]: process-group serve teardown + closeOnParentExit

Harden managed-mode lifecycle so a managed serve never leaks its `bare`
inference worker or outlives the process that owns it.

- Process-group teardown: spawn `qvac serve` detached (its own group) and,
  when stopServe must escalate past the grace window, SIGKILL the whole
  group. A plain SIGKILL of the serve pid never cascades to the grandchild
  bare worker, so previously a wedged serve orphaned the worker. The
  graceful SIGTERM is still sent to the serve process only, so a healthy
  serve orchestrates its own shutdown and releases the global worker lock
  (no stale lock left behind); the group SIGKILL is the wedged-path fallback.

- `closeOnParentExit` option: for a daemon-style host whose sole job is to
  keep a managed serve alive for a parent process (e.g. an editor/agent
  plugin). The provider watches its parent pid and, the moment the parent
  exits (on POSIX we are reparented to init, ppid → 1), closes itself —
  deregistering the consumer so the runner reaps the serve — and exits.
  Without it a hard-killed parent would leave a reparented host alive,
  keeping its consumer marker forever so the serve was never reaped.

Tests: a stubborn-grandchild fake serve proves group teardown reaps the
worker; `parentIsGone` unit-tests the parent-watch decision.

* QVAC-19900 fix: keep managed serve lifecycle correct under close() race and crash-respawn

- Undo the consumer re-registration when close() wins the race against an
  in-flight fetch retry: resolveServe re-adds the marker after close() removed
  it, which would keep the shared serve warm until the process exits.
- Preserve live consumer markers when sweepServes reaps a crashed/orphaned
  serve, so a respawned runner inherits the still-alive sessions instead of
  idle-reaping the fresh serve out from under them.
- docs: bump managed-mode ctx_size examples to 32768 for agent-sized prompts.

* QVAC-19900 fix: rename reresolve result to resolved for clarity in managed fetch

* QVAC-19900 mod: collapse redundant sync/async registry teardown helpers

removeConsumer/removeConsumerSync and removeRecord/removeRecordSync were a
confusing sync/async mirror: the async removeConsumer was only ever called right
after the sync one (a guaranteed no-op), and the removeRecord pair was really two
teardown semantics under near-identical names. Marker/record teardown is a single
unlink/rm, cheap enough to be synchronous everywhere — including process 'exit'
handlers where async can't run — so collapse each pair into one sync function.
No behaviour change; addresses review feedback on #2408.

* QVAC-19900 mod: trim verbose comments in managed registry

Tighten the sync-rationale comments on removeRecord/removeConsumer and drop a
stale, broken leftover comment above ensureDirSync. Keeps the non-obvious intent
(why sync, preserveConsumers semantics) without the narration.

* QVAC-19900 mod: drop unused DEFAULT_SERVE_BIN and ephemeralConfigName

Both were dead: DEFAULT_SERVE_BIN was never imported (serve-process spawns the
resolved CLI path verbatim) and ephemeralConfigName was an unused helper
(writeEphemeralConfig uses a fixed name inside an mkdtemp dir). Removing the
latter also drops the now-unused randomBytes import.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant