testing qvac-cli workflow#4
Merged
Merged
Conversation
olyasir
added a commit
that referenced
this pull request
Apr 28, 2026
Earlier perf #4 dropped the per-step ggml_backend_tensor_set for the KV cache inputs on the assumption that ggml_set_input + the sched allocator preserves input slots between ggml_backend_sched_graph_compute calls. That holds for sched-managed multi-backend setups (where Tesla T4 + Vulkan still produces cos_sim=0.99999 / max|Δ|=0.020 vs the PyTorch reference), but it breaks two paths that actually run in CI: - CPU-only (alloc_staged_simple → ggml_gallocr → graph_compute) reuses input slots across compute calls, so steps 1–9 read garbage KV. - Adreno Vulkan on the Samsung S25 Ultra device farm slot has the same effective semantics (Adreno Vulkan driver) and crashed the addon test with the same divergence pattern. Symptom on linux-x64 / linux-arm64 GitHub-hosted runners (CPU backend): cos_sim = 0.3135 (threshold > 0.9), max|Δ| = 1.65 (threshold < 0.25). Restoring the per-step upload unconditionally trades ~80 MB of H2D traffic per inference on Vulkan-sched setups for correctness on every backend. A conditional restore (skip on sched paths) would recover that perf, but the branch isn't worth the correctness risk in this PR.
3 tasks
This was referenced May 5, 2026
GustavoA1604
added a commit
that referenced
this pull request
May 7, 2026
Bundle of correctness, hygiene, and CI-doc fixes from the recent code review. Each item below has its own paragraph in the diff comments. - #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js to package.json so consumers running the integration tests from the npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`. - #2 deps: move @qvac/langdetect-text from runtime dependencies to devDependencies (it's only referenced from examples/, which aren't in the published files list). - #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming detection used to read engine_->options() outside engineMu_, racing with reload(). synthesize() now returns SynthesizeResult { pcm, wasStreaming } where wasStreaming is captured under the engine lock against the local shared_ptr so process() doesn't have to touch engine_ again. - #4 deferred-load: ChatterboxModel + SupertonicModel constructors used to call load() eagerly, so JsInterface::createInstance() (sync on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop. Both models now implement IModelAsyncLoad: constructors validate + return; the actual load is deferred to waitForLoadInitialization(), which the new addon_js::activate wraps inside JsAsyncTask::run so the parse runs on a worker thread. binding.cpp registers addon_js::activate in place of JsInterface::activate; tts.js now awaits the resulting promise. - #5 dead code: drop _resolvePath (unused), drop the (void)inputObj read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE / FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but- not-thrown so future maintainers don't delete them blindly (the unit suite asserts the values). - #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_ reset pattern: cancel() sets it, synthesize() fast-fails on it, process() resets it per call so a stale cancel doesn't poison the next run. - #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that the JS layer is the source of truth for useGPU and nGpuLayers wins downstream; left a pointer to std::optional<bool> if a future caller ever needs to distinguish "absent" from "explicit false". - #10 fork pointers: README.md and test/utils/downloadModel.js no longer point at GustavoA1604/chatterbox.cpp; both reference the upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now. - #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment on the build-and-test job documenting that continue-on-error is the early-days landing posture (merge-guard treats success || skipped as pass), with a pointer to tighten once Device Farm provisioning is stable. Nits: - 'use strict' added to addonLogging.js (matches every other .js). - node-vs-bare runtime banners on scripts/{generate,validate}-mobile-integration-tests.js. - ttsOutputDebugString no longer JSON.stringify's the full PCM Int16Array on every chunk-streaming event; emits a tiny summary ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen}) instead. Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load contract); 4 skipped real-GGUF tests behind the existing QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF / QVAC_TEST_SUPERTONIC_GGUF env-var gates. Lint clean. Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604
added a commit
that referenced
this pull request
May 11, 2026
…#1983) * feat: add @qvac/tts-ggml package (Chatterbox English on qvac-tts.cpp) New Bare addon wrapping the `qvac-tts::qvac-tts` static library (backed by the `tts-cpp` port added in tetherto/qvac-registry-vcpkg). API-compatible with the Chatterbox engine exposed by `@qvac/tts-onnx` so downstream consumers can swap backends without touching orchestration code. ## Scope * First iteration. Supports Chatterbox **English** only. Chatterbox multilingual, LavaSR enhancer, Supertonic engine, and streaming are out of scope and remain in `@qvac/tts-onnx`. They'll land alongside the evolution of qvac-tts.cpp. * Native backend is the static `qvac-tts` library from the QVAC vcpkg registry (`ports/tts-cpp`, baseline `2026-04-21`). No ONNX Runtime dependency. ## JS surface * `@qvac/tts-ggml` exports `TTSGgml` with the same method shape as `ONNXTTS`: `run` / `runStream` / `runStreaming` / `reload` / `unload` / `destroy`. * `files: { modelDir }` looks for `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` side-by-side; `files.t3Model` / `files.s3genModel` override the defaults. * Options: `referenceAudio`, `voiceDir` (baked profile), `seed`, `nGpuLayers`, `threads`, `outputSampleRate`, plus placeholders for the upcoming streaming flags (`streamChunkTokens`, `streamFirstChunkTokens`, `cfmSteps`). * Shared reusable lib code (`lib/textChunker.js`, `lib/textStreamAccumulator.js`, `addonLogging.*`) is copied verbatim from `@qvac/tts-onnx`. * New error class `QvacErrorAddonTTSGgml` uses codes **13001–14000** to avoid collisions with `@qvac/tts-onnx` (7001–7011) when both packages are loaded in the same Bare process. ## Native addon * `addon/src/model-interface/chatterbox/ChatterboxModel.{hpp,cpp}` — `IModel` + `IModelCancel` implementation. First-iteration strategy: assemble argv for `qvac_tts_cli_main` with a scratch `.wav` output path, call it synchronously, then parse the resulting 16-bit mono PCM wav back into `std::vector<int16_t>` for the JS handler. Consequences: every job re-loads the model (~700 ms + inference time), no mid-synthesis cancellation, no streaming. The follow-up milestone replaces this with a persistent, struct-based API once qvac-tts.cpp exposes one. * `addon/src/js-interface/{JSAdapter.{hpp,cpp}, binding.cpp}` — JS-to-C++ config bridging (same string-map pattern as `@qvac/tts-onnx`) and the `BARE_MODULE(qvac_tts_ggml, ...)` registration exposing `createInstance` / `runJob` / `reload` / `activate` / `cancel` / `destroyInstance` / `loadWeights` / `setLogger` / `releaseLogger`. * `addon/src/addon/AddonJs.hpp` — JS-facing `createInstance` / `runJob` / `reload` wrappers that register a `JsAudioOutputHandler` emitting `{ outputArray: Int16Array, sampleRate: number }` to JS. ## Build / registry * `CMakeLists.txt` uses `find_package(qvac-tts-cpp CONFIG REQUIRED)` and the standard `cmake-bare` + `cmake-vcpkg` scaffolding (shape matches `@qvac/transcription-whispercpp`). * `vcpkg.json` depends on `tts-cpp` (with a `vulkan` feature passthrough) plus `qvac-lib-inference-addon-cpp`, `qvac-lint-cpp`, and `gtest`. * `vcpkg-configuration.json` points at tetherto/qvac-registry-vcpkg. NOTE: the baseline pin here is inherited from `@qvac/transcription-whispercpp` and **must be bumped** to a commit that contains the `tts-cpp` port once that registry PR lands. A follow-up commit will update it. ## Tests & examples * Integration + unit test files for Chatterbox English are copied verbatim from `@qvac/tts-onnx` with only mechanical renames (`ONNXTTS` -> `TTSGgml`, `QvacErrorAddonTTS` -> `QvacErrorAddonTTSGgml`, `@qvac/tts-onnx/text-chunker` -> `../../lib/textChunker.js`). Some paths in `test/integration/addon.test.js` still import Supertonic / LavaSR helpers that don't exist in this package — those test blocks will fail fast when the file loads, which is expected until those backends get their own ggml packages. * Examples: `chatterbox-tts.js`, `chatterbox-streaming-tts.js`, plus shared `wav-helper.js` + `pcm-chunk-player.js`. ## What's not in this PR (known gaps) * No docs: README, NOTICE, CHANGELOG, PULL_REQUEST_TEMPLATE changes will land in a single documentation pass once the registry + fork commits have merged upstream. * `vcpkg-configuration.json` baseline needs to point at a qvac-registry-vcpkg commit that ships `tts-cpp` (pending the registry PR). * Actual `npm run build` requires the registry and fork commits to be on `main` of their respective upstream repos. * chore: point tts-ggml vcpkg baseline at the tts-cpp-bearing registry commit Bumps `vcpkg-configuration.json` to GustavoA1604/qvac-registry-vcpkg at commit 1e2839680b6be8d8ffff889a9c29b966c176098c — the commit that adds the `tts-cpp` port. Paired with the `qvac-tts` library already pinned in the port's `portfile.cmake` (GustavoA1604/chatterbox.cpp @ 0fe4a521618cc30358040b29d75d4261b31cbb60). Will be re-pointed at tetherto/qvac-registry-vcpkg once the registry PR lands upstream. * chore: tts-ggml: trim tests + examples to Chatterbox English, restore mobile wrapper Second pass over @qvac/tts-ggml after the build started passing: prune everything that only made sense for the ONNX-era multi-engine scope and adapt the remaining Chatterbox-English bits to the GGUF + file-path reference-audio contract. Restores `test/mobile/` so the Android build has something to point at. ## C++ * `ChatterboxModel.cpp`: the `ArgvBuilder::buildArgv` doc comment contained `**/` which closed the block comment early and broke the build. Rewrote as a `//` comment. ## Examples * `examples/chatterbox-tts.js` — rewrite for v0 contract: single `<text>` argv, `files: { modelDir }` pointing at the two GGUFs, `referenceAudio` is now a wav **path** (addon passes it to `--reference-audio`) instead of a Float32Array. Drops english/multilingual arg and the CHATTERBOX_VARIANT switch that picked which `.onnx` files to load. * Removed `examples/chatterbox-streaming-tts.js` + `examples/pcm-chunk-player.js`. The v0 addon re-loads the model per `run()` call — exposing streaming would mislead. Both come back alongside the persistent-engine milestone. * `package.json`: `npm run example` now passes a default text so it runs without extra args. ## Tests ### Kept as-is (engine-agnostic) * `test/unit/textChunker.test.js` * `test/mock/{MockedBinding,utils}.js` * `test/utils/{wav-helper,pcmConcatenator,loader.fake,runWhisper,runTTS}.js` * `test/reference-audio/jfk.wav`, `test/data/sentences-*.js` ### Mechanical fixes * `test/unit/tts.error.test.js` — fix error-code assertions to the tts-ggml range (`13001–14000`); was still checking the `@qvac/tts-onnx` range (`7001–7011`). * `test/unit/tts-ggml.lifecycle.test.js` — fix stale `QvacErrorAddonTTS` import to `QvacErrorAddonTTSGgml`; switch the stubbed model to `{ t3Model, s3genModel }` GGUFs and drop the non-existent `engine: 'chatterbox'` option. * `test/unit/tts-ggml.sentence-stream.test.js` — same GGUF/engine cleanup. ### Rewritten * `test/unit/chatterbox.inference.test.js` — drop tests that asserted the old ONNX file shape (`tokenizer / speechEncoder / embedTokens / conditionalDecoder / languageModel`), the removed `engine` detection and the wrong `getModelKey` return value (`'onnx-tts'` -> `'tts-ggml'`). New tests cover: `modelDir` derives the two GGUF paths; explicit `t3Model` / `s3genModel` override the defaults. The mocked-binding run/reload/cancel flow stays. * `test/integration/addon.test.js` — fresh, ~180 LoC, Chatterbox-English only. Ensures the GGUFs are present, runs the short sentence set through `loadChatterboxTTS` + `runChatterboxTTS[WithSplit]`, and (on darwin only) runs a whisper-based WER check via the existing `runWhisper` util. Drops the Chatterbox-multilingual block + every Supertonic + LavaSR block that doesn't apply to this package. * `test/utils/runChatterboxTTS.js` — rewrite for the GGUF contract: `files: { modelDir, t3Model, s3genModel }`, `referenceAudio` as a file path that falls back to `test/reference-audio/jfk.wav` (or the mobile test-asset when `global.assetPaths` is present). No more WAV decode / resample on the JS side. * `test/utils/downloadModel.js` — trim from 1007 LoC to 280. Drops the Supertonic + LavaSR + Chatterbox-multilingual + Cangjie downloaders. Keeps the shared HTTP/curl infrastructure and `ensureWhisperModel` (still used by the integration WER check). `ensureChatterboxModels` is now **check-only**: it verifies `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` exist locally and, if missing, prints the exact commands for generating them from the qvac-tts.cpp (née chatterbox.cpp) conversion scripts. Once the GGUFs land on a canonical HuggingFace repo we'll wire up download URLs here. ## Scripts * `scripts/ensure-chatterbox.js` — simplify to a single invocation against `./models/`. Drops the variant / language matrix that the ONNX downloader needed. * `scripts/ensure-models.js` — now a thin alias to `ensure-chatterbox.js`. Drops the Supertonic + LavaSR orchestration. ## Mobile * Restored `test/mobile/{integration.auto.cjs, integration-runtime.cjs, testAssets/jfk.wav}` so the Android build has a wrapper to point at. * `package.json`: re-added `test/mobile` to the `files` list. ## Gitignore * Ignore generated `.clang-format` / `.clang-tidy` / `.valgrind.supp` (produced by the top-level `configure_file(...)` calls) and `build_*/` dirs (bare-make convention). ## Verified locally * `npx standard "test/**/*.js" "*.js" "lib/*.js"` — clean. * `npm run test:unit` — 38/38 pass (105/105 asserts). * `npm run build && bare examples/chatterbox-tts.js "Hello from qvac tts ggml."` produces a 24 kHz wav as expected. * Add streaming support * Update ggml backend to use separate ggml repo * tts-ggml: consume renamed tts-cpp library (2026-04-24#1) Upstream chatterbox.cpp renamed the package + namespace + target from qvac-tts to tts-cpp and tightened the library boundary; pick up the new artefacts here: - find_package(qvac-tts-cpp CONFIG REQUIRED) -> find_package(tts-cpp CONFIG REQUIRED) - qvac-tts::qvac-tts -> tts-cpp::tts-cpp - qvac_tts::chatterbox -> tts_cpp::chatterbox (engine ptrs, EngineOptions, SynthesisResult, forward-decls in ChatterboxModel.hpp) - #include <qvac-tts/chatterbox/engine.h> -> #include <tts-cpp/chatterbox/engine.h> - Doxygen / inline doc references to the old names refreshed alongside the code changes. vcpkg wiring: - vcpkg-configuration.json baseline bumped to qvac-registry-vcpkg commit bc30b0b (ports/tts-cpp renamed and repointed at chatterbox.cpp@f8f9145). - vcpkg.json tts-cpp constraint bumped to 2026-04-24#1 (the port that carries the rename + namespace + install(EXPORT) changes). Verified with a cold bare-make generate + bare-make build against the new port, and the addon's existing unit + integration test suites. Made-with: Cursor * tts-ggml: bump tts-cpp port to 2026-05-07 + registry baseline Picks up the round-3 review-fix wave landed on the tts-cpp port: e673182 scrub stale patches/ refs from README (N10) 8ba10a6 drop unreachable TTS_CPP_GGML_LIB_PREFIX block (N8) 4b5d2d7 mirror N1-N7 fixes from chatterbox.cpp source-of-truth - N1 supertonic alive-registry guard against freed-backend gallocr_free assert on hot-swap (Vulkan/Metal/CUDA) - N2 drop dead g_sink_* state, soften log_set docstring - N3 Turbo BPE try/catch (exception-safe Engine ctor) - N4 STFT cancel checkpoint + tighter Engine::cancel() doc - N5 document s3gen_preload/unload refcount semantics - N6 drop dead cached_text_lc Supertonic shim - N7 fix misleading "no copy" view-vs-copy log wording Plus the integrated-port-only round-2 fixes that landed earlier: fa0d490 close patches/-deleted regression: TTS_CPP_USE_SYSTEM_GGML now defaults ON; bundled-without-patches hard-errors at configure time with a pointer at the ggml-speech vcpkg port. ae34c58 README rewritten for integrated/vcpkg context. a2f2dd6 top-level qvac-ext-lib-whisper.cpp README points at the tts-cpp/ subtree (alongside parakeet-cpp/). Public API used by ChatterboxModel (tts_cpp::chatterbox::Engine / EngineOptions / SynthesisResult / s3gen_preload / s3gen_unload) is backward-compatible: the new port adds Engine::backend_name(), MTL-variant fields on EngineOptions (language / cfg_weight / min_p / exaggeration), and a separate tts_cpp::supertonic::Engine class, but nothing this consumer was already calling has changed. Edits: packages/tts-ggml/vcpkg.json - tts-cpp dep: version>=2026-04-24#1 -> version>=2026-05-07. packages/tts-ggml/vcpkg-configuration.json - default-registry baseline: bc30b0b (April 2026 fork-only state) -> 16b91afdcfd59baea60e81f3da94f49311ef2a97. The new baseline pulls in the post-tetherto-merge state (parakeet-cpp port at 932d5d9, ggml-speech port-version 1 at f07bdd0) plus the new tts-cpp port (16b91af) on the developer's GustavoA1604 registry fork. Smoke-test plan: after running `vcpkg install` against the new baseline, the tts-cpp port's vcpkg_from_github resolves at GustavoA1604/qvac-ext-lib-whisper.cpp@e673182 (tts-cpp branch) until the upstream PR merges. ChatterboxModel should build and synthesize identically; expanding to Multilingual + Supertonic flows is the follow-up commit on the package side. Co-authored-by: Cursor <cursoragent@cursor.com> * Add chatterbox multilingual and supertonic * Add mobile integration tests * tts-ggml: drop clang-19 pin in linux-clang toolchain The toolchain hardcoded `clang-19` / `clang++-19` (versioned binary names) since the package's first commit (0a2c978). Linux CI hadn't exercised this path before — the new on-pr-tts-ggml.yml -> integration matrix is the first time it does, and it fails on every linux runner (ai-run-ubuntu-22.04, ai-run-linux-gpu, ubuntu-24.04-arm) at vcpkg's "detect_compiler" step because none of the GH-hosted images ship a `clang-19` symlink: Detecting compiler hash for triplet x64-linux... error: while detecting compiler information: ... CMake Error at scripts/cmake/vcpkg_execute_required_process.cmake:127 (message): Command failed: ... -DVCPKG_CHAINLOAD_TOOLCHAIN_FILE= .../tts-ggml/vcpkg/triplets/../toolchains/linux-clang.cmake ... Match parakeet's working pattern (qvac-lib-infer-parakeet/vcpkg/ toolchains/linux-clang.cmake): use unversioned `clang` / `clang++` so each runner picks up its image's default clang (clang-15 on ubuntu-22.04, clang-18 on ubuntu-24.04, whatever the AI runners ship). The `-stdlib=libc++` flag added by x64-linux.cmake / arm64-linux.cmake is honoured by every reasonable clang version. Co-authored-by: Cursor <cursoragent@cursor.com> * Add C++ tests and coverage; fix linux build * tts-ggml: address PR review feedback Bundle of correctness, hygiene, and CI-doc fixes from the recent code review. Each item below has its own paragraph in the diff comments. - #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js to package.json so consumers running the integration tests from the npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`. - #2 deps: move @qvac/langdetect-text from runtime dependencies to devDependencies (it's only referenced from examples/, which aren't in the published files list). - #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming detection used to read engine_->options() outside engineMu_, racing with reload(). synthesize() now returns SynthesizeResult { pcm, wasStreaming } where wasStreaming is captured under the engine lock against the local shared_ptr so process() doesn't have to touch engine_ again. - #4 deferred-load: ChatterboxModel + SupertonicModel constructors used to call load() eagerly, so JsInterface::createInstance() (sync on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop. Both models now implement IModelAsyncLoad: constructors validate + return; the actual load is deferred to waitForLoadInitialization(), which the new addon_js::activate wraps inside JsAsyncTask::run so the parse runs on a worker thread. binding.cpp registers addon_js::activate in place of JsInterface::activate; tts.js now awaits the resulting promise. - #5 dead code: drop _resolvePath (unused), drop the (void)inputObj read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE / FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but- not-thrown so future maintainers don't delete them blindly (the unit suite asserts the values). - #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_ reset pattern: cancel() sets it, synthesize() fast-fails on it, process() resets it per call so a stale cancel doesn't poison the next run. - #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that the JS layer is the source of truth for useGPU and nGpuLayers wins downstream; left a pointer to std::optional<bool> if a future caller ever needs to distinguish "absent" from "explicit false". - #10 fork pointers: README.md and test/utils/downloadModel.js no longer point at GustavoA1604/chatterbox.cpp; both reference the upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now. - #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment on the build-and-test job documenting that continue-on-error is the early-days landing posture (merge-guard treats success || skipped as pass), with a pointer to tighten once Device Farm provisioning is stable. Nits: - 'use strict' added to addonLogging.js (matches every other .js). - node-vs-bare runtime banners on scripts/{generate,validate}-mobile-integration-tests.js. - ttsOutputDebugString no longer JSON.stringify's the full PCM Int16Array on every chunk-streaming event; emits a tiny summary ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen}) instead. Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load contract); 4 skipped real-GGUF tests behind the existing QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF / QVAC_TEST_SUPERTONIC_GGUF env-var gates. Lint clean. Co-authored-by: Cursor <cursoragent@cursor.com> * tts-ggml: unblock CI integration tests on every desktop runner Four independent failures, one per platform: 1. linux-x64 / linux-arm64: addon load crashed at `libomp.so.5: cannot open shared object file`. tts-cpp's binary is built with clang under the linux-clang toolchain and links against libomp (LLVM OpenMP runtime); only `libgomp1` (GNU OpenMP) was being apt-installed. Add `libomp5` so libomp.so.5 is on the loader path. 2. darwin-arm64: convert-models.sh aborted at line 200 with `hf_args[@]: unbound variable`. macOS's system bash is 3.2 which treats `"${arr[@]}"` as nounset access when the array is empty under `set -u`; with HF_TOKEN unset we hit it on every fresh runner. Use the `${arr[@]+"${arr[@]}"}` idiom (defined-or-nothing) at all six call sites and add a header comment so the next maintainer doesn't accidentally regress. 3. darwin-x64: pip install bombed building `llvmlite` from source because the macos-15-large runner has no LLVM 15 development install. Root cause: librosa pulls in numba 0.65+, which stopped shipping darwin-x86_64 wheels for Python 3.12. Pin Python to 3.11 in the Setup Python step; 3.11 has prebuilt wheels for the entire numba/llvmlite/librosa stack on darwin-x64 and is fine for every other converter dependency. 4. windows-2022: ChatterboxModel::load threw `vk::createInstance: ErrorIncompatibleDriver`. Root cause: the addon's index.js::_validateConfig defaults `useGPU = true` when neither useGPU nor nGpuLayers is specified, so the test ran with n_gpu_layers=99 -> ggml_backend_vk_init -> vk::createInstance -> ErrorIncompatibleDriver on the runner's no-Vulkan-driver image. runChatterboxTTS.js now honours `process.env.NO_GPU === 'true'` (set on the no-GPU matrix entries) and forces useGPU=false on exactly those runners; the other test runners (chatterbox-mtl, gpu-smoke, multiple-runs) already had this guard. Also documents the `mesa-vulkan-drivers` apt package (already pulled in) as the software ICD that lets the Vulkan-built prebuild's runtime backend probe enumerate at least one device on linux runners. Co-authored-by: Cursor <cursoragent@cursor.com> * tts-ggml: drop Chatterbox from mobile bundle (Metro V8 string limit) Mobile build failed at `:app:createBundleReleaseJsAndAssets` with: SyntaxError: assets/testAssets/chatterbox-s3gen.gguf: Cannot create a string longer than 0x1fffffe8 characters Root cause: Metro's bundler reads every asset under `test/mobile/testAssets/` via `Buffer.toString()`. V8's max string length is 0x1fffffe8 (~512 MiB). chatterbox-s3gen.gguf is ~1 GiB even with --quant q4_0 because the s3gen converter only quantizes attention weights and leaves the bulk of the s3gen graph in fp16 ("0/291 weight tensors quantized" in the converter log). Fix: bundle ONLY supertonic.gguf (~125 MiB, comfortably under the limit) on mobile. Mobile Chatterbox tests degrade cleanly to `t.pass('Skipped: Chatterbox GGUFs not available')` via the existing `ensureChatterboxModels` helper -- it already returns { success: false } when the GGUFs aren't on disk. Cache key bumped to v2 so existing v1 cache entries (which include the chatterbox files) are evicted on the next run. Bundling Chatterbox on mobile requires either: - adding `gguf` to qvac-test-addon-mobile's metro `assetExts` so the JS-string read is skipped (then the s3gen file can flow through the bundle as a raw asset), or - pushing the chatterbox GGUFs to the device via `adb push` outside the bundle and surfacing the path through downloadModel.js's existing ANDROID_CANDIDATE_DIRS fallback. Both are outside the scope of this PR; documented inline above the cache step for the next maintainer. Co-authored-by: Cursor <cursoragent@cursor.com> * Bump hash of vcpkg * Consume vcpkg from tetherto repository * Fix integration tests failures in all platforms * Further fix tests * fix: Make useGPU flag more meaningful (#1953) * fix[api]: make useGPU flag actually force CPU/GPU and reject useGPU/nGpuLayers conflicts * add gpu smoke test * resolve comments --------- Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local> * Update dependencies after monorepo directory changes * Further drop qvac-lib- prefix * Add CHANGELOG.md --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Ishan Vohra <ishanvohra2@gmail.com> Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>
simon-iribarren
added a commit
to simon-iribarren/qvac
that referenced
this pull request
May 13, 2026
…te concurrency Address non-blocking review nits on PR tetherto#2007: - aggregate-events: explain why a wire event carrying both error and cancelled signals resolves to error (closes brief open question tetherto#3). - kv-cache-session: doc-comment on deleteKvCacheState explaining the ordering guarantee under concurrent in-flight turns -- delete is wire-async, in-flight turns roll back idempotently when their commit probe finds the file gone (closes brief open question tetherto#4). Comments only; no behavior changes.
This was referenced May 13, 2026
Merged
simon-iribarren
added a commit
that referenced
this pull request
May 13, 2026
…ache via KvCacheSession (#2007) * QVAC-18182 feat[api]: typed cancel outcomes on the wire + atomic KV-cache via KvCacheSession Builds on QVAC-18181's request lifecycle primitives (DisposableScope, RequestContext, RequestRegistry) to deliver the M2 milestone: - Typed cancel outcomes: `stopReason: "cancelled"` on `completionDone` events, and `InferenceCancelledError(requestId, partial)` thrown from CompletionRun promise-aggregates (`final` / `text` / `toolCalls` / `stats`). The wire stream still ends normally so iterating `run.events` is unaffected — the typed error lives on the aggregate promises that callers `await` for the final result. - KvCacheSession (`server/bare/plugins/llamacpp-completion/ops/ kv-cache-session.ts`) — single atomic owner of the three KV-cache layers (`cachedMessageCounts`, `initializedCaches`, on-disk `.bin` files). `beginTurn` / `commitTurn` / `rollback` collapse the three duplicated cleanup blocks in `completion-stream.ts` into one scope.defer hook. Cross-model administrative deletion lives at the module level as `deleteKvCacheState(...)`, called by the RPC `handleDeleteCache` handler. - Stop-button race close — `RequestRegistry` now keeps a bounded cancelled-before-begin map (128 entries, 30s TTL). A `cancel({ requestId })` that lands before the server's `begin(...)` ran is applied retroactively when begin lands, so same-tick stop clicks no longer disappear into the void. Internal-only — the wire surface for `cancel` is unchanged (Option A in the brief). Cursor rules updated in the same PR so the request-lifecycle and KV-cache topic docs stay in sync with the implementation. Tests: - unit: KvCacheSession (bareTest-gated, runs in the Bare consumer), RequestRegistry race + bounded-set eviction, completion-event schema cancelled cases. - e2e: cancellation-tests.ts adds three definitions — mid-stream cancel (events.stopReason === "cancelled", final rejects with InferenceCancelledError, partial.text matches concatenated contentDelta), cancel-before-begin (retroactive abort), and cancel-then-resume-kv-cache (rollback wiped the three layers, the next turn re-primes cleanly). * chore: drop planning labels (Mx/Dx) from QVAC-18182 comments Strips milestone (`M1`/`M2`/`M3a`...) and deliverable (`D2`/`D5`/`D7`) labels from comments and test titles introduced with the typed-cancel outcomes + KvCacheSession work. The substantive descriptions of the contracts (Stop-button race, cancelled-before-begin map, three-layer session ownership, etc.) are preserved; only the planning-doc references are removed so the code reads cleanly without the pitch context. Durable `QVAC-XXXXX` ticket references are kept. No behavior or API surface changes. * chore: drop Asana ticket references from QVAC-18182 code comments Strips QVAC-XXXXX inline ticket references from code/test comments introduced by the typed-cancel-outcomes work. Concept names (Stop-button race, cancelled-before-begin, etc.) and prose descriptions of the contracts are preserved; only the ticket-tag suffixes go. Also renames a test cache key from `qvac-18182-cancel-resume-kvcache` to `cancel-then-resume-kvcache` so the cache key reads as a stable identifier rather than a ticket reference. No behavior or API surface changes. * QVAC-18182 doc: clarify error>cancelled precedence + deleteKvCacheState concurrency Address non-blocking review nits on PR #2007: - aggregate-events: explain why a wire event carrying both error and cancelled signals resolves to error (closes brief open question #3). - kv-cache-session: doc-comment on deleteKvCacheState explaining the ordering guarantee under concurrent in-flight turns -- delete is wire-async, in-flight turns roll back idempotently when their commit probe finds the file gone (closes brief open question #4). Comments only; no behavior changes. * QVAC-18182 doc: demonstrate typed cancel outcomes in cancel example Enhance the existing cancel-by-request-id example to demonstrate the two M2 cancel-outcome channels: - run.events ends normally with completionDone carrying stopReason: "cancelled" -- show reading it inside the iteration loop. - run.text rejects with InferenceCancelledError(requestId, partial) on cancel -- show the instanceof check and consuming partial.text, partial.toolCalls, partial.stats. Also update the header to remove the now-stale "logged as a no-match" sentence (same-tick cancels are no longer dropped after M2's race close). Pure documentation enhancement; no API or behavior changes. * QVAC-18182 fix: address PR review — partial-prime cleanup + parent-aborted state Two follow-ups from Opanin's review on PR #2007: 1. KvCacheSession.beginTurn: if `primeIfMissing` throws after the addon has partially written a `.bin` to disk, the next `beginCustom` would `fsPromises.access(cachePath)` → true and trust the half-primed file as a valid cache (no rollback hook is registered yet — the handler hasn't seen the `TurnHandle`). Wrap both `beginCustom` and `beginAuto` prime calls in a shared `primeOrCleanup` helper that best-effort unlinks the partial file before re-throwing the original prime error. Adds a bare-only unit test asserting the on-disk file is removed and the init flag stays unset on the failed-prime path. 2. RequestRegistry.begin: when `parentSignal` was already aborted at begin time, line 271 aborts the controller but the `state` ternary still landed `"running"`, exactly the "momentarily-running with already-aborted signal" the preCancel branch was guarding against. Extend the ternary to cover both inputs and the existing `parentSignal already aborted` test now also asserts `ctx.state === "cancelling"`. No behavior change on the happy path. Lint + typecheck + 351-test unit suite green locally on the changed files. * QVAC-18182 fix: prime is atomic — addon writes to .prime.tmp + atomic rename Upgrade the previous reactive cleanup workaround (PR #2007 review by @opaninakuffo) into a proactive atomic-by-construction design: - The session steers `model.run({ saveSessionPath })` to a sibling `cachePath + ".prime.tmp"` path. - Only after the prime closure resolves successfully do we promote the temp file to the canonical `cachePath` via `fsPromises.rename` (atomic same-volume on every host we target). - The canonical cache path is therefore *never* observable in a partial state — a thrown prime is indistinguishable on disk from a never-attempted prime, so the next existence probe (in-process or cross-process worker restart) cannot trust corrupt bytes. Defensive details: - We unlink any leftover `.prime.tmp` *before* invoking the closure, so a deferred-write addon path can't accidentally promote stale-from-crash bytes left by a prior worker. - On prime success we probe the temp path before renaming. If the addon deferred its disk write (some llama.cpp paths flush lazily), the temp doesn't exist and we leave the canonical path absent — `verifySaveAndRecord` in `commitTurn` is the authoritative check. - On rename failure we unlink the temp and surface the rename error; rename atomicity guarantees the canonical path was untouched. Why this is better than the prior `primeOrCleanup`: - Best-effort `unlink` was load-bearing for correctness in the old design — a failed unlink left a half-primed canonical file the next `beginCustom` would trust. The new design moves the only possible "partial" file to a non-trusted name, so failed cleanup cannot corrupt the canonical name by construction. - The unit test no longer mocks the workaround surface; it asserts the actual invariant ("canonical path was never written") plus the positive rename and the leftover-sweep guarantees. Tests: 3 bare-only kv-cache-session unit tests (throw-leaves-canonical- untouched, success-promotes-via-rename, leftover-from-crash-is-swept). Lint + typecheck + 351-test unit suite green locally on the changed files. Long-term, the right fix is one layer down — the llama.cpp addon should write transactionally itself and surface save errors instead of swallowing them. When that lands, this helper collapses to a direct `prime(cachePath)` call and the `verifySaveAndRecord` access-probe fallback (TODO already documented) can be retired together. Filed as a separate follow-up; out of scope for this PR. * QVAC-18182 fix: replace prime-atomic helper with verifyPrimedFile post-prime probe Audit of the llama.cpp addon (`CacheManager::writeCacheFile` → `llama_state_save_file`, return value swallowed; `LlamaModel:: processPromptImpl` lines 575-599) shows the bug shape Opanin flagged on PR #2007 — "primeIfMissing throws after a partial save" — does not actually fire. The save call is the very last operation on the prefill path, the addon ignores its return value, and any earlier throw means no save was attempted. So: - `primeOrCleanup` (`ac8d2d74e`) and the upgrade to `primeAtomically` (`a7420f3e6`) defended against a code path that the addon does not produce. - The real corruption shape is silent partial writes (addon's `llama_state_save_file` returns false, addon ignores it, file is half-written or empty). Atomic temp+rename did NOT close this gap — on a "silent partial" the closure resolves successfully and the helper would happily promote the partial `.prime.tmp` to the canonical path. Replace both helpers with a small `verifyPrimedFile` that mirrors the existing `verifySaveAndRecord` access-probe pattern used at commit time, applied at prime time: - After a successful prime closure, `fsPromises.stat` the canonical path. If it doesn't exist (addon was interrupted before save) or has size 0 (addon save call produced an empty file), throw and best-effort unlink the empty leftover so the next existence probe doesn't trust it. - This catches the two failure modes Opanin's concern was a proxy for (cancelled-mid-prime; addon save quietly produced nothing) without claiming defense against partial-but-nonzero writes, which can only be closed at the addon layer. The `RequestRegistry` parent-aborted-state fix (`ctx.state` ternary covers `opts.parentSignal?.aborted`) from `ac8d2d74e` is preserved unchanged — it stands on its own as a correct response to Opanin's second comment. Long-term root cause stays the addon: have `CacheManager::writeCacheFile` check `llama_state_save_file`'s return value and throw on failure. When that lands, both `verifyPrimedFile` and `verifySaveAndRecord`'s access-probes can be retired together. Filed as a separate follow-up — out of scope for this PR. Tests: 3 prior bare-only prime-atomic tests removed; 2 new bare-only tests added (no-file and empty-file rejection paths). Lint + typecheck + 330-test unit suite green locally on the changed files (pre-existing sdcpp-generation lint errors unchanged). * QVAC-18182 doc: kv-cache rule documents addon non-transactional save + matched access-probes Extend the "Cache Initialization (primeIfMissing)" section in .cursor/rules/sdk/docs/kv-cache-system.mdc with the corrected addon-contract analysis: - The llama.cpp addon's CacheManager::writeCacheFile discards llama_state_save_file's bool return; maybeSaveCacheToDisk is the last call on the prefill path. So no closure-rejection path can coexist with a partial file on disk. - Document the four real outcomes as a table (interrupted / success / silent partial write / pre-eval throw) so future readers can see why the SDK takes the shape it does. - Pin both SDK-side defenses as a matched pair: verifyPrimedFile at prime time (added in this PR) and verifySaveAndRecord at commit time (existing). Both are honest about what they catch (missing / empty file) and what they don't (partial-but-nonzero, only addon fix can close that). - Reference the addon-layer follow-up (1214778658064488 / "throw on llama_state_save_file failure") so the next contributor knows both probes will be retired together when the addon throws on save failure. No code change — rule-only update.
simon-iribarren
added a commit
to simon-iribarren/qvac
that referenced
this pull request
May 14, 2026
- transcribe.ts: route the two `Transcription Update` debug emits
through `requestLogger.debug` so they carry the per-request prefix,
matching the rule's `grep "requestId=<id>"` invariant. Drop the now-
unused module-level `logger`. Collapse two `scope.defer(async () =>
{ await restorePrompt(...) })` wrappers to bare arrow callbacks
(review tetherto#5, tetherto#10).
- inference-handler-migrations.test.ts: add bareTest op-level cancel-
by-requestId cases for `transcribe (whisper)` (asserts loop exit +
addon.cancel called + reload-count == 2 to pin the
`applyPrompt + restorePrompt runs exactly once` invariant) and
`finetune` (asserts model.cancel called + scope unwind clears the
runtime-state flag back to IDLE). Pin the NMT soft-cancel contract
by instrumenting the addon and asserting addon.cancel was NOT called
during a translate cancel (review tetherto#3, tetherto#7).
- request-lifecycle-primitives.mdc: reconcile the "polling
signal.aborted mid-handler" anti-pattern with the new "Per-iteration
cancel check (M3b)" canonical pattern. The anti-pattern is *adding*
the check when the addon already honours signal directly; the M3b
pattern is *introducing* the check where the addon doesn't and the
loop is the only soft-cancel exit (review tetherto#4).
simon-iribarren
added a commit
that referenced
this pull request
May 15, 2026
* QVAC-18183 feat[api]: inference-handler migrations
Migrate the four remaining inference handler kinds onto the
RequestRegistry primitives shipped in M3a (cancel-capability
declaration, per-kind concurrency policy, structured
`[request-lifecycle]` logging). Each handler now opens a
request-scoped `ManagedRequestContext`, threads the optional
`requestId` from the wire request (falling back to a server-minted
UUID), routes hard cancels to `addon.cancel()` at a single signal-
listener leaf, and replaces ad-hoc `try/finally` cleanup with
`scope.defer(...)` registrations so cleanup runs in LIFO order on
every exit path.
- `embed` (kind "embeddings", `{ scope: "model", hard: true }`):
`packages/sdk/server/bare/ops/embed.ts` opens the context, threads
`requestId` from `embedRequestSchema`, post-await `signal.aborted`
checks raise `InferenceCancelledError`.
- `transcribe` / `transcribeStream` (kind "transcribe",
`{ scope: "model", hard: true }`): collapsed
`try { ... } finally { restorePrompt(...) }` into
`scope.defer(restorePrompt)`, added per-iteration
`if (ctx.signal.aborted) break;` in the `response.iterate()` loop
(Option A from §4 of the M3b brief — explicit, visible at the call
site, no `takeWhileNotAborted` wrapper).
- `translate` (kind "translate"): two engine branches.
llamacpp-completion declares `{ scope: "model", hard: true }` and
wires `signal → addon.cancel()`; nmtcpp-translation keeps
`{ scope: "none" }` and soft-cancels inside both the streaming
iterate loop and the `runBatch` early-return path.
- `finetune` (kind "finetune"): flipped the llamacpp-completion
manifest declaration from `{ scope: "none" }` to
`{ scope: "model", hard: true }` (the addon already exposes
`model.cancel()`). `startFinetune` opens a registry context and
wires `signal → model.cancel()`; the two-level `try/finally`
collapses into `scope.defer` for `clearFinetuneRuntimeState` and
`handle.removeListener`. `cancelFinetune(modelId)` is now a thin
wrapper over `getRequestRegistry().cancel({ modelId, kind:
"finetune" })` — never invokes `model.cancel()` directly.
Per §4 of the brief: per-iteration cancel granularity uses
Option A (explicit `if (ctx.signal.aborted) break;` at the top of
each streaming loop body). No `takeWhileNotAborted` wrapper was
introduced.
Per §7 anti-patterns: M3b adds zero `oneAtATimePerModel` policies
(the four migrated kinds tolerate concurrent requests against the
same model), leaves the M1 compat-fallback in
`server/bare/ops/cancel.ts` untouched (M3d retires it), and does
not modify `cancelHandler.ts`.
Other changes:
- `embed`, `transcribe`, `transcribeStream`, `translate`,
`finetune` request schemas grow an optional `requestId` field
(`.string().min(1).optional()`); server-side ops fall back to
`generateServerRequestId()` when absent.
- Whisper / Parakeet / LLM / NMT plugin handlers thread
`request.requestId` into their bare ops.
- `plugin-cancel-capability.test.ts` truth-table flipped for the
`finetune` row.
- New `inference-handler-migrations.test.ts` covers schema-level
optional-`requestId` acceptance for all four kinds and pins the
`[request-lifecycle] begin/cancel/end` line shape for each kind.
The op-level cancel-by-requestId / cancel-by-modelId integration
tests are bare-runtime-gated (the migrated ops pull `bare-crypto`
/ `bare-fs` transitively and can't load under Bun, same reason as
`finetune-ops.test.disabled.ts`).
- `.cursor/rules/sdk/request-lifecycle-primitives.mdc` and
`.cursor/rules/sdk/docs/request-lifecycle-system.mdc` updated:
M3b row marked shipped, finetune truth-table row flipped,
canonical-handler-shape section refreshed to use `embed.ts` as the
cleanest reference and to document the Option A per-iteration
check.
Verification:
- `bun lint` (eslint + tsc --noEmit): green.
- `bun run typecheck`: green.
- `bun run test:unit`: every test file green except the
pre-existing `client/rpc/rpc-client.ts` `#rpc` package-resolution
failure on upstream/main (also reproducible without these
changes; unrelated to M3b).
* QVAC-18183 fix: address PR #2058 review feedback
- transcribe.ts: route the two `Transcription Update` debug emits
through `requestLogger.debug` so they carry the per-request prefix,
matching the rule's `grep "requestId=<id>"` invariant. Drop the now-
unused module-level `logger`. Collapse two `scope.defer(async () =>
{ await restorePrompt(...) })` wrappers to bare arrow callbacks
(review #5, #10).
- inference-handler-migrations.test.ts: add bareTest op-level cancel-
by-requestId cases for `transcribe (whisper)` (asserts loop exit +
addon.cancel called + reload-count == 2 to pin the
`applyPrompt + restorePrompt runs exactly once` invariant) and
`finetune` (asserts model.cancel called + scope unwind clears the
runtime-state flag back to IDLE). Pin the NMT soft-cancel contract
by instrumenting the addon and asserting addon.cancel was NOT called
during a translate cancel (review #3, #7).
- request-lifecycle-primitives.mdc: reconcile the "polling
signal.aborted mid-handler" anti-pattern with the new "Per-iteration
cancel check (M3b)" canonical pattern. The anti-pattern is *adding*
the check when the addon already honours signal directly; the M3b
pattern is *introducing* the check where the addon doesn't and the
loop is the only soft-cancel exit (review #4).
* QVAC-18183 fix: drop unsafe `addon` re-narrowing in translate.ts onAbort
Addresses opaninakuffo's review comment on #2058:
`AnyModel.addon` is already typed as `AddonInterface | undefined`
(see `server/bare/registry/model-registry.ts:17-20`), so the
`as unknown as { addon?: { cancel?(jobId?: string): Promise<void> } }`
cast was unnecessary. Matches the simpler pattern used by `embed.ts`
and `transcribe.ts` for the same `onAbort` shape — keeps the four
M3b-migrated ops uniform.
* QVAC-18183 doc: trim internal milestone references from cursor rules + code comments
Removed the "Migration Roadmap" table, "M1/M2/M3a-d" milestone labels, planning-brief
decision references (Decision A/B.2, D1/D2), workspace-local paths
(`tasks/release-0.11.0-planning/...`, `pitch-3-decisions.md`), and "in review"
forward-references from the request-lifecycle cursor rules and the matching code
comments in the bare ops, finetune wrapper, and the inference-migration tests. The
canonical handler shape, anti-patterns, primitives reference, plugin cancel-capability
truth table, and concurrency-policy / structured-logging sections all stay — only the
internal milestone framing comes out.
gianni-cor
pushed a commit
that referenced
this pull request
May 18, 2026
* feat: add qvac-lib-infer-vla hello-world addon scaffold
- New addon package at packages/qvac-lib-infer-vla with ggml backend.
- CI workflows for on-pr, on-merge, prebuilds, integration + mobile tests, cpp-tests.
- Temporarily renames on-pr-qvac-lib-infer-vla.yml to on-pr-ocr-onnx.yml
so the existing workflow name triggers CI while verifying hello-world scaffold.
* fix[notask]: pure-JS helper pattern for hello-world addon unit tests
- Extract `normalizeName()` into a pure-JS `addon.js` helper in the vla
scaffold so `npm run test:unit` no longer loads the native `.bare` addon.
- Mirror the pattern used by qvac-lib-infer-llamacpp-embed, which lets CI's
ts-checks job (which runs `test:unit --if-present` without a build) pass.
- Propagate the same pattern to the `new-addon` skill templates and document
the rule in SKILL.md so future scaffolds inherit it.
* fix[notask]: fix Windows build for hello-world scaffold
Add Windows compile defines (`NOMINMAX`, `WIN32_LEAN_AND_MEAN`, `NOGDI`)
and link `msvcrt.lib`, mirroring qvac-lib-infer-llamacpp-embed. Without
these, the Windows SDK macros `ERROR` (wingdi.h) and `min` (minwindef.h)
collide with `Priority::ERROR` and `std::min` in the
`qvac-lib-inference-addon-cpp` headers.
Propagate the same fix to the `new-addon` skill template so future
scaffolds inherit it.
* fix: use versionless filename for pinned Vulkan SDK download
LunarG rotated out the versioned `vulkansdk-linux-x86_64-${VERSION}.tar.xz`
download URL and now only serves `vulkan_sdk.tar.xz` under each pinned
version path. Prebuild workflows using the pinned version (currently
1.4.341.1) fail with `wget` exit code 8 (HTTP 404) on every fresh runner.
Align the pinned-version URL with the `latest` URL pattern, which already
uses `vulkan_sdk.tar.xz` and continues to return 200 for pinned versions.
Verified:
- https://sdk.lunarg.com/sdk/download/1.4.341.1/linux/vulkan_sdk.tar.xz -> 200
- https://sdk.lunarg.com/sdk/download/1.4.341.1/linux/vulkansdk-linux-x86_64-1.4.341.1.tar.xz -> 404
* chore[notask]: bump setup-vulkan-sdk action pin on tmp-vla
Point the vla prebuild workflow at the cherry-picked Vulkan URL fix
so CI on this branch actually picks it up. The previous pin still
resolved to the pre-fix action, so Linux/Android prebuilds kept
hitting wget exit 8 (HTTP 404) even after the fix commit landed on
tmp-vla.
* feat[bc]: port SmolVLA ggml inference into qvac-lib-infer-vla
Replace hello-world scaffold with real SmolVLA inference engine (739-tensor
vision+text+expert model, 10-step flow-matching ODE). JS surface exposes
VlaModel, preprocessImage, padState. Integration test downloads the LIBERO
checkpoint from S3 via GitHub OIDC so CI can exercise end-to-end inference.
* infra: add on-pr CI workflow for qvac-lib-infer-vla
The VLA package was missing an on-pr workflow, so nothing ran sanity checks,
cpp-lint/tests, ts-checks, prebuilds, or integration tests against a PR. This
adds one mirroring the Embed template so integration tests (which pull the
SmolVLA LIBERO GGUF from S3) gate the PR.
* doc: harden new-addon skill with explicit 7-workflow check
Add Step 4a validation gate that lists every expected workflow filename and
fails loudly if any is missing. The prior VLA scaffold shipped with only 6/7
workflows (on-pr-*.yml silently dropped), which left PRs against the new
package without sanity checks, cpp-lint/tests, ts-checks, prebuilds, or
integration tests. Also make Step 6 list each generated filename by name so
miscounts are caught at report time.
* fix: use std::numbers::pi_v<float> to unbreak Windows (MSVC) build
MSVC's `<cmath>` does not define `M_PI` unless `_USE_MATH_DEFINES` is set
before the include, so the x64-windows prebuild job failed to compile
smolvla.cpp. Switch to the C++20 `std::numbers::pi_v<float>` constant,
which works on every toolchain we build with.
* feat: enable full GPU backend set (Vulkan + Metal + OpenCL) in qvac-lib-infer-vla
Drop default-features:false on the qvac-fabric dep so the port's platform-
auto-selected backends get built: Metal on iOS/macOS, Vulkan on Linux/Android/
Windows, plus the CPU fallback everywhere. Declare the OpenCL dep on Android
so qvac-fabric's Android GPU backend can pick it up alongside Vulkan, mirroring
the LLM addon's setup.
The addon already calls ggml_backend_load_all_from_path(BACKENDS_SUBDIR) and
ships each GGML_AVAILABLE_BACKEND as a shared/static lib via CMakeLists, so no
C++ changes are needed — the extra backends get discovered at runtime.
* chore[notask]: rename vla workflow display names for easier triggering
Use `on-merge-vla` for the merge workflow and `vla` for the PR workflow so
`gh workflow run vla` uniquely resolves to the on-pr trigger without ambiguity
against all the other `(Vla)`-suffixed package workflows.
* chore[notask]: mask vla on-pr workflow as on-pr-ocr-onnx.yml on tmp-vla
Temporarily rename the VLA on-pr workflow to the OCR filename so
`gh workflow run on-pr-ocr-onnx.yml --ref tmp-vla` resolves the workflow
ID via main's registration and then dispatches against our file content
on tmp-vla. Scoped to tmp-vla only — does not affect main's OCR workflow.
* fix: satisfy standardjs no-new in vla integration tests
Capture the VlaModel constructor return and destroy it so standardjs
stops flagging the error-path probes with `no-new`. These paths throw
synchronously before the native handle is fully built, so the destroy
is cheap and safe.
* fix: replace brittle t.exception() in vla unit tests to unblock bare run
Brittle's t.exception() runs the probed function inside a promise chain; on
the bare runtime the assertion helper rethrows into an uncaught rejection
which aborts the process with SIGABRT (exit 134). This made the ts-checks
job fail on CI even though every assertion passed.
Switch both rejection probes (preprocessImage and padState) to the same
try/catch + t.ok pattern already used in the integration tests.
* style: apply clang-format-19 to qvac-lib-infer-vla sources
Satisfies cpp-lint 'Check C++ files format' step (run from CI):
git-clang-format-19 --extensions c,cc,cpp,cxx,h,hh,hpp,hxx -- packages/qvac-lib-infer-vla
* test[notask]: fix ci failures from tmp-vla PR-style dispatch
- mobile: add test/mobile/ scaffold (integration-runtime.cjs + auto.cjs)
and matching generate/validate scripts. Mobile workflow requires
test/mobile/*.cjs; before this commit the dir didn't exist.
- integration (linux-x64): install aws CLI v2 on linux runners
(idempotent). Needed for ai-run-linux-gpu self-hosted runner that
lacks a pre-baked aws CLI.
- integration (darwin-x64): skip S3 download + QVAC_VLA_MODEL on the
macos-15-large Intel runner. Its Apple Paravirtual GPU exposes only
~1 GB working set — too small for the 4 GB SmolVLA model, which
triggers GGML_ASSERT(buf_src) mid-inference on Metal. Darwin-arm64
still runs the full end-to-end test.
* ci[notask]: skip cpp-lint on workflow_dispatch in vla on-pr
cpp-lint passes `github.event.pull_request.base.sha` as the diff base;
on workflow_dispatch that's empty, and the called workflow then runs
`git-clang-format-19 --diff ""` which fails with "'' is not a commit".
Gate the job on `github.event_name == 'pull_request_target'` so
dispatch-style runs (we use these to test tmp-vla) don't fail it.
Real PRs still run the format check normally. merge-guard is
if-always, so the skipped job doesn't block it.
* fix: ship ggml core libs on Android and add AWS CLI to PATH on self-hosted linux
Two independent CI fixes for the VLA addon:
1. Android mobile integration tests were failing because the prebuild
shipped only backend shared libs (libqvac-ggml-vulkan.so,
libqvac-ggml-cpu-*.so, libqvac-ggml-opencl.so) and the addon .bare
itself. qvac-fabric builds ggml with GGML_BACKEND_DL=ON on Android,
which makes ggml::ggml and ggml::ggml-base shared libraries too, so
without them the addon's dlopen fails with unresolved ggml_* symbols.
Install them alongside the backend libs when GGML_BACKEND_DL is set.
2. linux-x64 integration tests were failing on the self-hosted
ai-run-linux-gpu runner because AWS CLI v2 installs to
/usr/local/bin/aws but that directory is not on PATH for subsequent
steps. Append it to $GITHUB_PATH so later steps (aws s3 sync, etc.)
can resolve the binary. Also simplified the install block to early-
exit when aws is already present.
* fix[notask]: VLA Android ggml backend-DL compat + linux AWS CLI perms
Two fixes for remaining tmp-vla CI failures:
1. Android addon failed to dlopen the .bare because qvac-fabric builds
ggml with GGML_BACKEND_DL=ON, which keeps the core ggml_backend_*
registry symbols in the addon but puts `ggml_backend_cpu_init` in the
separately-loaded CPU backend .so. Switch to the device-registry API
(`ggml_backend_dev_by_type` + `ggml_backend_dev_init`) so the CPU
backend is obtained from whichever backend was loaded at runtime via
`ggml_backend_load_all_from_path`. Also revert the CMakeLists hack
that shipped ggml::ggml / ggml::ggml-base alongside the addon — those
ship as static .a under this vcpkg triplet and are useless at dlopen.
2. linux-x64 integration jobs were hitting `aws: Permission denied` on
the self-hosted `ai-run-linux-gpu` runner because a leftover install
at /usr/local/bin/aws had mode bits the runner user couldn't execute.
Add an `[ -x /usr/local/bin/aws ]` early-return path so we reuse a
good existing install, and `chmod -R a+rX` after any fresh install to
harden against the same footgun next time.
* fix[notask]: tolerate Vulkan teardown SIGSEGV on ai-run-linux-gpu
The Linux x64 integration matrix runs on two Ubuntu runners: a plain
ubuntu-22.04 (CPU only) and a self-hosted ai-run-linux-gpu (Tesla T4
Vulkan). Tests all pass cleanly on both, but the GPU runner's bare
process exits with SIGSEGV (exit 139) ~0.5s after the final test
completes — inside ggml-vulkan's static-destructor chain interacting
with the NVIDIA Vulkan ICD.
Fixing that upstream is out of scope for this branch, but we still want
GPU coverage in CI. Wrap the `npm run test:integration` invocation so
that exit 139 is tolerated IFF the captured TAP output shows all tests
passed (the `# ok` end marker and the `# tests = N/N pass` summary).
Any other non-zero exit, and any missing TAP pass marker, still fails
the job.
* feat[api]: expose per-stage timings and PyTorch reference assertion in VLA
- VlaModel.run() now returns { actions, stats } where stats carries
vision_ms, smollm2_compute_ms, smollm2_total_ms, ode_ms, total_ms
captured during inference. C ABI of smolvla_inference is preserved;
C++ callers use new smolvla_inference_with_timing.
- Integration test: tolerance-based comparison against a committed
PyTorch reference (test/integration/assets/pt_actions_libero_fixed.json,
generated by scripts/generate_reference.py), plus wiring of the shared
performance reporter (vla addon type). Uploads perf-report.json as
a per-platform artifact in the integration-test workflow.
* test: regenerate VLA PyTorch reference at action_dim=7
The committed reference was generated at action_dim=6 but the current
smolvla-libero-f32-fixed.gguf reports action_dim=7, so the tolerance
asserts were skipped in CI with "shape mismatch (ref=50x6, actual=50x7)".
Regenerated with `generate_reference.py --action-dim 7`; local run now
exercises both new asserts with max|Δ|=0.0009, cos=1.0000.
* feat: bundle SmolVLA GGUF on mobile via presigned S3 URL
Ports the presigned-URL-on-mobile pattern used by qvac-lib-infer-nmtcpp so
the VLA end-to-end test actually runs on AWS Device Farm. Without a GGUF
on device the mobile test skipped, leaving the Step Summary empty.
- scripts/generate-smolvla-presigned-url.sh: resolve the latest date dir
under s3://MODEL_S3_BUCKET/qvac_models_compiled/vla/smolvla-libero/,
presign smolvla-libero-f32-fixed.gguf for 6h, export to GITHUB_ENV.
- integration-mobile-test-qvac-lib-infer-vla.yml: OIDC auth to
eu-central-1, run the presign script, and bundle the URL into
test/mobile/testAssets/smolvla-urls.json before the addon is packed.
- test/integration/addon.test.js: on mobile, load the URL from
global.assetPaths, download into global.testDir/vla-models/ (with
retry/redirect handling and a ≥100MB cache-hit shortcut) and use that
as the modelPath instead of relying on QVAC_VLA_MODEL.
- package.json: add bare-fetch devDep, same version range as nmtcpp.
* fix: stream SmolVLA GGUF download on mobile via bare-https
The mobile end-to-end test was crashing the Bare runtime at
after-test:runAddonTest with State=1 on both iOS and Android. Root cause
was the _downloadFile helper loading the entire 2.1 GiB GGUF into memory
via bare-fetch + response.arrayBuffer() + Buffer.from(buffer), which
peaked at ~4.5 GB and got OOM-killed by the mobile kernel.
Replace the buffered download with a bare-https streaming pipe:
https.get + fs.createWriteStream + res.on('data', chunk => write(chunk)).
Same pattern Parakeet, TTS/Chatterbox, and Diffusion use for their
multi-GB Device Farm models. Preserves redirect handling (301/302/
307/308), retry+backoff, and adds progress logs every 50 MB. Failed
attempts unlink the partial file before retrying.
Drop bare-fetch from devDependencies — bare-https is a Bare runtime
module, so no new dep is needed.
* ci: align darwin-arm64 integration runner with prebuild SDK
Prebuilds for darwin-arm64 are built on macos-14 (macOS 14 SDK), but the
integration test job was running on macos-15-xlarge. The .bare binary —
including its linked Metal/MPSGraph frameworks — was compiled against the
macOS 14 SDK then loaded on a macOS 15 host. That cross-SDK mismatch is a
plausible cause of the Metal correctness divergence we are seeing on CI
(max|Δ|=1.9789 on CI darwin-arm64 vs max|Δ|=0.0006 on a macos-15.5 M3
Max running the same GGUF locally). Match the runner OS to the prebuild
runner (macos-14-xlarge) so the binary executes on the SDK it was built
against.
Also tighten the end-to-end mobile test: remove the t.comment + t.pass()
graceful-skip branches that silently masked iOS CI failures. On mobile
the presigned S3 URL is bundled at build time, so a fetch/load/inference
failure is now a hard t.fail(), and we assert the downloaded GGUF exists
and is at least 100 MB before proceeding.
* ci: run darwin-arm64 VLA integration on self-hosted mac-mini-m4
GitHub's hosted macos-*-xlarge runners are Apple Virtualization VMs —
their Metal driver reports "Apple Paravirtual device" with
`simdgroup reduction = false` and `simdgroup matrix mul. = false`. ggml
falls back to a scalar Metal path that is ~40x slower and produces
different f32 accumulation, which is what caused the darwin-arm64
correctness failure (max|Δ|=1.97, cos=0.15) and a ~12s vs ~0.3s
inference time versus the same GGUF on a real M3 Max.
macos-14-xlarge has the same paravirt signature (confirmed in
run 24887526194: max|Δ|=1.07 on SDK-aligned runner), so the earlier
fix didn't help.
Switch darwin-arm64 integration to the self-hosted mac-mini-m4 runner
(label: mac-mini-m4-gpu), the same setup the diffusion addon uses for
Metal-backed correctness tests.
* ci: install AWS CLI on darwin-arm64 self-hosted runner
The mac-mini-m4 self-hosted runner doesn't ship with aws CLI preinstalled,
so the "Download SmolVLA model from S3" step fails with
`aws: command not found` (run 24888672009, job 72877826352). GHA's Linux
matrix entry had an idempotent aws install; darwin had none. Add the
equivalent macOS step that checks PATH, then /usr/local/bin/aws, then
installs via the official AWSCLIV2.pkg installer. Scoped to darwin-arm64
since darwin-x64 runs on a GHA-hosted Intel Mac that already has aws.
* ci: install AWS CLI user-local on mac-mini-m4 (no sudo)
The self-hosted mac-mini-m4-gpu runner doesn't have passwordless sudo,
so `sudo installer -pkg AWSCLIV2.pkg -target /` fails with
`sudo: a terminal is required to read the password` (run 24889823710,
job 72880523559).
Pivot to a user-local install: `pkgutil --expand-full` unpacks the
official pkg without sudo, and the payload at
`aws-cli.pkg/Payload/aws-cli/aws` is a real Mach-O universal binary
(verified: aws-cli/2.34.36 runs standalone from that path). Move it
to `$HOME/.local/aws-cli` and add that dir to `$GITHUB_PATH`.
Also widen the preflight check to pick up `/opt/homebrew/bin/aws` and
the user-local path, so the step is a no-op on subsequent runs.
* test: fix mobile model download — bare-https has no .get()
Mobile Device Farm runs were failing at test 4 (`end-to-end inference
runs (needs GGUF)`) with `[vla-model] download failed after 3 attempts:
https.get is not a function` on iPhone 16 Pro / 16e / 17 and Pixel 9 Pro /
Galaxy S25 Ultra (run 24891028803).
Root cause: `bare-https` only exports `.request()` — there is no
Node-compatible `.get()`. Switch to the same pattern
`qvac-lib-infer-llamacpp-embed/test/integration/utils.js` uses:
`https.request(url, cb)` followed by an explicit `req.end()`, since
`.request()` returns a writable that must be closed before the request
is actually sent.
t.fail() hardening surfaced this correctly — desktop remains green
(real M4 Metal: max|Δ|=0.0006, cos=1.0000).
* test: fix mobile VLA download crash — use response.pipe(file)
Mobile Device Farm runs were still failing after the https.get→request fix.
Android (Pixel 9 Pro) crashed at 50MB / 2.4% of the 2.2GB download with
SIGABRT on the mqt_v_js thread inside libbare-kit.so; iOS exhibited the
same APP CRASHED pattern (run 24899187856, job 72913667435).
Root cause: the download was using `res.on('data', chunk =>
writeStream.write(chunk))` with no backpressure — V8 + file stream
queue grew until the JS bridge aborted. `qvac-lib-infer-llamacpp-embed`
downloads with `response.pipe(file)`, which applies backpressure
automatically. Switch to the same pattern, plus the full safeResolve/
safeReject error hygiene (destroy file + unlink on error, follow
redirects cleanly).
Progress logging is preserved (`res.on('data')` is kept for byte
counting only; the pipe does the actual writing).
Desktop remained green through both prior fix attempts (real M4 Metal:
max|Δ|=0.0006, cos=1.0000) — this only affects the mobile fetch path.
* test: raise mobile GGUF e2e test timeout to 20 min
The backpressure fix (6021b43b, res.pipe(file)) successfully resolved the
50MB SIGABRT on Android — download now progresses past 50MB cleanly
(logcat: [vla-model] progress: 50MB (2.4%) at 18:07:10 then keeps going
with no crash in libbare-kit.so).
New failure mode surfaced: brittle's default 30-second per-test timeout
fires before a 2.2GB mobile download + model load + inference can
complete. On Pixel 9 Pro and Galaxy S25 Ultra the test timed out at
30s → Uncaught (in promise) Error: Test timed out after 30000 ms →
SIGABRT on mqt_v_js as the unhandled rejection propagates through the
bare bridge.
Only the end-to-end inference test needs the long budget — the other
three tests (module exports, empty path rejection, missing GGUF
rejection) stay at 30s. 20 min is conservative for:
- 2.2GB HTTPS download over mobile carrier (5-10 min)
- SmolVLA model load (vision 12L + text 32L + expert 32L, ~1 min)
- Vision x2 + SmolLM2 prefix + 10-step ODE (~15s on CPU/Vulkan)
- Headroom for Device Farm variability
Desktop is unaffected: it uses QVAC_VLA_MODEL from a pre-staged path
and finishes in ~15 sec (max|Δ|=0.0006 on M4 Metal, cos=1.0000).
* fix: mmap+host_ptr GGUF load to fix iOS Metal alloc crash
Mobile run 24905749242 (commit 8bdc077e) confirmed all download/timeout
fixes worked: Pixel 9 Pro reaches `runAddonTest passed (4/4)`. Two new
unrelated bugs surfaced; this fixes the iOS one.
iOS root cause
On iPhone 16 Pro / 16e / 17, every load attempt crashed at model load
with EXC_BAD_ACCESS in `ggml_metal_buffer_is_shared` at NULL+0x10. The
faulting stack:
ggml_metal_buffer_is_shared
ggml_backend_metal_buffer_type_shared_alloc_buffer
alloc_tensor_range
ggml_backend_alloc_ctx_tensors_from_buft
smolvla_load_model+51156
`smolvla_load_model` was hand-rolling a load path that did:
1. gguf_init_from_file(no_alloc=false) — heap-allocate full 2.2 GB on CPU
2. ggml_init(no_alloc=true) — duplicate context for GPU
3. ggml_backend_alloc_ctx_tensors() — single 2.2 GB Metal shared-mode
allocation, which iOS Metal cannot service. The internal
allocator returned NULL, then dereffed it.
Why the LLM and diffusion addons don't hit this on iOS
Both delegate model loading to a library (llama_load_model_from_file in
qvac-fabric, new_sd_ctx in stable-diffusion-cpp) that uses the
ggml_backend_dev_buffer_from_host_ptr() path on devices reporting
`caps.buffer_from_host_ptr=true` (Apple Metal, CPU). That path wraps an
mmap'd region in a backend buffer and the Metal backend internally
slices it into per-tensor sub-buffers each ≤ max_tensor_size — no
giant single shared-mode allocation.
Fix — mirror llama-model.cpp:6648 create_backend_buffers
- gguf_init_from_file(no_alloc=true): metadata only (~few MB), no 2.2 GB
heap copy.
- Probe device caps (buffer_from_host_ptr, is_default_buft).
- FAST PATH (Apple Metal, CPU): mmap the GGUF file with PROT_READ |
MAP_PRIVATE; call ggml_backend_dev_buffer_from_host_ptr() with
ggml_get_max_tensor_size(ctx) as the slicer hint; wire each tensor
to its mmap-relative position via ggml_backend_tensor_alloc().
Zero-copy: process memory stays around tensor metadata + lazily-paged
mmap, no second allocation.
- FALLBACK (Vulkan / Android, Windows, no-host-ptr device): allocate
via ggml_backend_alloc_ctx_tensors_from_buft() then read from disk
with fseek/fread and upload via ggml_backend_tensor_set(). Same path
as before but without the duplicate-context dance, and emits a clear
failure message if the alloc returns NULL.
- Replace single `buf_w` with `std::vector<ggml_backend_buffer_t>
bufs_w` (Metal will create multiple sub-buffers; CPU/Vulkan keep one).
- Track mmap_addr/mmap_size on the model and munmap in
smolvla_free_model AFTER backend buffers are released.
- Mirror diffusion's CMake: define GGML_BACKEND_DL on Android so the
addon's TUs see the same flag the qvac-fabric ggml port was built
with.
The previous duplicate-context-+-remap-pointers code is removed
entirely. Tensors stay in the single ctx_data, and either the mmap or
alloc+copy path populates their data pointers in place.
Validation
Linux desktop (Vulkan device probed but CPU path engaged):
- 4/4 integration tests pass, 23/23 asserts pass
- alloc+copy fallback exercised: total weights 2127.2 MB, 739 tensors
- Quality vs PyTorch HuggingFaceVLA/smolvla_libero:
max|Δ|=0.0009, mean|Δ|=0.00003, cos=1.0000 (350 values)
matches the prior baseline (max|Δ|=0.0006 on M4 Metal).
- 2/2 C++ unit tests pass.
The mmap path needs Device Farm iOS to validate end-to-end; the
fallback is exercised on every desktop run today.
* fix: use 64-bit fseek for >2GB GGUF read on Windows + 32-bit POSIX
Win32 integration test in run 24980777510 (commit 46c55b30) failed at:
smolvla_load_model: failed to read tensor 'v.enc.blk.7.ffn_down.bias'
at offset 2149428256
Root cause: the fallback alloc+copy path used fseek() with a (long)
cast on the offset. On Windows long is 32-bit (LLP64), so any offset
above 2^31-1 (≈2.15 GB) silently truncates. The smolvla GGUF is
~2.13 GB of weight data, so tensors past the ~2 GB mark cannot be
seeked to. Same trap exists on 32-bit POSIX targets where off_t
defaults to 32-bit unless _FILE_OFFSET_BITS=64.
Fix:
- Define _FILE_OFFSET_BITS=64 at the top of smolvla.cpp before any
system header so off_t / fseeko / ftello are 64-bit on POSIX.
- In the fallback path use _fseeki64() on Windows and fseeko() on
POSIX (both 64-bit-clean).
- Add explicit <cstdio>/<cstdint> includes since we now reference
the 64-bit variants directly.
The mmap fast path (Apple Metal, CPU-with-host-ptr) is unaffected —
it never calls fseek; mmap addresses are pointer-sized.
Validation
- Linux desktop alloc+copy fallback path still passes:
- 4/4 integration tests, 23/23 asserts
- 739 tensors, total 2127.2 MB loaded, all tensors past the
2 GB boundary read correctly
- Quality vs PyTorch HuggingFaceVLA/smolvla_libero unchanged:
max|Δ|=0.0009, mean|Δ|=0.00003, cos=1.0000 (350 values)
Win32 needs a CI roundtrip to confirm the fix end-to-end.
* refactor[bc]: align qvac-lib-infer-vla with canonical addon shape
- index.js: replace synchronous VlaModel(ggufPath) with the canonical
constructor ({ files, config, logger, opts }) and add load / run / unload /
pause / cancel / getState built on @qvac/infer-base's createJobHandler +
exclusiveRunQueue and @qvac/logging. run() returns a QvacResponse and the
underlying synchronous binding is driven through job.start/output/end.
- index.d.ts: update typings to match the new async API.
- package.json: declare @qvac/logging, @qvac/infer-base, bare-fs, bare-path
runtime deps; add top-level test, coverage:cpp* scripts; rewire
test:integration to generate test/integration/all.js (and chain
test:mobile:generate); replace scaffold description with the real one;
pin cmake-bare to 1.7.5 and bump brittle to ^3.16.5.
- CMakeLists.txt: add ENABLE_COVERAGE / VK_PROFILING options and replace the
ENV-probe ANDROID_STL block with the canonical option().
- on-merge workflow: rename display name to "On Merge Trigger (Vla)".
- integration tests: switch to the new constructor + await load/run/unload
flow.
* feat[notask]: scaffold new addons in canonical shape
Update the new-addon skill so a freshly scaffolded addon ships with the
canonical shape used across the monorepo, removing the consistency-fix
round-trip that qvac-lib-infer-vla just had to absorb.
- templates/index.js: replace the synchronous sayHello() wrapper with a
canonical class. Constructor `({ files, config, logger, opts })` validates
`files.model` like every other addon; lifecycle is `load` / `run` / `unload`
/ `pause` / `cancel` / `getState`; `run()` returns a `QvacResponse` driven
through `createJobHandler` + `exclusiveRunQueue` from `@qvac/infer-base`,
with logging via `@qvac/logging`. The hello-world `binding.sayHello()` call
is driven inline so synchronous backends still flow through the standard
job interface.
- templates/index.d.ts: typings updated to match the new async surface.
- templates/package.json: declare the canonical runtime deps
(`@qvac/infer-base`, `@qvac/logging`, `bare-fs`, `bare-path`); add
top-level `npm test`, `coverage:cpp:*` scripts; rewire `test:integration`
through `test:integration:generate` (which also chains
`test:mobile:generate`); pin `cmake-bare` to exact `1.7.5` and bump
`brittle` to `^3.16.5` to match `qvac-lib-infer-llamacpp-llm`. The
backend-specific deps placeholder is renamed `BACKEND_NPM_DEPS` and is
appended inside the canonical dependencies block (with a leading comma).
- templates/CMakeLists.txt: add `option(ANDROID_STL ...)`,
`option(ENABLE_COVERAGE ...)`, `option(VK_PROFILING ...)` so the
prebuild workflow's `vk-profiling` input and the `coverage:cpp` scripts
actually reach CMake.
- templates/test/integration/addon.test.js: switch to the new constructor
+ await load/run/unload flow; add a constructor-validation test.
- SKILL.md: document the canonical class shape contract, update the
substitution table for `BACKEND_NPM_DEPS`, expand the verification step
to include `npm test`, and update the next-step hint so the developer
preserves the constructor signature and lifecycle when filling in the
real model logic.
* Revert "feat[notask]: scaffold new addons in canonical shape"
This reverts commit 8f84f1c1a56dd0c731ee4142b5253b66b3f44a55.
* fix: address VLA review feedback — JS/CI consistency, correctness, perf
Consistency
- package.json: add `build:pack` and `mobile:copy-prebuilds` scripts so the
mobile workflow stops falling back to its inline `npm pack` and warning
about missing prebuild fan-out.
- integration-mobile-test-qvac-lib-infer-vla.yml: rename the Device Farm log
artifact from `devicefarm-logs-llamacpp-embed-` to `devicefarm-logs-vla-`
and pin `actions/upload-artifact` to the canonical SHA used elsewhere in
the repo. Document that the `_LLAMACPP_EMBED` Device Farm secrets are
intentionally shared (no dedicated `_VLA` secrets are provisioned yet).
Correctness
- index.js: clear `_hasActiveResponse` synchronously on both the success
and failure paths. Previously the catch re-threw before the trailing
`.finally(...)` cleanup wired up, so a native-side inference error left
the model permanently `RUN_BUSY` until `unload()`. The success path's
cleanup ran one microtask late, leaving a window where chained `run()`
calls could observe the stale flag.
- index.js: `pickPrimaryGgufPath` now matches `-0*1-of-N.gguf` instead of
any shard index, so multi-shard models always pick shard 1 regardless of
the input array order.
- test/integration/addon.test.js: drain the redirect / non-2xx response
body via `res.resume()` so `bare-https` releases the underlying socket
before we follow the redirect or fail.
Performance
- addon.js: rewrite `preprocessImage` to do bilinear resize, letterbox-pad
and the [0,1]→[-1,1] shift in a single pass over the output buffer. Drops
the `src` and `resized` intermediates (3 × 3 MB allocations → 1) and
hoists the per-output-pixel coordinates out of the channel loop so all
three channels share one set of weights. Adds an optional `opts.scale`
override so callers that already know the pixel range skip the
256-element scan in `detectScale`.
- test/integration/addon.test.js: replace the per-chunk float division +
`toFixed` percentage compare in `_streamDownload`'s `'data'` handler
with a byte-threshold check; the 2.2 GB GGUF download no longer pays
per-chunk floating-point overhead just to gate a log every 50 MB.
* fix: address VLA review feedback — C++ correctness + perf
Correctness
- AddonJs.hpp: introduce a `VlaHandle` indirection wrapper so an explicit
`destroyVlaModel` can null out the inner `VlaModel*` while the GC
finalizer still owns the heap-allocated wrapper. Previously the eager
`delete` in `destroyVlaModel` left a dangling pointer in the JS external
slot that the GC finalizer would then re-`delete` (use-after-free /
double-free). `unwrap` now throws when the model has been destroyed
rather than dereferencing a freed pointer.
- smolvla.cpp (mmap fast path): reject the host-ptr buffer path when
`data_offset >= file_size` (would underflow `tensor_data_size` to a
huge `size_t`) or when `st.st_size > SIZE_MAX` (would truncate the
mapping length on 32-bit targets where the GGUF won't fit anyway).
Falls through to the alloc+copy path with a clearer diagnostic.
Performance
- AddonJs.hpp / AddonCpp.hpp: switch the `runVlaModel` JS→C++ boundary to
zero-copy. `typedArrayPtr<T>()` returns the underlying ArrayBuffer
pointer + length via `js_get_typedarray_info` directly; `VlaModel::run`
now takes raw `const T*` + lengths instead of `std::vector` copies.
Drops one `std::vector<float>` copy per image (~3 MB each at
3×512×512 f32) plus state/tokens/noise copies on every inference call.
The mask still copies into a small `bool` buffer because the inference
signature requires `const bool*`; the copy is 48 bytes so it's not
worth restructuring smolvla_inference_with_timing's ABI.
- smolvla.cpp (ODE loop): hoist the per-step `te_single` allocation out
of the loop and replace the 50-iteration `memcpy` broadcast with a
doubling pattern (~7 memcpy calls instead of 50). Drop the redundant
per-step KV cache re-upload — the KV inputs are uploaded once before
the loop via `ggml_set_input`, and `ggml_backend_sched` preserves
input-tagged tensors between `ggml_backend_sched_graph_compute` calls
while the scheduler is not reset.
Not addressed in this commit
- The post-sg2 KV mini-graph re-extraction (16 separate per-layer
graphs after the main SmolLM2 forward). Eliminating this requires
pinning the K/V output tensors to a host-allocated CPU buffer so
gallocr cannot overwrite them between compute calls — a deeper
graph-allocator restructure that needs end-to-end validation against
the PyTorch reference assertion. Tracking as a follow-up; the perf
win there is large (roughly 2× SmolLM2 stage cost).
* fix: guard te_single broadcast against chunk_size=0
The doubling-pattern memcpy in the ODE loop unconditionally copied one
row of te_single before checking chunk_size. With chunk_size == 0 the
te_expanded buffer is empty and that initial memcpy would overflow.
The pre-existing per-step loop didn't have this hazard because the
for-loop simply didn't run.
In production chunk_size is always 50, but adding the guard keeps the
fast path defensive.
* feat: gate VLA GPU backend selection on Adreno < 800
Mirrors lib-infer-diffusion / qvac-lib-infer-llamacpp-llm: when the loaded
ggml plugins expose an Adreno GPU below the 800 series, fall back to the
CPU backend instead of `ggml_backend_dev_init`-ing it. The Qualcomm
OpenCL ICD on Adreno < 800 has incomplete OpenCL 3.0 support, broken
kernel compilation for several ggml ops, and shared-memory OOMs;
Vulkan on those generations also has driver issues that misbehave on
some ggml ops. Older Snapdragon devices that get added to the Device
Farm pool will now run on CPU rather than crashing on `init`.
Adds:
- `addon/src/utils/BackendSelection.{hpp,cpp}` with
`parseAdrenoModel(description)` and `pickBestGpuDevice()`. Pure logic,
testable without the JS bridge.
- `test/unit/test_backend_selection.cpp` exercising the Adreno parser
on the description shapes ggml emits ("Adreno (TM) 830", "Adreno 740",
case variations, non-Adreno).
- `smolvla_load_model` now uses `pickBestGpuDevice()` instead of
`ggml_backend_dev_by_type(GPU)`, so Adreno < 800 falls through to
the CPU init below.
Tests: 7/7 C++ unit (was 2), 6/6 JS unit, 4/4 integration; lint clean.
* feat: tag VLA perf-report rows with execution provider and ship a
dedicated mobile perf artifact
Without these, the Adreno < 800 gate that just landed has no observable
signature in CI: a Samsung S22/S23 falling from Vulkan to CPU shows up
only as a 5–20× total_ms increase in the perf-report tables, with no
column saying *why*. You'd have to scrape stderr to attribute the
regression. This change closes both gaps.
(a) Backend-name plumbing
- `AddonCpp.hpp::VlaModel::backendName()` returns the ggml backend name
("CPU", "Vulkan", "OpenCL", "Metal", …) via `ggml_backend_name(...)`,
with fallbacks for the unloaded / nameless cases.
- `AddonJs.hpp::getVlaBackendName(handle)` exposes it as a JS string
binding; `binding.cpp` registers it.
- `index.js`: `_load()` reads `binding.getVlaBackendName(this._handle)`
and stashes it in `this._backendName`; `get backendName()` exposes it;
`unload()` clears it.
- `index.d.ts`: documented as `readonly backendName: string | null`.
- `test/integration/addon.test.js`: passes the value as
`execution_provider` to `_perfReporter.record(...)`. Step Summary
tables (and the JSON artifact) now show one of `CPU`/`Vulkan`/
`OpenCL`/`Metal`/`unknown` per row, so a Vulkan→CPU regression is
immediately visible.
(b) Dedicated mobile perf artifact
`integration-mobile-test-qvac-lib-infer-vla.yml` already uploaded
`devicefarm-logs-vla-…` containing everything Device Farm produced, but
the perf-report was buried in there as either a file in
customer-artifacts or a `[PERF_REPORT_*]` marker run on stdout. Added a
post-download step that:
- Walks the downloaded `devicefarm-logs/<platform>` tree.
- First tries to find `perf-report.json` shipped directly as a Device
Farm file artifact (the test writes it to writable paths on Android
/ iOS, which Device Farm packs into customer-artifacts).
- Falls back to single-block `[PERF_REPORT_START]…[PERF_REPORT_END]`
marker scraping.
- Falls back to chunked `[PERF_CHUNK:id:i:n]…` reassembly (sorts by
index, validates the resulting JSON parses).
- Writes `mobile-perf/perf-report-<platform>.json` and uploads it as
artifact `vla-perf-mobile-<platform>` (mirrors the desktop workflow's
`vla-perf-<platform>-<arch>-<os>` naming for symmetry).
- Emits `::warning::` rather than failing the job when no perf data is
found, so this never breaks an otherwise-green CI run.
Verified: lint clean, 6/6 JS unit, 4/4 JS integration, 7/7 C++ unit;
workflow YAML parses.
* fix: restore per-step KV cache upload in VLA ODE loop
Earlier perf #4 dropped the per-step ggml_backend_tensor_set for the
KV cache inputs on the assumption that ggml_set_input + the sched
allocator preserves input slots between ggml_backend_sched_graph_compute
calls. That holds for sched-managed multi-backend setups (where Tesla
T4 + Vulkan still produces cos_sim=0.99999 / max|Δ|=0.020 vs the
PyTorch reference), but it breaks two paths that actually run in CI:
- CPU-only (alloc_staged_simple → ggml_gallocr → graph_compute)
reuses input slots across compute calls, so steps 1–9 read garbage
KV.
- Adreno Vulkan on the Samsung S25 Ultra device farm slot has the
same effective semantics (Adreno Vulkan driver) and crashed the
addon test with the same divergence pattern.
Symptom on linux-x64 / linux-arm64 GitHub-hosted runners (CPU backend):
cos_sim = 0.3135 (threshold > 0.9), max|Δ| = 1.65 (threshold < 0.25).
Restoring the per-step upload unconditionally trades ~80 MB of H2D
traffic per inference on Vulkan-sched setups for correctness on every
backend. A conditional restore (skip on sched paths) would recover
that perf, but the branch isn't worth the correctness risk in this
PR.
* test: pin bare-tls/bare-https to 2.x for VLA mobile tests
bare-tls@3.0.0 (published 2026-04-28) flips on default certificate
verification with the commit "Load default trust store and reject
untrusted certificates by default", and bare-https@3.0.0 (same day)
widens its dep from bare-tls@^2.0.0 to ^3.0.0. With no populated
trust store inside the Bare Android/iOS runtime, every TLS handshake
to the SmolVLA presigned S3 URL fails:
[vla-model] downloading: https://tether-ai-dev.s3.eu-central-1...
[vla-model] retry 1/2 after 500ms (last: CERTIFICATE_VERIFY_FAILED: Handshake failed)
not ok 1 - mobile model fetch failed
runAddonTest: FAIL (3/4 passed)
Confirmed across both Pixel 9 Pro and Samsung Galaxy S25 Ultra on
runs 25066695862 and 25074966624. Same root cause would hit any
addon whose mobile suite installs after 2026-04-28; NMTCPP and
Parakeet's last green runs predate the publish.
Pin both packages to the highest published 2.x (2.2.3 / 2.1.3) via
npm overrides until upstream ships a CA-bundle-aware bare-tls. If
the npm install layer is what bare-pack resolves at app-build time,
this restores the previous (non-validating) behavior and unblocks
mobile CI; if BareKit's baked-in bare-tls wins instead, we'll see
the same handshake error and need a runtime-level fix.
* Revert "test: pin bare-tls/bare-https to 2.x for VLA mobile tests"
The override block placed in this addon's package.json had no effect
on the failing mobile run (25092791397 logcat shows the same
CERTIFICATE_VERIFY_FAILED). The reason is that bare-link / bare-pack
both run from tetherto/qvac-test-addon-mobile's node_modules at
app-build time, and npm's `overrides` only apply in the root project
of `npm install` — when this addon is installed transitively from
that repo, the overrides are silently dropped.
The fix lives in tetherto/qvac-test-addon-mobile#38 instead. Reverting
here to keep dead config out of the addon.
* refactor: rename packages/qvac-lib-infer-vla -> packages/vla
Match the directory name to the npm package name (`@qvac/vla`),
mirroring the diffusion-cpp rename done in #1786. The previous
`packages/qvac-lib-infer-vla` carried over from the lib-infer-*
naming era and no longer matched what gets published.
Renamed:
- packages/qvac-lib-infer-vla/ -> packages/vla/
- .github/workflows/on-pr-ocr-onnx.yml -> on-pr-vla.yml
- .github/workflows/integration-mobile-test-...vla.yml -> integration-mobile-test-vla.yml
- .github/workflows/integration-test-...vla.yml -> integration-test-vla.yml
- .github/workflows/on-merge-...vla.yml -> on-merge-vla.yml
- .github/workflows/on-pr-close-...vla.yml -> on-pr-close-vla.yml
- .github/workflows/prebuilds-...vla.yml -> prebuilds-vla.yml
`on-pr-ocr-onnx.yml` was the source of yesterday's pull_request_target
mix-up — its content is the VLA workflow but the filename meant
GitHub kept resolving the OCR workflow from main on PR events.
Renaming it to `on-pr-vla.yml` fixes that.
Updated path/slug references inside workflows + package metadata:
- `packages/qvac-lib-infer-vla` -> `packages/vla`
- artifact prefix `qvac-lib-infer-vla-` -> `vla-`
- `package-slug: qvac-lib-infer-vla` -> `vla`
- `package.json` `repository.directory` + `homepage`
- `vcpkg.json` top-level `name`
- perf reporter addon name in `test/integration/addon.test.js`
- SKILL.md references in `packages/ocr-onnx/.agent/`
Kept (mirroring diffusion-cpp's rename):
- C++ internal symbols (`BARE_MODULE("qvac-lib-infer-vla", ...)`,
`add_bare_module(qvac-lib-infer-vla ...)` in CMakeLists). These
are stable native-binding identifiers, not paths.
* refactor: keep on-pr-ocr-onnx.yml filename until tmp-vla merges to main
Reverting just the `on-pr-ocr-onnx.yml` -> `on-pr-vla.yml` rename
from the previous commit. Reason: GitHub Actions requires
`workflow_dispatch` workflow files to exist on the default branch
to be registered; until tmp-vla lands in main, the new
`on-pr-vla.yml` is unknown to the API and `gh workflow run` 404s.
Keeping the file at the historical `on-pr-ocr-onnx.yml` path on
tmp-vla means:
- `gh workflow run on-pr-ocr-onnx.yml --ref tmp-vla` continues to
work (it was the dispatch target throughout this branch).
- The file's *content* is still the VLA workflow as before; only
the filename is preserved for dispatch compatibility.
The proper rename to `on-pr-vla.yml` should be a follow-up PR opened
after tmp-vla is merged into main, mirroring the timing diffusion-cpp
used in #1786 (the rename happened on main, where its workflows were
already registered). Other workflow renames in this branch
(integration-test-vla, on-merge-vla, prebuilds-vla, etc.) are kept
because they're consumed via `uses:` from the dispatch workflow, not
dispatched directly — file existence on the default branch isn't
required for those.
* feat: run VLA integration tests on CPU and GPU side-by-side
Add a `backend` matrix dimension to integration-test-vla and
integration-mobile-test-vla so every GPU-equipped runner is
exercised twice — once with the runner's preferred accelerator
(Metal / Vulkan) and once forced onto CPU. Result: a clean
per-platform "GPU vs CPU" delta in the perf-report artifact set
for the same hardware, the same model, the same test vector.
Plumbing:
- smolvla.cpp: read VLA_FORCE_CPU env var (any non-empty,
non-"0" value) before vla_backend_selection::pickBestGpuDevice.
When set, skip GPU pick and fall through to the existing CPU
init path. One getenv + one if-guard.
- integration-test-vla.yml: dual rows for ai-run-linux-gpu /
mac-mini-m4 / ai-run-windows11-gpu (the runners with a real
GPU). Linux arm64 + Linux x64 hosted + macOS x64 hosted have
no GPU prebuild; one row each (auto == cpu effectively).
`VLA_FORCE_CPU` plumbed via env: matrix.backend == 'cpu'.
perf-report artifact name now includes the backend so both
rows of the same os land separate files.
- integration-mobile-test-vla.yml: 4 rows total (Android+iOS
× auto+cpu). The bundled smolvla-urls.json now carries a
`forceCpu` flag derived from matrix.backend, since env vars
don't propagate to BareKit's child process the way they do
on desktop. devicefarm-logs and vla-perf-mobile artifact
names include the backend.
- addon.test.js: when running on mobile, read forceCpu from the
bundled config and set process.env.VLA_FORCE_CPU before
VlaModel.load(). The C++ side reads the env identically on
every platform.
Cost:
- +5 desktop matrix rows (-> 10 total). Three new GPU runners
× ~5 min each = ~15 extra runner-minutes per CI cycle.
- +2 mobile matrix rows (-> 4 total). Doubles Device Farm spend
for VLA mobile, but VLA mobile only ran one config before so
this is the first time we'll see CPU vs GPU on phone.
Notable: Pixel 9 Pro's Adreno 730 already falls through to CPU
under `auto` (gated by Adreno < 800 in BackendSelection.cpp), so
its `cpu` row is redundant in practice. Kept for matrix symmetry
and uniform artifact set; can be pruned later if Device Farm
spend matters.
* refactor: run VLA CPU/GPU comparison in one process per runner
Replace the workflow-level `backend: [auto, cpu]` matrix with an
explicit `backend` argument on `VlaModel.load()`. The integration
test now loads + runs the model twice in a single Bare process —
once on the runner's preferred backend (Metal/Vulkan/Adreno/…) and
once forced onto CPU — so each CI runner produces one perf-report
artifact carrying both rows. Halves CI runner-minutes, drops the
duplicated model download/install, and gives a single artifact per
host with a clean side-by-side comparison.
JS surface:
- `VlaModel.load({ backend: 'auto' | 'cpu' })`. Default `'auto'`.
- Plumbed into `binding.createVlaModel(ggufPath, backend)` →
`VlaModel(ggufPath, forceCpu)` → `smolvla_load_model(..., force_cpu)`.
C++:
- `smolvla_load_model` gains an explicit `bool force_cpu` parameter;
`pickBestGpuDevice` is skipped when set. The `VLA_FORCE_CPU` env-var
fallback is removed — the param is the only knob now.
Test:
- addon.test.js loops `['auto', 'cpu']` inside the same e2e test.
Each iteration owns its own VlaModel and `unload()`s before the
next one starts, so memory-constrained mobile devices don't hold
two copies of the weights at once. Two perf-report rows per
artifact, distinguished by both `test` name and `execution_provider`.
CI:
- integration-test-vla.yml drops the `backend` matrix dimension —
7 rows total instead of 10 (3 GPU runners × 2 + 4 CPU-only × 1).
- integration-mobile-test-vla.yml drops the dual-row mobile matrix
(4 → 2). The `forceCpu` field in `smolvla-urls.json` is gone since
the bundled config no longer needs to communicate the backend choice.
- Artifact names lose the `-${backend}` suffix.
Verified locally on linux-x64 (Vulkan): auto=2.55s, cpu=10.4s; both
rows quality-clean (cos sim ≈ 1.0 vs PyTorch reference).
* fix: surface VLA mobile perf-report (mirror OCR's working path)
Two pre-existing breakages converged to give us empty
`vla-perf-mobile-*` artifacts on every prior run:
1. addon.test.js's mobile inline reporter only flushed via
`process.on('exit')`. On Device Farm the BareKit-hosted process is
torn down before that handler fires, so the
`[PERF_REPORT_START]…[PERF_REPORT_END]` markers never reach
logcat / iOS console — and the perf-report.json file is never
written to the device.
2. The workflow's inline Node extractor only handled clean text. It
didn't strip the Android logcat line prefix
(`MM-DD HH:MM:SS.mmm PID TID …:`) or the BareKit ReactNativeJS
bridge wrapper (`'[Bare]', '...'`), so even when chunked markers
*did* land in a log they failed to parse.
Replicate OCR's canonical mobile perf-report path:
- addon.test.js: after each `_perfReporter.record(...)` on mobile,
call `writeReport()` + `writeToConsole()` immediately, mirroring
packages/ocr-onnx/test/integration/utils.js. The exit-handler
flush stays for desktop. Each call is idempotent — overwriting
the file with N records is fine since the report is cumulative.
- integration-mobile-test-vla.yml: replace the inline Node
extractor with a call to `scripts/perf-report/extract-from-log.js`
(the same script OCR mobile uses). It already handles logcat
prefix stripping, ReactNativeJS bridge unwrapping, JS-string
`\'` escapes, chunk reassembly, and `schema_version` validation.
Verified locally (linux-x64) that the test still emits the
two-backend perf-report with both rows; quality unchanged.
* fix: render VLA quality Step Summary table correctly
Two bugs in the quality table emitted to GITHUB_STEP_SUMMARY:
1. The `Max |Δ|` and `Mean |Δ|` column labels contain literal pipe
characters that markdown parses as column separators, so the
3-column quality table was rendered as if it had 5 columns. Escape
the pipes (`\|`) so they render as text.
2. Cosine similarity was rendered with `(v * 100).toFixed(1) + '%'`,
which collapses any value at or above ~0.99995 to "100.0%" — losing
the precision that makes the metric useful for spotting regressions.
Add a `cos-sim` column unit that prints raw `toFixed(8)`
(e.g. `0.99999999`) so identical-looking near-perfect runs stay
distinguishable.
Applies to both the desktop reporter (writeStepSummary) and the
mobile render-step-summary script.
* feat: render mobile VLA perf-report into GitHub Step Summary
The mobile job uploaded `vla-perf-mobile-Android` for the first time
on commit f41a0f3c, but nothing was rendering it into the Actions
Step Summary tab — so the per-device CPU-vs-GPU table only showed
up for desktop runners. Wire `scripts/perf-report/render-step-summary.js`
into the mobile workflow so each device's report (Pixel 9 Pro,
Galaxy S25 Ultra, …) emits the same compact markdown table the
desktop reporter writes.
`extract-from-log.js` writes per-device subdirs when Device Farm
runs more than one phone in the pool, so the new step loops over
every `performance-report.json` under `mobile-perf/` and appends a
fresh table per device, matching OCR's mobile pattern.
* feat: optimize VLA inference with op fusion and KV-projection hoist
Three measurable graph-level changes in `build_transformer_layer` and
`build_denoise_step_graph`, validated against the existing PyTorch
reference (`pt_actions_libero_fixed.json`, 350 values):
- **Hoist cross-attn K/V projections out of the ODE loop.** The action
expert's `k_proj`/`v_proj` against the VLM KV cache only depend on
inputs that are invariant across the 10 ODE denoise steps. Project
once after SmolLM2 forward and overwrite `kv_keys_data[i]` /
`kv_vals_data[i]` for cross-attn layers in place — eliminates 16
layers x 9 redundant steps = 144 matmul-pairs per inference.
- **Replace `scale -> +mask -> soft_max` triples with `ggml_soft_max_ext`**
at the 4 live attention sites. Bit-for-bit equivalent, fewer graph
nodes, helps backends with non-trivial kernel-launch overhead.
- **Replace `silu(gate) * up` with `ggml_swiglu_split`** at the 2 live
SwiGLU MLP sites.
Final cumulative speed (warm bench, median of iter 2-5, vs baseline tip):
| Backend | total baseline | total final | Delta |
|---|---:|---:|---:|
| auto (Vulkan / Intel Iris Xe) | 2345 ms | 2247 ms | -4.2% |
| cpu | 10084 ms | 9921 ms | -1.6% |
ODE inner loop specifically: -6.9% auto, -2.6% cpu - that's where the
cross-attn KV hoist lands. Accuracy unchanged: max|delta|=0.0032 auto /
0.0009 cpu, cos=1.00000.
Also adds:
- `test/bench.js`: warm-bench harness (loads model once, runs N
inferences, reports per-stage min/med/max). Single-run integration
timings showed up to 2x variance from system load on this dev box,
unsuitable for A/B comparison.
- `test/unit/test_flash_attn.cpp`: gtest comparing `ggml_flash_attn_ext`
against the unfused reference on synthetic Q/K/V at the SmolLM2
prefill shapes. Documents the **F16-mask + `GGML_PREC_F32` recipe**
required to call flash-attn correctly (F32 mask is silently accepted
but produces structured-but-shifted output, cos~0.28). The recipe
works correctness-wise; it's currently 3x slower than the unfused
matmul on Intel Iris Xe Vulkan (no matrix cores) but plausibly faster
on Adreno/Metal. To be re-evaluated on the mobile device farm before
enabling, ideally gated on `has_matrix_cores`.
- `opt.md`: per-optimization log with implementation, accuracy, speed,
and the failed/skipped attempts (drop-GQA-repeat broke CPU mul_mat
broadcast; time-MLP split linears regress on strided weight matmul;
flash-attn-ext requires F16 mask, see above).
* fix[ci]: address HIGH security findings in vla CI workflows
- prebuilds-vla.yml: drop unconditional `printenv` step that dumped
AWS_OIDC_ROLE_ARN, NPM_TOKEN, PAT_TOKEN, and other resolved env-var
secrets to public CI logs.
- integration-test-vla.yml: drop `npm config list` from the run-state
diagnostics; it printed the just-written .npmrc, leaking the npm and
GPR _authToken values. Replaced `npm list` with `npm list --depth=0`
to keep dependency visibility without the dump.
- integration-test-vla.yml, cpp-tests-vla.yml: route ${{ github.token }}
through a `GH_TOKEN` env var instead of inline shell interpolation in
`git config` invocations, so it gets standard secret masking and
doesn't end up in the runner process listing.
* chore: drop opt.md, untrack vla performance-report.json
- opt.md was a 497-line scratch log of the VLA op-fusion / KV-projection
optimization work. The summary belongs in the PR description, not in
the repo tree.
- packages/vla/test/results/performance-report.json is regenerated by
every CI run and uploaded as a workflow artifact; it has no business
living in source control. Gitignore the directory and stop tracking
the file (file kept on disk for any local working sessions).
* fix: address review quick-wins for vla addon
Correctness:
- action_dim default is now 7 across the C++ hparams struct, the GGUF
fallback, and generate_reference.py. The integration test now hard-fails
on a (chunk_size, action_dim) shape mismatch instead of skipping the
PyTorch quality gate with a comment, so a regression in either side
shows up as a failed assertion. Added an explicit hparams unit-test
assertion for action_dim.
- mmap loader bails out cleanly when ggml_backend_tensor_alloc fails for
any tensor: it frees the buffer, munmaps the file, and falls through
to the alloc+copy path instead of leaving partially-wired tensors with
invalid pointers and pretending success.
- smolvla_inference_with_timing rejects out-of-range n_images, lang_len,
and state_dim before they feed into n_visual_tokens / prefix_len /
tensor sizing, where bad values would underflow int math and cause
out-of-bounds writes during graph build.
Security:
- mmap loader validates every per-tensor (offset, nbytes) against the
mapped region before wiring, so a crafted GGUF cannot point a tensor
past the end of the mapping.
- Mobile workflow builds smolvla-urls.json with `jq` so the presigned
URL cannot break out of its JSON string, and replaces the partial
`head -c 120` echo (which leaked the bucket host and X-Amz-Credential
prefix) with a byte-count confirmation.
Performance:
- Precompute the sinusoidal time-embedding period table at load time.
The per-ODE-step embedding now does 360 multiply / sinf / cosf calls
instead of paying for 360 powf evaluations per step (~3,600 powf calls
per inference eliminated). Hint the kernel with MADV_WILLNEED on the
zero-copy mmap path so first inference doesn't demand-page through
the 2+ GB GGUF.
Dead code:
- Drop the unused smolvla_rope helper (whose comment claimed RoPE mode 0
while the body called NEOX), the unused to_bf16_precision helper, and
the leaky run_graph stub in test_flash_attn.cpp.
* refactor: adopt QvacErrorBase / ERR_CODES pattern in vla addon
Every other inference addon (parakeet, whispercpp, nmtcpp, ocr-onnx,
onnx-tts, llamacpp-llm, …) ships a lib/error.js with a package-specific
QvacErrorBase subclass and a frozen ERR_CODES map registered with
@qvac/error. VLA was the only one still throwing bare Error / TypeError /
RangeError, which prevents callers from branching on err.code and
breaks the localized message registry.
Adds packages/vla/lib/error.js with QvacErrorAddonVla and 9 codes in
the previously-unused 30001..31000 range:
FAILED_TO_LOAD_WEIGHTS, FAILED_TO_DESTROY, MODEL_NOT_FOUND,
INVALID_CONFIG, MISSING_REQUIRED_PARAMETER, INVALID_INPUT,
JOB_ALREADY_RUNNING, INSTANCE_NOT_INITIALIZED, MODEL_UNLOADED.
index.js threads structured errors through the public surface: input
validation in validateRunInput now throws INVALID_INPUT; constructor
files.model checks raise MISSING_REQUIRED_PARAMETER / INVALID_CONFIG;
load() backend validation raises INVALID_CONFIG; binding load failures
are wrapped as FAILED_TO_LOAD_WEIGHTS with `cause` preserving the
underlying error; binding.destroyVlaModel failures during unload now
raise FAILED_TO_DESTROY instead of being swallowed; run-before-load and
run-while-busy raise INSTANCE_NOT_INITIALIZED and JOB_ALREADY_RUNNING;
in-flight jobs cancelled by unload see MODEL_UNLOADED on the failure
side. ERR_CODES and QvacErrorAddonVla are exported alongside VlaModel,
matching the OCR / parakeet pattern.
index.d.ts gains the QvacErrorAddonVla class and ERR_CODES literal-type
map. package.json declares @qvac/error ^0.1.0 as a dependency and adds
lib/ to the published files list.
Existing test assertions on /non-empty array/ and /absolute path/
continue to match the new structured messages — verified by running
test:unit (6/6 pass), test:integration sans GGUF (4/4 pass), and
test:dts.
* test: switch vla integration fixture to vision-Q8-quantized GGUF
Bumps the integration-test model from smolvla-libero-f32-fixed.gguf
(2026-04-21) to smolvla-libero-vision-q8.gguf (2026-04-30) — same
LIBERO checkpoint with Q8_0 quantization on the vision-encoder linear
weights. Cuts vision-stage time roughly in half on Vulkan and ~4× on
CPU (see test/results/perf reports).
Q8 on the vision encoder occasionally flips the gripper dim (action[6],
near-binary in [-1, 1]) at decision boundaries on the synthetic gray
fixture — measured max |Δ| ~0.6 on Vulkan, ~1.2 on CPU. Position /
rotation dims stay tight (mean |Δ| ≈ 0.01). LIBERO closed-loop eval
shows equivalent task success vs the F32 GGUF (60% vs 70% across 30
episodes — within statistical noise). Tolerances loosen to max |Δ| 1.5
to absorb gripper sign flips and cosine >0.95 as the structural sanity
check.
Updates the S3 path in integration-test-vla.yml and the mobile presign
script to match.
* fix[ci]: prevent artifact poisoning in vla integration workflows
CodeQL (rule "Artifact poisoning") flagged 19 alerts on the VLA
workflows: actions/download-artifact was writing directly into the
workspace path (packages/vla/prebuilds, addon/packages/vla/prebuilds),
and subsequent steps (npm install, npm run bundle, npm run build:pack,
xcodebuild, npm run test:integration, …) execute code from that same
workspace. Combined with workflow_dispatch.inputs being user-controlled,
that's a path for a poisoned artifact to land code that then runs with
the workflow's secrets.
Fix mirrors the pattern PR #1728 applied to OCR / parakeet / nmtcpp /
diffusion / etc.: download into a runner.temp staging directory, then
add an explicit copy step to move the contents into the workspace.
CodeQL recognises the explicit cp as a maintainer-controlled boundary
and stops the dataflow trace.
Touches three download-artifact sites:
- integration-test-vla.yml: prebuilds → workspace
- integration-mobile-test-vla.yml: Android prebuilds → workspace
- integration-mobile-test-vla.yml: iOS prebuilds → workspace
* feat: add LIBERO sim eval driver + QVAC HTTP bridge under packages/vla/sim
Drops in a self-contained eval pipeline that scores SmolVLA on LIBERO
through either the QVAC GGUF addon (over HTTP) or the original PyTorch
policy, so the two are directly comparable on the same env seeds and
noise sequence.
Files:
packages/vla/sim/eval_libero_sim.py Python entry, --backend {qvac,pytorch}
packages/vla/sim/qvac_http_policy.py lerobot SmolVLAPolicy subclass that
routes the forward pass over HTTP
packages/vla/sim/smolvla_http.py binary-protocol HTTP client
packages/vla/sim/server/server.js Bare HTTP host for @qvac/vla
packages/vla/sim/server/package.json server runtime deps
packages/vla/sim/requirements.txt pinned Python deps (lerobot, libero,
robosuite, mujoco, etc.)
packages/vla/sim/README.md setup + run + compare runbook
Verified end-to-end on libero_spatial (10 tasks x 3 episodes = 30):
QVAC F32 GGUF (Vulkan): 18/30 = 60.0%
QVAC Q8 vision (Vulkan): 21/30 = 70.0%
PyTorch (CUDA): 21/30 = 70.0%
All within the n=30 noise band; Q8-vision matches PyTorch task-for-task on
9/10. lerobot itself is unmodified — the bridge works through its
public make_policy extension point + a Python class swap.
* chore: drop new-addon skill from vla branch
The new-addon skill scaffolding (added in earlier tmp-vla commits) is
unrelated to the SmolVLA addon work in PR #1784 and was being carried
along by accident. Removing it from this branch so the PR diff focuses
on the vla addon and the LIBERO sim eval driver only.
The skill itself can be re-introduced on its own branch / PR if still
wanted.
* chore: drop test_flash_attn.cpp + tighten the comment that referenced it
The attention path uses unfused mul_mat → soft_max_ext → mul_mat. The
flash-attn alternative was ~3× slower per layer on Intel Iris Xe Vulkan
when measured, so we never wired it into the production path. The test
existed only to keep a "side-by-side correctness vs the unfused path"
harness around in case we wanted to re-evaluate flash-attn on Adreno or
Mali later.
Removing 389 lines of test code that exercises a dead path; the pointer
in smolvla.cpp's attention block is rewritten so it captures the
"measured 3× slower on Iris Xe" finding without referring to the
deleted file.
* fix: address security + correctness findings from code review
Security (4):
* sim/server/server.js: cap request bodies at 32 MB (prevents heap-exhaust DoS
via unbounded POST). Reject early in the data-event handler with
req.destroy() instead of buffering until oom.
* sim/server/server.js: validate every header field that flows into a typed
array length (state_dim, n_images, img_w, img_h, n_tokens). Without bounds,
a crafted client could ask for state_dim=2**30 and allocate gigabytes
before the C++ side even saw the request. Also bound the JSON header_len
itself to 64 KB and add a body-truncation check after the per-section reads.
* sim/server/server.js: drop model_path from /info response — it leaked the
on-disk GGUF location to anything that could reach the port.
* sim/server/server.js: adopt the published @qvac/vla async API
(`new VlaModel({ files: { model: [...] } })` + `await model.load()` +
`await model.run(...)`). The previous code used an older sync signature
that happened to match the version installed on the dev server but does
not match the API this PR ships, so /predict would 500 on every request
against a fresh install. Server now boots inside an async IIFE that awaits
load() before listen() begins accepting connections.
Correctness (3):
* smolvla.cpp: smolvla_create now calls smolvla_free_model() before delete on
load failure. The struct has no destructor, so the previous `delete model`
leaked any backend buffers / mmap regions / ggml contexts / backend handles
that smolvla_load_model had already initialised before failing.
* smolvla.cpp: replace the inline ODE-loop dispatch
(`sg3.sched ? sched_compute : graph_compute(backend_cpu, ...)`) with the
shared compute_staged helper. Avoids the foot-gun of hardcoding backend_cpu
on the fallback branch — if alloc_staged_sched ever returned with
sched==nullptr on a GPU build, the inline form would silently fire CPU
compute on GPU-allocated tensors.
* sim/qvac_http_policy.py: surface a clear RuntimeError when the batch has
no camera images, instead of crashing on `images_chw[-1]` while filling
dummy frames for empty cameras.
Verified:
* C++ rebuild + integration test: 4/4 tests pass, 41/41 asserts. Quality
numbers unchanged (Vulkan max|Δ|=0.588 cos=0.997; CPU max|Δ|=1.131
cos=0.989).
Two reviewer findings were verified as non-issues and intentionally not
fixed: the pos_ids = -1 bug doesn't trigger because n_images>=1 is enforced
upstream (so n_visual_tokens >= 64, so pos >= 64 before the lang loop), and
the GGUF mmap data_offset overflow is already caught by the existing strict
`<` check against st.st_size.
* fix: server.js — use response.await() pattern + opts.stats:true
Two issues introduced by the previous review-fix commit (43f1f875):
1. `model.run()` returns a QvacResponse, not `{ actions, stats }`. The
destructure was awaiting the call once and pulling `actions`/`stats`
directly off the response object, but those fields don't exist on
QvacResponse — they live behind `response.await()`. Result: every POST
/predict crashed encodeResponse with `Cannot read properties of
undefined (reading 'buffer')`. Switching to the canonical two-step
p…
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
Earlier perf #4 dropped the per-step ggml_backend_tensor_set for the KV cache inputs on the assumption that ggml_set_input + the sched allocator preserves input slots between ggml_backend_sched_graph_compute calls. That holds for sched-managed multi-backend setups (where Tesla T4 + Vulkan still produces cos_sim=0.99999 / max|Δ|=0.020 vs the PyTorch reference), but it breaks two paths that actually run in CI: - CPU-only (alloc_staged_simple → ggml_gallocr → graph_compute) reuses input slots across compute calls, so steps 1–9 read garbage KV. - Adreno Vulkan on the Samsung S25 Ultra device farm slot has the same effective semantics (Adreno Vulkan driver) and crashed the addon test with the same divergence pattern. Symptom on linux-x64 / linux-arm64 GitHub-hosted runners (CPU backend): cos_sim = 0.3135 (threshold > 0.9), max|Δ| = 1.65 (threshold < 0.25). Restoring the per-step upload unconditionally trades ~80 MB of H2D traffic per inference on Vulkan-sched setups for correctness on every backend. A conditional restore (skip on sched paths) would recover that perf, but the branch isn't worth the correctness risk in this PR.
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
…#1983) * feat: add @qvac/tts-ggml package (Chatterbox English on qvac-tts.cpp) New Bare addon wrapping the `qvac-tts::qvac-tts` static library (backed by the `tts-cpp` port added in tetherto/qvac-registry-vcpkg). API-compatible with the Chatterbox engine exposed by `@qvac/tts-onnx` so downstream consumers can swap backends without touching orchestration code. ## Scope * First iteration. Supports Chatterbox **English** only. Chatterbox multilingual, LavaSR enhancer, Supertonic engine, and streaming are out of scope and remain in `@qvac/tts-onnx`. They'll land alongside the evolution of qvac-tts.cpp. * Native backend is the static `qvac-tts` library from the QVAC vcpkg registry (`ports/tts-cpp`, baseline `2026-04-21`). No ONNX Runtime dependency. ## JS surface * `@qvac/tts-ggml` exports `TTSGgml` with the same method shape as `ONNXTTS`: `run` / `runStream` / `runStreaming` / `reload` / `unload` / `destroy`. * `files: { modelDir }` looks for `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` side-by-side; `files.t3Model` / `files.s3genModel` override the defaults. * Options: `referenceAudio`, `voiceDir` (baked profile), `seed`, `nGpuLayers`, `threads`, `outputSampleRate`, plus placeholders for the upcoming streaming flags (`streamChunkTokens`, `streamFirstChunkTokens`, `cfmSteps`). * Shared reusable lib code (`lib/textChunker.js`, `lib/textStreamAccumulator.js`, `addonLogging.*`) is copied verbatim from `@qvac/tts-onnx`. * New error class `QvacErrorAddonTTSGgml` uses codes **13001–14000** to avoid collisions with `@qvac/tts-onnx` (7001–7011) when both packages are loaded in the same Bare process. ## Native addon * `addon/src/model-interface/chatterbox/ChatterboxModel.{hpp,cpp}` — `IModel` + `IModelCancel` implementation. First-iteration strategy: assemble argv for `qvac_tts_cli_main` with a scratch `.wav` output path, call it synchronously, then parse the resulting 16-bit mono PCM wav back into `std::vector<int16_t>` for the JS handler. Consequences: every job re-loads the model (~700 ms + inference time), no mid-synthesis cancellation, no streaming. The follow-up milestone replaces this with a persistent, struct-based API once qvac-tts.cpp exposes one. * `addon/src/js-interface/{JSAdapter.{hpp,cpp}, binding.cpp}` — JS-to-C++ config bridging (same string-map pattern as `@qvac/tts-onnx`) and the `BARE_MODULE(qvac_tts_ggml, ...)` registration exposing `createInstance` / `runJob` / `reload` / `activate` / `cancel` / `destroyInstance` / `loadWeights` / `setLogger` / `releaseLogger`. * `addon/src/addon/AddonJs.hpp` — JS-facing `createInstance` / `runJob` / `reload` wrappers that register a `JsAudioOutputHandler` emitting `{ outputArray: Int16Array, sampleRate: number }` to JS. ## Build / registry * `CMakeLists.txt` uses `find_package(qvac-tts-cpp CONFIG REQUIRED)` and the standard `cmake-bare` + `cmake-vcpkg` scaffolding (shape matches `@qvac/transcription-whispercpp`). * `vcpkg.json` depends on `tts-cpp` (with a `vulkan` feature passthrough) plus `qvac-lib-inference-addon-cpp`, `qvac-lint-cpp`, and `gtest`. * `vcpkg-configuration.json` points at tetherto/qvac-registry-vcpkg. NOTE: the baseline pin here is inherited from `@qvac/transcription-whispercpp` and **must be bumped** to a commit that contains the `tts-cpp` port once that registry PR lands. A follow-up commit will update it. ## Tests & examples * Integration + unit test files for Chatterbox English are copied verbatim from `@qvac/tts-onnx` with only mechanical renames (`ONNXTTS` -> `TTSGgml`, `QvacErrorAddonTTS` -> `QvacErrorAddonTTSGgml`, `@qvac/tts-onnx/text-chunker` -> `../../lib/textChunker.js`). Some paths in `test/integration/addon.test.js` still import Supertonic / LavaSR helpers that don't exist in this package — those test blocks will fail fast when the file loads, which is expected until those backends get their own ggml packages. * Examples: `chatterbox-tts.js`, `chatterbox-streaming-tts.js`, plus shared `wav-helper.js` + `pcm-chunk-player.js`. ## What's not in this PR (known gaps) * No docs: README, NOTICE, CHANGELOG, PULL_REQUEST_TEMPLATE changes will land in a single documentation pass once the registry + fork commits have merged upstream. * `vcpkg-configuration.json` baseline needs to point at a qvac-registry-vcpkg commit that ships `tts-cpp` (pending the registry PR). * Actual `npm run build` requires the registry and fork commits to be on `main` of their respective upstream repos. * chore: point tts-ggml vcpkg baseline at the tts-cpp-bearing registry commit Bumps `vcpkg-configuration.json` to GustavoA1604/qvac-registry-vcpkg at commit 1e2839680b6be8d8ffff889a9c29b966c176098c — the commit that adds the `tts-cpp` port. Paired with the `qvac-tts` library already pinned in the port's `portfile.cmake` (GustavoA1604/chatterbox.cpp @ 0fe4a521618cc30358040b29d75d4261b31cbb60). Will be re-pointed at tetherto/qvac-registry-vcpkg once the registry PR lands upstream. * chore: tts-ggml: trim tests + examples to Chatterbox English, restore mobile wrapper Second pass over @qvac/tts-ggml after the build started passing: prune everything that only made sense for the ONNX-era multi-engine scope and adapt the remaining Chatterbox-English bits to the GGUF + file-path reference-audio contract. Restores `test/mobile/` so the Android build has something to point at. ## C++ * `ChatterboxModel.cpp`: the `ArgvBuilder::buildArgv` doc comment contained `**/` which closed the block comment early and broke the build. Rewrote as a `//` comment. ## Examples * `examples/chatterbox-tts.js` — rewrite for v0 contract: single `<text>` argv, `files: { modelDir }` pointing at the two GGUFs, `referenceAudio` is now a wav **path** (addon passes it to `--reference-audio`) instead of a Float32Array. Drops english/multilingual arg and the CHATTERBOX_VARIANT switch that picked which `.onnx` files to load. * Removed `examples/chatterbox-streaming-tts.js` + `examples/pcm-chunk-player.js`. The v0 addon re-loads the model per `run()` call — exposing streaming would mislead. Both come back alongside the persistent-engine milestone. * `package.json`: `npm run example` now passes a default text so it runs without extra args. ## Tests ### Kept as-is (engine-agnostic) * `test/unit/textChunker.test.js` * `test/mock/{MockedBinding,utils}.js` * `test/utils/{wav-helper,pcmConcatenator,loader.fake,runWhisper,runTTS}.js` * `test/reference-audio/jfk.wav`, `test/data/sentences-*.js` ### Mechanical fixes * `test/unit/tts.error.test.js` — fix error-code assertions to the tts-ggml range (`13001–14000`); was still checking the `@qvac/tts-onnx` range (`7001–7011`). * `test/unit/tts-ggml.lifecycle.test.js` — fix stale `QvacErrorAddonTTS` import to `QvacErrorAddonTTSGgml`; switch the stubbed model to `{ t3Model, s3genModel }` GGUFs and drop the non-existent `engine: 'chatterbox'` option. * `test/unit/tts-ggml.sentence-stream.test.js` — same GGUF/engine cleanup. ### Rewritten * `test/unit/chatterbox.inference.test.js` — drop tests that asserted the old ONNX file shape (`tokenizer / speechEncoder / embedTokens / conditionalDecoder / languageModel`), the removed `engine` detection and the wrong `getModelKey` return value (`'onnx-tts'` -> `'tts-ggml'`). New tests cover: `modelDir` derives the two GGUF paths; explicit `t3Model` / `s3genModel` override the defaults. The mocked-binding run/reload/cancel flow stays. * `test/integration/addon.test.js` — fresh, ~180 LoC, Chatterbox-English only. Ensures the GGUFs are present, runs the short sentence set through `loadChatterboxTTS` + `runChatterboxTTS[WithSplit]`, and (on darwin only) runs a whisper-based WER check via the existing `runWhisper` util. Drops the Chatterbox-multilingual block + every Supertonic + LavaSR block that doesn't apply to this package. * `test/utils/runChatterboxTTS.js` — rewrite for the GGUF contract: `files: { modelDir, t3Model, s3genModel }`, `referenceAudio` as a file path that falls back to `test/reference-audio/jfk.wav` (or the mobile test-asset when `global.assetPaths` is present). No more WAV decode / resample on the JS side. * `test/utils/downloadModel.js` — trim from 1007 LoC to 280. Drops the Supertonic + LavaSR + Chatterbox-multilingual + Cangjie downloaders. Keeps the shared HTTP/curl infrastructure and `ensureWhisperModel` (still used by the integration WER check). `ensureChatterboxModels` is now **check-only**: it verifies `chatterbox-t3-turbo.gguf` + `chatterbox-s3gen.gguf` exist locally and, if missing, prints the exact commands for generating them from the qvac-tts.cpp (née chatterbox.cpp) conversion scripts. Once the GGUFs land on a canonical HuggingFace repo we'll wire up download URLs here. ## Scripts * `scripts/ensure-chatterbox.js` — simplify to a single invocation against `./models/`. Drops the variant / language matrix that the ONNX downloader needed. * `scripts/ensure-models.js` — now a thin alias to `ensure-chatterbox.js`. Drops the Supertonic + LavaSR orchestration. ## Mobile * Restored `test/mobile/{integration.auto.cjs, integration-runtime.cjs, testAssets/jfk.wav}` so the Android build has a wrapper to point at. * `package.json`: re-added `test/mobile` to the `files` list. ## Gitignore * Ignore generated `.clang-format` / `.clang-tidy` / `.valgrind.supp` (produced by the top-level `configure_file(...)` calls) and `build_*/` dirs (bare-make convention). ## Verified locally * `npx standard "test/**/*.js" "*.js" "lib/*.js"` — clean. * `npm run test:unit` — 38/38 pass (105/105 asserts). * `npm run build && bare examples/chatterbox-tts.js "Hello from qvac tts ggml."` produces a 24 kHz wav as expected. * Add streaming support * Update ggml backend to use separate ggml repo * tts-ggml: consume renamed tts-cpp library (2026-04-24#1) Upstream chatterbox.cpp renamed the package + namespace + target from qvac-tts to tts-cpp and tightened the library boundary; pick up the new artefacts here: - find_package(qvac-tts-cpp CONFIG REQUIRED) -> find_package(tts-cpp CONFIG REQUIRED) - qvac-tts::qvac-tts -> tts-cpp::tts-cpp - qvac_tts::chatterbox -> tts_cpp::chatterbox (engine ptrs, EngineOptions, SynthesisResult, forward-decls in ChatterboxModel.hpp) - #include <qvac-tts/chatterbox/engine.h> -> #include <tts-cpp/chatterbox/engine.h> - Doxygen / inline doc references to the old names refreshed alongside the code changes. vcpkg wiring: - vcpkg-configuration.json baseline bumped to qvac-registry-vcpkg commit bc30b0b (ports/tts-cpp renamed and repointed at chatterbox.cpp@f8f9145). - vcpkg.json tts-cpp constraint bumped to 2026-04-24#1 (the port that carries the rename + namespace + install(EXPORT) changes). Verified with a cold bare-make generate + bare-make build against the new port, and the addon's existing unit + integration test suites. Made-with: Cursor * tts-ggml: bump tts-cpp port to 2026-05-07 + registry baseline Picks up the round-3 review-fix wave landed on the tts-cpp port: e673182 scrub stale patches/ refs from README (N10) 8ba10a6 drop unreachable TTS_CPP_GGML_LIB_PREFIX block (N8) 4b5d2d7 mirror N1-N7 fixes from chatterbox.cpp source-of-truth - N1 supertonic alive-registry guard against freed-backend gallocr_free assert on hot-swap (Vulkan/Metal/CUDA) - N2 drop dead g_sink_* state, soften log_set docstring - N3 Turbo BPE try/catch (exception-safe Engine ctor) - N4 STFT cancel checkpoint + tighter Engine::cancel() doc - N5 document s3gen_preload/unload refcount semantics - N6 drop dead cached_text_lc Supertonic shim - N7 fix misleading "no copy" view-vs-copy log wording Plus the integrated-port-only round-2 fixes that landed earlier: fa0d490 close patches/-deleted regression: TTS_CPP_USE_SYSTEM_GGML now defaults ON; bundled-without-patches hard-errors at configure time with a pointer at the ggml-speech vcpkg port. ae34c58 README rewritten for integrated/vcpkg context. a2f2dd6 top-level qvac-ext-lib-whisper.cpp README points at the tts-cpp/ subtree (alongside parakeet-cpp/). Public API used by ChatterboxModel (tts_cpp::chatterbox::Engine / EngineOptions / SynthesisResult / s3gen_preload / s3gen_unload) is backward-compatible: the new port adds Engine::backend_name(), MTL-variant fields on EngineOptions (language / cfg_weight / min_p / exaggeration), and a separate tts_cpp::supertonic::Engine class, but nothing this consumer was already calling has changed. Edits: packages/tts-ggml/vcpkg.json - tts-cpp dep: version>=2026-04-24#1 -> version>=2026-05-07. packages/tts-ggml/vcpkg-configuration.json - default-registry baseline: bc30b0b (April 2026 fork-only state) -> 16b91afdcfd59baea60e81f3da94f49311ef2a97. The new baseline pulls in the post-tetherto-merge state (parakeet-cpp port at 932d5d9, ggml-speech port-version 1 at f07bdd0) plus the new tts-cpp port (16b91af) on the developer's GustavoA1604 registry fork. Smoke-test plan: after running `vcpkg install` against the new baseline, the tts-cpp port's vcpkg_from_github resolves at GustavoA1604/qvac-ext-lib-whisper.cpp@e673182 (tts-cpp branch) until the upstream PR merges. ChatterboxModel should build and synthesize identically; expanding to Multilingual + Supertonic flows is the follow-up commit on the package side. Co-authored-by: Cursor <cursoragent@cursor.com> * Add chatterbox multilingual and supertonic * Add mobile integration tests * tts-ggml: drop clang-19 pin in linux-clang toolchain The toolchain hardcoded `clang-19` / `clang++-19` (versioned binary names) since the package's first commit (0a2c978). Linux CI hadn't exercised this path before — the new on-pr-tts-ggml.yml -> integration matrix is the first time it does, and it fails on every linux runner (ai-run-ubuntu-22.04, ai-run-linux-gpu, ubuntu-24.04-arm) at vcpkg's "detect_compiler" step because none of the GH-hosted images ship a `clang-19` symlink: Detecting compiler hash for triplet x64-linux... error: while detecting compiler information: ... CMake Error at scripts/cmake/vcpkg_execute_required_process.cmake:127 (message): Command failed: ... -DVCPKG_CHAINLOAD_TOOLCHAIN_FILE= .../tts-ggml/vcpkg/triplets/../toolchains/linux-clang.cmake ... Match parakeet's working pattern (qvac-lib-infer-parakeet/vcpkg/ toolchains/linux-clang.cmake): use unversioned `clang` / `clang++` so each runner picks up its image's default clang (clang-15 on ubuntu-22.04, clang-18 on ubuntu-24.04, whatever the AI runners ship). The `-stdlib=libc++` flag added by x64-linux.cmake / arm64-linux.cmake is honoured by every reasonable clang version. Co-authored-by: Cursor <cursoragent@cursor.com> * Add C++ tests and coverage; fix linux build * tts-ggml: address PR review feedback Bundle of correctness, hygiene, and CI-doc fixes from the recent code review. Each item below has its own paragraph in the diff comments. - #1 files-array: add test/utils/runSupertonicTTS.js + test/data/sentences-{medium,long}.js to package.json so consumers running the integration tests from the npm tarball don't crash with `Cannot find module ../utils/runSupertonicTTS`. - #2 deps: move @qvac/langdetect-text from runtime dependencies to devDependencies (it's only referenced from examples/, which aren't in the published files list). - #3 race-fix: ChatterboxModel::process()'s post-synthesize streaming detection used to read engine_->options() outside engineMu_, racing with reload(). synthesize() now returns SynthesizeResult { pcm, wasStreaming } where wasStreaming is captured under the engine lock against the local shared_ptr so process() doesn't have to touch engine_ again. - #4 deferred-load: ChatterboxModel + SupertonicModel constructors used to call load() eagerly, so JsInterface::createInstance() (sync on the JS thread) was parsing ~370 MB of GGUF on the Bare event loop. Both models now implement IModelAsyncLoad: constructors validate + return; the actual load is deferred to waitForLoadInitialization(), which the new addon_js::activate wraps inside JsAsyncTask::run so the parse runs on a worker thread. binding.cpp registers addon_js::activate in place of JsInterface::activate; tts.js now awaits the resulting promise. - #5 dead code: drop _resolvePath (unused), drop the (void)inputObj read in AddonJs.hpp::runJob, document FAILED_TO_PAUSE / FAILED_TO_STOP / JOB_ALREADY_RUNNING in lib/error.js as reserved-but- not-thrown so future maintainers don't delete them blindly (the unit suite asserts the values). - #6 cancel-reset: SupertonicModel grew Chatterbox's cancelRequested_ reset pattern: cancel() sets it, synthesize() fast-fails on it, process() resets it per call so a stale cancel doesn't poison the next run. - #7 useGPU comment: explain in JSAdapter::buildChatterboxConfig that the JS layer is the source of truth for useGPU and nGpuLayers wins downstream; left a pointer to std::optional<bool> if a future caller ever needs to distinguish "absent" from "explicit false". - #10 fork pointers: README.md and test/utils/downloadModel.js no longer point at GustavoA1604/chatterbox.cpp; both reference the upstream tetherto/qvac-ext-lib-whisper.cpp/tts-cpp tree now. - #9 doc: integration-mobile-test-tts-ggml.yml gained a header comment on the build-and-test job documenting that continue-on-error is the early-days landing posture (merge-guard treats success || skipped as pass), with a pointer to tighten once Device Farm provisioning is stable. Nits: - 'use strict' added to addonLogging.js (matches every other .js). - node-vs-bare runtime banners on scripts/{generate,validate}-mobile-integration-tests.js. - ttsOutputDebugString no longer JSON.stringify's the full PCM Int16Array on every chunk-streaming event; emits a tiny summary ({sampleRate, chunkIndex, isLast, sentenceChunk, outputArrayLen}) instead. Tests: 35 passing (33 -> 35; two new assertions cover the deferred-load contract); 4 skipped real-GGUF tests behind the existing QVAC_TEST_CHATTERBOX_T3_GGUF / QVAC_TEST_CHATTERBOX_S3GEN_GGUF / QVAC_TEST_SUPERTONIC_GGUF env-var gates. Lint clean. Co-authored-by: Cursor <cursoragent@cursor.com> * tts-ggml: unblock CI integration tests on every desktop runner Four independent failures, one per platform: 1. linux-x64 / linux-arm64: addon load crashed at `libomp.so.5: cannot open shared object file`. tts-cpp's binary is built with clang under the linux-clang toolchain and links against libomp (LLVM OpenMP runtime); only `libgomp1` (GNU OpenMP) was being apt-installed. Add `libomp5` so libomp.so.5 is on the loader path. 2. darwin-arm64: convert-models.sh aborted at line 200 with `hf_args[@]: unbound variable`. macOS's system bash is 3.2 which treats `"${arr[@]}"` as nounset access when the array is empty under `set -u`; with HF_TOKEN unset we hit it on every fresh runner. Use the `${arr[@]+"${arr[@]}"}` idiom (defined-or-nothing) at all six call sites and add a header comment so the next maintainer doesn't accidentally regress. 3. darwin-x64: pip install bombed building `llvmlite` from source because the macos-15-large runner has no LLVM 15 development install. Root cause: librosa pulls in numba 0.65+, which stopped shipping darwin-x86_64 wheels for Python 3.12. Pin Python to 3.11 in the Setup Python step; 3.11 has prebuilt wheels for the entire numba/llvmlite/librosa stack on darwin-x64 and is fine for every other converter dependency. 4. windows-2022: ChatterboxModel::load threw `vk::createInstance: ErrorIncompatibleDriver`. Root cause: the addon's index.js::_validateConfig defaults `useGPU = true` when neither useGPU nor nGpuLayers is specified, so the test ran with n_gpu_layers=99 -> ggml_backend_vk_init -> vk::createInstance -> ErrorIncompatibleDriver on the runner's no-Vulkan-driver image. runChatterboxTTS.js now honours `process.env.NO_GPU === 'true'` (set on the no-GPU matrix entries) and forces useGPU=false on exactly those runners; the other test runners (chatterbox-mtl, gpu-smoke, multiple-runs) already had this guard. Also documents the `mesa-vulkan-drivers` apt package (already pulled in) as the software ICD that lets the Vulkan-built prebuild's runtime backend probe enumerate at least one device on linux runners. Co-authored-by: Cursor <cursoragent@cursor.com> * tts-ggml: drop Chatterbox from mobile bundle (Metro V8 string limit) Mobile build failed at `:app:createBundleReleaseJsAndAssets` with: SyntaxError: assets/testAssets/chatterbox-s3gen.gguf: Cannot create a string longer than 0x1fffffe8 characters Root cause: Metro's bundler reads every asset under `test/mobile/testAssets/` via `Buffer.toString()`. V8's max string length is 0x1fffffe8 (~512 MiB). chatterbox-s3gen.gguf is ~1 GiB even with --quant q4_0 because the s3gen converter only quantizes attention weights and leaves the bulk of the s3gen graph in fp16 ("0/291 weight tensors quantized" in the converter log). Fix: bundle ONLY supertonic.gguf (~125 MiB, comfortably under the limit) on mobile. Mobile Chatterbox tests degrade cleanly to `t.pass('Skipped: Chatterbox GGUFs not available')` via the existing `ensureChatterboxModels` helper -- it already returns { success: false } when the GGUFs aren't on disk. Cache key bumped to v2 so existing v1 cache entries (which include the chatterbox files) are evicted on the next run. Bundling Chatterbox on mobile requires either: - adding `gguf` to qvac-test-addon-mobile's metro `assetExts` so the JS-string read is skipped (then the s3gen file can flow through the bundle as a raw asset), or - pushing the chatterbox GGUFs to the device via `adb push` outside the bundle and surfacing the path through downloadModel.js's existing ANDROID_CANDIDATE_DIRS fallback. Both are outside the scope of this PR; documented inline above the cache step for the next maintainer. Co-authored-by: Cursor <cursoragent@cursor.com> * Bump hash of vcpkg * Consume vcpkg from tetherto repository * Fix integration tests failures in all platforms * Further fix tests * fix: Make useGPU flag more meaningful (#1953) * fix[api]: make useGPU flag actually force CPU/GPU and reject useGPU/nGpuLayers conflicts * add gpu smoke test * resolve comments --------- Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local> * Update dependencies after monorepo directory changes * Further drop qvac-lib- prefix * Add CHANGELOG.md --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Ishan Vohra <ishanvohra2@gmail.com> Co-authored-by: Ishan Vohra <ishanvohra@Ishans-MacBook-Air.local>
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
…ache via KvCacheSession (#2007) * QVAC-18182 feat[api]: typed cancel outcomes on the wire + atomic KV-cache via KvCacheSession Builds on QVAC-18181's request lifecycle primitives (DisposableScope, RequestContext, RequestRegistry) to deliver the M2 milestone: - Typed cancel outcomes: `stopReason: "cancelled"` on `completionDone` events, and `InferenceCancelledError(requestId, partial)` thrown from CompletionRun promise-aggregates (`final` / `text` / `toolCalls` / `stats`). The wire stream still ends normally so iterating `run.events` is unaffected — the typed error lives on the aggregate promises that callers `await` for the final result. - KvCacheSession (`server/bare/plugins/llamacpp-completion/ops/ kv-cache-session.ts`) — single atomic owner of the three KV-cache layers (`cachedMessageCounts`, `initializedCaches`, on-disk `.bin` files). `beginTurn` / `commitTurn` / `rollback` collapse the three duplicated cleanup blocks in `completion-stream.ts` into one scope.defer hook. Cross-model administrative deletion lives at the module level as `deleteKvCacheState(...)`, called by the RPC `handleDeleteCache` handler. - Stop-button race close — `RequestRegistry` now keeps a bounded cancelled-before-begin map (128 entries, 30s TTL). A `cancel({ requestId })` that lands before the server's `begin(...)` ran is applied retroactively when begin lands, so same-tick stop clicks no longer disappear into the void. Internal-only — the wire surface for `cancel` is unchanged (Option A in the brief). Cursor rules updated in the same PR so the request-lifecycle and KV-cache topic docs stay in sync with the implementation. Tests: - unit: KvCacheSession (bareTest-gated, runs in the Bare consumer), RequestRegistry race + bounded-set eviction, completion-event schema cancelled cases. - e2e: cancellation-tests.ts adds three definitions — mid-stream cancel (events.stopReason === "cancelled", final rejects with InferenceCancelledError, partial.text matches concatenated contentDelta), cancel-before-begin (retroactive abort), and cancel-then-resume-kv-cache (rollback wiped the three layers, the next turn re-primes cleanly). * chore: drop planning labels (Mx/Dx) from QVAC-18182 comments Strips milestone (`M1`/`M2`/`M3a`...) and deliverable (`D2`/`D5`/`D7`) labels from comments and test titles introduced with the typed-cancel outcomes + KvCacheSession work. The substantive descriptions of the contracts (Stop-button race, cancelled-before-begin map, three-layer session ownership, etc.) are preserved; only the planning-doc references are removed so the code reads cleanly without the pitch context. Durable `QVAC-XXXXX` ticket references are kept. No behavior or API surface changes. * chore: drop Asana ticket references from QVAC-18182 code comments Strips QVAC-XXXXX inline ticket references from code/test comments introduced by the typed-cancel-outcomes work. Concept names (Stop-button race, cancelled-before-begin, etc.) and prose descriptions of the contracts are preserved; only the ticket-tag suffixes go. Also renames a test cache key from `qvac-18182-cancel-resume-kvcache` to `cancel-then-resume-kvcache` so the cache key reads as a stable identifier rather than a ticket reference. No behavior or API surface changes. * QVAC-18182 doc: clarify error>cancelled precedence + deleteKvCacheState concurrency Address non-blocking review nits on PR #2007: - aggregate-events: explain why a wire event carrying both error and cancelled signals resolves to error (closes brief open question #3). - kv-cache-session: doc-comment on deleteKvCacheState explaining the ordering guarantee under concurrent in-flight turns -- delete is wire-async, in-flight turns roll back idempotently when their commit probe finds the file gone (closes brief open question #4). Comments only; no behavior changes. * QVAC-18182 doc: demonstrate typed cancel outcomes in cancel example Enhance the existing cancel-by-request-id example to demonstrate the two M2 cancel-outcome channels: - run.events ends normally with completionDone carrying stopReason: "cancelled" -- show reading it inside the iteration loop. - run.text rejects with InferenceCancelledError(requestId, partial) on cancel -- show the instanceof check and consuming partial.text, partial.toolCalls, partial.stats. Also update the header to remove the now-stale "logged as a no-match" sentence (same-tick cancels are no longer dropped after M2's race close). Pure documentation enhancement; no API or behavior changes. * QVAC-18182 fix: address PR review — partial-prime cleanup + parent-aborted state Two follow-ups from Opanin's review on PR #2007: 1. KvCacheSession.beginTurn: if `primeIfMissing` throws after the addon has partially written a `.bin` to disk, the next `beginCustom` would `fsPromises.access(cachePath)` → true and trust the half-primed file as a valid cache (no rollback hook is registered yet — the handler hasn't seen the `TurnHandle`). Wrap both `beginCustom` and `beginAuto` prime calls in a shared `primeOrCleanup` helper that best-effort unlinks the partial file before re-throwing the original prime error. Adds a bare-only unit test asserting the on-disk file is removed and the init flag stays unset on the failed-prime path. 2. RequestRegistry.begin: when `parentSignal` was already aborted at begin time, line 271 aborts the controller but the `state` ternary still landed `"running"`, exactly the "momentarily-running with already-aborted signal" the preCancel branch was guarding against. Extend the ternary to cover both inputs and the existing `parentSignal already aborted` test now also asserts `ctx.state === "cancelling"`. No behavior change on the happy path. Lint + typecheck + 351-test unit suite green locally on the changed files. * QVAC-18182 fix: prime is atomic — addon writes to .prime.tmp + atomic rename Upgrade the previous reactive cleanup workaround (PR #2007 review by @opaninakuffo) into a proactive atomic-by-construction design: - The session steers `model.run({ saveSessionPath })` to a sibling `cachePath + ".prime.tmp"` path. - Only after the prime closure resolves successfully do we promote the temp file to the canonical `cachePath` via `fsPromises.rename` (atomic same-volume on every host we target). - The canonical cache path is therefore *never* observable in a partial state — a thrown prime is indistinguishable on disk from a never-attempted prime, so the next existence probe (in-process or cross-process worker restart) cannot trust corrupt bytes. Defensive details: - We unlink any leftover `.prime.tmp` *before* invoking the closure, so a deferred-write addon path can't accidentally promote stale-from-crash bytes left by a prior worker. - On prime success we probe the temp path before renaming. If the addon deferred its disk write (some llama.cpp paths flush lazily), the temp doesn't exist and we leave the canonical path absent — `verifySaveAndRecord` in `commitTurn` is the authoritative check. - On rename failure we unlink the temp and surface the rename error; rename atomicity guarantees the canonical path was untouched. Why this is better than the prior `primeOrCleanup`: - Best-effort `unlink` was load-bearing for correctness in the old design — a failed unlink left a half-primed canonical file the next `beginCustom` would trust. The new design moves the only possible "partial" file to a non-trusted name, so failed cleanup cannot corrupt the canonical name by construction. - The unit test no longer mocks the workaround surface; it asserts the actual invariant ("canonical path was never written") plus the positive rename and the leftover-sweep guarantees. Tests: 3 bare-only kv-cache-session unit tests (throw-leaves-canonical- untouched, success-promotes-via-rename, leftover-from-crash-is-swept). Lint + typecheck + 351-test unit suite green locally on the changed files. Long-term, the right fix is one layer down — the llama.cpp addon should write transactionally itself and surface save errors instead of swallowing them. When that lands, this helper collapses to a direct `prime(cachePath)` call and the `verifySaveAndRecord` access-probe fallback (TODO already documented) can be retired together. Filed as a separate follow-up; out of scope for this PR. * QVAC-18182 fix: replace prime-atomic helper with verifyPrimedFile post-prime probe Audit of the llama.cpp addon (`CacheManager::writeCacheFile` → `llama_state_save_file`, return value swallowed; `LlamaModel:: processPromptImpl` lines 575-599) shows the bug shape Opanin flagged on PR #2007 — "primeIfMissing throws after a partial save" — does not actually fire. The save call is the very last operation on the prefill path, the addon ignores its return value, and any earlier throw means no save was attempted. So: - `primeOrCleanup` (`ac8d2d74e`) and the upgrade to `primeAtomically` (`a7420f3e6`) defended against a code path that the addon does not produce. - The real corruption shape is silent partial writes (addon's `llama_state_save_file` returns false, addon ignores it, file is half-written or empty). Atomic temp+rename did NOT close this gap — on a "silent partial" the closure resolves successfully and the helper would happily promote the partial `.prime.tmp` to the canonical path. Replace both helpers with a small `verifyPrimedFile` that mirrors the existing `verifySaveAndRecord` access-probe pattern used at commit time, applied at prime time: - After a successful prime closure, `fsPromises.stat` the canonical path. If it doesn't exist (addon was interrupted before save) or has size 0 (addon save call produced an empty file), throw and best-effort unlink the empty leftover so the next existence probe doesn't trust it. - This catches the two failure modes Opanin's concern was a proxy for (cancelled-mid-prime; addon save quietly produced nothing) without claiming defense against partial-but-nonzero writes, which can only be closed at the addon layer. The `RequestRegistry` parent-aborted-state fix (`ctx.state` ternary covers `opts.parentSignal?.aborted`) from `ac8d2d74e` is preserved unchanged — it stands on its own as a correct response to Opanin's second comment. Long-term root cause stays the addon: have `CacheManager::writeCacheFile` check `llama_state_save_file`'s return value and throw on failure. When that lands, both `verifyPrimedFile` and `verifySaveAndRecord`'s access-probes can be retired together. Filed as a separate follow-up — out of scope for this PR. Tests: 3 prior bare-only prime-atomic tests removed; 2 new bare-only tests added (no-file and empty-file rejection paths). Lint + typecheck + 330-test unit suite green locally on the changed files (pre-existing sdcpp-generation lint errors unchanged). * QVAC-18182 doc: kv-cache rule documents addon non-transactional save + matched access-probes Extend the "Cache Initialization (primeIfMissing)" section in .cursor/rules/sdk/docs/kv-cache-system.mdc with the corrected addon-contract analysis: - The llama.cpp addon's CacheManager::writeCacheFile discards llama_state_save_file's bool return; maybeSaveCacheToDisk is the last call on the prefill path. So no closure-rejection path can coexist with a partial file on disk. - Document the four real outcomes as a table (interrupted / success / silent partial write / pre-eval throw) so future readers can see why the SDK takes the shape it does. - Pin both SDK-side defenses as a matched pair: verifyPrimedFile at prime time (added in this PR) and verifySaveAndRecord at commit time (existing). Both are honest about what they catch (missing / empty file) and what they don't (partial-but-nonzero, only addon fix can close that). - Reference the addon-layer follow-up (1214778658064488 / "throw on llama_state_save_file failure") so the next contributor knows both probes will be retired together when the addon throws on save failure. No code change — rule-only update.
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
* QVAC-18183 feat[api]: inference-handler migrations
Migrate the four remaining inference handler kinds onto the
RequestRegistry primitives shipped in M3a (cancel-capability
declaration, per-kind concurrency policy, structured
`[request-lifecycle]` logging). Each handler now opens a
request-scoped `ManagedRequestContext`, threads the optional
`requestId` from the wire request (falling back to a server-minted
UUID), routes hard cancels to `addon.cancel()` at a single signal-
listener leaf, and replaces ad-hoc `try/finally` cleanup with
`scope.defer(...)` registrations so cleanup runs in LIFO order on
every exit path.
- `embed` (kind "embeddings", `{ scope: "model", hard: true }`):
`packages/sdk/server/bare/ops/embed.ts` opens the context, threads
`requestId` from `embedRequestSchema`, post-await `signal.aborted`
checks raise `InferenceCancelledError`.
- `transcribe` / `transcribeStream` (kind "transcribe",
`{ scope: "model", hard: true }`): collapsed
`try { ... } finally { restorePrompt(...) }` into
`scope.defer(restorePrompt)`, added per-iteration
`if (ctx.signal.aborted) break;` in the `response.iterate()` loop
(Option A from §4 of the M3b brief — explicit, visible at the call
site, no `takeWhileNotAborted` wrapper).
- `translate` (kind "translate"): two engine branches.
llamacpp-completion declares `{ scope: "model", hard: true }` and
wires `signal → addon.cancel()`; nmtcpp-translation keeps
`{ scope: "none" }` and soft-cancels inside both the streaming
iterate loop and the `runBatch` early-return path.
- `finetune` (kind "finetune"): flipped the llamacpp-completion
manifest declaration from `{ scope: "none" }` to
`{ scope: "model", hard: true }` (the addon already exposes
`model.cancel()`). `startFinetune` opens a registry context and
wires `signal → model.cancel()`; the two-level `try/finally`
collapses into `scope.defer` for `clearFinetuneRuntimeState` and
`handle.removeListener`. `cancelFinetune(modelId)` is now a thin
wrapper over `getRequestRegistry().cancel({ modelId, kind:
"finetune" })` — never invokes `model.cancel()` directly.
Per §4 of the brief: per-iteration cancel granularity uses
Option A (explicit `if (ctx.signal.aborted) break;` at the top of
each streaming loop body). No `takeWhileNotAborted` wrapper was
introduced.
Per §7 anti-patterns: M3b adds zero `oneAtATimePerModel` policies
(the four migrated kinds tolerate concurrent requests against the
same model), leaves the M1 compat-fallback in
`server/bare/ops/cancel.ts` untouched (M3d retires it), and does
not modify `cancelHandler.ts`.
Other changes:
- `embed`, `transcribe`, `transcribeStream`, `translate`,
`finetune` request schemas grow an optional `requestId` field
(`.string().min(1).optional()`); server-side ops fall back to
`generateServerRequestId()` when absent.
- Whisper / Parakeet / LLM / NMT plugin handlers thread
`request.requestId` into their bare ops.
- `plugin-cancel-capability.test.ts` truth-table flipped for the
`finetune` row.
- New `inference-handler-migrations.test.ts` covers schema-level
optional-`requestId` acceptance for all four kinds and pins the
`[request-lifecycle] begin/cancel/end` line shape for each kind.
The op-level cancel-by-requestId / cancel-by-modelId integration
tests are bare-runtime-gated (the migrated ops pull `bare-crypto`
/ `bare-fs` transitively and can't load under Bun, same reason as
`finetune-ops.test.disabled.ts`).
- `.cursor/rules/sdk/request-lifecycle-primitives.mdc` and
`.cursor/rules/sdk/docs/request-lifecycle-system.mdc` updated:
M3b row marked shipped, finetune truth-table row flipped,
canonical-handler-shape section refreshed to use `embed.ts` as the
cleanest reference and to document the Option A per-iteration
check.
Verification:
- `bun lint` (eslint + tsc --noEmit): green.
- `bun run typecheck`: green.
- `bun run test:unit`: every test file green except the
pre-existing `client/rpc/rpc-client.ts` `#rpc` package-resolution
failure on upstream/main (also reproducible without these
changes; unrelated to M3b).
* QVAC-18183 fix: address PR #2058 review feedback
- transcribe.ts: route the two `Transcription Update` debug emits
through `requestLogger.debug` so they carry the per-request prefix,
matching the rule's `grep "requestId=<id>"` invariant. Drop the now-
unused module-level `logger`. Collapse two `scope.defer(async () =>
{ await restorePrompt(...) })` wrappers to bare arrow callbacks
(review #5, #10).
- inference-handler-migrations.test.ts: add bareTest op-level cancel-
by-requestId cases for `transcribe (whisper)` (asserts loop exit +
addon.cancel called + reload-count == 2 to pin the
`applyPrompt + restorePrompt runs exactly once` invariant) and
`finetune` (asserts model.cancel called + scope unwind clears the
runtime-state flag back to IDLE). Pin the NMT soft-cancel contract
by instrumenting the addon and asserting addon.cancel was NOT called
during a translate cancel (review #3, #7).
- request-lifecycle-primitives.mdc: reconcile the "polling
signal.aborted mid-handler" anti-pattern with the new "Per-iteration
cancel check (M3b)" canonical pattern. The anti-pattern is *adding*
the check when the addon already honours signal directly; the M3b
pattern is *introducing* the check where the addon doesn't and the
loop is the only soft-cancel exit (review #4).
* QVAC-18183 fix: drop unsafe `addon` re-narrowing in translate.ts onAbort
Addresses opaninakuffo's review comment on #2058:
`AnyModel.addon` is already typed as `AddonInterface | undefined`
(see `server/bare/registry/model-registry.ts:17-20`), so the
`as unknown as { addon?: { cancel?(jobId?: string): Promise<void> } }`
cast was unnecessary. Matches the simpler pattern used by `embed.ts`
and `transcribe.ts` for the same `onAbort` shape — keeps the four
M3b-migrated ops uniform.
* QVAC-18183 doc: trim internal milestone references from cursor rules + code comments
Removed the "Migration Roadmap" table, "M1/M2/M3a-d" milestone labels, planning-brief
decision references (Decision A/B.2, D1/D2), workspace-local paths
(`tasks/release-0.11.0-planning/...`, `pitch-3-decisions.md`), and "in review"
forward-references from the request-lifecycle cursor rules and the matching code
comments in the bare ops, finetune wrapper, and the inference-migration tests. The
canonical handler shape, anti-patterns, primitives reference, plugin cancel-capability
truth table, and concurrency-policy / structured-logging sections all stay — only the
internal milestone framing comes out.
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
* feat: add qvac-lib-infer-vla hello-world addon scaffold
- New addon package at packages/qvac-lib-infer-vla with ggml backend.
- CI workflows for on-pr, on-merge, prebuilds, integration + mobile tests, cpp-tests.
- Temporarily renames on-pr-qvac-lib-infer-vla.yml to on-pr-ocr-onnx.yml
so the existing workflow name triggers CI while verifying hello-world scaffold.
* fix[notask]: pure-JS helper pattern for hello-world addon unit tests
- Extract `normalizeName()` into a pure-JS `addon.js` helper in the vla
scaffold so `npm run test:unit` no longer loads the native `.bare` addon.
- Mirror the pattern used by qvac-lib-infer-llamacpp-embed, which lets CI's
ts-checks job (which runs `test:unit --if-present` without a build) pass.
- Propagate the same pattern to the `new-addon` skill templates and document
the rule in SKILL.md so future scaffolds inherit it.
* fix[notask]: fix Windows build for hello-world scaffold
Add Windows compile defines (`NOMINMAX`, `WIN32_LEAN_AND_MEAN`, `NOGDI`)
and link `msvcrt.lib`, mirroring qvac-lib-infer-llamacpp-embed. Without
these, the Windows SDK macros `ERROR` (wingdi.h) and `min` (minwindef.h)
collide with `Priority::ERROR` and `std::min` in the
`qvac-lib-inference-addon-cpp` headers.
Propagate the same fix to the `new-addon` skill template so future
scaffolds inherit it.
* fix: use versionless filename for pinned Vulkan SDK download
LunarG rotated out the versioned `vulkansdk-linux-x86_64-${VERSION}.tar.xz`
download URL and now only serves `vulkan_sdk.tar.xz` under each pinned
version path. Prebuild workflows using the pinned version (currently
1.4.341.1) fail with `wget` exit code 8 (HTTP 404) on every fresh runner.
Align the pinned-version URL with the `latest` URL pattern, which already
uses `vulkan_sdk.tar.xz` and continues to return 200 for pinned versions.
Verified:
- https://sdk.lunarg.com/sdk/download/1.4.341.1/linux/vulkan_sdk.tar.xz -> 200
- https://sdk.lunarg.com/sdk/download/1.4.341.1/linux/vulkansdk-linux-x86_64-1.4.341.1.tar.xz -> 404
* chore[notask]: bump setup-vulkan-sdk action pin on tmp-vla
Point the vla prebuild workflow at the cherry-picked Vulkan URL fix
so CI on this branch actually picks it up. The previous pin still
resolved to the pre-fix action, so Linux/Android prebuilds kept
hitting wget exit 8 (HTTP 404) even after the fix commit landed on
tmp-vla.
* feat[bc]: port SmolVLA ggml inference into qvac-lib-infer-vla
Replace hello-world scaffold with real SmolVLA inference engine (739-tensor
vision+text+expert model, 10-step flow-matching ODE). JS surface exposes
VlaModel, preprocessImage, padState. Integration test downloads the LIBERO
checkpoint from S3 via GitHub OIDC so CI can exercise end-to-end inference.
* infra: add on-pr CI workflow for qvac-lib-infer-vla
The VLA package was missing an on-pr workflow, so nothing ran sanity checks,
cpp-lint/tests, ts-checks, prebuilds, or integration tests against a PR. This
adds one mirroring the Embed template so integration tests (which pull the
SmolVLA LIBERO GGUF from S3) gate the PR.
* doc: harden new-addon skill with explicit 7-workflow check
Add Step 4a validation gate that lists every expected workflow filename and
fails loudly if any is missing. The prior VLA scaffold shipped with only 6/7
workflows (on-pr-*.yml silently dropped), which left PRs against the new
package without sanity checks, cpp-lint/tests, ts-checks, prebuilds, or
integration tests. Also make Step 6 list each generated filename by name so
miscounts are caught at report time.
* fix: use std::numbers::pi_v<float> to unbreak Windows (MSVC) build
MSVC's `<cmath>` does not define `M_PI` unless `_USE_MATH_DEFINES` is set
before the include, so the x64-windows prebuild job failed to compile
smolvla.cpp. Switch to the C++20 `std::numbers::pi_v<float>` constant,
which works on every toolchain we build with.
* feat: enable full GPU backend set (Vulkan + Metal + OpenCL) in qvac-lib-infer-vla
Drop default-features:false on the qvac-fabric dep so the port's platform-
auto-selected backends get built: Metal on iOS/macOS, Vulkan on Linux/Android/
Windows, plus the CPU fallback everywhere. Declare the OpenCL dep on Android
so qvac-fabric's Android GPU backend can pick it up alongside Vulkan, mirroring
the LLM addon's setup.
The addon already calls ggml_backend_load_all_from_path(BACKENDS_SUBDIR) and
ships each GGML_AVAILABLE_BACKEND as a shared/static lib via CMakeLists, so no
C++ changes are needed — the extra backends get discovered at runtime.
* chore[notask]: rename vla workflow display names for easier triggering
Use `on-merge-vla` for the merge workflow and `vla` for the PR workflow so
`gh workflow run vla` uniquely resolves to the on-pr trigger without ambiguity
against all the other `(Vla)`-suffixed package workflows.
* chore[notask]: mask vla on-pr workflow as on-pr-ocr-onnx.yml on tmp-vla
Temporarily rename the VLA on-pr workflow to the OCR filename so
`gh workflow run on-pr-ocr-onnx.yml --ref tmp-vla` resolves the workflow
ID via main's registration and then dispatches against our file content
on tmp-vla. Scoped to tmp-vla only — does not affect main's OCR workflow.
* fix: satisfy standardjs no-new in vla integration tests
Capture the VlaModel constructor return and destroy it so standardjs
stops flagging the error-path probes with `no-new`. These paths throw
synchronously before the native handle is fully built, so the destroy
is cheap and safe.
* fix: replace brittle t.exception() in vla unit tests to unblock bare run
Brittle's t.exception() runs the probed function inside a promise chain; on
the bare runtime the assertion helper rethrows into an uncaught rejection
which aborts the process with SIGABRT (exit 134). This made the ts-checks
job fail on CI even though every assertion passed.
Switch both rejection probes (preprocessImage and padState) to the same
try/catch + t.ok pattern already used in the integration tests.
* style: apply clang-format-19 to qvac-lib-infer-vla sources
Satisfies cpp-lint 'Check C++ files format' step (run from CI):
git-clang-format-19 --extensions c,cc,cpp,cxx,h,hh,hpp,hxx -- packages/qvac-lib-infer-vla
* test[notask]: fix ci failures from tmp-vla PR-style dispatch
- mobile: add test/mobile/ scaffold (integration-runtime.cjs + auto.cjs)
and matching generate/validate scripts. Mobile workflow requires
test/mobile/*.cjs; before this commit the dir didn't exist.
- integration (linux-x64): install aws CLI v2 on linux runners
(idempotent). Needed for ai-run-linux-gpu self-hosted runner that
lacks a pre-baked aws CLI.
- integration (darwin-x64): skip S3 download + QVAC_VLA_MODEL on the
macos-15-large Intel runner. Its Apple Paravirtual GPU exposes only
~1 GB working set — too small for the 4 GB SmolVLA model, which
triggers GGML_ASSERT(buf_src) mid-inference on Metal. Darwin-arm64
still runs the full end-to-end test.
* ci[notask]: skip cpp-lint on workflow_dispatch in vla on-pr
cpp-lint passes `github.event.pull_request.base.sha` as the diff base;
on workflow_dispatch that's empty, and the called workflow then runs
`git-clang-format-19 --diff ""` which fails with "'' is not a commit".
Gate the job on `github.event_name == 'pull_request_target'` so
dispatch-style runs (we use these to test tmp-vla) don't fail it.
Real PRs still run the format check normally. merge-guard is
if-always, so the skipped job doesn't block it.
* fix: ship ggml core libs on Android and add AWS CLI to PATH on self-hosted linux
Two independent CI fixes for the VLA addon:
1. Android mobile integration tests were failing because the prebuild
shipped only backend shared libs (libqvac-ggml-vulkan.so,
libqvac-ggml-cpu-*.so, libqvac-ggml-opencl.so) and the addon .bare
itself. qvac-fabric builds ggml with GGML_BACKEND_DL=ON on Android,
which makes ggml::ggml and ggml::ggml-base shared libraries too, so
without them the addon's dlopen fails with unresolved ggml_* symbols.
Install them alongside the backend libs when GGML_BACKEND_DL is set.
2. linux-x64 integration tests were failing on the self-hosted
ai-run-linux-gpu runner because AWS CLI v2 installs to
/usr/local/bin/aws but that directory is not on PATH for subsequent
steps. Append it to $GITHUB_PATH so later steps (aws s3 sync, etc.)
can resolve the binary. Also simplified the install block to early-
exit when aws is already present.
* fix[notask]: VLA Android ggml backend-DL compat + linux AWS CLI perms
Two fixes for remaining tmp-vla CI failures:
1. Android addon failed to dlopen the .bare because qvac-fabric builds
ggml with GGML_BACKEND_DL=ON, which keeps the core ggml_backend_*
registry symbols in the addon but puts `ggml_backend_cpu_init` in the
separately-loaded CPU backend .so. Switch to the device-registry API
(`ggml_backend_dev_by_type` + `ggml_backend_dev_init`) so the CPU
backend is obtained from whichever backend was loaded at runtime via
`ggml_backend_load_all_from_path`. Also revert the CMakeLists hack
that shipped ggml::ggml / ggml::ggml-base alongside the addon — those
ship as static .a under this vcpkg triplet and are useless at dlopen.
2. linux-x64 integration jobs were hitting `aws: Permission denied` on
the self-hosted `ai-run-linux-gpu` runner because a leftover install
at /usr/local/bin/aws had mode bits the runner user couldn't execute.
Add an `[ -x /usr/local/bin/aws ]` early-return path so we reuse a
good existing install, and `chmod -R a+rX` after any fresh install to
harden against the same footgun next time.
* fix[notask]: tolerate Vulkan teardown SIGSEGV on ai-run-linux-gpu
The Linux x64 integration matrix runs on two Ubuntu runners: a plain
ubuntu-22.04 (CPU only) and a self-hosted ai-run-linux-gpu (Tesla T4
Vulkan). Tests all pass cleanly on both, but the GPU runner's bare
process exits with SIGSEGV (exit 139) ~0.5s after the final test
completes — inside ggml-vulkan's static-destructor chain interacting
with the NVIDIA Vulkan ICD.
Fixing that upstream is out of scope for this branch, but we still want
GPU coverage in CI. Wrap the `npm run test:integration` invocation so
that exit 139 is tolerated IFF the captured TAP output shows all tests
passed (the `# ok` end marker and the `# tests = N/N pass` summary).
Any other non-zero exit, and any missing TAP pass marker, still fails
the job.
* feat[api]: expose per-stage timings and PyTorch reference assertion in VLA
- VlaModel.run() now returns { actions, stats } where stats carries
vision_ms, smollm2_compute_ms, smollm2_total_ms, ode_ms, total_ms
captured during inference. C ABI of smolvla_inference is preserved;
C++ callers use new smolvla_inference_with_timing.
- Integration test: tolerance-based comparison against a committed
PyTorch reference (test/integration/assets/pt_actions_libero_fixed.json,
generated by scripts/generate_reference.py), plus wiring of the shared
performance reporter (vla addon type). Uploads perf-report.json as
a per-platform artifact in the integration-test workflow.
* test: regenerate VLA PyTorch reference at action_dim=7
The committed reference was generated at action_dim=6 but the current
smolvla-libero-f32-fixed.gguf reports action_dim=7, so the tolerance
asserts were skipped in CI with "shape mismatch (ref=50x6, actual=50x7)".
Regenerated with `generate_reference.py --action-dim 7`; local run now
exercises both new asserts with max|Δ|=0.0009, cos=1.0000.
* feat: bundle SmolVLA GGUF on mobile via presigned S3 URL
Ports the presigned-URL-on-mobile pattern used by qvac-lib-infer-nmtcpp so
the VLA end-to-end test actually runs on AWS Device Farm. Without a GGUF
on device the mobile test skipped, leaving the Step Summary empty.
- scripts/generate-smolvla-presigned-url.sh: resolve the latest date dir
under s3://MODEL_S3_BUCKET/qvac_models_compiled/vla/smolvla-libero/,
presign smolvla-libero-f32-fixed.gguf for 6h, export to GITHUB_ENV.
- integration-mobile-test-qvac-lib-infer-vla.yml: OIDC auth to
eu-central-1, run the presign script, and bundle the URL into
test/mobile/testAssets/smolvla-urls.json before the addon is packed.
- test/integration/addon.test.js: on mobile, load the URL from
global.assetPaths, download into global.testDir/vla-models/ (with
retry/redirect handling and a ≥100MB cache-hit shortcut) and use that
as the modelPath instead of relying on QVAC_VLA_MODEL.
- package.json: add bare-fetch devDep, same version range as nmtcpp.
* fix: stream SmolVLA GGUF download on mobile via bare-https
The mobile end-to-end test was crashing the Bare runtime at
after-test:runAddonTest with State=1 on both iOS and Android. Root cause
was the _downloadFile helper loading the entire 2.1 GiB GGUF into memory
via bare-fetch + response.arrayBuffer() + Buffer.from(buffer), which
peaked at ~4.5 GB and got OOM-killed by the mobile kernel.
Replace the buffered download with a bare-https streaming pipe:
https.get + fs.createWriteStream + res.on('data', chunk => write(chunk)).
Same pattern Parakeet, TTS/Chatterbox, and Diffusion use for their
multi-GB Device Farm models. Preserves redirect handling (301/302/
307/308), retry+backoff, and adds progress logs every 50 MB. Failed
attempts unlink the partial file before retrying.
Drop bare-fetch from devDependencies — bare-https is a Bare runtime
module, so no new dep is needed.
* ci: align darwin-arm64 integration runner with prebuild SDK
Prebuilds for darwin-arm64 are built on macos-14 (macOS 14 SDK), but the
integration test job was running on macos-15-xlarge. The .bare binary —
including its linked Metal/MPSGraph frameworks — was compiled against the
macOS 14 SDK then loaded on a macOS 15 host. That cross-SDK mismatch is a
plausible cause of the Metal correctness divergence we are seeing on CI
(max|Δ|=1.9789 on CI darwin-arm64 vs max|Δ|=0.0006 on a macos-15.5 M3
Max running the same GGUF locally). Match the runner OS to the prebuild
runner (macos-14-xlarge) so the binary executes on the SDK it was built
against.
Also tighten the end-to-end mobile test: remove the t.comment + t.pass()
graceful-skip branches that silently masked iOS CI failures. On mobile
the presigned S3 URL is bundled at build time, so a fetch/load/inference
failure is now a hard t.fail(), and we assert the downloaded GGUF exists
and is at least 100 MB before proceeding.
* ci: run darwin-arm64 VLA integration on self-hosted mac-mini-m4
GitHub's hosted macos-*-xlarge runners are Apple Virtualization VMs —
their Metal driver reports "Apple Paravirtual device" with
`simdgroup reduction = false` and `simdgroup matrix mul. = false`. ggml
falls back to a scalar Metal path that is ~40x slower and produces
different f32 accumulation, which is what caused the darwin-arm64
correctness failure (max|Δ|=1.97, cos=0.15) and a ~12s vs ~0.3s
inference time versus the same GGUF on a real M3 Max.
macos-14-xlarge has the same paravirt signature (confirmed in
run 24887526194: max|Δ|=1.07 on SDK-aligned runner), so the earlier
fix didn't help.
Switch darwin-arm64 integration to the self-hosted mac-mini-m4 runner
(label: mac-mini-m4-gpu), the same setup the diffusion addon uses for
Metal-backed correctness tests.
* ci: install AWS CLI on darwin-arm64 self-hosted runner
The mac-mini-m4 self-hosted runner doesn't ship with aws CLI preinstalled,
so the "Download SmolVLA model from S3" step fails with
`aws: command not found` (run 24888672009, job 72877826352). GHA's Linux
matrix entry had an idempotent aws install; darwin had none. Add the
equivalent macOS step that checks PATH, then /usr/local/bin/aws, then
installs via the official AWSCLIV2.pkg installer. Scoped to darwin-arm64
since darwin-x64 runs on a GHA-hosted Intel Mac that already has aws.
* ci: install AWS CLI user-local on mac-mini-m4 (no sudo)
The self-hosted mac-mini-m4-gpu runner doesn't have passwordless sudo,
so `sudo installer -pkg AWSCLIV2.pkg -target /` fails with
`sudo: a terminal is required to read the password` (run 24889823710,
job 72880523559).
Pivot to a user-local install: `pkgutil --expand-full` unpacks the
official pkg without sudo, and the payload at
`aws-cli.pkg/Payload/aws-cli/aws` is a real Mach-O universal binary
(verified: aws-cli/2.34.36 runs standalone from that path). Move it
to `$HOME/.local/aws-cli` and add that dir to `$GITHUB_PATH`.
Also widen the preflight check to pick up `/opt/homebrew/bin/aws` and
the user-local path, so the step is a no-op on subsequent runs.
* test: fix mobile model download — bare-https has no .get()
Mobile Device Farm runs were failing at test 4 (`end-to-end inference
runs (needs GGUF)`) with `[vla-model] download failed after 3 attempts:
https.get is not a function` on iPhone 16 Pro / 16e / 17 and Pixel 9 Pro /
Galaxy S25 Ultra (run 24891028803).
Root cause: `bare-https` only exports `.request()` — there is no
Node-compatible `.get()`. Switch to the same pattern
`qvac-lib-infer-llamacpp-embed/test/integration/utils.js` uses:
`https.request(url, cb)` followed by an explicit `req.end()`, since
`.request()` returns a writable that must be closed before the request
is actually sent.
t.fail() hardening surfaced this correctly — desktop remains green
(real M4 Metal: max|Δ|=0.0006, cos=1.0000).
* test: fix mobile VLA download crash — use response.pipe(file)
Mobile Device Farm runs were still failing after the https.get→request fix.
Android (Pixel 9 Pro) crashed at 50MB / 2.4% of the 2.2GB download with
SIGABRT on the mqt_v_js thread inside libbare-kit.so; iOS exhibited the
same APP CRASHED pattern (run 24899187856, job 72913667435).
Root cause: the download was using `res.on('data', chunk =>
writeStream.write(chunk))` with no backpressure — V8 + file stream
queue grew until the JS bridge aborted. `qvac-lib-infer-llamacpp-embed`
downloads with `response.pipe(file)`, which applies backpressure
automatically. Switch to the same pattern, plus the full safeResolve/
safeReject error hygiene (destroy file + unlink on error, follow
redirects cleanly).
Progress logging is preserved (`res.on('data')` is kept for byte
counting only; the pipe does the actual writing).
Desktop remained green through both prior fix attempts (real M4 Metal:
max|Δ|=0.0006, cos=1.0000) — this only affects the mobile fetch path.
* test: raise mobile GGUF e2e test timeout to 20 min
The backpressure fix (6021b43b, res.pipe(file)) successfully resolved the
50MB SIGABRT on Android — download now progresses past 50MB cleanly
(logcat: [vla-model] progress: 50MB (2.4%) at 18:07:10 then keeps going
with no crash in libbare-kit.so).
New failure mode surfaced: brittle's default 30-second per-test timeout
fires before a 2.2GB mobile download + model load + inference can
complete. On Pixel 9 Pro and Galaxy S25 Ultra the test timed out at
30s → Uncaught (in promise) Error: Test timed out after 30000 ms →
SIGABRT on mqt_v_js as the unhandled rejection propagates through the
bare bridge.
Only the end-to-end inference test needs the long budget — the other
three tests (module exports, empty path rejection, missing GGUF
rejection) stay at 30s. 20 min is conservative for:
- 2.2GB HTTPS download over mobile carrier (5-10 min)
- SmolVLA model load (vision 12L + text 32L + expert 32L, ~1 min)
- Vision x2 + SmolLM2 prefix + 10-step ODE (~15s on CPU/Vulkan)
- Headroom for Device Farm variability
Desktop is unaffected: it uses QVAC_VLA_MODEL from a pre-staged path
and finishes in ~15 sec (max|Δ|=0.0006 on M4 Metal, cos=1.0000).
* fix: mmap+host_ptr GGUF load to fix iOS Metal alloc crash
Mobile run 24905749242 (commit 8bdc077e) confirmed all download/timeout
fixes worked: Pixel 9 Pro reaches `runAddonTest passed (4/4)`. Two new
unrelated bugs surfaced; this fixes the iOS one.
iOS root cause
On iPhone 16 Pro / 16e / 17, every load attempt crashed at model load
with EXC_BAD_ACCESS in `ggml_metal_buffer_is_shared` at NULL+0x10. The
faulting stack:
ggml_metal_buffer_is_shared
ggml_backend_metal_buffer_type_shared_alloc_buffer
alloc_tensor_range
ggml_backend_alloc_ctx_tensors_from_buft
smolvla_load_model+51156
`smolvla_load_model` was hand-rolling a load path that did:
1. gguf_init_from_file(no_alloc=false) — heap-allocate full 2.2 GB on CPU
2. ggml_init(no_alloc=true) — duplicate context for GPU
3. ggml_backend_alloc_ctx_tensors() — single 2.2 GB Metal shared-mode
allocation, which iOS Metal cannot service. The internal
allocator returned NULL, then dereffed it.
Why the LLM and diffusion addons don't hit this on iOS
Both delegate model loading to a library (llama_load_model_from_file in
qvac-fabric, new_sd_ctx in stable-diffusion-cpp) that uses the
ggml_backend_dev_buffer_from_host_ptr() path on devices reporting
`caps.buffer_from_host_ptr=true` (Apple Metal, CPU). That path wraps an
mmap'd region in a backend buffer and the Metal backend internally
slices it into per-tensor sub-buffers each ≤ max_tensor_size — no
giant single shared-mode allocation.
Fix — mirror llama-model.cpp:6648 create_backend_buffers
- gguf_init_from_file(no_alloc=true): metadata only (~few MB), no 2.2 GB
heap copy.
- Probe device caps (buffer_from_host_ptr, is_default_buft).
- FAST PATH (Apple Metal, CPU): mmap the GGUF file with PROT_READ |
MAP_PRIVATE; call ggml_backend_dev_buffer_from_host_ptr() with
ggml_get_max_tensor_size(ctx) as the slicer hint; wire each tensor
to its mmap-relative position via ggml_backend_tensor_alloc().
Zero-copy: process memory stays around tensor metadata + lazily-paged
mmap, no second allocation.
- FALLBACK (Vulkan / Android, Windows, no-host-ptr device): allocate
via ggml_backend_alloc_ctx_tensors_from_buft() then read from disk
with fseek/fread and upload via ggml_backend_tensor_set(). Same path
as before but without the duplicate-context dance, and emits a clear
failure message if the alloc returns NULL.
- Replace single `buf_w` with `std::vector<ggml_backend_buffer_t>
bufs_w` (Metal will create multiple sub-buffers; CPU/Vulkan keep one).
- Track mmap_addr/mmap_size on the model and munmap in
smolvla_free_model AFTER backend buffers are released.
- Mirror diffusion's CMake: define GGML_BACKEND_DL on Android so the
addon's TUs see the same flag the qvac-fabric ggml port was built
with.
The previous duplicate-context-+-remap-pointers code is removed
entirely. Tensors stay in the single ctx_data, and either the mmap or
alloc+copy path populates their data pointers in place.
Validation
Linux desktop (Vulkan device probed but CPU path engaged):
- 4/4 integration tests pass, 23/23 asserts pass
- alloc+copy fallback exercised: total weights 2127.2 MB, 739 tensors
- Quality vs PyTorch HuggingFaceVLA/smolvla_libero:
max|Δ|=0.0009, mean|Δ|=0.00003, cos=1.0000 (350 values)
matches the prior baseline (max|Δ|=0.0006 on M4 Metal).
- 2/2 C++ unit tests pass.
The mmap path needs Device Farm iOS to validate end-to-end; the
fallback is exercised on every desktop run today.
* fix: use 64-bit fseek for >2GB GGUF read on Windows + 32-bit POSIX
Win32 integration test in run 24980777510 (commit dc46a306) failed at:
smolvla_load_model: failed to read tensor 'v.enc.blk.7.ffn_down.bias'
at offset 2149428256
Root cause: the fallback alloc+copy path used fseek() with a (long)
cast on the offset. On Windows long is 32-bit (LLP64), so any offset
above 2^31-1 (≈2.15 GB) silently truncates. The smolvla GGUF is
~2.13 GB of weight data, so tensors past the ~2 GB mark cannot be
seeked to. Same trap exists on 32-bit POSIX targets where off_t
defaults to 32-bit unless _FILE_OFFSET_BITS=64.
Fix:
- Define _FILE_OFFSET_BITS=64 at the top of smolvla.cpp before any
system header so off_t / fseeko / ftello are 64-bit on POSIX.
- In the fallback path use _fseeki64() on Windows and fseeko() on
POSIX (both 64-bit-clean).
- Add explicit <cstdio>/<cstdint> includes since we now reference
the 64-bit variants directly.
The mmap fast path (Apple Metal, CPU-with-host-ptr) is unaffected —
it never calls fseek; mmap addresses are pointer-sized.
Validation
- Linux desktop alloc+copy fallback path still passes:
- 4/4 integration tests, 23/23 asserts
- 739 tensors, total 2127.2 MB loaded, all tensors past the
2 GB boundary read correctly
- Quality vs PyTorch HuggingFaceVLA/smolvla_libero unchanged:
max|Δ|=0.0009, mean|Δ|=0.00003, cos=1.0000 (350 values)
Win32 needs a CI roundtrip to confirm the fix end-to-end.
* refactor[bc]: align qvac-lib-infer-vla with canonical addon shape
- index.js: replace synchronous VlaModel(ggufPath) with the canonical
constructor ({ files, config, logger, opts }) and add load / run / unload /
pause / cancel / getState built on @qvac/infer-base's createJobHandler +
exclusiveRunQueue and @qvac/logging. run() returns a QvacResponse and the
underlying synchronous binding is driven through job.start/output/end.
- index.d.ts: update typings to match the new async API.
- package.json: declare @qvac/logging, @qvac/infer-base, bare-fs, bare-path
runtime deps; add top-level test, coverage:cpp* scripts; rewire
test:integration to generate test/integration/all.js (and chain
test:mobile:generate); replace scaffold description with the real one;
pin cmake-bare to 1.7.5 and bump brittle to ^3.16.5.
- CMakeLists.txt: add ENABLE_COVERAGE / VK_PROFILING options and replace the
ENV-probe ANDROID_STL block with the canonical option().
- on-merge workflow: rename display name to "On Merge Trigger (Vla)".
- integration tests: switch to the new constructor + await load/run/unload
flow.
* feat[notask]: scaffold new addons in canonical shape
Update the new-addon skill so a freshly scaffolded addon ships with the
canonical shape used across the monorepo, removing the consistency-fix
round-trip that qvac-lib-infer-vla just had to absorb.
- templates/index.js: replace the synchronous sayHello() wrapper with a
canonical class. Constructor `({ files, config, logger, opts })` validates
`files.model` like every other addon; lifecycle is `load` / `run` / `unload`
/ `pause` / `cancel` / `getState`; `run()` returns a `QvacResponse` driven
through `createJobHandler` + `exclusiveRunQueue` from `@qvac/infer-base`,
with logging via `@qvac/logging`. The hello-world `binding.sayHello()` call
is driven inline so synchronous backends still flow through the standard
job interface.
- templates/index.d.ts: typings updated to match the new async surface.
- templates/package.json: declare the canonical runtime deps
(`@qvac/infer-base`, `@qvac/logging`, `bare-fs`, `bare-path`); add
top-level `npm test`, `coverage:cpp:*` scripts; rewire `test:integration`
through `test:integration:generate` (which also chains
`test:mobile:generate`); pin `cmake-bare` to exact `1.7.5` and bump
`brittle` to `^3.16.5` to match `qvac-lib-infer-llamacpp-llm`. The
backend-specific deps placeholder is renamed `BACKEND_NPM_DEPS` and is
appended inside the canonical dependencies block (with a leading comma).
- templates/CMakeLists.txt: add `option(ANDROID_STL ...)`,
`option(ENABLE_COVERAGE ...)`, `option(VK_PROFILING ...)` so the
prebuild workflow's `vk-profiling` input and the `coverage:cpp` scripts
actually reach CMake.
- templates/test/integration/addon.test.js: switch to the new constructor
+ await load/run/unload flow; add a constructor-validation test.
- SKILL.md: document the canonical class shape contract, update the
substitution table for `BACKEND_NPM_DEPS`, expand the verification step
to include `npm test`, and update the next-step hint so the developer
preserves the constructor signature and lifecycle when filling in the
real model logic.
* Revert "feat[notask]: scaffold new addons in canonical shape"
This reverts commit 1abbc96bf40a975499bdb2ba2a6950003a43407b.
* fix: address VLA review feedback — JS/CI consistency, correctness, perf
Consistency
- package.json: add `build:pack` and `mobile:copy-prebuilds` scripts so the
mobile workflow stops falling back to its inline `npm pack` and warning
about missing prebuild fan-out.
- integration-mobile-test-qvac-lib-infer-vla.yml: rename the Device Farm log
artifact from `devicefarm-logs-llamacpp-embed-` to `devicefarm-logs-vla-`
and pin `actions/upload-artifact` to the canonical SHA used elsewhere in
the repo. Document that the `_LLAMACPP_EMBED` Device Farm secrets are
intentionally shared (no dedicated `_VLA` secrets are provisioned yet).
Correctness
- index.js: clear `_hasActiveResponse` synchronously on both the success
and failure paths. Previously the catch re-threw before the trailing
`.finally(...)` cleanup wired up, so a native-side inference error left
the model permanently `RUN_BUSY` until `unload()`. The success path's
cleanup ran one microtask late, leaving a window where chained `run()`
calls could observe the stale flag.
- index.js: `pickPrimaryGgufPath` now matches `-0*1-of-N.gguf` instead of
any shard index, so multi-shard models always pick shard 1 regardless of
the input array order.
- test/integration/addon.test.js: drain the redirect / non-2xx response
body via `res.resume()` so `bare-https` releases the underlying socket
before we follow the redirect or fail.
Performance
- addon.js: rewrite `preprocessImage` to do bilinear resize, letterbox-pad
and the [0,1]→[-1,1] shift in a single pass over the output buffer. Drops
the `src` and `resized` intermediates (3 × 3 MB allocations → 1) and
hoists the per-output-pixel coordinates out of the channel loop so all
three channels share one set of weights. Adds an optional `opts.scale`
override so callers that already know the pixel range skip the
256-element scan in `detectScale`.
- test/integration/addon.test.js: replace the per-chunk float division +
`toFixed` percentage compare in `_streamDownload`'s `'data'` handler
with a byte-threshold check; the 2.2 GB GGUF download no longer pays
per-chunk floating-point overhead just to gate a log every 50 MB.
* fix: address VLA review feedback — C++ correctness + perf
Correctness
- AddonJs.hpp: introduce a `VlaHandle` indirection wrapper so an explicit
`destroyVlaModel` can null out the inner `VlaModel*` while the GC
finalizer still owns the heap-allocated wrapper. Previously the eager
`delete` in `destroyVlaModel` left a dangling pointer in the JS external
slot that the GC finalizer would then re-`delete` (use-after-free /
double-free). `unwrap` now throws when the model has been destroyed
rather than dereferencing a freed pointer.
- smolvla.cpp (mmap fast path): reject the host-ptr buffer path when
`data_offset >= file_size` (would underflow `tensor_data_size` to a
huge `size_t`) or when `st.st_size > SIZE_MAX` (would truncate the
mapping length on 32-bit targets where the GGUF won't fit anyway).
Falls through to the alloc+copy path with a clearer diagnostic.
Performance
- AddonJs.hpp / AddonCpp.hpp: switch the `runVlaModel` JS→C++ boundary to
zero-copy. `typedArrayPtr<T>()` returns the underlying ArrayBuffer
pointer + length via `js_get_typedarray_info` directly; `VlaModel::run`
now takes raw `const T*` + lengths instead of `std::vector` copies.
Drops one `std::vector<float>` copy per image (~3 MB each at
3×512×512 f32) plus state/tokens/noise copies on every inference call.
The mask still copies into a small `bool` buffer because the inference
signature requires `const bool*`; the copy is 48 bytes so it's not
worth restructuring smolvla_inference_with_timing's ABI.
- smolvla.cpp (ODE loop): hoist the per-step `te_single` allocation out
of the loop and replace the 50-iteration `memcpy` broadcast with a
doubling pattern (~7 memcpy calls instead of 50). Drop the redundant
per-step KV cache re-upload — the KV inputs are uploaded once before
the loop via `ggml_set_input`, and `ggml_backend_sched` preserves
input-tagged tensors between `ggml_backend_sched_graph_compute` calls
while the scheduler is not reset.
Not addressed in this commit
- The post-sg2 KV mini-graph re-extraction (16 separate per-layer
graphs after the main SmolLM2 forward). Eliminating this requires
pinning the K/V output tensors to a host-allocated CPU buffer so
gallocr cannot overwrite them between compute calls — a deeper
graph-allocator restructure that needs end-to-end validation against
the PyTorch reference assertion. Tracking as a follow-up; the perf
win there is large (roughly 2× SmolLM2 stage cost).
* fix: guard te_single broadcast against chunk_size=0
The doubling-pattern memcpy in the ODE loop unconditionally copied one
row of te_single before checking chunk_size. With chunk_size == 0 the
te_expanded buffer is empty and that initial memcpy would overflow.
The pre-existing per-step loop didn't have this hazard because the
for-loop simply didn't run.
In production chunk_size is always 50, but adding the guard keeps the
fast path defensive.
* feat: gate VLA GPU backend selection on Adreno < 800
Mirrors lib-infer-diffusion / qvac-lib-infer-llamacpp-llm: when the loaded
ggml plugins expose an Adreno GPU below the 800 series, fall back to the
CPU backend instead of `ggml_backend_dev_init`-ing it. The Qualcomm
OpenCL ICD on Adreno < 800 has incomplete OpenCL 3.0 support, broken
kernel compilation for several ggml ops, and shared-memory OOMs;
Vulkan on those generations also has driver issues that misbehave on
some ggml ops. Older Snapdragon devices that get added to the Device
Farm pool will now run on CPU rather than crashing on `init`.
Adds:
- `addon/src/utils/BackendSelection.{hpp,cpp}` with
`parseAdrenoModel(description)` and `pickBestGpuDevice()`. Pure logic,
testable without the JS bridge.
- `test/unit/test_backend_selection.cpp` exercising the Adreno parser
on the description shapes ggml emits ("Adreno (TM) 830", "Adreno 740",
case variations, non-Adreno).
- `smolvla_load_model` now uses `pickBestGpuDevice()` instead of
`ggml_backend_dev_by_type(GPU)`, so Adreno < 800 falls through to
the CPU init below.
Tests: 7/7 C++ unit (was 2), 6/6 JS unit, 4/4 integration; lint clean.
* feat: tag VLA perf-report rows with execution provider and ship a
dedicated mobile perf artifact
Without these, the Adreno < 800 gate that just landed has no observable
signature in CI: a Samsung S22/S23 falling from Vulkan to CPU shows up
only as a 5–20× total_ms increase in the perf-report tables, with no
column saying *why*. You'd have to scrape stderr to attribute the
regression. This change closes both gaps.
(a) Backend-name plumbing
- `AddonCpp.hpp::VlaModel::backendName()` returns the ggml backend name
("CPU", "Vulkan", "OpenCL", "Metal", …) via `ggml_backend_name(...)`,
with fallbacks for the unloaded / nameless cases.
- `AddonJs.hpp::getVlaBackendName(handle)` exposes it as a JS string
binding; `binding.cpp` registers it.
- `index.js`: `_load()` reads `binding.getVlaBackendName(this._handle)`
and stashes it in `this._backendName`; `get backendName()` exposes it;
`unload()` clears it.
- `index.d.ts`: documented as `readonly backendName: string | null`.
- `test/integration/addon.test.js`: passes the value as
`execution_provider` to `_perfReporter.record(...)`. Step Summary
tables (and the JSON artifact) now show one of `CPU`/`Vulkan`/
`OpenCL`/`Metal`/`unknown` per row, so a Vulkan→CPU regression is
immediately visible.
(b) Dedicated mobile perf artifact
`integration-mobile-test-qvac-lib-infer-vla.yml` already uploaded
`devicefarm-logs-vla-…` containing everything Device Farm produced, but
the perf-report was buried in there as either a file in
customer-artifacts or a `[PERF_REPORT_*]` marker run on stdout. Added a
post-download step that:
- Walks the downloaded `devicefarm-logs/<platform>` tree.
- First tries to find `perf-report.json` shipped directly as a Device
Farm file artifact (the test writes it to writable paths on Android
/ iOS, which Device Farm packs into customer-artifacts).
- Falls back to single-block `[PERF_REPORT_START]…[PERF_REPORT_END]`
marker scraping.
- Falls back to chunked `[PERF_CHUNK:id:i:n]…` reassembly (sorts by
index, validates the resulting JSON parses).
- Writes `mobile-perf/perf-report-<platform>.json` and uploads it as
artifact `vla-perf-mobile-<platform>` (mirrors the desktop workflow's
`vla-perf-<platform>-<arch>-<os>` naming for symmetry).
- Emits `::warning::` rather than failing the job when no perf data is
found, so this never breaks an otherwise-green CI run.
Verified: lint clean, 6/6 JS unit, 4/4 JS integration, 7/7 C++ unit;
workflow YAML parses.
* fix: restore per-step KV cache upload in VLA ODE loop
Earlier perf #4 dropped the per-step ggml_backend_tensor_set for the
KV cache inputs on the assumption that ggml_set_input + the sched
allocator preserves input slots between ggml_backend_sched_graph_compute
calls. That holds for sched-managed multi-backend setups (where Tesla
T4 + Vulkan still produces cos_sim=0.99999 / max|Δ|=0.020 vs the
PyTorch reference), but it breaks two paths that actually run in CI:
- CPU-only (alloc_staged_simple → ggml_gallocr → graph_compute)
reuses input slots across compute calls, so steps 1–9 read garbage
KV.
- Adreno Vulkan on the Samsung S25 Ultra device farm slot has the
same effective semantics (Adreno Vulkan driver) and crashed the
addon test with the same divergence pattern.
Symptom on linux-x64 / linux-arm64 GitHub-hosted runners (CPU backend):
cos_sim = 0.3135 (threshold > 0.9), max|Δ| = 1.65 (threshold < 0.25).
Restoring the per-step upload unconditionally trades ~80 MB of H2D
traffic per inference on Vulkan-sched setups for correctness on every
backend. A conditional restore (skip on sched paths) would recover
that perf, but the branch isn't worth the correctness risk in this
PR.
* test: pin bare-tls/bare-https to 2.x for VLA mobile tests
bare-tls@3.0.0 (published 2026-04-28) flips on default certificate
verification with the commit "Load default trust store and reject
untrusted certificates by default", and bare-https@3.0.0 (same day)
widens its dep from bare-tls@^2.0.0 to ^3.0.0. With no populated
trust store inside the Bare Android/iOS runtime, every TLS handshake
to the SmolVLA presigned S3 URL fails:
[vla-model] downloading: https://REMOVED-S3-BUCKET.s3.eu-central-1...
[vla-model] retry 1/2 after 500ms (last: CERTIFICATE_VERIFY_FAILED: Handshake failed)
not ok 1 - mobile model fetch failed
runAddonTest: FAIL (3/4 passed)
Confirmed across both Pixel 9 Pro and Samsung Galaxy S25 Ultra on
runs 25066695862 and 25074966624. Same root cause would hit any
addon whose mobile suite installs after 2026-04-28; NMTCPP and
Parakeet's last green runs predate the publish.
Pin both packages to the highest published 2.x (2.2.3 / 2.1.3) via
npm overrides until upstream ships a CA-bundle-aware bare-tls. If
the npm install layer is what bare-pack resolves at app-build time,
this restores the previous (non-validating) behavior and unblocks
mobile CI; if BareKit's baked-in bare-tls wins instead, we'll see
the same handshake error and need a runtime-level fix.
* Revert "test: pin bare-tls/bare-https to 2.x for VLA mobile tests"
The override block placed in this addon's package.json had no effect
on the failing mobile run (25092791397 logcat shows the same
CERTIFICATE_VERIFY_FAILED). The reason is that bare-link / bare-pack
both run from tetherto/qvac-test-addon-mobile's node_modules at
app-build time, and npm's `overrides` only apply in the root project
of `npm install` — when this addon is installed transitively from
that repo, the overrides are silently dropped.
The fix lives in tetherto/qvac-test-addon-mobile#38 instead. Reverting
here to keep dead config out of the addon.
* refactor: rename packages/qvac-lib-infer-vla -> packages/vla
Match the directory name to the npm package name (`@qvac/vla`),
mirroring the diffusion-cpp rename done in #1786. The previous
`packages/qvac-lib-infer-vla` carried over from the lib-infer-*
naming era and no longer matched what gets published.
Renamed:
- packages/qvac-lib-infer-vla/ -> packages/vla/
- .github/workflows/on-pr-ocr-onnx.yml -> on-pr-vla.yml
- .github/workflows/integration-mobile-test-...vla.yml -> integration-mobile-test-vla.yml
- .github/workflows/integration-test-...vla.yml -> integration-test-vla.yml
- .github/workflows/on-merge-...vla.yml -> on-merge-vla.yml
- .github/workflows/on-pr-close-...vla.yml -> on-pr-close-vla.yml
- .github/workflows/prebuilds-...vla.yml -> prebuilds-vla.yml
`on-pr-ocr-onnx.yml` was the source of yesterday's pull_request_target
mix-up — its content is the VLA workflow but the filename meant
GitHub kept resolving the OCR workflow from main on PR events.
Renaming it to `on-pr-vla.yml` fixes that.
Updated path/slug references inside workflows + package metadata:
- `packages/qvac-lib-infer-vla` -> `packages/vla`
- artifact prefix `qvac-lib-infer-vla-` -> `vla-`
- `package-slug: qvac-lib-infer-vla` -> `vla`
- `package.json` `repository.directory` + `homepage`
- `vcpkg.json` top-level `name`
- perf reporter addon name in `test/integration/addon.test.js`
- SKILL.md references in `packages/ocr-onnx/.agent/`
Kept (mirroring diffusion-cpp's rename):
- C++ internal symbols (`BARE_MODULE("qvac-lib-infer-vla", ...)`,
`add_bare_module(qvac-lib-infer-vla ...)` in CMakeLists). These
are stable native-binding identifiers, not paths.
* refactor: keep on-pr-ocr-onnx.yml filename until tmp-vla merges to main
Reverting just the `on-pr-ocr-onnx.yml` -> `on-pr-vla.yml` rename
from the previous commit. Reason: GitHub Actions requires
`workflow_dispatch` workflow files to exist on the default branch
to be registered; until tmp-vla lands in main, the new
`on-pr-vla.yml` is unknown to the API and `gh workflow run` 404s.
Keeping the file at the historical `on-pr-ocr-onnx.yml` path on
tmp-vla means:
- `gh workflow run on-pr-ocr-onnx.yml --ref tmp-vla` continues to
work (it was the dispatch target throughout this branch).
- The file's *content* is still the VLA workflow as before; only
the filename is preserved for dispatch compatibility.
The proper rename to `on-pr-vla.yml` should be a follow-up PR opened
after tmp-vla is merged into main, mirroring the timing diffusion-cpp
used in #1786 (the rename happened on main, where its workflows were
already registered). Other workflow renames in this branch
(integration-test-vla, on-merge-vla, prebuilds-vla, etc.) are kept
because they're consumed via `uses:` from the dispatch workflow, not
dispatched directly — file existence on the default branch isn't
required for those.
* feat: run VLA integration tests on CPU and GPU side-by-side
Add a `backend` matrix dimension to integration-test-vla and
integration-mobile-test-vla so every GPU-equipped runner is
exercised twice — once with the runner's preferred accelerator
(Metal / Vulkan) and once forced onto CPU. Result: a clean
per-platform "GPU vs CPU" delta in the perf-report artifact set
for the same hardware, the same model, the same test vector.
Plumbing:
- smolvla.cpp: read VLA_FORCE_CPU env var (any non-empty,
non-"0" value) before vla_backend_selection::pickBestGpuDevice.
When set, skip GPU pick and fall through to the existing CPU
init path. One getenv + one if-guard.
- integration-test-vla.yml: dual rows for ai-run-linux-gpu /
mac-mini-m4 / ai-run-windows11-gpu (the runners with a real
GPU). Linux arm64 + Linux x64 hosted + macOS x64 hosted have
no GPU prebuild; one row each (auto == cpu effectively).
`VLA_FORCE_CPU` plumbed via env: matrix.backend == 'cpu'.
perf-report artifact name now includes the backend so both
rows of the same os land separate files.
- integration-mobile-test-vla.yml: 4 rows total (Android+iOS
× auto+cpu). The bundled smolvla-urls.json now carries a
`forceCpu` flag derived from matrix.backend, since env vars
don't propagate to BareKit's child process the way they do
on desktop. devicefarm-logs and vla-perf-mobile artifact
names include the backend.
- addon.test.js: when running on mobile, read forceCpu from the
bundled config and set process.env.VLA_FORCE_CPU before
VlaModel.load(). The C++ side reads the env identically on
every platform.
Cost:
- +5 desktop matrix rows (-> 10 total). Three new GPU runners
× ~5 min each = ~15 extra runner-minutes per CI cycle.
- +2 mobile matrix rows (-> 4 total). Doubles Device Farm spend
for VLA mobile, but VLA mobile only ran one config before so
this is the first time we'll see CPU vs GPU on phone.
Notable: Pixel 9 Pro's Adreno 730 already falls through to CPU
under `auto` (gated by Adreno < 800 in BackendSelection.cpp), so
its `cpu` row is redundant in practice. Kept for matrix symmetry
and uniform artifact set; can be pruned later if Device Farm
spend matters.
* refactor: run VLA CPU/GPU comparison in one process per runner
Replace the workflow-level `backend: [auto, cpu]` matrix with an
explicit `backend` argument on `VlaModel.load()`. The integration
test now loads + runs the model twice in a single Bare process —
once on the runner's preferred backend (Metal/Vulkan/Adreno/…) and
once forced onto CPU — so each CI runner produces one perf-report
artifact carrying both rows. Halves CI runner-minutes, drops the
duplicated model download/install, and gives a single artifact per
host with a clean side-by-side comparison.
JS surface:
- `VlaModel.load({ backend: 'auto' | 'cpu' })`. Default `'auto'`.
- Plumbed into `binding.createVlaModel(ggufPath, backend)` →
`VlaModel(ggufPath, forceCpu)` → `smolvla_load_model(..., force_cpu)`.
C++:
- `smolvla_load_model` gains an explicit `bool force_cpu` parameter;
`pickBestGpuDevice` is skipped when set. The `VLA_FORCE_CPU` env-var
fallback is removed — the param is the only knob now.
Test:
- addon.test.js loops `['auto', 'cpu']` inside the same e2e test.
Each iteration owns its own VlaModel and `unload()`s before the
next one starts, so memory-constrained mobile devices don't hold
two copies of the weights at once. Two perf-report rows per
artifact, distinguished by both `test` name and `execution_provider`.
CI:
- integration-test-vla.yml drops the `backend` matrix dimension —
7 rows total instead of 10 (3 GPU runners × 2 + 4 CPU-only × 1).
- integration-mobile-test-vla.yml drops the dual-row mobile matrix
(4 → 2). The `forceCpu` field in `smolvla-urls.json` is gone since
the bundled config no longer needs to communicate the backend choice.
- Artifact names lose the `-${backend}` suffix.
Verified locally on linux-x64 (Vulkan): auto=2.55s, cpu=10.4s; both
rows quality-clean (cos sim ≈ 1.0 vs PyTorch reference).
* fix: surface VLA mobile perf-report (mirror OCR's working path)
Two pre-existing breakages converged to give us empty
`vla-perf-mobile-*` artifacts on every prior run:
1. addon.test.js's mobile inline reporter only flushed via
`process.on('exit')`. On Device Farm the BareKit-hosted process is
torn down before that handler fires, so the
`[PERF_REPORT_START]…[PERF_REPORT_END]` markers never reach
logcat / iOS console — and the perf-report.json file is never
written to the device.
2. The workflow's inline Node extractor only handled clean text. It
didn't strip the Android logcat line prefix
(`MM-DD HH:MM:SS.mmm PID TID …:`) or the BareKit ReactNativeJS
bridge wrapper (`'[Bare]', '...'`), so even when chunked markers
*did* land in a log they failed to parse.
Replicate OCR's canonical mobile perf-report path:
- addon.test.js: after each `_perfReporter.record(...)` on mobile,
call `writeReport()` + `writeToConsole()` immediately, mirroring
packages/ocr-onnx/test/integration/utils.js. The exit-handler
flush stays for desktop. Each call is idempotent — overwriting
the file with N records is fine since the report is cumulative.
- integration-mobile-test-vla.yml: replace the inline Node
extractor with a call to `scripts/perf-report/extract-from-log.js`
(the same script OCR mobile uses). It already handles logcat
prefix stripping, ReactNativeJS bridge unwrapping, JS-string
`\'` escapes, chunk reassembly, and `schema_version` validation.
Verified locally (linux-x64) that the test still emits the
two-backend perf-report with both rows; quality unchanged.
* fix: render VLA quality Step Summary table correctly
Two bugs in the quality table emitted to GITHUB_STEP_SUMMARY:
1. The `Max |Δ|` and `Mean |Δ|` column labels contain literal pipe
characters that markdown parses as column separators, so the
3-column quality table was rendered as if it had 5 columns. Escape
the pipes (`\|`) so they render as text.
2. Cosine similarity was rendered with `(v * 100).toFixed(1) + '%'`,
which collapses any value at or above ~0.99995 to "100.0%" — losing
the precision that makes the metric useful for spotting regressions.
Add a `cos-sim` column unit that prints raw `toFixed(8)`
(e.g. `0.99999999`) so identical-looking near-perfect runs stay
distinguishable.
Applies to both the desktop reporter (writeStepSummary) and the
mobile render-step-summary script.
* feat: render mobile VLA perf-report into GitHub Step Summary
The mobile job uploaded `vla-perf-mobile-Android` for the first time
on commit 1d605a2d, but nothing was rendering it into the Actions
Step Summary tab — so the per-device CPU-vs-GPU table only showed
up for desktop runners. Wire `scripts/perf-report/render-step-summary.js`
into the mobile workflow so each device's report (Pixel 9 Pro,
Galaxy S25 Ultra, …) emits the same compact markdown table the
desktop reporter writes.
`extract-from-log.js` writes per-device subdirs when Device Farm
runs more than one phone in the pool, so the new step loops over
every `performance-report.json` under `mobile-perf/` and appends a
fresh table per device, matching OCR's mobile pattern.
* feat: optimize VLA inference with op fusion and KV-projection hoist
Three measurable graph-level changes in `build_transformer_layer` and
`build_denoise_step_graph`, validated against the existing PyTorch
reference (`pt_actions_libero_fixed.json`, 350 values):
- **Hoist cross-attn K/V projections out of the ODE loop.** The action
expert's `k_proj`/`v_proj` against the VLM KV cache only depend on
inputs that are invariant across the 10 ODE denoise steps. Project
once after SmolLM2 forward and overwrite `kv_keys_data[i]` /
`kv_vals_data[i]` for cross-attn layers in place — eliminates 16
layers x 9 redundant steps = 144 matmul-pairs per inference.
- **Replace `scale -> +mask -> soft_max` triples with `ggml_soft_max_ext`**
at the 4 live attention sites. Bit-for-bit equivalent, fewer graph
nodes, helps backends with non-trivial kernel-launch overhead.
- **Replace `silu(gate) * up` with `ggml_swiglu_split`** at the 2 live
SwiGLU MLP sites.
Final cumulative speed (warm bench, median of iter 2-5, vs baseline tip):
| Backend | total baseline | total final | Delta |
|---|---:|---:|---:|
| auto (Vulkan / Intel Iris Xe) | 2345 ms | 2247 ms | -4.2% |
| cpu | 10084 ms | 9921 ms | -1.6% |
ODE inner loop specifically: -6.9% auto, -2.6% cpu - that's where the
cross-attn KV hoist lands. Accuracy unchanged: max|delta|=0.0032 auto /
0.0009 cpu, cos=1.00000.
Also adds:
- `test/bench.js`: warm-bench harness (loads model once, runs N
inferences, reports per-stage min/med/max). Single-run integration
timings showed up to 2x variance from system load on this dev box,
unsuitable for A/B comparison.
- `test/unit/test_flash_attn.cpp`: gtest comparing `ggml_flash_attn_ext`
against the unfused reference on synthetic Q/K/V at the SmolLM2
prefill shapes. Documents the **F16-mask + `GGML_PREC_F32` recipe**
required to call flash-attn correctly (F32 mask is silently accepted
but produces structured-but-shifted output, cos~0.28). The recipe
works correctness-wise; it's currently 3x slower than the unfused
matmul on Intel Iris Xe Vulkan (no matrix cores) but plausibly faster
on Adreno/Metal. To be re-evaluated on the mobile device farm before
enabling, ideally gated on `has_matrix_cores`.
- `opt.md`: per-optimization log with implementation, accuracy, speed,
and the failed/skipped attempts (drop-GQA-repeat broke CPU mul_mat
broadcast; time-MLP split linears regress on strided weight matmul;
flash-attn-ext requires F16 mask, see above).
* fix[ci]: address HIGH security findings in vla CI workflows
- prebuilds-vla.yml: drop unconditional `printenv` step that dumped
AWS_OIDC_ROLE_ARN, NPM_TOKEN, PAT_TOKEN, and other resolved env-var
secrets to public CI logs.
- integration-test-vla.yml: drop `npm config list` from the run-state
diagnostics; it printed the just-written .npmrc, leaking the npm and
GPR _authToken values. Replaced `npm list` with `npm list --depth=0`
to keep dependency visibility without the dump.
- integration-test-vla.yml, cpp-tests-vla.yml: route ${{ github.token }}
through a `GH_TOKEN` env var instead of inline shell interpolation in
`git config` invocations, so it gets standard secret masking and
doesn't end up in the runner process listing.
* chore: drop opt.md, untrack vla performance-report.json
- opt.md was a 497-line scratch log of the VLA op-fusion / KV-projection
optimization work. The summary belongs in the PR description, not in
the repo tree.
- packages/vla/test/results/performance-report.json is regenerated by
every CI run and uploaded as a workflow artifact; it has no business
living in source control. Gitignore the directory and stop tracking
the file (file kept on disk for any local working sessions).
* fix: address review quick-wins for vla addon
Correctness:
- action_dim default is now 7 across the C++ hparams struct, the GGUF
fallback, and generate_reference.py. The integration test now hard-fails
on a (chunk_size, action_dim) shape mismatch instead of skipping the
PyTorch quality gate with a comment, so a regression in either side
shows up as a failed assertion. Added an explicit hparams unit-test
assertion for action_dim.
- mmap loader bails out cleanly when ggml_backend_tensor_alloc fails for
any tensor: it frees the buffer, munmaps the file, and falls through
to the alloc+copy path instead of leaving partially-wired tensors with
invalid pointers and pretending success.
- smolvla_inference_with_timing rejects out-of-range n_images, lang_len,
and state_dim before they feed into n_visual_tokens / prefix_len /
tensor sizing, where bad values would underflow int math and cause
out-of-bounds writes during graph build.
Security:
- mmap loader validates every per-tensor (offset, nbytes) against the
mapped region before wiring, so a crafted GGUF cannot point a tensor
past the end of the mapping.
- Mobile workflow builds smolvla-urls.json with `jq` so the presigned
URL cannot break out of its JSON string, and replaces the partial
`head -c 120` echo (which leaked the bucket host and X-Amz-Credential
prefix) with a byte-count confirmation.
Performance:
- Precompute the sinusoidal time-embedding period table at load time.
The per-ODE-step embedding now does 360 multiply / sinf / cosf calls
instead of paying for 360 powf evaluations per step (~3,600 powf calls
per inference eliminated). Hint the kernel with MADV_WILLNEED on the
zero-copy mmap path so first inference doesn't demand-page through
the 2+ GB GGUF.
Dead code:
- Drop the unused smolvla_rope helper (whose comment claimed RoPE mode 0
while the body called NEOX), the unused to_bf16_precision helper, and
the leaky run_graph stub in test_flash_attn.cpp.
* refactor: adopt QvacErrorBase / ERR_CODES pattern in vla addon
Every other inference addon (parakeet, whispercpp, nmtcpp, ocr-onnx,
onnx-tts, llamacpp-llm, …) ships a lib/error.js with a package-specific
QvacErrorBase subclass and a frozen ERR_CODES map registered with
@qvac/error. VLA was the only one still throwing bare Error / TypeError /
RangeError, which prevents callers from branching on err.code and
breaks the localized message registry.
Adds packages/vla/lib/error.js with QvacErrorAddonVla and 9 codes in
the previously-unused 30001..31000 range:
FAILED_TO_LOAD_WEIGHTS, FAILED_TO_DESTROY, MODEL_NOT_FOUND,
INVALID_CONFIG, MISSING_REQUIRED_PARAMETER, INVALID_INPUT,
JOB_ALREADY_RUNNING, INSTANCE_NOT_INITIALIZED, MODEL_UNLOADED.
index.js threads structured errors through the public surface: input
validation in validateRunInput now throws INVALID_INPUT; constructor
files.model checks raise MISSING_REQUIRED_PARAMETER / INVALID_CONFIG;
load() backend validation raises INVALID_CONFIG; binding load failures
are wrapped as FAILED_TO_LOAD_WEIGHTS with `cause` preserving the
underlying error; binding.destroyVlaModel failures during unload now
raise FAILED_TO_DESTROY instead of being swallowed; run-before-load and
run-while-busy raise INSTANCE_NOT_INITIALIZED and JOB_ALREADY_RUNNING;
in-flight jobs cancelled by unload see MODEL_UNLOADED on the failure
side. ERR_CODES and QvacErrorAddonVla are exported alongside VlaModel,
matching the OCR / parakeet pattern.
index.d.ts gains the QvacErrorAddonVla class and ERR_CODES literal-type
map. package.json declares @qvac/error ^0.1.0 as a dependency and adds
lib/ to the published files list.
Existing test assertions on /non-empty array/ and /absolute path/
continue to match the new structured messages — verified by running
test:unit (6/6 pass), test:integration sans GGUF (4/4 pass), and
test:dts.
* test: switch vla integration fixture to vision-Q8-quantized GGUF
Bumps the integration-test model from smolvla-libero-f32-fixed.gguf
(2026-04-21) to smolvla-libero-vision-q8.gguf (2026-04-30) — same
LIBERO checkpoint with Q8_0 quantization on the vision-encoder linear
weights. Cuts vision-stage time roughly in half on Vulkan and ~4× on
CPU (see test/results/perf reports).
Q8 on the vision encoder occasionally flips the gripper dim (action[6],
near-binary in [-1, 1]) at decision boundaries on the synthetic gray
fixture — measured max |Δ| ~0.6 on Vulkan, ~1.2 on CPU. Position /
rotation dims stay tight (mean |Δ| ≈ 0.01). LIBERO closed-loop eval
shows equivalent task success vs the F32 GGUF (60% vs 70% across 30
episodes — within statistical noise). Tolerances loosen to max |Δ| 1.5
to absorb gripper sign flips and cosine >0.95 as the structural sanity
check.
Updates the S3 path in integration-test-vla.yml and the mobile presign
script to match.
* fix[ci]: prevent artifact poisoning in vla integration workflows
CodeQL (rule "Artifact poisoning") flagged 19 alerts on the VLA
workflows: actions/download-artifact was writing directly into the
workspace path (packages/vla/prebuilds, addon/packages/vla/prebuilds),
and subsequent steps (npm install, npm run bundle, npm run build:pack,
xcodebuild, npm run test:integration, …) execute code from that same
workspace. Combined with workflow_dispatch.inputs being user-controlled,
that's a path for a poisoned artifact to land code that then runs with
the workflow's secrets.
Fix mirrors the pattern PR #1728 applied to OCR / parakeet / nmtcpp /
diffusion / etc.: download into a runner.temp staging directory, then
add an explicit copy step to move the contents into the workspace.
CodeQL recognises the explicit cp as a maintainer-controlled boundary
and stops the dataflow trace.
Touches three download-artifact sites:
- integration-test-vla.yml: prebuilds → workspace
- integration-mobile-test-vla.yml: Android prebuilds → workspace
- integration-mobile-test-vla.yml: iOS prebuilds → workspace
* feat: add LIBERO sim eval driver + QVAC HTTP bridge under packages/vla/sim
Drops in a self-contained eval pipeline that scores SmolVLA on LIBERO
through either the QVAC GGUF addon (over HTTP) or the original PyTorch
policy, so the two are directly comparable on the same env seeds and
noise sequence.
Files:
packages/vla/sim/eval_libero_sim.py Python entry, --backend {qvac,pytorch}
packages/vla/sim/qvac_http_policy.py lerobot SmolVLAPolicy subclass that
routes the forward pass over HTTP
packages/vla/sim/smolvla_http.py binary-protocol HTTP client
packages/vla/sim/server/server.js Bare HTTP host for @qvac/vla
packages/vla/sim/server/package.json server runtime deps
packages/vla/sim/requirements.txt pinned Python deps (lerobot, libero,
robosuite, mujoco, etc.)
packages/vla/sim/README.md setup + run + compare runbook
Verified end-to-end on libero_spatial (10 tasks x 3 episodes = 30):
QVAC F32 GGUF (Vulkan): 18/30 = 60.0%
QVAC Q8 vision (Vulkan): 21/30 = 70.0%
PyTorch (CUDA): 21/30 = 70.0%
All within the n=30 noise band; Q8-vision matches PyTorch task-for-task on
9/10. lerobot itself is unmodified — the bridge works through its
public make_policy extension point + a Python class swap.
* chore: drop new-addon skill from vla branch
The new-addon skill scaffolding (added in earlier tmp-vla commits) is
unrelated to the SmolVLA addon work in PR #1784 and was being carried
along by accident. Removing it from this branch so the PR diff focuses
on the vla addon and the LIBERO sim eval driver only.
The skill itself can be re-introduced on its own branch / PR if still
wanted.
* chore: drop test_flash_attn.cpp + tighten the comment that referenced it
The attention path uses unfused mul_mat → soft_max_ext → mul_mat. The
flash-attn alternative was ~3× slower per layer on Intel Iris Xe Vulkan
when measured, so we never wired it into the production path. The test
existed only to keep a "side-by-side correctness vs the unfused path"
harness around in case we wanted to re-evaluate flash-attn on Adreno or
Mali later.
Removing 389 lines of test code that exercises a dead path; the pointer
in smolvla.cpp's attention block is rewritten so it captures the
"measured 3× slower on Iris Xe" finding without referring to the
deleted file.
* fix: address security + correctness findings from code review
Security (4):
* sim/server/server.js: cap request bodies at 32 MB (prevents heap-exhaust DoS
via unbounded POST). Reject early in the data-event handler with
req.destroy() instead of buffering until oom.
* sim/server/server.js: validate every header field that flows into a typed
array length (state_dim, n_images, img_w, img_h, n_tokens). Without bounds,
a crafted client could ask for state_dim=2**30 and allocate gigabytes
before the C++ side even saw the request. Also bound the JSON header_len
itself to 64 KB and add a body-truncation check after the per-section reads.
* sim/server/server.js: drop model_path from /info response — it leaked the
on-disk GGUF location to anything that could reach the port.
* sim/server/server.js: adopt the published @qvac/vla async API
(`new VlaModel({ files: { model: [...] } })` + `await model.load()` +
`await model.run(...)`). The previous code used an older sync signature
that happened to match the version installed on the dev server but does
not match the API this PR ships, so /predict would 500 on every request
against a fresh install. Server now boots inside an async IIFE that awaits
load() before listen() begins accepting connections.
Correctness (3):
* smolvla.cpp: smolvla_create now calls smolvla_free_model() before delete on
load failure. The struct has no destructor, so the previous `delete model`
leaked any backend buffers / mmap regions / ggml contexts / backend handles
that smolvla_load_model had already initialised before failing.
* smolvla.cpp: replace the inline ODE-loop dispatch
(`sg3.sched ? sched_compute : graph_compute(backend_cpu, ...)`) with the
shared compute_staged helper. Avoids the foot-gun of hardcoding backend_cpu
on the fallback branch — if alloc_staged_sched ever returned with
sched==nullptr on a GPU build, the inline form would silently fire CPU
compute on GPU-allocated tensors.
* sim/qvac_http_policy.py: surface a clear RuntimeError when the batch has
no camera images, instead of crashing on `images_chw[-1]` while filling
dummy frames for empty cameras.
Verified:
* C++ rebuild + integration test: 4/4 tests pass, 41/41 asserts. Quality
numbers unchanged (Vulkan max|Δ|=0.588 cos=0.997; CPU max|Δ|=1.131
cos=0.989).
Two reviewer findings were verified as non-issues and intentionally not
fixed: the pos_ids = -1 bug doesn't trigger because n_images>=1 is enforced
upstream (so n_visual_tokens >= 64, so pos >= 64 before the lang loop), and
the GGUF mmap data_offset overflow is already caught by the existing strict
`<` check against st.st_size.
* fix: server.js — use response.await() pattern + opts.stats:true
Two issues introduced by the previous review-fix commit (f9d0f4d3):
1. `model.run()` returns a QvacResponse, not `{ actions, stats }`. The
destructure was awaiting the call once and pulling `actions`/`stats`
directly off the response object, but those fields don't exist on
QvacResponse — they live behind `response.await()`. Result: every POST
/predict crashed encodeResponse with `Cannot read properties of
undefined (reading 'buffer')`. Switching to the canonical two-step
…
aegioscy
added a commit
that referenced
this pull request
May 26, 2026
…h priority fixes **Critical Issues (C1–C7):** - C1: Thread-local callbacks already implemented (tl_progressCtx, tl_abortModel) - C2: Gate unused preview_mode config (parsed but never wired) - C3: Fix memory leak on generate_image() exception paths using RAII wrappers - C4: Null-check generate_image/video returns, throw StatusError on failure - C5: Implement applyFluxImg2ImgDimDefaults() for FLUX img2img dimension defaults - C6: Harden VideoStableDiffusion (LoRA rejection; end_image/flf2vid deferred) - C7: Harden mapAddonEvent with explicit Uint8Array checks and documentation **High Priority (H1–H12) - Previously completed:** - Shared integer parsing (requireInt, requirePositiveInt, etc.) with overflow guards - Standardized cancellation errors via makeCancelledError() - JS input validation (dimensions, prompts, image coercion) - Overflow checks in image resizing & AVI encoding - Cooperative cancellation in video post-generation - TypeScript .d.ts synchronization **Infrastructure:** - Scaffold local vcpkg overlay port for Wan I2V VAE-tiling patch - Restore portfile.cmake + supporting config files - Pin to stable-diffusion-cpp@00cd2a09 (registry #4) for SD_BACKEND_PREF_AUTO **Files Changed:** C++ handlers, model interface, utilities: integer parsing, error handling, memory safety JavaScript: input validation, FLUX dimension defaults, video params, event mapping TypeScript: type definitions for new exports and corrected runtime behavior vcpkg: local overlay + patch machinery for I2V fix Closes #HIGH-PRIORITY, fixes i2v model loading via patched VAE tiling. Co-authored-by: Cursor <cursoragent@cursor.com>
aegioscy
added a commit
that referenced
this pull request
May 26, 2026
…h priority fixes **Critical Issues (C1–C7):** - C1: Thread-local callbacks already implemented (tl_progressCtx, tl_abortModel) - C2: Gate unused preview_mode config (parsed but never wired) - C3: Fix memory leak on generate_image() exception paths using RAII wrappers - C4: Null-check generate_image/video returns, throw StatusError on failure - C5: Implement applyFluxImg2ImgDimDefaults() for FLUX img2img dimension defaults - C6: Harden VideoStableDiffusion (LoRA rejection; end_image/flf2vid deferred) - C7: Harden mapAddonEvent with explicit Uint8Array checks and documentation **High Priority (H1–H12) - Previously completed:** - Shared integer parsing (requireInt, requirePositiveInt, etc.) with overflow guards - Standardized cancellation errors via makeCancelledError() - JS input validation (dimensions, prompts, image coercion) - Overflow checks in image resizing & AVI encoding - Cooperative cancellation in video post-generation - TypeScript .d.ts synchronization **Infrastructure:** - Scaffold local vcpkg overlay port for Wan I2V VAE-tiling patch - Restore portfile.cmake + supporting config files - Pin to stable-diffusion-cpp@00cd2a09 (registry #4) for SD_BACKEND_PREF_AUTO **Files Changed:** C++ handlers, model interface, utilities: integer parsing, error handling, memory safety JavaScript: input validation, FLUX dimension defaults, video params, event mapping TypeScript: type definitions for new exports and corrected runtime behavior vcpkg: local overlay + patch machinery for I2V fix Closes #HIGH-PRIORITY, fixes i2v model loading via patched VAE tiling. Co-authored-by: Cursor <cursoragent@cursor.com>
aegioscy
added a commit
that referenced
this pull request
May 26, 2026
…h priority fixes **Critical Issues (C1–C7):** - C1: Thread-local callbacks already implemented (tl_progressCtx, tl_abortModel) - C2: Gate unused preview_mode config (parsed but never wired) - C3: Fix memory leak on generate_image() exception paths using RAII wrappers - C4: Null-check generate_image/video returns, throw StatusError on failure - C5: Implement applyFluxImg2ImgDimDefaults() for FLUX img2img dimension defaults - C6: Harden VideoStableDiffusion (LoRA rejection; end_image/flf2vid deferred) - C7: Harden mapAddonEvent with explicit Uint8Array checks and documentation **High Priority (H1–H12) - Previously completed:** - Shared integer parsing (requireInt, requirePositiveInt, etc.) with overflow guards - Standardized cancellation errors via makeCancelledError() - JS input validation (dimensions, prompts, image coercion) - Overflow checks in image resizing & AVI encoding - Cooperative cancellation in video post-generation - TypeScript .d.ts synchronization **Infrastructure:** - Scaffold local vcpkg overlay port for Wan I2V VAE-tiling patch - Restore portfile.cmake + supporting config files - Pin to stable-diffusion-cpp@00cd2a09 (registry #4) for SD_BACKEND_PREF_AUTO **Files Changed:** C++ handlers, model interface, utilities: integer parsing, error handling, memory safety JavaScript: input validation, FLUX dimension defaults, video params, event mapping TypeScript: type definitions for new exports and corrected runtime behavior vcpkg: local overlay + patch machinery for I2V fix Closes #HIGH-PRIORITY, fixes i2v model loading via patched VAE tiling. Co-authored-by: Cursor <cursoragent@cursor.com>
aegioscy
added a commit
that referenced
this pull request
May 26, 2026
- overlay portfile: bump stable-diffusion-cpp pin from 00cd2a09 (#4) to 747a1801 (#5) so EsrganUpscaler.cpp's sd_upscaler_device_t and new_upscaler_ctx_with_device resolve; patch still applies cleanly - SdModel.cpp processVideo: revert init_image / control_frames dimension mismatch from resize to throw, matching C++ unit test expectations - test_wan_video.cpp: remove all flf2vid and endImageBytes tests (flf2vid was removed from the C++ layer); update ValidationThrowClearsThreadLocalState to use img2vid instead Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac
that referenced
this pull request
May 26, 2026
…l registry PR lands Drops the previous shortcut of pointing the addon's vcpkg `default-registry` baseline at my personal fork. Instead, the vcpkg port files being added in the companion qvac-registry-vcpkg PR tetherto#169 are vendored into the addon as an overlay port so CI can validate the addon-side migration end-to-end against the WIP port without depending on the fork staying alive. Layout: vcpkg-overlays/whisper-cpp/ — verbatim copy of the qvac-registry-vcpkg PR tetherto#169 port tree (portfile.cmake + vcpkg.json + patches/0001-move-gnuinstalldirs-before- add-subdirectory-src.patch). vcpkg-configuration.json: default-registry is restored to tetherto/qvac-registry-vcpkg at HEAD (6df36b4f), and a new top-level "overlay-ports" entry points at the vendored copy. Process this unblocks (per Gustavo's merge protocol): 1. THIS commit — addon validates against WIP port via overlay (no fork dependency). 2. CI greens on the addon PR — proves the migration is safe. 3. Merge order is now flexible: registry PR tetherto#169 (and any follow-up registry PRs) can be merged independently. 4. After registry merges, the next commit on the addon branch removes vcpkg-overlays/whisper-cpp/, bumps the default-registry baseline to the new tetherto/main SHA, and re-runs CI to prove the addon still resolves the port from the merged registry. 5. Then the addon PR is merged. Verified locally on x64-linux: - npx bare-make generate resolves whisper-cpp[core,vulkan]@1.8.5 from the overlay path and ggml-speech[core,vulkan]@2026-04-09tetherto#4 from tetherto/main (logged as "whisper-cpp[core,vulkan]:x64-linux@1.8.5 -- /home/.../vcpkg-overlays/whisper-cpp" and "ggml-speech[core,vulkan]:x64-linux@2026-04-09#4 -- git+https://github.com/tetherto/qvac-registry-vcpkg.git@b9dab610"). - bare-make build + install: clean. Final prebuild stages libqvac-speech-ggml-{cpu,vulkan}.a (speech-prefixed — confirms ggml-speech consumption, not bundled). - npm run test:cpp: 106 / 107 pass (1 pre-existing skip; 0 failures, 0 regressions). Backend identity capture verified from the test log: "Active GPU backend: id=2 name='Vulkan' device='NVIDIA GeForce RTX 5090' mem_total_mb=32607 mem_free_mb=31149". Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac
that referenced
this pull request
May 26, 2026
…l registry PR lands Drops the previous shortcut of pointing the addon's vcpkg `default-registry` baseline at my personal fork. Instead, the vcpkg port files being added in the companion qvac-registry-vcpkg PR tetherto#169 are vendored into the addon as an overlay port so CI can validate the addon-side migration end-to-end against the WIP port without depending on the fork staying alive. Layout: vcpkg-overlays/whisper-cpp/ — verbatim copy of the qvac-registry-vcpkg PR tetherto#169 port tree (portfile.cmake + vcpkg.json + patches/0001-move-gnuinstalldirs-before- add-subdirectory-src.patch). vcpkg-configuration.json: default-registry is restored to tetherto/qvac-registry-vcpkg at HEAD (6df36b4f), and a new top-level "overlay-ports" entry points at the vendored copy. Process this unblocks (per Gustavo's merge protocol): 1. THIS commit — addon validates against WIP port via overlay (no fork dependency). 2. CI greens on the addon PR — proves the migration is safe. 3. Merge order is now flexible: registry PR tetherto#169 (and any follow-up registry PRs) can be merged independently. 4. After registry merges, the next commit on the addon branch removes vcpkg-overlays/whisper-cpp/, bumps the default-registry baseline to the new tetherto/main SHA, and re-runs CI to prove the addon still resolves the port from the merged registry. 5. Then the addon PR is merged. Verified locally on x64-linux: - npx bare-make generate resolves whisper-cpp[core,vulkan]@1.8.5 from the overlay path and ggml-speech[core,vulkan]@2026-04-09tetherto#4 from tetherto/main (logged as "whisper-cpp[core,vulkan]:x64-linux@1.8.5 -- /home/.../vcpkg-overlays/whisper-cpp" and "ggml-speech[core,vulkan]:x64-linux@2026-04-09#4 -- git+https://github.com/tetherto/qvac-registry-vcpkg.git@b9dab610"). - bare-make build + install: clean. Final prebuild stages libqvac-speech-ggml-{cpu,vulkan}.a (speech-prefixed — confirms ggml-speech consumption, not bundled). - npm run test:cpp: 106 / 107 pass (1 pre-existing skip; 0 failures, 0 regressions). Backend identity capture verified from the test log: "Active GPU backend: id=2 name='Vulkan' device='NVIDIA GeForce RTX 5090' mem_total_mb=32607 mem_free_mb=31149". Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac
that referenced
this pull request
May 26, 2026
…l registry PR lands Drops the previous shortcut of pointing the addon's vcpkg `default-registry` baseline at my personal fork. Instead, the vcpkg port files being added in the companion qvac-registry-vcpkg PR tetherto#169 are vendored into the addon as an overlay port so CI can validate the addon-side migration end-to-end against the WIP port without depending on the fork staying alive. Layout: vcpkg-overlays/whisper-cpp/ — verbatim copy of the qvac-registry-vcpkg PR tetherto#169 port tree (portfile.cmake + vcpkg.json + patches/0001-move-gnuinstalldirs-before- add-subdirectory-src.patch). vcpkg-configuration.json: default-registry is restored to tetherto/qvac-registry-vcpkg at HEAD (6df36b4f), and a new top-level "overlay-ports" entry points at the vendored copy. Process this unblocks (per Gustavo's merge protocol): 1. THIS commit — addon validates against WIP port via overlay (no fork dependency). 2. CI greens on the addon PR — proves the migration is safe. 3. Merge order is now flexible: registry PR tetherto#169 (and any follow-up registry PRs) can be merged independently. 4. After registry merges, the next commit on the addon branch removes vcpkg-overlays/whisper-cpp/, bumps the default-registry baseline to the new tetherto/main SHA, and re-runs CI to prove the addon still resolves the port from the merged registry. 5. Then the addon PR is merged. Verified locally on x64-linux: - npx bare-make generate resolves whisper-cpp[core,vulkan]@1.8.5 from the overlay path and ggml-speech[core,vulkan]@2026-04-09tetherto#4 from tetherto/main (logged as "whisper-cpp[core,vulkan]:x64-linux@1.8.5 -- /home/.../vcpkg-overlays/whisper-cpp" and "ggml-speech[core,vulkan]:x64-linux@2026-04-09#4 -- git+https://github.com/tetherto/qvac-registry-vcpkg.git@b9dab610"). - bare-make build + install: clean. Final prebuild stages libqvac-speech-ggml-{cpu,vulkan}.a (speech-prefixed — confirms ggml-speech consumption, not bundled). - npm run test:cpp: 106 / 107 pass (1 pre-existing skip; 0 failures, 0 regressions). Backend identity capture verified from the test log: "Active GPU backend: id=2 name='Vulkan' device='NVIDIA GeForce RTX 5090' mem_total_mb=32607 mem_free_mb=31149". Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac
that referenced
this pull request
May 26, 2026
…l registry PR lands Drops the previous shortcut of pointing the addon's vcpkg `default-registry` baseline at my personal fork. Instead, the vcpkg port files being added in the companion qvac-registry-vcpkg PR tetherto#169 are vendored into the addon as an overlay port so CI can validate the addon-side migration end-to-end against the WIP port without depending on the fork staying alive. Layout: vcpkg-overlays/whisper-cpp/ — verbatim copy of the qvac-registry-vcpkg PR tetherto#169 port tree (portfile.cmake + vcpkg.json + patches/0001-move-gnuinstalldirs-before- add-subdirectory-src.patch). vcpkg-configuration.json: default-registry is restored to tetherto/qvac-registry-vcpkg at HEAD (6df36b4f), and a new top-level "overlay-ports" entry points at the vendored copy. Process this unblocks (per Gustavo's merge protocol): 1. THIS commit — addon validates against WIP port via overlay (no fork dependency). 2. CI greens on the addon PR — proves the migration is safe. 3. Merge order is now flexible: registry PR tetherto#169 (and any follow-up registry PRs) can be merged independently. 4. After registry merges, the next commit on the addon branch removes vcpkg-overlays/whisper-cpp/, bumps the default-registry baseline to the new tetherto/main SHA, and re-runs CI to prove the addon still resolves the port from the merged registry. 5. Then the addon PR is merged. Verified locally on x64-linux: - npx bare-make generate resolves whisper-cpp[core,vulkan]@1.8.5 from the overlay path and ggml-speech[core,vulkan]@2026-04-09tetherto#4 from tetherto/main (logged as "whisper-cpp[core,vulkan]:x64-linux@1.8.5 -- /home/.../vcpkg-overlays/whisper-cpp" and "ggml-speech[core,vulkan]:x64-linux@2026-04-09#4 -- git+https://github.com/tetherto/qvac-registry-vcpkg.git@b9dab610"). - bare-make build + install: clean. Final prebuild stages libqvac-speech-ggml-{cpu,vulkan}.a (speech-prefixed — confirms ggml-speech consumption, not bundled). - npm run test:cpp: 106 / 107 pass (1 pre-existing skip; 0 failures, 0 regressions). Backend identity capture verified from the test log: "Active GPU backend: id=2 name='Vulkan' device='NVIDIA GeForce RTX 5090' mem_total_mb=32607 mem_free_mb=31149". Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
to Zbig9000/qvac
that referenced
this pull request
May 26, 2026
…ggml PR tetherto#13 HEAD Wires Zbig9000/qvac-ext-ggml@QVAC-18992-merge-ggml-from-whisper-cpp@d39c0d29 (qvac-ext-ggml PR tetherto#13) into the addon's vcpkg-configuration.json as an overlay port, alongside the existing whisper-cpp overlay (registry PR tetherto#169). This lets the addon's full CI matrix exercise BOTH: - whisper-cpp 1.8.5 from registry PR tetherto#169 (already present) - ggml-speech 2026-05-26 from qvac-ext-ggml PR tetherto#13 (new) before either underlying PR is merged to its respective registry/branch. Overlay diff vs registry's ggml-speech@2026-04-09 tetherto#4: - REF/SHA512 → PR tetherto#13 HEAD (d39c0d29) - new vulkan dep on spirv-headers - new patch 0001-ggml-vulkan-find-spirv-headers.patch wiring SPIRV-Headers into ggml-vulkan (PR tetherto#13's v0.10.2 sync adds #include <spirv/unified1/spirv.hpp> but upstream ggml-vulkan CMakeLists.txt never finds SPIRV-Headers; the same fix should be pushed upstream later and the patch dropped) - version-date / port-version bumped so vcpkg picks overlay over registry Local validation with both overlays active: - vcpkg dep graph: ggml-speech resolves from vcpkg-overlays/ggml-speech, whisper-cpp from vcpkg-overlays/whisper-cpp, spirv-headers from microsoft/vcpkg - cryptographic confirmation: buildtree src/ggml-vulkan/ggml-vulkan.cpp sha256 IDENTICAL to qvac-ext-ggml@d39c0d29:src/ggml-vulkan/ggml-vulkan.cpp, GGML_VERSION = 0.10.2 (PR tetherto#13's upstream sync) - linux-x64 cpp tests: 107/107 pass - js suite: test:dts + lint + unit (30/30) + integration (10/10) + multiple + accuracy (Japanese WER 0%) + chunking (10-min audio) + live-stream-simulation + model-file-validation (5/5) - cpp-lint: clang-format clean, clang-tidy-19 0 user-code errors Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000
added a commit
that referenced
this pull request
May 28, 2026
… feature + GPU backend identity (QVAC-19236, QVAC-18992, QVAC-18993) (#2270) * transcription-whispercpp 0.9.0: ggml-speech migration + metal feature + GPU backend identity in runtime stats Three ticket deliverables combined into a single coordinated 0.9.0 release of the addon (paired with the whisper-cpp 1.8.5 + metal-feature port rewrite landing in qvac-registry-vcpkg companion PR): QVAC-18992 — Migrate to use ggml speech branch ---------------------------------------------- Addon now consumes `whisper-cpp 1.8.5#0` which links the system- installed `ggml-speech` (port-version 4) via WHISPER_USE_SYSTEM_GGML=ON. Whisper + parakeet + tts all share the same libqvac-speech-ggml-* binary set on every triplet (was: whisper-cpp brought a separate libqvac-ggml-* set). CMakeLists.txt: rewritten to mirror transcription-parakeet exactly — two-branch BACKEND_DL_LIBS / BACKEND_DL_LOOSE_SOS collection so the per-arch CPU IMPORTED targets and the MODULE Vulkan/OpenCL .so files (which ggml-config deliberately omits from GGML_AVAILABLE_BACKENDS) both get staged into prebuilds/<bare_target>/<module_name>/ for the runtime ggml_backend_load_all_from_path() scan. The old whisper- specific find_library fallback (created SHARED IMPORTED targets from raw .so paths to work around bundled-ggml's MODULE-target export gap) is removed — ggml-speech port surfaces what it can, BACKEND_DL_LOOSE_SOS catches the rest. vcpkg-configuration.json default-registry baseline pinned to my fork for CI; will be re-pinned to tetherto/qvac-registry-vcpkg HEAD after the companion vcpkg-registry PR merges. vcpkg.json override bumped to whisper-cpp 1.8.5#0. QVAC-19236 — Expose backend selection as features ------------------------------------------------- Addon's vcpkg.json now selects whisper-cpp[metal] for osx (was unconditionally on via the portfile; now declarative). iOS dep entry stays without the [metal] feature until the separate iOS Metal/MTLCompiler XPC crash is investigated — iOS continues to ship on the CPU backend by simply not asking for [metal]. QVAC-18993 — Android dynamic-backend + per-device GPU assertion --------------------------------------------------------------- Added a one-shot device introspection step at model load time: `WhisperModel::captureActiveBackendInfo()` enumerates the ggml backend registry (after ensureBackendsLoadedAndroid() loads the dynamic .so modules on Android) and records the first GPU/IGPU device's identity + memory snapshot. Result is surfaced through the existing runtimeStats() pipeline as three new keys (the RuntimeStats variant only takes double|int64_t, so backend identity is encoded as a stable numeric enum): gpuBackendId 0=CPU, 1=Metal, 2=Vulkan, 3=OpenCL, 4=CUDA, 99=other gpuMemTotalMb -1 when the device does not expose memory accounting gpuMemFreeMb -1 when the device does not expose memory accounting The selected backend's full name + device description are also logged once via QLOG(INFO) so they're recoverable from the Android Device-Farm logcat capture for the human-readable assertion side (S25 -> "OpenCL" / "Adreno (TM) …", Pixel 9 -> "Vulkan" / "Mali-…"). Mobile-perf-runner.js now asserts the new keys are present and, on Android with use_gpu=true, that gpuBackendId resolves to either Vulkan (2) or OpenCL (3) — the union covers both Device-Farm device families without needing a per-device branch from inside the bare spec (the device capabilities split lives in the wdio config, not here). index.d.ts: extended RuntimeStats with the three new keys + the enum documentation. CHANGELOG.md: consolidated 0.9.0 entry covering all three tickets. Verified locally on linux-x64: - npx bare-make generate succeeds (whisper-cpp 1.8.5 + ggml-speech 2026-04-09#4 resolve cleanly via my fork baseline) - npx bare-make build succeeds (.bare module + libqvac-speech- ggml-cpu.a + libqvac-speech-ggml-vulkan.a linked into prebuilds) - test:cpp passes: 106 / 107 (1 streaming case skipped, pre- existing; 0 failures, 0 regressions). Backend capture verified from the test log: `Active GPU backend: id=2 name='Vulkan' device='NVIDIA GeForce RTX 5090' mem_total_mb=32607 mem_free_mb=31342`. Co-authored-by: Cursor <cursoragent@cursor.com> * transcription-whispercpp: pin whisper-cpp WIP port as an overlay until registry PR lands Drops the previous shortcut of pointing the addon's vcpkg `default-registry` baseline at my personal fork. Instead, the vcpkg port files being added in the companion qvac-registry-vcpkg PR #169 are vendored into the addon as an overlay port so CI can validate the addon-side migration end-to-end against the WIP port without depending on the fork staying alive. Layout: vcpkg-overlays/whisper-cpp/ — verbatim copy of the qvac-registry-vcpkg PR #169 port tree (portfile.cmake + vcpkg.json + patches/0001-move-gnuinstalldirs-before- add-subdirectory-src.patch). vcpkg-configuration.json: default-registry is restored to tetherto/qvac-registry-vcpkg at HEAD (6df36b4f), and a new top-level "overlay-ports" entry points at the vendored copy. Process this unblocks (per Gustavo's merge protocol): 1. THIS commit — addon validates against WIP port via overlay (no fork dependency). 2. CI greens on the addon PR — proves the migration is safe. 3. Merge order is now flexible: registry PR #169 (and any follow-up registry PRs) can be merged independently. 4. After registry merges, the next commit on the addon branch removes vcpkg-overlays/whisper-cpp/, bumps the default-registry baseline to the new tetherto/main SHA, and re-runs CI to prove the addon still resolves the port from the merged registry. 5. Then the addon PR is merged. Verified locally on x64-linux: - npx bare-make generate resolves whisper-cpp[core,vulkan]@1.8.5 from the overlay path and ggml-speech[core,vulkan]@2026-04-09#4 from tetherto/main (logged as "whisper-cpp[core,vulkan]:x64-linux@1.8.5 -- /home/.../vcpkg-overlays/whisper-cpp" and "ggml-speech[core,vulkan]:x64-linux@2026-04-09#4 -- git+https://github.com/tetherto/qvac-registry-vcpkg.git@b9dab610"). - bare-make build + install: clean. Final prebuild stages libqvac-speech-ggml-{cpu,vulkan}.a (speech-prefixed — confirms ggml-speech consumption, not bundled). - npm run test:cpp: 106 / 107 pass (1 pre-existing skip; 0 failures, 0 regressions). Backend identity capture verified from the test log: "Active GPU backend: id=2 name='Vulkan' device='NVIDIA GeForce RTX 5090' mem_total_mb=32607 mem_free_mb=31149". Co-authored-by: Cursor <cursoragent@cursor.com> * transcription-whispercpp: clang-format + clang-tidy fixes on captureActiveBackendInfo() Caught locally by running the exact CI cpp-lint commands against this branch: git-clang-format --binary clang-format --extensions c,cc,cpp,... --diff "$(git merge-base HEAD upstream/main)" -- packages/transcription-whispercpp clang-tidy-19 -p build addon/src/model-interface/whisper.cpp/WhisperModel.cpp --header-filter='^.../packages/transcription-whispercpp/addon/...' --warnings-as-errors='*' Two findings, both in code added by the previous commit fab6888: 1. clang-format (8 hunks): include ordering (now grouped alphabetically per the project's IncludeBlocks rule), allman-style brace wrapping around the single-statement `if` bodies in gpuBackendIdFromName() and on the `dev == nullptr` early-continue in captureActiveBackendInfo(), and the column-limit-driven multi-line spread on the std::transform() call and the two gpu_mem_{total,free}_mb_ ternary assignments. 2. clang-tidy readability-identifier-naming on the new `K_BYTES_PER_MB` local constexpr: project convention enforced by .clang-tidy is `kBytesPerMb` (lower-camel with a `k` prefix) for function-scope constants, not SCREAMING_SNAKE. Renamed to kBytesPerMb at all three use sites. Re-validated after the fix: - clang-format --diff: no remaining diffs - clang-tidy-19 --warnings-as-errors='*': 0 user-code errors (4137 warnings, all suppressed as non-user-code per the header-filter regex) - npx bare-make generate + build + install: clean - npm run test:cpp: 107 / 107 pass (kBytesPerMb rename is a pure identifier change; behaviour is byte-for-byte identical and the Vulkan backend identity log still reports `Active GPU backend: id=2 name='Vulkan' device='NVIDIA GeForce RTX 5090' mem_total_mb=32607 mem_free_mb=31178`). - npm run test:dts: clean - npm run lint (standardJS): clean - npm run test:unit / test:integration / test:integration:multiple / test:integration:chunking / test:integration:accuracy (multi-lang incl. Japanese WER 0.00%) / test:integration:live-stream-simultion / test:unit:reload:esraw / test:integration:model-file-validation / test:integration:corrupted-model — all pass with the new formatted source. Confirms the new captureActiveBackendInfo() introduced in fab6888 would have been caught by CI on the first push; fixing locally before re-trigger avoids one CI cycle. Co-authored-by: Cursor <cursoragent@cursor.com> * transcription-whispercpp: add ggml-speech overlay pinned to qvac-ext-ggml PR #13 HEAD Wires Zbig9000/qvac-ext-ggml@QVAC-18992-merge-ggml-from-whisper-cpp@d39c0d29 (qvac-ext-ggml PR #13) into the addon's vcpkg-configuration.json as an overlay port, alongside the existing whisper-cpp overlay (registry PR #169). This lets the addon's full CI matrix exercise BOTH: - whisper-cpp 1.8.5 from registry PR #169 (already present) - ggml-speech 2026-05-26 from qvac-ext-ggml PR #13 (new) before either underlying PR is merged to its respective registry/branch. Overlay diff vs registry's ggml-speech@2026-04-09 #4: - REF/SHA512 → PR #13 HEAD (d39c0d29) - new vulkan dep on spirv-headers - new patch 0001-ggml-vulkan-find-spirv-headers.patch wiring SPIRV-Headers into ggml-vulkan (PR #13's v0.10.2 sync adds #include <spirv/unified1/spirv.hpp> but upstream ggml-vulkan CMakeLists.txt never finds SPIRV-Headers; the same fix should be pushed upstream later and the patch dropped) - version-date / port-version bumped so vcpkg picks overlay over registry Local validation with both overlays active: - vcpkg dep graph: ggml-speech resolves from vcpkg-overlays/ggml-speech, whisper-cpp from vcpkg-overlays/whisper-cpp, spirv-headers from microsoft/vcpkg - cryptographic confirmation: buildtree src/ggml-vulkan/ggml-vulkan.cpp sha256 IDENTICAL to qvac-ext-ggml@d39c0d29:src/ggml-vulkan/ggml-vulkan.cpp, GGML_VERSION = 0.10.2 (PR #13's upstream sync) - linux-x64 cpp tests: 107/107 pass - js suite: test:dts + lint + unit (30/30) + integration (10/10) + multiple + accuracy (Japanese WER 0%) + chunking (10-min audio) + live-stream-simulation + model-file-validation (5/5) - cpp-lint: clang-format clean, clang-tidy-19 0 user-code errors Co-authored-by: Cursor <cursoragent@cursor.com> * transcription-whispercpp: bump ggml-speech overlay to PR #13 HEAD e31785e4 Picks up the Apple-Metal build fix pushed to qvac-ext-ggml PR #13 (restores the lost 'typedef struct {' before ggml_metal_kargs_supertonic_depthwise_1d in src/ggml-metal/ggml-metal-impl.h). Without this bump the Apple-Metal prebuild matrix (darwin-arm64, ios-arm64, ios-arm64-simulator, ios-x64-simulator) fails to compile against PR #13's source. Local linux-x64 re-validation: vcpkg downloads the new tarball (e31785e4), applies the spirv-headers patch, builds clean, 107/107 C++ tests pass. Co-authored-by: Cursor <cursoragent@cursor.com> * vcpkg-overlays: sync ggml-speech overlay to registry post-merge state; bump version>=ggml-speech in whisper-cpp overlay Two related overlay corrections so the overlay tree is a verbatim mirror of what qvac-registry-vcpkg PR #169 will publish: 1. vcpkg-overlays/ggml-speech/ was still pinned to the pre-merge fork (Zbig9000/qvac-ext-ggml@QVAC-18992-merge-ggml-from-whisper-cpp@e31785e4, version-date 2026-05-26#0) from the days before tetherto/qvac-ext-ggml PR #13 merged. Synced wholesale to qvac-registry-vcpkg/ports/ggml-speech: REF e31785e4 -> c9126afc (merge commit of PR #13 on @speech) SHA512 <fork SHA> -> <tetherto SHA> HEAD_REF QVAC-18992-merge-ggml-from-whisper-cpp -> speech version-date 2026-05-26#0 -> 2026-05-27#0 description updated to drop "LOCAL OVERLAY" language Source-wise this is a no-op (c9126afc on @speech contains e31785e4 as its single PR-side parent, so the tree is identical), but the overlay must declare the exact REF/version that will land in the registry so the build is provably what gets published. 2. vcpkg-overlays/whisper-cpp/vcpkg.json: version>=ggml-speech bumped 2026-04-09#4 -> 2026-05-27. whisper-cpp@1.8.5 only works against the new ggml-speech (v0.10.2 vendored sources, new symbol set, spirv-headers Vulkan wiring), so the constraint must reflect that minimum. In practice the resolver always picked 2026-05-27 from the addon's own override, so this is metadata-only and not a behavior change. Local validation on x64-linux (vulkan feature) with synced overlays: - bare-make generate resolves ggml-speech[core,vulkan]@2026-05-27 (was 2026-05-26 with the stale overlay) + whisper-cpp[core,vulkan]@1.8.5 + spirv-headers (transitive from ggml-speech vulkan dep) - build links clean - npm run test:cpp -> 107/107 pass - npm run test:unit -> 30/30 pass - npm run test:dts -> clean Co-authored-by: Cursor <cursoragent@cursor.com> * vcpkg-overlays/ggml-speech: pin spirv-headers vulkan dep to version>=1.4.341.0 Mirrors the same fix in qvac-registry-vcpkg PR #169 so the overlay stays a verbatim copy of what the registry will publish. Without a version>= constraint, the resolved spirv-headers version depends entirely on the consumer's microsoft/vcpkg baseline; 1.4.341.0 is the version already used by qvac-fabric. Local validation on x64-linux: vcpkg upgrades spirv-headers from the addon's baseline 1.4.304.1 to the required 1.4.341.0, addon builds clean, 107/107 cpp tests + 30/30 unit tests pass. Co-authored-by: Cursor <cursoragent@cursor.com> * transcription-whispercpp: drop vcpkg overlays now that qvac-registry-vcpkg#169 is merged Step E of the cross-repo merge protocol: now that the registry PR has landed on tetherto/qvac-registry-vcpkg@main as b54eb17 ("whisper-cpp 1.8.5 + ggml-speech 2026-05-27 + tts-cpp/parakeet-cpp re-validation"), the addon no longer needs the WIP overlay ports. vcpkg-configuration.json: - default-registry.baseline 6df36b4f -> b54eb17 (the merge SHA of qvac-registry-vcpkg#169) - drop overlay-ports block (vcpkg-overlays/{whisper-cpp,ggml-speech}/) vcpkg-overlays/whisper-cpp/ -> removed vcpkg-overlays/ggml-speech/ -> removed The whisper-cpp version pin in vcpkg.json overrides is unchanged (still 1.8.5 / port-version 0), which now resolves straight from the registry. ggml-speech is pulled in transitively at 2026-05-27#0 (the new baseline). spirv-headers is pulled in transitively from microsoft/vcpkg at the 1.4.341.0 floor declared in the new ggml-speech port. Local validation on x64-linux (vulkan feature) against the merged registry, with no overlays: - bare-make generate resolves ggml-speech[core,vulkan]:x64-linux@2026-05-27 -> tetherto/qvac-registry-vcpkg git-tree c201f77 (identical to the overlay-phase tree -- proves the source code is the same as what CI ran the last 28/28 green matrix on) whisper-cpp[core,vulkan]:x64-linux@1.8.5 -> tetherto/qvac-registry-vcpkg git-tree d18888f (also identical to the overlay-phase tree) spirv-headers:x64-linux@1.4.341.0 -> microsoft/vcpkg (transitive via ggml-speech[vulkan]) The ggml-speech and whisper-cpp package-ABI hashes are byte-identical to the last overlay-phase run, confirming the registry resolution and the overlay resolution install the exact same content. - build links clean - npm run test:cpp -> 107/107 pass - npm run test:unit -> 30/30 pass - npm run test:dts -> clean Co-authored-by: Cursor <cursoragent@cursor.com> * transcription-whispercpp: revert default-registry baseline bump Address @jpgaribotti review on #2270: "Don't update the baseline." The whisper-cpp@1.8.5#0 override in vcpkg.json + the version>=ggml-speech and version>=spirv-headers constraints declared inside the new whisper-cpp and ggml-speech ports are enough to pull the new ports out of the registry's git history without bumping the baseline past a9d7e924 -- vcpkg's overrides walk the registry's versions/ database across history, they are not gated on the baseline tree. Local re-validation on x64-linux (vulkan), with baseline kept at a9d7e924 (the value already on tetherto/qvac@main): bare-make generate resolves: ggml-speech[core,vulkan]:x64-linux@2026-05-27 -> git-tree c201f77 whisper-cpp[core,vulkan]:x64-linux@1.8.5 -> git-tree d18888f spirv-headers:x64-linux@1.4.341.0 -> microsoft/vcpkg All three resolved git-trees and package-ABI hashes match the previous baseline-bumped run byte-for-byte, confirming the dropped baseline change is purely a no-op for what gets installed. Build links clean, npm run test:cpp 107/107 pass, test:unit 30/30 pass, test:dts clean. Co-authored-by: Cursor <cursoragent@cursor.com> * transcription-whispercpp: address jpgaribotti review on backend identity API Four review items on PR #2270: 1. Align BackendId numeric values with transcription-parakeet's BackendId enum (CPU=0, Metal=1, CUDA=2, Vulkan=3, OpenCL=4, Other=99). Whisper previously used (Metal=1, Vulkan=2, OpenCL=3, CUDA=4) which silently broke cross-addon device-farm comparison. While we're at it, rename gpuBackendId -> backendId and add a companion backendDevice (0=CPU, 1=GPU) so the RuntimeStats shape mirrors parakeet's. Public-API change but 0.9.0 hasn't shipped yet so no migration cost. 2. Replicate whisper.cpp's exact GPU selection in captureActiveBackendInfo() so the reported backend matches what whisper actually initialised against: - read use_gpu / gpu_device out of WhisperConfig (was: always enumerate, even for use_gpu=false) - pick GGML_BACKEND_DEVICE_TYPE_GPU only (was: GPU or IGPU -- whisper rejects IGPU, so reporting one would lie) - honour gpu_device index when set (was: ignored) Was: first-match enumeration across all GPU/IGPU devices, could disagree with whisper's pick on Android where Vulkan and OpenCL both register and ggml_backend_dev_get() order differs from whisper's preference. 3. Emit a WARNING through the addon logger when use_gpu=true was requested but no GPU device is registered (silent CPU fallback case). Mirrors ParakeetModel::loadModel()'s WARNING so the iOS/desktop mobile-perf paths stop hiding silent CPU fallback behind a "backendId !== null" assertion. 4. CHANGELOG.md: drop the "Re-pinned the default-registry baseline..." paragraph -- we're keeping the baseline conservative per the same review. Files updated to keep everything in sync: - addon/src/model-interface/whisper.cpp/WhisperModel.hpp: rename gpu_backend_id_ -> backend_id_, add backend_device_, rename gpu_backend_name_ -> backend_name_, update doc comment numbers. - addon/src/model-interface/whisper.cpp/WhisperModel.cpp: rewrite backendIdFromName() -> backendIdFromRegName() with parakeet's numbering and the Metal/MTL alias parakeet uses; rewrite captureActiveBackendInfo() per items 2-3; switch runtimeStats() to emit backendDevice + backendId (was: gpuBackendId only). - index.d.ts: rename gpuBackendId -> backendId, add backendDevice, introduce BackendId enum (re-exported from the namespace) with the same docstring shape parakeet uses; emphasise the cross-addon contract. - test/integration/mobile-perf-runner.js: switch to backendDevice + backendId; flip the Android-GPU assertion union from "Vulkan=2 || OpenCL=3" to "Vulkan=3 || OpenCL=4"; also assert backendDevice is reported. - CHANGELOG.md: rewrite the 0.9.0 "Added" runtime-stats bullet to describe the new field shape + numbering + BackendId enum, drop the baseline-bump paragraph. Local validation on x64-linux (vulkan feature) with the conservative baseline (a9d7e924, no change): - bare-make generate / build / install: clean - npm run test:cpp -> 107/107 pass - npm run test:unit -> 30/30 pass - npm run test:dts -> clean (BackendId enum + new fields type-check) - npm run test:integration -> 10/10 pass - npm run test:integration:accuracy -> 8/8 pass - npm run test:integration:chunking -> 1/1 pass - git-clang-format --diff vs upstream/main: clean - clang-tidy-19 -p build WhisperModel.cpp: 0 user-code warnings Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
jpgaribotti
pushed a commit
that referenced
this pull request
Jun 2, 2026
…encoder (#2237) * feat(diffusion-cpp): add Wan 2.1 I2V model download, FLF2V helpers, and VAE tiling patch Adds tooling and assets to support image-to-video (img2vid) and frame-to-frame interpolation (FLF2V) generation with the Wan 2.1 I2V 14B model in GGUF format. Additions: - scripts/download-model-wan-i2v.sh: downloads city96/Wan2.1-I2V-14B-480P-gguf Q4_K_M (~11 GB) plus VAE, T5-XXL, and CLIP ViT-H/14 vision encoder - examples/generate-shannon-flux.js: FLUX2-klein img2img helper to generate an end-frame at matching resolution (FLF2V requires both frames to share dims) - examples/generate-flf-end-frame.js: alternative img2vid-based frame generator - addon/examples/img2vid-wan-example.cpp + CMakeLists.txt: native C++ usage example - vcpkg/ports/patches/wan-i2v-encode-video-bypass-tiling.patch: patches stable-diffusion.cpp to skip 2D VAE tiling for 4D video tensors (avoids GGML_ASSERT failure during VAE encode in img2vid/flf2vid) - assets/claude-shannon-resized.jpg, assets/maks-original.jpg: example assets Note: This PR adds only NEW files; the corresponding C++ wiring for clipVision in addon/src/* and JS bindings in addon.js/video.js/index.js is tracked separately in feature/itv (b0e32e0) and will be ported in a follow-up PR once compatible with the post-history-rewrite addon refactor. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(diffusion-cpp): port Wan 2.1 I2V C++ wiring and JS bindings from feature/itv - Port full addon/src C++ implementation: clipVisionPath support in SdCtxHandlers, AddonJs, and SdModel; FLF2V (first-last-frame-to-video) handlers in SdVidGenHandlers; updated AviWriter and SdVideoFrames for video generation - Add clipVisionPath to video.js and index.js configurationParams so the native addon receives the CLIP vision encoder path for I2V/FLF2V modes - Update img2vid-wan.js to default to the dedicated Wan 2.1 I2V 14B GGUF checkpoint with CLIP vision, replacing the T2V 1.3B placeholder - Update flf2vid-wan.js with production-ready FLF2V defaults, crossfade prompt, and releaseLogger() in finally block to prevent process hang - Update img2img-flux2.js and img2img-flux2-f16.js with clipVisionPath passthrough fix Co-authored-by: Cursor <cursoragent@cursor.com> * feat(diffusion-cpp): remove FLF2V interpolation, deliver I2V only Remove first-last-frame-to-video (flf2vid) mode from the public API: - Delete examples/flf2vid-wan.js and examples/generate-flf-end-frame.js - Remove 'flf2vid' from VIDEO_MODES and all end_image validation in video.js - Remove VideoMode 'flf2vid' and end_image field from video.d.ts Co-authored-by: Cursor <cursoragent@cursor.com> * feat(diffusion-cpp): remove flf2vid from C++ addon entirely Remove first-last-frame-to-video from the native layer: - SdModel.cpp: remove flf2vid mode branch, end_image decode/resize path, vidParams.end_image assignment, and endImg/endData locals - SdModel.hpp: remove endImageBytes field from GenerationJob - SdVidGenHandlers.cpp/.hpp: remove flf2vid from valid mode set and comments - AddonJs.hpp: remove endImageBuffer parsing - SdCtxHandlers.hpp: remove FLF2V references from clipVisionPath comment Supported video modes are now strictly txt2vid and img2vid. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): Address all critical C1–C7 issues + implement High priority fixes **Critical Issues (C1–C7):** - C1: Thread-local callbacks already implemented (tl_progressCtx, tl_abortModel) - C2: Gate unused preview_mode config (parsed but never wired) - C3: Fix memory leak on generate_image() exception paths using RAII wrappers - C4: Null-check generate_image/video returns, throw StatusError on failure - C5: Implement applyFluxImg2ImgDimDefaults() for FLUX img2img dimension defaults - C6: Harden VideoStableDiffusion (LoRA rejection; end_image/flf2vid deferred) - C7: Harden mapAddonEvent with explicit Uint8Array checks and documentation **High Priority (H1–H12) - Previously completed:** - Shared integer parsing (requireInt, requirePositiveInt, etc.) with overflow guards - Standardized cancellation errors via makeCancelledError() - JS input validation (dimensions, prompts, image coercion) - Overflow checks in image resizing & AVI encoding - Cooperative cancellation in video post-generation - TypeScript .d.ts synchronization **Infrastructure:** - Scaffold local vcpkg overlay port for Wan I2V VAE-tiling patch - Restore portfile.cmake + supporting config files - Pin to stable-diffusion-cpp@00cd2a09 (registry #4) for SD_BACKEND_PREF_AUTO **Files Changed:** C++ handlers, model interface, utilities: integer parsing, error handling, memory safety JavaScript: input validation, FLUX dimension defaults, video params, event mapping TypeScript: type definitions for new exports and corrected runtime behavior vcpkg: local overlay + patch machinery for I2V fix Closes #HIGH-PRIORITY, fixes i2v model loading via patched VAE tiling. Co-authored-by: Cursor <cursoragent@cursor.com> * Merge origin/main with C1-C7 critical fixes (excluding flf2vid) Co-authored-by: Cursor <cursoragent@cursor.com> * style(diffusion-cpp): clang-format C++ files changed vs main Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): fix unit test failures after flf2vid removal - video.js: add peekImageDims helper; reject off-grid init_image / control_frames dimensions when caller omits explicit width/height; unify control_frames error message to 'must be a non-empty Uint8Array' - test: remove flf2vid-specific tests (29,40,56,58,64-66); update test 63 error-message regex; update test 29 mode list regex Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): fix cpp-tests build failures - overlay portfile: bump stable-diffusion-cpp pin from 00cd2a09 (#4) to 747a1801 (#5) so EsrganUpscaler.cpp's sd_upscaler_device_t and new_upscaler_ctx_with_device resolve; patch still applies cleanly - SdModel.cpp processVideo: revert init_image / control_frames dimension mismatch from resize to throw, matching C++ unit test expectations - test_wan_video.cpp: remove all flf2vid and endImageBytes tests (flf2vid was removed from the C++ layer); update ValidationThrowClearsThreadLocalState to use img2vid instead Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): pass clipVisionPath to addon in ImgStableDiffusion Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): align init_images error messages with integration test expectations Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): fix 10 failing cpp-tests unit tests - Restore diffusionFlashAttn/diffusionConvDirect/vaeConvDirect defaults to true - Restore preview handlers (mode/interval/denoised/noisy) — revert C2 gating - Remove flf2vid from AcceptsTxt2VidImg2VidFlf2Vid test (renamed) - Add zero/negative/fractional/out-of-range validation to parseVaeTileSize Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): apply FLUX img2img 1024 defaults when prediction is in load config Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): address PR review comments (jpgaribotti, jesusmb1995) - Remove generate:flf2vid npm script (example file was deleted) - Fix img2vid-wan-example.cpp default to GGUF path (not fp8_scaled) - Align Wan I2V spatial constraint to 16 (was 8) in video.js - Throw (not warn) when files.clipVision missing for img2vid - Remove endImageBuffer dead code from addon.js - Scrub stale flf2vid/end_image references from JSDoc and comments Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): update video-validation tests for alignTo=16 (Wan spatial multiple) Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): fix unit test regressions from alignTo=16 and clipVision throw - Add FAKE_CLIP_VISION to makeWanModel defaults so img2vid tests pass the new 'files.clipVision required' guard - Fix test 41: width/height 104 -> 112 (first multiple of 16 > 100) Co-authored-by: Cursor <cursoragent@cursor.com> * chore(diffusion-cpp): scrub all remaining FLF2V/end_image references Remove every comment, JSDoc, test, and CHANGELOG mention of flf2vid, FLF2V, first-last-frame, and end_image across the package. Also removes the end_image validation blocks in video.js and the two corresponding unit tests, since end_image was only ever used by the now-removed flf2vid mode. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(ci): remove stale vcpkg dir before clone on macOS self-hosted runners Self-hosted macOS runners persist the parent directory between runs, so a leftover vcpkg/ from a previous job causes `git clone` to fail with "destination path 'vcpkg' already exists". Add `rm -rf vcpkg` before the clone to ensure a clean state. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(ci): update setup-vcpkg SHA to include stale-dir rm fix All workflow callers were pinned to 6e8d3c3 (original action commit) which didn't include the rm -rf vcpkg cleanup. Update all 7 callers to 80fdb78 so CI picks up the fix on macOS self-hosted runners. Co-authored-by: Cursor <cursoragent@cursor.com> * revert(ci): remove rm -rf vcpkg patch from setup-vcpkg action Runner-level cleanup to be handled by DevOps. Keeping the SHA bump in workflow callers to stay in sync with the current action commit. Co-authored-by: Cursor <cursoragent@cursor.com> * test(diffusion-cpp): add Wan 2.1 I2V smoke integration test Adds a CI smoke test for img2vid mode alongside the existing txt2vid test in generate-video-wan.test.js. Downloads the I2V 14B Q4_K_M GGUF, shared VAE/T5-XXL, and clip_vision_h models on demand; uses the existing von-neumann-colorized.jpg asset as init_image; runs 2 steps at 480x272 to keep wall-clock under 5 minutes on GPU runners. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): use city96 public repo for Wan I2V GGUF model download bartowski's wan2.1-i2v-14b-480p-GGUF repo requires authentication (401). Switch to city96/Wan2.1-I2V-14B-480P-gguf which is public (gated: false) and is the same source used by the download-model-wan-i2v.sh script. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): resolve init_image dimension mismatch in I2V video generation - Remove hardcoded 480x272 dimensions from I2V test to prevent mismatch with 512x512 init_image - Infer video dimensions from init_image header when width/height are omitted - Add early JavaScript validation to catch dimension mismatches before C++ execution - Provide helpful error message guiding users to either omit dimensions or pre-scale the image Fixes Windows CI failure: "init_image dimensions 512x512 do not match video dimensions 480x272" Co-authored-by: Cursor <cursoragent@cursor.com> * ci(diffusion-cpp): skip Wan tests on CPU-only runners, enable on GPU darwin-arm64 - Remove blanket darwin skip to allow Wan tests on GPU-enabled darwin-arm64 - Only skip Wan tests on mobile and CPU-only runners (NO_GPU=true) - Fixes darwin-x64 CI timeout by skipping Wan tests on CPU-only macos-15-large - Allows Wan tests to run on GPU-enabled mac-mini-m4 (darwin-arm64) Resolves: darwin-x64 integration test taking 50+ minutes Co-authored-by: Cursor <cursoragent@cursor.com> * ci: add debug logging for Wan test skip behavior - Add workflow step to log NO_GPU and test configuration before tests run - Add console.log in Wan test module to show skip decision - Helps diagnose why darwin-x64 integration tests are taking too long This will show us: - If NO_GPU env var is properly set - Whether Wan tests are actually being skipped or running Co-authored-by: Cursor <cursoragent@cursor.com> * fix: resolve linting quote style error in Wan I2V test Co-authored-by: Cursor <cursoragent@cursor.com> * fix: revert overly strict init_image dimension validation The dimension mismatch check was catching a valid use case where: - caller passes off-grid init_image (e.g. 100x100) - caller explicitly specifies aligned width/height (e.g. 112x112) - caller handles alignment themselves Removing this check restores the original behavior and allows callers to intentionally provide mismatched dimensions. The C++ layer will catch truly invalid combinations. Fixes failing unit test: "accepts off-grid init_image when caller passes explicit aligned width/height" Co-authored-by: Cursor <cursoragent@cursor.com> * fix: correct workspace cleanup condition for all self-hosted runners Replace restrictive startsWith(matrix.runner, 'qvac-') check with runner.environment != 'github-hosted' to properly apply workspace cleanup to ALL self-hosted runners, including mac-mini-m4-gpu and other runners that don't follow the qvac- naming convention. This ensures self-hosted runners (whether qvac-*, mac-mini-*, or others) get proper workspace cleanup, while github-hosted runners skip it. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: refine workspace cleanup condition to avoid GitHub-hosted ARM runners Use explicit exclusion of standard GitHub runner prefixes (ubuntu-, macos-, windows-) instead of runner.environment check, which may not work reliably with GitHub-hosted ARM runners like ubuntu-24.04-arm and ubuntu-22.04-arm. This ensures: - Self-hosted runners (qvac-*, mac-mini-*, etc.) get cleanup (✓) - GitHub-hosted runners (ubuntu-*, macos-*, windows-*) skip cleanup (✓) - GitHub-hosted ARM runners (ubuntu-*-arm) skip cleanup (✓) Co-authored-by: Cursor <cursoragent@cursor.com> * chore: sync CI/CD workflows from main Pulls latest workflow files from main branch to ensure feature/wan-i2v uses the current CI/CD configurations, including the workspace cleanup fixes for self-hosted macOS runners. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: use correct workspace cleanup condition instead of failed runner.environment The runner.environment != 'github-hosted' condition caused failures on GitHub-hosted ARM runners (ubuntu-*-arm). Use explicit prefix exclusion instead: - Skip cleanup for GitHub-provided runners (ubuntu-*, macos-*, windows-*) - Apply cleanup to all self-hosted runners (qvac-*, mac-mini-*, etc.) This is the correct fix that should have been in PR #2359. Co-authored-by: Cursor <cursoragent@cursor.com> * chore: sync workflows with main Pull all workflow files from main to keep feature/wan-i2v workflows identical to main. No custom CI/CD changes on this branch. Co-authored-by: Cursor <cursoragent@cursor.com> * chore: update vcpkg overlay to point to fix/wan-i2v-vae-tiling PR branch Point the stable-diffusion-cpp portfile to the fix/wan-i2v-vae-tiling branch from qvac-ext-stable-diffusion.cpp PR #9 instead of applying the patch overlay. This allows testing the upstream fix before it's merged. Once the PR is merged and published in the qvac registry, this overlay can be removed entirely. GitHub PR: tetherto/qvac-ext-stable-diffusion.cpp#9 Co-authored-by: Cursor <cursoragent@cursor.com> * fix: pin vcpkg overlay to exact commit SHA instead of branch name Using a branch name REF without SHA512 causes vcpkg to fail. Pin to exact commit 793d377 (HEAD of fix/wan-i2v-vae-tiling branch) with the correct SHA512 hash. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: point vcpkg overlay to clean cherry-pick on 2026-03-01 base Previous branch was based off master and included 9 upstream commits that shouldn't be in the PR (CI workflow changes, docs, etc.). New clean branch fix/wan-i2v-vae-tiling-clean is based directly off 2026-03-01 with only the VAE tiling fix cherry-picked. PR: tetherto/qvac-ext-stable-diffusion.cpp#10 Co-authored-by: Cursor <cursoragent@cursor.com> * fix: correct SHA512 to use zip hash (vcpkg downloads .zip not .tar.gz) Co-authored-by: Cursor <cursoragent@cursor.com> * chore: remove patch file — fix is baked into the pinned commit The portfile now points directly to the commit that already contains the VAE tiling fix, so the patch file is redundant and has been removed. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: use tar.gz SHA512 — vcpkg downloads .tar.gz not .zip Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): use 256x256 init image for Wan I2V to fit Metal GPU budget The Wan I2V 14B test OOM'd on the Mac mini M4 Metal backend during diffusion compute (kIOGPUCommandBufferCallbackErrorOutOfMemory). The 512x512 init image (inferred as the video resolution) was ~2x the pixels of the original 480x272 config and exceeded the GPU memory budget. Add a pre-resized 256x256 init image asset and point the I2V smoke test at it, shrinking the video latent/activation footprint so the 14B model fits in GPU memory on the Mac mini M4 runner. Co-authored-by: Cursor <cursoragent@cursor.com> * test(diffusion-cpp): skip Wan video tests on macOS/Metal due to GPU OOM The Wan 14B I2V model OOMs the Mac mini M4 Metal GPU during diffusion compute (kIOGPUCommandBufferCallbackErrorOutOfMemory), even after dropping the init image to 256x256. Exclude darwin entirely from the Wan suite; the tests still run on Linux/Windows GPU runners. Co-authored-by: Cursor <cursoragent@cursor.com> * test(diffusion-cpp): remove unused 256x256 init image Wan tests are now skipped on macOS/Metal, so the smaller init image added to work around the Metal GPU OOM is no longer needed. Revert the I2V smoke test back to the original 512x512 init image and delete the resized asset. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): satisfy clang-tidy identifier-naming in addon clang-tidy readability-identifier-naming flagged six globals introduced by the Wan I2V wiring. Rename to match the package .clang-tidy convention: - global constants -> UPPER_CASE: kMaxSafeJsonInt, kAddonId, kCancelled, kJobCancelledMessage - thread_local globals -> g_ prefix: tl_progressCtx, tl_abortModel Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): restore root VideoStableDiffusion export VideoStableDiffusion was dropped from index.js when the Wan 2.1 I2V bindings were ported (ca07e91), leaving require('@qvac/diffusion-cpp').VideoStableDiffusion undefined even though index.d.ts still declares it as a named export. Re-export it from the barrel to realign the runtime export with the type declarations. The subpath entry point (@qvac/diffusion-cpp/video) was unaffected. Co-authored-by: Cursor <cursoragent@cursor.com> * build(diffusion-cpp): consume sd.cpp 2026-03-01#6 from registry, drop overlay PR #10 (Wan 2.1 I2V VAE-tiling fix) is merged into the 2026-03-01 branch of qvac-ext-stable-diffusion.cpp and published to the registry as 2026-03-01#6. Remove the temporary package-local stable-diffusion-cpp vcpkg overlay port and its overlay-ports entry, bump the dependency to #6, and point the registry baseline at the commit that publishes it. Registry bump: tetherto/qvac-registry-vcpkg#175 Co-authored-by: Cursor <cursoragent@cursor.com> * build(diffusion-cpp): repoint vcpkg baseline to merged registry commit Registry PR tetherto/qvac-registry-vcpkg#175 is merged. Update the default-registry baseline from the temporary PR-branch commit to the registry main merge commit (8693af45) that publishes stable-diffusion-cpp 2026-03-01#6. Co-authored-by: Cursor <cursoragent@cursor.com> * Update vcpkg-configuration.json * Update vcpkg-configuration.json * Update CHANGELOG.md * bump version to 0.11.0 * fix(diffusion-cpp): remove broken Wan C++ example Co-authored-by: Cursor <cursoragent@cursor.com> * fix(diffusion-cpp): address PR review on Wan I2V video bindings - Standardize video dimensions on multiples of 16 end-to-end: C++ width/height handlers and video.d.ts now match the JS wrapper. - requireRange: reject non-finite values (NaN/Inf) before range check. - Video seed uses requireInt64 (parity with image path); no silent truncation of fractional/out-of-range seeds. - Use typed makeCancelledError() at all diffusion cancel sites. - Docs: clipVision is required for img2vid and throws; preview-callback options are parsed but not yet wired. Co-authored-by: Cursor <cursoragent@cursor.com> * test(diffusion-cpp): update unit tests for 16-aligned dims and typed cancel - SdVidGenHandlers dimension tests now expect multiples of 16 (reject multiples of 8 that aren't 16-aligned), matching the handler change. - Cancel-context test expects the typed [ Diffusion :: Cancelled ] code emitted by makeCancelledError() at all diffusion cancel sites. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: gianni-cor <gianfranco.cordella@tether.io>
simon-iribarren
added a commit
to simon-iribarren/qvac
that referenced
this pull request
Jun 8, 2026
Lifecycle correctness: - Spawn lock: steal only when the owner pid is dead (with an mtime fallback for an unreadable lock), so a legitimate multi-minute cold start no longer loses its lock after 30s and spawns a duplicate runner/serve (tetherto#1). - close(): the fetch path now bails out instead of re-resolving once closed, so a request racing close() can't silently re-add a consumer / spawn a runner (tetherto#3). - sweepServes: when an orphaned serve's pid is alive but its health check fails, keep the record instead of dropping it — dropping stranded a live serve with no registry trace. We only reap once it answers as ours, or drop once its pid dies (tetherto#4). - servePort: fold a pinned port into the fleet key so pinned-port callers don't reuse an auto-allocated serve on a different port, and distinct pins don't collide (tetherto#5). - Respawn: expose baseURL/port/pid as getters over live state, updated on every reconnect, so diagnostics/external clients see the real serve after recovery (tetherto#6). - retargetUrl now handles Request inputs (not just string/URL) so a respawn stays transparent if the SDK ever switches input shapes (tetherto#8). Docs: - README + docs-site: direct-baseURL tools (OpenCode/Cline/Aider) don't extend liveness; document the long-lived-sentinel/wrapper pattern and fix the misleading "the script doesn't have to stay running" note (tetherto#2). - Reconcile version wording: README/changelog now describe managed mode as unreleased (package is 0.1.0); docs-site integration page documents managed mode + the async overload (tetherto#7). Tests: spawn-lock steal/keep matrix, fleet-key pinned-port sensitivity, and the runner-dead + serve-alive + health-failing sweep case. Build + suite green (60 pass / 1 integration skip).
olyasir
added a commit
that referenced
this pull request
Jun 8, 2026
simon-iribarren
added a commit
that referenced
this pull request
Jun 10, 2026
* feat[api]: add managed mode to @qvac/ai-sdk-provider (QVAC-19900)
Add `mode: 'managed'` so the provider can synthesize an ephemeral
qvac.config.json from a model-constant list, spawn and supervise
`qvac serve` on a free port, and tear it down on host exit. External
mode is unchanged and stays synchronous; the managed supervisor is
lazily dynamic-imported so external-mode users pay no startup cost.
@qvac/cli becomes an optional peer dependency.
* fix: resolve @qvac/cli via main entry when its exports block package.json (QVAC-19900)
The published @qvac/cli ships a string `exports` field ("./dist/index.js"),
which makes the `./package.json` subpath non-resolvable
(ERR_PACKAGE_PATH_NOT_EXPORTED). Managed mode relied on resolving
`@qvac/cli/package.json` to locate the bin, so it would fail to find the CLI
on a clean install. Fall back to resolving the package main entry, which for
@qvac/cli is the same file as the `qvac` bin.
* doc: update ai-sdk provider agent setup after queue (QVAC-19900)
* QVAC-19900 feat[api]: per-model config for managed mode
Managed mode `models` now accepts spec objects ({ name, config, preload,
default }) alongside bare constant names, so callers can set per-model serve
options — notably `ctx_size` and `reasoning_budget` — that coding agents like
OpenCode require. The synthesized qvac.config.json carries the config block,
honors explicit `preload`/`default`, and validates names inside spec objects.
Exports the new `QvacManagedModel` type and documents per-model config plus a
managed-mode OpenCode example in the README.
* QVAC-19900 feat[api]: shared idle-reaped managed serve daemon
Rework managed mode from a per-provider supervisor into a shared,
self-cleaning serve daemon so it is robust standalone and usable by any
tool, not just a single session.
- Reuse via a fleet key (model set + per-model config + host) keyed in a
cross-process registry under ~/.qvac/managed-serves/; createQvac attaches
to a matching healthy serve instead of cold-starting a duplicate.
- A detached runner owns the qvac serve child and reaps it once no consumer
process has been alive for serveIdleTimeout (default 5m). Liveness, not
request traffic, is the signal, so it works for tools that hit baseURL
directly (OpenCode/Cline/Aider).
- close() now detaches (deregisters the consumer) instead of killing; a
shared serve survives until its last user is gone.
- Sweep only reaps dead/orphaned serves, never a healthy serve a live
process owns (fixes a second session SIGKILLing a downloading serve).
- Respawn-on-failure: fetch re-resolves and retries once on ECONNREFUSED.
- reuse:false (or a pinned servePort) yields a private serve reaped as soon
as its owner exits.
Refactor into serve-process.ts (spawn/health/stop), registry.ts,
fleet-key.ts, runner.ts; remove supervisor.ts and pid-tracker.ts. Add
reuse and serveIdleTimeout options. Rewrite tests and add reuse/idle-reap
end-to-end coverage; document the shared lifecycle in the README.
* QVAC-19900 fix: reject duplicate model names in managed mode
Each managed model maps to a single serve alias keyed by its name, so a
repeated name silently overwrote the earlier entry — and could drop its
`default: true`. Reject duplicates up front with DuplicateManagedModelError
instead of resolving them ambiguously. Addresses PR review feedback.
* QVAC-19900 fix[api]: address managed-mode self-review findings
- Per-instance consumer markers (<pid>.<rand>) so two providers in one
process sharing a fleet key don't deregister each other on close (A).
- Restrict respawn retry to ECONNREFUSED so an in-flight completion is
never blindly replayed on ECONNRESET/EPIPE (C).
- Health-check the recorded baseURL before SIGTERM-ing an orphaned serve,
guarding against killing a recycled pid (D).
- Use dirname() instead of a posix-only regex for ephemeral config cleanup (E).
- Fold serveBinPath into the fleet key so distinct local builds don't share
a serve (G).
- Export managed error classes + QvacManagedErrorCode for instanceof checks (H).
- Reject more than one explicit default: true (I).
- Deregister the consumer if resolveServe throws (F); drop dead
firstConsumerPid runner param (J).
Tests: per-instance markers, health-gated orphan sweep (kills serving
orphan, spares non-serving stranger pid), fleet-key serveBinPath sensitivity,
multiple-default rejection. README updated.
* QVAC-19900 fix[api]: address managed-mode lifecycle review (round 2)
Lifecycle correctness:
- Spawn lock: steal only when the owner pid is dead (with an mtime fallback for
an unreadable lock), so a legitimate multi-minute cold start no longer loses
its lock after 30s and spawns a duplicate runner/serve (#1).
- close(): the fetch path now bails out instead of re-resolving once closed, so
a request racing close() can't silently re-add a consumer / spawn a runner (#3).
- sweepServes: when an orphaned serve's pid is alive but its health check fails,
keep the record instead of dropping it — dropping stranded a live serve with
no registry trace. We only reap once it answers as ours, or drop once its pid
dies (#4).
- servePort: fold a pinned port into the fleet key so pinned-port callers don't
reuse an auto-allocated serve on a different port, and distinct pins don't
collide (#5).
- Respawn: expose baseURL/port/pid as getters over live state, updated on every
reconnect, so diagnostics/external clients see the real serve after recovery (#6).
- retargetUrl now handles Request inputs (not just string/URL) so a respawn stays
transparent if the SDK ever switches input shapes (#8).
Docs:
- README + docs-site: direct-baseURL tools (OpenCode/Cline/Aider) don't extend
liveness; document the long-lived-sentinel/wrapper pattern and fix the
misleading "the script doesn't have to stay running" note (#2).
- Reconcile version wording: README/changelog now describe managed mode as
unreleased (package is 0.1.0); docs-site integration page documents managed
mode + the async overload (#7).
Tests: spawn-lock steal/keep matrix, fleet-key pinned-port sensitivity, and the
runner-dead + serve-alive + health-failing sweep case. Build + suite green
(60 pass / 1 integration skip).
* docs: use canonical qvac.tether.io URL in ai-sdk-provider README
* QVAC-19900 feat[api]: public model catalog + catalog-id aliases in managed mode
Add `models.qvacCatalog`, a public models.dev-style catalog that maps
friendly ids (`qwen3.5-9b`) to the SDK constant the serve loads
(`QWEN3_5_9B_MULTIMODAL_Q4_K_M`), so the id a user picks from models.dev
resolves end-to-end with no translation layer in front of the serve.
Managed mode now accepts catalog ids as model names: the synthesized
serve config keys the alias by the friendly id while `model` resolves to
the underlying SDK constant, so the serve answers `qwen3.5-9b` directly.
Bare SDK constants keep working unchanged. A drift unit test fails CI if
any catalog constant disappears from the generated SDK catalog.
* QVAC-19900 feat[api]: process-group serve teardown + closeOnParentExit
Harden managed-mode lifecycle so a managed serve never leaks its `bare`
inference worker or outlives the process that owns it.
- Process-group teardown: spawn `qvac serve` detached (its own group) and,
when stopServe must escalate past the grace window, SIGKILL the whole
group. A plain SIGKILL of the serve pid never cascades to the grandchild
bare worker, so previously a wedged serve orphaned the worker. The
graceful SIGTERM is still sent to the serve process only, so a healthy
serve orchestrates its own shutdown and releases the global worker lock
(no stale lock left behind); the group SIGKILL is the wedged-path fallback.
- `closeOnParentExit` option: for a daemon-style host whose sole job is to
keep a managed serve alive for a parent process (e.g. an editor/agent
plugin). The provider watches its parent pid and, the moment the parent
exits (on POSIX we are reparented to init, ppid → 1), closes itself —
deregistering the consumer so the runner reaps the serve — and exits.
Without it a hard-killed parent would leave a reparented host alive,
keeping its consumer marker forever so the serve was never reaped.
Tests: a stubborn-grandchild fake serve proves group teardown reaps the
worker; `parentIsGone` unit-tests the parent-watch decision.
* QVAC-19900 fix: keep managed serve lifecycle correct under close() race and crash-respawn
- Undo the consumer re-registration when close() wins the race against an
in-flight fetch retry: resolveServe re-adds the marker after close() removed
it, which would keep the shared serve warm until the process exits.
- Preserve live consumer markers when sweepServes reaps a crashed/orphaned
serve, so a respawned runner inherits the still-alive sessions instead of
idle-reaping the fresh serve out from under them.
- docs: bump managed-mode ctx_size examples to 32768 for agent-sized prompts.
* QVAC-19900 fix: rename reresolve result to resolved for clarity in managed fetch
* QVAC-19900 mod: collapse redundant sync/async registry teardown helpers
removeConsumer/removeConsumerSync and removeRecord/removeRecordSync were a
confusing sync/async mirror: the async removeConsumer was only ever called right
after the sync one (a guaranteed no-op), and the removeRecord pair was really two
teardown semantics under near-identical names. Marker/record teardown is a single
unlink/rm, cheap enough to be synchronous everywhere — including process 'exit'
handlers where async can't run — so collapse each pair into one sync function.
No behaviour change; addresses review feedback on #2408.
* QVAC-19900 mod: trim verbose comments in managed registry
Tighten the sync-rationale comments on removeRecord/removeConsumer and drop a
stale, broken leftover comment above ensureDirSync. Keeps the non-obvious intent
(why sync, preserveConsumers semantics) without the narration.
* QVAC-19900 mod: drop unused DEFAULT_SERVE_BIN and ephemeralConfigName
Both were dead: DEFAULT_SERVE_BIN was never imported (serve-process spawns the
resolved CLI path verbatim) and ephemeralConfigName was an unused helper
(writeEphemeralConfig uses a fixed name inside an mkdtemp dir). Removing the
latter also drops the now-unused randomBytes import.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.