qvac-fabric: bump to 8189.0.1#138
Merged
Merged
Conversation
Updates the qvac-fabric port to upstream tag v8189.0.1 (commit 739b309ae, the SSH-resigned tip of temp-8189). Refreshes the source tarball SHA512, resets port-version to 0, and bumps the baseline plus per-version manifest. The portfile bug-fix for BUILD_LLAMA=OFF introduced in #136 (port-version 1) is preserved. What's new in v8189.0.1 over v8189.0.0: - Inject enable_thinking into the Jinja template context so models like Qwen 3.5 and Gemma 4 actually emit reasoning content (tetherto/qvac-fabric-llm.cpp#128). - Add GGML_OP_DELTA_NET_AR Vulkan compute shader + dispatch path (#129) so Vulkan no longer falls back to CPU per token on Qwen 3.5 / DeltaNet decode. - vulkan: Force f32 src1 through the strided cpy path to fix an embedding-model crash (#130). Co-authored-by: Cursor <cursoragent@cursor.com>
jpgaribotti
approved these changes
May 8, 2026
gianni-cor
added a commit
to zoq/qvac-fork
that referenced
this pull request
May 8, 2026
The 8189.0.1 port (tetherto/qvac-registry-vcpkg#138) drops port-version 1's BUILD_LLAMA=OFF portfile workaround and ships the new fabric tip 739b309ae. Notable upstream fixes pulled in: - Inject enable_thinking into the Jinja template context so Qwen 3.5 and Gemma 4 actually emit <think> reasoning content. - GGML_OP_DELTA_NET_AR Vulkan compute shader (Qwen 3.5 / DeltaNet decode no longer falls back to CPU per token). - vulkan: f32 src1 strided cpy fix (embedding-model crash). Validated on macOS-arm64: vcpkg resolves qvac-fabric[core,gpu-backends,llama]:arm64-osx@8189.0.1 and the addon builds end-to-end. Co-authored-by: Cursor <cursoragent@cursor.com>
gianni-cor
added a commit
to tetherto/qvac
that referenced
this pull request
May 11, 2026
* Restore Qwen3.5 / Gemma4 / PaddleOCR-VL tests + Mali coopmat fix Stack of three logical changes squashed into one commit so the test ports stay self-consistent with the build/runtime they depend on: 1. qvac-fabric overlay ports (LLM + embed + nmtcpp): - Pin to fabric 78db8bf4 (PR tetherto/qvac-fabric-llm.cpp#121 HEAD, includes c79a8851 "ggml-vulkan: Fix NaN outputs on Mali"). - Drop -DGGML_VULKAN_DISABLE_COOPMAT*=ON for Android so coopmat shaders are compiled in. With coopmat off, runtime device->coopmat_support is false and the Mali fix's ARM-gated branches were skipped, leaving Qwen3-Q8_0 finetuning NaN on Pixel 9 Pro Mali. - Wire up overlay-ports in each package's vcpkg-configuration.json. - Add find_package(OpenSSL) before find_package(llama) in the LLM CMakeLists so llama-targets.cmake's transitive OpenSSL::SSL reference (via cpp-httplib) resolves on local builds. 2. utils.js downloadFile redirect race: - Track a handedOff flag set when the redirect branch hands off dest to a recursive call. All cleanup paths now skip fs.unlink once ownership is transferred, so a late error from the outer writestream can't delete the freshly-downloaded file (Pixel ENOENT after "successful" mmproj download). 3. Three new integration tests + their mobile harness wiring: - qwen3-5.test.js — basic / multi-turn / tool-calling - gemma4.test.js — text / multi-turn / image (forced to CPU on darwin + mobile because gemma4v projector SIGSEGVs on Metal and Adreno OpenCL) / tool-calling - ocr-paddle.test.js — OCR; mobile maxTokens capped to 768 - Ported to the new addon API (files: { model: [absPath], projectionModel?: absPath }, config: …). - Added matching unit test test_text_llm_context_qwen3.cpp. - integration.auto.cjs registers runQwen35Test, runGemma4Test, runOcrPaddleTest dispatchers. - test-groups.json: iOS heavy4 cluster (Gemma4+OcrLighton+OcrPaddle), iOS lightB adds Qwen35, Android groupB has Qwen35 first then Gemma4 / OcrPaddle. - Workflow: Android GroupB Device Farm jobTimeout 60→90 min. * API port + Gemma4 tool-call fix. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Wire addon/src/patches ahead of the vcpkg include path to pick up the LlamacppUtils.hpp ptr-API override. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * API port + Gemma4 tool-call fix. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Split iOS heavy4 into three single-test specs (heavy4 = OcrLighton, new heavy7 = Gemma4, new heavy8 = OcrPaddle) and schedule them as separate Device Farm runs to avoid memory pressure. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Drop LlamacppUtils.hpp patch override; bump addon-cpp to 1.1.7 The LlamacppUtils.hpp common_init_result_ptr API now ships in qvac-lib-inference-addon-cpp 1.1.7 (PR #1887), so the local addon/src/patches/qvac-lib-inference-addon-cpp/LlamacppUtils.hpp shim is no longer needed in the embed and llm addons. - Delete the patch headers in embed and llm. - Drop the BEFORE PRIVATE addon/src/patches include path from the embed/llm production and unit-test CMakeLists. - Bump qvac-lib-inference-addon-cpp version>= to 1.1.7 in the embed, llm, and nmtcpp vcpkg.json files so they pick up the upstream ptr-API header from the registry. The OpenSSL find_package() addition stays — it's an unrelated local-build fix. Co-authored-by: Cursor <cursoragent@cursor.com> * Cap ocr-lighton predict to 1800 (desktop) / 768 (mobile) so the LightOnOCR response can't overrun ctx_size=4096. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Rewrite sliding-context test to use the post-GGML_PAD effective n_ctx (512) and retune n_predict / n_discarded so all 8 cases match the current ContextSlider semantics. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Allow embed batching test to override ctx_size and pin gte-large to batch_size=512 / ctx_size=384 to probe the Mali Vulkan first-submit ErrorDeviceLost. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Fix reverse-prompt scenario by removing comma, space, listing both 'pizza' and 'Pizza', and lowercasing the assertion comparisons to match 'Pizza' and 'pizza'. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Sanitize media Uint8Array prompts before logging to avoid V8 Zone OOM. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Use Qwen3 family chat-template to fix Qwen3.5-0.8B gibberish output on macOS Metal. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Update portfiles to point to the latest fabric. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Revert "Allow embed batching test to override ctx_size and pin gte-large to batch_size=512 / ctx_size=384 to probe the Mali Vulkan first-submit ErrorDeviceLost." This reverts commit 1408896. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Raise AfriqueGemma cancel maxWait to 60s, and apply the use_jinja gate-drop so Qwen3-family models always pick the fixed jinja template. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Drop the retired AfriqueGemma integration tests. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Update portfiles to point to the latest head. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Update portfiles to point to the latest head. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Drop qwen35 from the Qwen3-template detection and the supported-finetune-architecture list since neither path is actually validated for Qwen3.5. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Update portfiles to point to the latest head. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Enable coopmat. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Drop the Qwen3 use_jinja override pairing now that qwen35 is no longer treated as Qwen3-family. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Use only general.architecture for Qwen3 detection so Qwen3.5 stops getting the Qwen3 chat-template via the model-name substring fallback. Drop modelNameLooksLikeQwen3 / getModelName and the modelName parameter from supportsToolsCompactForModelMetadata and selectToolsCompactMarkerForModelMetadata. The substring match on general.name treated "Qwen3.5-..." as Qwen3 and overrode the model's embedded tokenizer.chat_template, contradicting the recent decision to keep qwen35 out of the Qwen3 family. Update the LlamaModel call site and unit tests; add explicit qwen35/nullopt negative cases. Co-authored-by: Cursor <cursoragent@cursor.com> * Accept HuggingFace function-call XML in extractToolCalls so the Qwen3.5 tool-calling integration test parses the model's native <tool_call><function=...><parameter=...>...</parameter></function></tool_call> envelope produced by its embedded chat template, in addition to the Qwen3-style JSON envelope. Co-authored-by: Cursor <cursoragent@cursor.com> * Bump n_predict in the Qwen3.5 basic and multi-turn integration tests so the embedded chat-template's reasoning block has room to finish before the answer on slower CI backends. Co-authored-by: Cursor <cursoragent@cursor.com> * Enable coopmat and point to the latest fabric. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Route Qwen3.5 inference and all finetuning on Mali to CPU, disable Vulkan coopmat at build time, halve mobile finetune workload to account for CPU training. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Point to the latest fabric version. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Force Bert to the CPU on Mali. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Run finetuning on Mali GPU. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Run Qwen 3.5 on Mali GPU. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Point to the latest fabric version and enable coopmat path. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * vcpkg: drop per-package qvac-fabric overlays Removes the qvac-fabric overlay-ports infrastructure from the LLM, Embed, and NMT manifests. The default-registry baseline is left untouched, so vcpkg now resolves qvac-fabric directly from the registry at the existing baseline (7248.2.3). Bumping to fabric 8189.0.0 will be handled by a separate baseline update; this commit only undoes the overlay-based development setup that was no longer needed. - vcpkg-configuration.json (3x): drop "overlay-ports" entry. - vcpkg/ports/qvac-fabric/ (3x): remove overlay portfile.cmake, vcpkg.json, and android-vulkan-version.cmake. Co-authored-by: Cursor <cursoragent@cursor.com> * vcpkg: bump qvac-fabric version constraint to 8189.0.0 Updates the consumer manifests in the LLM, Embed, and NMT packages to require qvac-fabric >= 8189.0.0. The default-registry baseline is intentionally left untouched. Co-authored-by: Cursor <cursoragent@cursor.com> * llm/embed/nmtcpp: bump versions for qvac-fabric 8189.0.0 - qvac-lib-infer-llamacpp-llm: 0.19.2 -> 0.20.0 (minor) - qvac-lib-infer-llamacpp-embed: 0.15.0 -> 0.16.0 (minor) - qvac-lib-infer-nmtcpp: 2.1.1 -> 3.0.0 (major) The nmtcpp major bump reflects a real behavioural regression: the previous overlay built ggml unconditionally with every GPU backend the platform supported (Vulkan/Metal/OpenCL); switching to the upstream registry port with the existing "default-features": false in nmtcpp's vcpkg.json now disables the new "gpu-backends" feature, so out-of-the-box ggml exposes only the CPU backend. Consumers that rely on GPU-accelerated nmt inference must add '"features": ["gpu-backends"]' to the qvac-fabric block of their nmtcpp build manifest. CHANGELOG entries added in all three packages. Co-authored-by: Cursor <cursoragent@cursor.com> * nmtcpp: opt into qvac-fabric gpu-backends feature; downgrade bump to 2.2.0 The previous commit (3.0.0) flagged a breaking change: switching from the always-on overlay to the registry port with default-features:false disabled GPU backends in ggml. Adding "features": ["gpu-backends"] to nmtcpp's qvac-fabric dep restores the previous Vulkan/Metal/OpenCL behaviour, so the bump is now a non-breaking minor (2.2.0) and the BREAKING note in the changelog is replaced with a plain Changed entry. Co-authored-by: Cursor <cursoragent@cursor.com> * nmtcpp: re-bump to 3.0.0 (major) Restores the major version bump for nmtcpp. The new fabric port schema (features split between gpu-backends/llama) and the move from a vendored overlay to the upstream registry are large enough downstream changes that consumers should treat this as a major release, even though runtime behaviour is preserved by opting into "gpu-backends". Co-authored-by: Cursor <cursoragent@cursor.com> * vcpkg: pin qvac-fabric to >=8189.0.0#1 The 8189.0.0 (port-version 0) qvac-fabric port shipped a configure-time bug for consumers without the "llama" feature (i.e. nmtcpp): -DLLAMA_MTMD=ON was passed unconditionally, which transitively enables LLAMA_BUILD_COMMON, which makes upstream call license_generate(common) -- but BUILD_LLAMA=OFF skips defining the 'common' target, so the cmake configure aborts. The fix landed in tetherto/qvac-registry-vcpkg#136 as qvac-fabric port-version 1. Bumping the consumer constraint from "version>=": "8189.0.0" to "version>=": "8189.0.0#1" forces vcpkg to pick the fixed port-version (otherwise it picks the lowest satisfying version, which is the broken #0). Validated: nmtcpp arm64-android cross-build now configures and builds end-to-end against the upstream registry, no overlay needed. Co-authored-by: Cursor <cursoragent@cursor.com> * docs: drop overlay-removal note from changelogs Removes the changelog bullet describing the deletion of the per-package qvac-fabric vcpkg overlay. The overlay teardown is mechanical packaging plumbing rather than a user-facing change worth documenting. Co-authored-by: Cursor <cursoragent@cursor.com> * test/llm: restore AfriqueGemma integration tests (desktop-only) Reverts e257a19's deletion of the afriquegemma-edge-cases and afriquegemma-translation integration tests, and adds a 'desktopOnly' opt-out so they're skipped on mobile without breaking the per-test group coverage invariant. - packages/qvac-lib-infer-llamacpp-llm/test/integration/afriquegemma-edge-cases.test.js: restored. - packages/qvac-lib-infer-llamacpp-llm/test/integration/afriquegemma-translation.test.js: restored. - test/mobile/test-groups.json: new top-level "desktopOnly" array listing runAfriquegemmaEdgeCasesTest and runAfriquegemmaTranslationTest. - scripts/generate-mobile-integration-tests.js: validateGroups now reads the desktopOnly list; entries are still emitted into integration.auto.cjs (so validate-mobile-tests stays happy) but excluded from the per-platform "missing" check, so the mobile runners never invoke them. - test/mobile/integration.auto.cjs: regenerated by `npm run test:mobile:generate`. - CHANGELOG note in qvac-lib-infer-llamacpp-llm under Tests. Validated via `npm run test:mobile:generate` + `npm run test:mobile:validate`. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(llm): drop AfriqueGemma test restoration changelog note Co-authored-by: Cursor <cursoragent@cursor.com> * test/llm: switch AfriqueGemma desktop-only skip to in-test pattern Per review: don't change generate-mobile-integration-tests.js. Use the same skip:isMobile pattern other tests already use (config-parameters, tool-calling, image), and keep the AfriqueGemma functions in the iOS lightA / Android groupA groups so the existing per-test coverage invariant stays intact. - packages/qvac-lib-infer-llamacpp-llm/scripts/generate-mobile-integration-tests.js: reverted to upstream/main (drops the desktopOnly opt-out plumbing). - test/mobile/test-groups.json: drops 'desktopOnly', adds runAfriquegemmaEdgeCasesTest and runAfriquegemmaTranslationTest back to ios.lightA and android.groupA. - test/integration/afriquegemma-edge-cases.test.js, test/integration/afriquegemma-translation.test.js: add isMobile = platform === 'ios' || platform === 'android', and skip:isMobile to every test() options object (13 total). - test/mobile/integration.auto.cjs: regenerated. Validators both green: npm run test:mobile:generate -> "all tests assigned for every platform" npm run test:mobile:validate -> ok Co-authored-by: Cursor <cursoragent@cursor.com> * test/llm: skip ocr-lighton on mobile Adds skip:isMobile to the single test in ocr-lighton.test.js, matching the AfriqueGemma / config-parameters / tool-calling pattern. isMobile is already defined in this file. The test stays in ios.heavy4 / android.groupB so per-platform group coverage is unaffected; the brittle test itself just skips on mobile. Co-authored-by: Cursor <cursoragent@cursor.com> * ci: revert workflow timeout change for llm mobile integration Drops PR #1874's edit to .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml (parameterised jobTimeoutMinutes + 90-minute override for Android GroupB). Workflow is restored to the upstream/main version. Co-authored-by: Cursor <cursoragent@cursor.com> * addons: disable flash-attn by default on the OpenCL backend Flash attention is not reliably supported by the OpenCL ggml backend (Adreno path), so when the chosen GPU backend ends up being OpenCL the addons now force "flash-attn=off" unless the user explicitly passed flash-attn / flash_attn in their config. LLM (LlamaModel.cpp / LlamaModel.hpp): - Add a bool isOpenCl parameter to tuneConfigMap (defaulted to false to keep the existing test_tune_config_map.cpp call sites working). - Mirror the BitNet-disabling branch with an else-if for OpenCL + notUserSet("flash-attn", "flash_attn"). - At the call site, read chosenBackend.first/second after chooseBackend returns and pass isOpenCl through. Embed (BertModel.cpp): - No tuneConfigMap equivalent here. Inject the same logic inline immediately after chooseBackend, before configFilemap is serialised into configVector. Honour user-set "flash-attn"/"flash_attn". Both packages compile cleanly via bare-make build on macOS-arm64. Co-authored-by: Cursor <cursoragent@cursor.com> * fixup! tuneConfigMap: keep ABI for existing 4-arg test callers CI failure on cpp-tests-darwin-arm64 (PR #1874): test/unit/test_tune_config_map.cpp:199:43: fatal error: no viable conversion from 'FtOverrides' to 'bool' The previous commit inserted bool isOpenCl as the 4th parameter of tuneConfigMap, but several existing tests pass FtOverrides{...} as the 4th positional argument (relying on it being finetuneOverrides). Swap the order so the new isOpenCl parameter comes after the existing finetuneOverrides; both stay defaulted, so all old 3-arg and 4-arg call sites compile unchanged. The production call site in LlamaModel.cpp is updated accordingly. Also adds 4 new TuneConfigMapTest cases covering the OpenCL branch: - OpenCl_NonBitnet_FlashAttnDisabledByDefault - OpenCl_UserSetFlashAttnHyphen_Respected - OpenCl_UserSetFlashAttnUnderscore_Respected - NotOpenCl_NonBitnet_FlashAttnUnchanged All 53 TuneConfigMapTest cases pass locally on macOS-arm64. Co-authored-by: Cursor <cursoragent@cursor.com> * Add QWen 3.5 vision test. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Route vision models with mmproj to CPU on Apple M1. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Route only the projector to CPU on Apple M1. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * run qwen3-5.test.js on IOS GPU * js lint * Recognize Gemma 4 channel reasoning markers in Qwen3ReasoningUtils, and bump gemma4 basic-test n_predict so the answer fits after the thinking preamble. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Wire reasoning-budget config to inputs.enable_thinking so passing reasoning-budget=0 disables the model's <think> reasoning channel, and add coverage for Qwen3, Qwen3.5, and Gemma 4. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * vcpkg: bump qvac-fabric to >=8189.0.1 The 8189.0.1 port (tetherto/qvac-registry-vcpkg#138) drops port-version 1's BUILD_LLAMA=OFF portfile workaround and ships the new fabric tip 739b309ae. Notable upstream fixes pulled in: - Inject enable_thinking into the Jinja template context so Qwen 3.5 and Gemma 4 actually emit <think> reasoning content. - GGML_OP_DELTA_NET_AR Vulkan compute shader (Qwen 3.5 / DeltaNet decode no longer falls back to CPU per token). - vulkan: f32 src1 strided cpy fix (embedding-model crash). Validated on macOS-arm64: vcpkg resolves qvac-fabric[core,gpu-backends,llama]:arm64-osx@8189.0.1 and the addon builds end-to-end. Co-authored-by: Cursor <cursoragent@cursor.com> * Disable the embed addon's BERT-on-Mali CPU override. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Prepend <think> opener to the visible stream when the chat template force-opens the reasoning channel. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Remove the Mali detection plumbing from the embed addon now that BERT runs on Mali GPU. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Bump n_predict and ctx_size in the Qwen3.5 reasoning-budget baseline so the model reliably reaches </think>. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Restore the mobile finetune dataset to 8 samples. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * test: drop AfriqueGemma + MedGemma + Dolphin-MoE tests Per review: cull tests that exercise models we no longer want covered in the LLM/SDK CI matrix. LLM (packages/llm-llamacpp): - Delete integration tests: - test/integration/afriquegemma-edge-cases.test.js - test/integration/afriquegemma-translation.test.js - test/integration/moe.test.js (dolphin-mixtral-2x7b) - Delete docs/afriquegemma-translation.md (only documents the now-removed integration tests). - Strip the medgemma-4b-it variant from: - test/integration/tool-calling.test.js (collapses ALL_TOOL_MODEL_VARIANTS / TOOL_MODEL_VARIANTS to qwen3-1.7b only, drops the now-unused isMobile derived var). - test/integration/finetuning-pause-resume.test.js (drops the medgemma-4b-it-q4_0 entry from FINETUNE_MODELS). - test/unit/test_model_metadata.cpp: drop the gemma3Model_ fixture + the two Gemma3-specific TEST_F cases (DiskSingleFile_Gemma3Arch_*); update the comment block listing exercised arches accordingly. - test/unit/pick-primary-gguf-path.test.js: keep the tensors.txt-first ordering test, but rebase the fixture filenames on Qwen3-4B-Q4_K_M-* so no medgemma names remain in the test corpus. - test/mobile/test-groups.json + test/mobile/integration.auto.cjs: drop runAfriquegemmaEdgeCasesTest, runAfriquegemmaTranslationTest, runMoeTest from both ios and android groups; auto.cjs trimmed to match. `validate-mobile-tests.js` is green. SDK (packages/sdk/tests-qvac): - Delete tests/translation-afriquegemma-tests.ts. - tests/test-definitions.ts: drop translationAfriquegemmaTests import + spread. - tests/shared/executors/translation-executor.ts: drop the import, the spread, and the |afriquegemma branch from the dispatch regex. - tests/mobile/consumer.ts + tests/desktop/consumer.ts: drop the AFRICAN_4B_TRANSLATION_Q4_K_M import and the resources.define("afriquegemma", ...) block; mobile also drops the afriquegemma-only SkipExecutor. - tests/shared/resource-lifecycle.ts: rephrase the eviction-comment example to a generic "large translation model" so it no longer references the deleted resource. Not touched: NOTICE/CHANGELOG (auto-generated/historical), sdk/models/registry/* (model constants in the registry are data, not tests), sdk/examples/translation/translation-llm-afriquegemma.ts (consumer-facing example, not a test). * Revert "test: drop AfriqueGemma references from packages/sdk/tests-qvac" Per review: keep packages/sdk/tests-qvac/ untouched. Restore the SDK afriquegemma test file, the test-definitions / translation-executor / desktop+mobile consumer / resource-lifecycle edits to their state prior to commit 36de6ec. Only the LLM-side cull (packages/llm-llamacpp + the deleted afrique / moe / medgemma test files there) from 36de6ec is kept. * Restore packages/llm-llamacpp/docs/afriquegemma-translation.md Per review: keep the AfriqueGemma translation doc. Commit 36de6ec removed it together with the LLM AfriqueGemma test files; restore it unchanged from the merge tip (e29836d). * chore: pin qvac-fabric to 8189.0.2 via overlay-ports for testing Adds an overlay port copy of qvac-fabric pointing at v8189.0.2 of tetherto/qvac-fabric-llm.cpp (tetherto/qvac-registry-vcpkg#140) to llm-llamacpp, embed-llamacpp, and translation-nmtcpp, declared via each package's vcpkg-configuration.json. Lets this PR exercise the new fabric build (incl. the Mali coopmat1 BitNet TQ NaN fix) without waiting for the registry baseline bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: pin overlay qvac-fabric to temp-8189 tip f686a1324 Point REF at the latest qvac-fabric-llm.cpp temp-8189 commit (f686a1324e13184d3257cb74c1ba17f9cf8ef575) instead of v8189.0.2 so the overlay tracks branch tip while the branch is still moving. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: extend Android LLM mobile test timeouts Allow slower Android Device Farm runs to finish model-heavy LLM tests before the harness marks them as timed out. Co-authored-by: Cursor <cursoragent@cursor.com> * vcpkg: drop qvac-fabric overlay-ports, bump version>= to 8189.0.2 tetherto/qvac-registry-vcpkg#140 publishes qvac-fabric@8189.0.2 in the default registry, so the temporary per-package overlay we used while the new fabric build was still being shaken out is no longer necessary. For llm-llamacpp, embed-llamacpp, and translation-nmtcpp: - Delete `packages/<pkg>/vcpkg/ports/qvac-fabric/` (portfile.cmake, vcpkg.json, android-vulkan-version.cmake) — the overlay copy. - Drop the `overlay-ports` entry from each package's vcpkg-configuration.json. The `default-registry` baseline is left untouched intentionally; the `version>=` constraints below are what forces vcpkg to resolve to the new fabric revision against the unchanged baseline. - Bump the `qvac-fabric` `version>=` pin from `8189.0.1` -> `8189.0.2` in each package's vcpkg.json. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(llm): drop dead sawMali plumbing from BackendSelection `sawMali` was threaded through `emplaceIfValidDevice` / `tryEmplaceDevice` / `chooseBackend` but never read by any caller — leftover from the earlier "Force BERT/Qwen3.5 to CPU on Mali" iterations. The embed-side cleanup already landed in 2ac5de0 ("Remove the Mali detection plumbing from the embed addon now that BERT runs on Mali GPU."); this finishes the symmetric removal on the LLM side. `sawAppleM1` plumbing is preserved unchanged. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(llm): explain why MtmdLlmContext skips inside_reasoning flip TextLlmContext flips reasoningState_.inside_reasoning = true alongside the forced "<think>\n" opener; MtmdLlmContext doesn't because it doesn't carry a reasoningState_ today. Add an inline note so the asymmetry isn't read as a bug, and point at the symmetric site to update if reasoning-aware EOS replacement is later added on the multimodal path. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(llm): narrow tool-call args quoter to leading bare key only The previous post-generation regex (`([{,])(\s*)([A-Za-z_]…)(\s*):` -> quote the ident) was too broad: it also matched `, ident:` substrings sitting inside JSON string values, so a tool call with a free-form string argument like `{"query":"phase one, step: validate"}` came out corrupted as `{"query":"phase one, "step": validate"}`, which then failed JSON.parse on the consumer side. In practice the rewrite is only needed for one upstream quirk: the Gemma 4 parser's `gemma4_args_to_json` (common/chat-parser.cpp) uses an `at_key_start()` helper that peeks backwards in the output buffer for a `{`/`,` -- so the very first top-level key is left bare while every nested or post-comma key is already quoted. All other tool dialects reach us via `json::dump()` upstream and already start with a quoted key. Replace the broad regex with one anchored at `^\{(\s*)<ident>\s*:`, which fixes exactly that single leading-bare-key case and cannot match anywhere inside a JSON string value. Verified end-to-end on linux-x64 against gemma-4-E2B-it-Q8_0 (CPU): - Adversarial prompt forcing `phase one, step: validate` as a tool arg string: baseline produced invalid JSON `{"query":"phase one, "step": validate"}` (parse fail at pos 55); this fix yields `{"query":"phase one, step: validate"}` and the test passes 7/7 assertions. - Existing simple-args happy path (`get_weather` with city/unit) still passes 5/5. Co-authored-by: Cursor <cursoragent@cursor.com> * revert(llm): drop synthetic <tool_call>{json}</tool_call> post-processing Each model now streams only its own native tool-call dialect: - Qwen3 / Hermes: <tool_call>{json}</tool_call> (already canonical) - Qwen3.5: <tool_call><function=name><parameter=k>v</parameter></function></tool_call> - Gemma 4: <|tool_call>call:NAME{key:<|"|>val<|"|>,...}<tool_call|> - Mistral, DeepSeek-R1, Functionary, GPT-OSS, etc. emit their own markers. The previous PR added a post-generation common_chat_parse pass that appended a uniform <tool_call>{json}</tool_call> envelope for every detected call. That duplicated tokens for Hermes-shape models (the envelope is already in the native stream) and inflated Gemma 4 output by ~14% with two synthetic copies per call. The leading-bare-key handling for Gemma 4's tc.arguments was also a constant source of sharp edges (broad regex corrupted string values containing ", ident:"; narrow anchored regex still required follow-up). Per-dialect parsing belongs at the SDK consumer layer, not in the addon. Removed: - Post-generation block in LlamaModel::processPromptImpl (synthesizer). - needsOutputCapture widening to include !resolved.tools.empty(). - LlmContext::getLastChatFormat() virtual. - lastChatFormat_ members + overrides in TextLlmContext, MtmdLlmContext. - common_chat_format* outFormat parameter from getPrompt(). - <regex> include in LlamaModel.cpp (no remaining users). Kept: - outThinkingForcedOpen mechanism (independent reasoning-channel feature). - toolsCompact_ controller and KV-cache trim logic. - All other PR work. Validated on linux-x64/CPU after incremental rebuild: - Gemma 4 (gemma-4-E2B-it-Q8_0): 6/6 asserts pass with native-dialect parser, no synthetic envelope leaks, output 941 chars (down from ~1100 with synthesizer). - Qwen3.5 (Qwen3.5-0.8B-Q8_0): 5/5 asserts pass with the existing parseXmlToolCall path, output 394 chars. Co-authored-by: Cursor <cursoragent@cursor.com> * test(llm): parse Gemma 4 native tool-call dialect in gemma4.test.js Without the synthetic <tool_call>{json}</tool_call> envelope reverted in the previous commit, Gemma 4 emits its own dialect: <|tool_call>call:NAME{key:<|"|>val<|"|>,...}<tool_call|> Strings are wrapped in <|"|>...<|"|> instead of "...", keys are bare, and the closing tag is <tool_call|> (trailing pipe, no slash). extractToolCalls now matches that shape directly and returns { name, argsRaw }. argsContainStringValue() helper checks the args body for a Gemma-4-quoted string literal. Substring-based assertion is sufficient to verify the model called the right tool with the right argument values; full dialect-to-JSON conversion lives upstream in fabric's gemma4_args_to_json and is not the addon test's job. qwen3-5.test.js was unchanged: Qwen3.5 wraps its <function=name> <parameter=k>v</parameter></function> XML in <tool_call>...</tool_call> natively, so the existing parseXmlToolCall path keeps working. Validated on linux-x64/CPU against gemma-4-E2B-it-Q8_0: 4/4 tests, 13/13 asserts (3 synthetic-input parser sanity checks + 1 live LLM run). Co-authored-by: Cursor <cursoragent@cursor.com> * revert(llm): drop Apple M1 detection + projector-CPU routing The PR added an Apple-M1-specific code path that detected the chip via the GPU description string and routed `params.mmproj_use_gpu = false` so the vision projector ran on CPU instead of Metal, working around a SIGSEGV in the projector's image-encoding kernel observed on M1 Metal at the time. Re-tested on M1 with the current fabric tip: no SIGSEGV, projector runs fine on Metal end-to-end. The carve-out is no longer needed. Removed: - BackendSelection: `isAppleM1Device()` helper, `bool& sawAppleM1` plumbing through `emplaceIfValidDevice` / `tryEmplaceDevice` / `chooseBackend`, and `bool* outSawAppleM1` parameter on both `chooseBackend` overloads. - LlamaModel: the `bool sawAppleM1 = false` local, the call-site argument, and the `params.mmproj_use_gpu = !sawAppleM1` ternary; mmproj now uses GPU on every desktop platform (Android still hardcoded to false). - test_backend_selection.cpp: `APPLE_M{1,2,3,4}_DESC` constants, `chooseBackendWithM1Flag()` helper, and the four `AppleM*_*` test cases. - gemma4.test.js / qwen3-5.test.js: the comment blocks describing the M1 carve-out; `useCpuForVision` semantics are unchanged (`useCpu || isMobile` on gemma4 and `useCpu` on qwen3-5). Verified on linux-x64/CPU after rebuild: 148/148 C++ unit tests pass (BackendSelectionTest, TuneConfigMapTest, ChatTemplateUtilsTest). Co-authored-by: Cursor <cursoragent@cursor.com> * revert(llm): drop dead Gemma 4 markers from updateQwen3ReasoningBuffer The PR added two extra substring scans for Gemma 4's reasoning channel markers (<|channel>thought open, <channel|> close) to updateQwen3ReasoningBuffer. The intent was to extend the EOS-rescue path (handleQwen3ReasoningEOS rewrites EOS-while-thinking into a closing tag) to Gemma 4. That never actually fires though: both the buffer-update call and the EOS-rescue call in TextLlmContext are gated by `if (isQwen3Model_)`, and isQwen3Model_ resolves to `general.architecture == "qwen3"` only. Gemma 4 reports architecture "gemma4", so the gate never opens, the markers never get scanned, and the rescue path never runs for Gemma 4. In live runs Gemma 4 always emits <channel|> cleanly before <eos>, so the rescue isn't needed on the happy path; if Gemma 4 ever truncates mid-thought under context pressure we will need a real dialect-aware rescue (per-arch close-tag token + extended gate) and a follow-up will add that. For this PR we just want the dead code gone so it doesn't mislead future readers about what's actually wired up. Net: -9 lines, file is now identical to upstream main. Co-authored-by: Cursor <cursoragent@cursor.com> * test(llm): switch gemma4 fixtures from unsloth to bartowski The unsloth GGUF pack (huggingface.co/unsloth/gemma-4-E2B-it-GGUF) tags <turn|> as the EOG token in tokenizer.ggml.eos_token_id and leaves <eos> classified as a regular text token. Gemma 4's training-baked behaviour after assistant content is to emit a few <eos> tokens before <turn|>, so with that pack the addon's generation loop -- which terminates on llama_vocab_is_eog -- doesn't stop until <turn|> arrives. We were observing ~9 spurious <eos> tokens trailing every Gemma 4 response, eating into n_predict and KV cache for no gain. bartowski's GGUF (huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF) ships the exact same vocabulary but tags <eos> as EOG (matching the base google/gemma-4-E2B-it tokenizer config). With that pack the addon terminates on the first <eos> -- empirically 0 trailing tokens, ~30 % shorter completions on the same prompt, same dialect output that the native-dialect parser added in 87e6c35 handles unchanged. Verified on linux-x64/CPU (qvac-dev-linux-x64) with the same get_weather tool prompt: unsloth Q8_0 : 941 chars, 9 trailing <eos>, EOG = {<turn|>, </s>} bartowski Q4_K_M: 676 chars, 0 trailing <eos>, EOG = {<eos>, </s>} Note: the unsloth metadata bug deserves an upstream issue against the unsloth pack maintainers; this PR's scope is just to stop our tests paying the wasted-tokens tax. Co-authored-by: Cursor <cursoragent@cursor.com> * test(llm): unblock gemma4 image test on mobile + fix ctx overflow Three changes to packages/llm-llamacpp/test/integration/gemma4.test.js (image-describe subtest): 1. Drop the mobile CPU-vision carve-out. useCpuForVision used to force `device: 'cpu'` on Android/iOS to dodge Adreno OpenCL SIGABRT and Mali Vulkan instability that bit us with the unsloth mmproj. With bartowski's mmproj (now the fixture in 787c3322) we want CI to actually exercise the device-farm GPU code path for vision -- if that path regresses on a real Adreno or Mali chip we want to find out from CI, not by accident in production. Desktop x64-darwin / linux-arm64 keep CPU fallback because those hosts don't have a working GPU stack here. 2. Bump ctx_size 2048 -> 8192. A single elephant.jpg encodes to ~260 mtmd image tokens. With ctx_size=2048 plus Gemma 4's verbose CoT preamble the generation loop overflowed nPast > n_ctx during sampling (MtmdLlmContext.cpp:452), throwing 'processPromptImpl: context overflow'. 8192 leaves comfortable headroom on every backend. 3. Set reasoning-budget=0 for this test. We literally ask the model "Answer in one word" -- the <|channel>thought ...<channel|> CoT preamble that Gemma 4 wants to emit by default is wasted tokens here, and was the actual cause of the overflow above (CoT was running 8k+ tokens before the model reached the one-word answer and emitted <eos>). Disabling thinking gives us a deterministic ~10-token "Elephant" + <eos> response, which is what the substring-based assertion is testing for anyway. Verified on linux-x64 (qvac-dev-linux-x64, 2x RTX 5090, Vulkan backend) end-to-end: output: "Elephant" asserts: 3/3 total time: ~2 s Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(llm): drop dead selectToolsCompactMarker(string) overload selectToolsCompactMarker(const std::string& architecture) had no production callers anywhere -- only its two unit tests (SelectToolsCompactMarkerForQwen3, SelectToolsCompactMarkerForUnsupportedArchitecture) referenced it. Live production code goes through selectToolsCompactMarkerForModelMetadata (LlamaModel::resolveToolsCompactConfig calls that one), which takes std::optional<std::string> and is the only path that ever reaches the "qwen3" -> "<tool_call>" mapping at runtime. Removed the .cpp definition, the .hpp declaration, and the two unit tests. selectToolsCompactMarkerForModelMetadata is unchanged and still covered by SelectToolsCompactMarkerForModelMetadataUsesArchitecture. ChatTemplateUtilsTest now runs 19/19 tests on linux-x64 (was 21/21). Co-authored-by: Cursor <cursoragent@cursor.com> * test(llm): drop redundant useCpuForVision alias; vision runs on GPU on mobile After we removed the per-mobile CPU carve-out for Gemma 4 vision (commit 2843297) and never had one for Qwen3.5 vision, useCpuForVision was just a no-op alias of useCpu used at exactly one call site each. Inline it. Net effect on the device routing matrix is unchanged but explicit: platform/arch useCpu device used -------------------------------------------------------- darwin-x64 true cpu (no working GPU here) linux-arm64 true cpu (no working GPU here) darwin-arm64 (M-series) false gpu (Metal) linux-x64 false gpu (Vulkan/OpenCL) ios false gpu (Metal -- device farm) android false gpu (Adreno OpenCL / Mali Vulkan -- device farm) So on iOS / Android the gemma4 and qwen3-5 image-describe subtests run through the actual GPU vision path -- the same path users hit -- and will surface any regression from CI rather than from production. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(llm): correct thinkingForcedOpen_ comment re: gemma4 Gemma4 does not hit this code path: upstream common_chat_params_init_gemma4 explicitly leaves thinking_forced_open unset because gemma4's reasoning channel is model-emitted. Drop the misleading reference and call out the actual templates that trigger this path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(changelog): refresh PR-1874 entries to reflect actual shipped scope The original CHANGELOG entries for llm-llamacpp 0.20.0, embed-llamacpp 0.16.0, and translation-nmtcpp 3.0.0 were drafted before the synthesizer revert, the M1 / sawMali / dead-code cleanups, the bartowski fixture swap, the native-dialect tool-call parsing, the reasoning-budget knob, the thinkingForcedOpen synthetic-opener, the new integration tests, and the move from 8189.0.0 to 8189.0.2. They now match what the PR actually ships. Compressed every entry to a flat bullet list grouped by Keep-a-Changelog section (Changed / Added / Removed / Fixed / Deprecated / Internals) and bumped the date to 2026-05-10. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(changelog): trim items that round-trip to net-zero in the PR Removed lines that described code that's neither in upstream/main nor in the PR head (so it has no observable impact on consumers): - llm-llamacpp 0.20.0: * "tool-call streaming: each model now streams its native dialect / no re-shaping" -- main already streamed native dialects; the PR-internal synthesizer never shipped, so this is a non-change. * "Dropped sawMali plumbing / Apple-M1 detection / dead Gemma 4 markers in Qwen3ReasoningUtils" -- all three were added and removed inside this PR's commit history; net diff is zero. - embed-llamacpp 0.16.0: * "Dropped Mali-detection plumbing" -- same: added and removed within this PR's history, net diff is zero. Kept genuine net removals against upstream/main: - Qwen3 model-name-based fallback. - Dead `selectToolsCompactMarker(std::string)` overload (was pre-existing in main, only ever called from unit tests). Co-authored-by: Cursor <cursoragent@cursor.com> * docs(notice): regenerate NOTICE for embed-llamacpp, llm-llamacpp, translation-nmtcpp Re-ran the notice-generate skill (.cursor/skills/notice-generate) for the three addons whose dependency surfaces changed in this PR: - qvac-fabric bumped from 7248.x to 8189.0.2 -- different transitive C++ license set. - holepunch / hyperswarm libs moved to peerDependencies on main, so the JS attribution lists shrink accordingly. - @qvac/infer-base bumped to 0.4.1. Per-package C++ resolution after the run: embed-llamacpp : opencl/qvac-fabric/qvac-lib-inference-addon-cpp/ qvac-lint-cpp + libc++ (5 deps) llm-llamacpp : the above + picojson + nlohmann-json (7 deps) translation-nmtcpp : bergamot-translator/sentencepiece/ssplit/ qvac-fabric/qvac-lib-inference-addon-cpp/ qvac-lint-cpp + libc++ (7 deps) Net: +206 / -585 lines across the three NOTICE files (mostly transitive JS attribution shrink from the holepunch peerDeps refactor). Co-authored-by: Cursor <cursoragent@cursor.com> * test(llm): make gemma4 reasoning-budget test tolerate model-emitted reasoning Gemma 4's reasoning channel is model-emitted (no template force-open), so the model decides per-prompt whether to engage reasoning. For trivial prompts like "What is the capital of France?" the model can short-circuit and skip the <|channel>thought…<channel|> markers, which made the test flaky on CI. Gate the marker / length assertions on the baseline actually emitting the opening marker; if it didn't, log a comment and skip the dependent checks instead of failing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * types(llm): declare reasoning_budget in LlamaConfig The C++ config parser already accepts `reasoning_budget` (and the kebab-case `reasoning-budget` alias), but neither was a typed property on `LlamaConfig` — they only typechecked via the catch-all index signature. Add a typed entry with JSDoc so TypeScript consumers get autocomplete and the accepted values (-1 default, 0 disabled). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(llm): allow per-request reasoning_budget override in run() `reasoning_budget` was load-time only. Add it to `GenerationParams` so `model.run(messages, { generationParams: { reasoning_budget: 0 } })` can disable reasoning for a single request without re-loading the model — same shape as `temp` / `top_p` / `seed` overrides. Wiring: - `LlmContext::GenerationParams` gains an optional `reasoning_budget` field and `hasOverrides()` covers it. - `applyGenerationParamsToContext` snapshots / overrides / restores `params.reasoning_budget` alongside `n_predict`. - `AddonJs::runJob` parses `generationParams.reasoning_budget` from JS and rejects values other than `-1` or `0`. - `index.d.ts` exposes `reasoning_budget?: -1 | 0` on `GenerationParams` with a JSDoc note. `tokenizeChat` already reads `params_.reasoning_budget`, so no change is needed in `TextLlmContext` / `MtmdLlmContext` — the temporary override naturally propagates to `inputs.enable_thinking`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(llm): cover per-request reasoning_budget override on Qwen3.5 Validates the new per-request `generationParams.reasoning_budget` override end-to-end in two runs against a single loaded model: 1. `reasoning_budget: 0` override suppresses the `<think>…</think>` reasoning markers for that one request. 2. The next `run()` with no override restores the load-time default (reasoning enabled), proving the override is request-scoped and not sticky. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(llm): case-insensitive antiprompt substring matching `checkAntiprompt` now lowercases both the recent output window and each antiprompt before the `find()` so a single `Pizza` entry catches the model's `pizza`, `Pizza`, `PIZZA`, etc. Callers no longer need to list every casing variant. Applied identically in `TextLlmContext` and `MtmdLlmContext`. The token-level early-exit path is unchanged (BPE tokens are case-specific; the substring path is the authoritative check). Also drop the stale comment on the `Reverse prompt stops generation` scenario in `config-parameters.test.js`: it claimed the addon split on `,` without trimming, but `LlamaModel.cpp::split()` already trims and drops empty segments. Replaced with a brief note that documents the new (current) behaviour and simplified the antiprompt list to `'network, Pizza, bitcoin, blockchain'` so the test exercises both the trim and the case-insensitive match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(llm): stress case-insensitive antiprompt with PiZzA mixed-case entry Swap the `Pizza` reverse_prompt entry for `PiZzA`. With case-sensitive matching `PiZzA` would never match the model's `pizza` / `Pizza` output; only case-insensitive comparison fires the stop. Verified locally — the test still completes with output length 5, so the antiprompt trips on the first emitted "Pizza". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(llm): validate reasoning_budget before truncating to int Address @jpgaribotti's review: previously the value was cast to int *before* the `0` / `-1` check, so fractional inputs like `0.5` or `-1.1` would silently truncate to a "valid" 0 / -1 and pass through. Validate against the exact double values (both `0` and `-1` are exactly representable in IEEE-754, so `==` comparison is safe) before casting to int when storing in `ov.reasoning_budget`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(llm): use std::from_chars for reasoning_budget load-time parse Address @jpgaribotti's review: `std::stoi` silently accepts trailing garbage (`"0abc"` → `0`) and throws an uncaught `std::out_of_range` for inputs that overflow `int`. Switch to `std::from_chars`, which fails clean on non-numeric input, overflow (`errc::result_out_of_range`), and trailing garbage (`ptr != end`), then validate against the allowed `-1` / `0` values in the same check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Signed-off-by: Marcus Edel <marcus.edel@collabora.com> Co-authored-by: gianni-cor <gianfranco.cordella@tether.io> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Proletter
pushed a commit
to tetherto/qvac
that referenced
this pull request
May 24, 2026
* Restore Qwen3.5 / Gemma4 / PaddleOCR-VL tests + Mali coopmat fix Stack of three logical changes squashed into one commit so the test ports stay self-consistent with the build/runtime they depend on: 1. qvac-fabric overlay ports (LLM + embed + nmtcpp): - Pin to fabric 78db8bf4 (PR tetherto/qvac-fabric-llm.cpp#121 HEAD, includes c79a8851 "ggml-vulkan: Fix NaN outputs on Mali"). - Drop -DGGML_VULKAN_DISABLE_COOPMAT*=ON for Android so coopmat shaders are compiled in. With coopmat off, runtime device->coopmat_support is false and the Mali fix's ARM-gated branches were skipped, leaving Qwen3-Q8_0 finetuning NaN on Pixel 9 Pro Mali. - Wire up overlay-ports in each package's vcpkg-configuration.json. - Add find_package(OpenSSL) before find_package(llama) in the LLM CMakeLists so llama-targets.cmake's transitive OpenSSL::SSL reference (via cpp-httplib) resolves on local builds. 2. utils.js downloadFile redirect race: - Track a handedOff flag set when the redirect branch hands off dest to a recursive call. All cleanup paths now skip fs.unlink once ownership is transferred, so a late error from the outer writestream can't delete the freshly-downloaded file (Pixel ENOENT after "successful" mmproj download). 3. Three new integration tests + their mobile harness wiring: - qwen3-5.test.js — basic / multi-turn / tool-calling - gemma4.test.js — text / multi-turn / image (forced to CPU on darwin + mobile because gemma4v projector SIGSEGVs on Metal and Adreno OpenCL) / tool-calling - ocr-paddle.test.js — OCR; mobile maxTokens capped to 768 - Ported to the new addon API (files: { model: [absPath], projectionModel?: absPath }, config: …). - Added matching unit test test_text_llm_context_qwen3.cpp. - integration.auto.cjs registers runQwen35Test, runGemma4Test, runOcrPaddleTest dispatchers. - test-groups.json: iOS heavy4 cluster (Gemma4+OcrLighton+OcrPaddle), iOS lightB adds Qwen35, Android groupB has Qwen35 first then Gemma4 / OcrPaddle. - Workflow: Android GroupB Device Farm jobTimeout 60→90 min. * API port + Gemma4 tool-call fix. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Wire addon/src/patches ahead of the vcpkg include path to pick up the LlamacppUtils.hpp ptr-API override. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * API port + Gemma4 tool-call fix. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Split iOS heavy4 into three single-test specs (heavy4 = OcrLighton, new heavy7 = Gemma4, new heavy8 = OcrPaddle) and schedule them as separate Device Farm runs to avoid memory pressure. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Drop LlamacppUtils.hpp patch override; bump addon-cpp to 1.1.7 The LlamacppUtils.hpp common_init_result_ptr API now ships in qvac-lib-inference-addon-cpp 1.1.7 (PR #1887), so the local addon/src/patches/qvac-lib-inference-addon-cpp/LlamacppUtils.hpp shim is no longer needed in the embed and llm addons. - Delete the patch headers in embed and llm. - Drop the BEFORE PRIVATE addon/src/patches include path from the embed/llm production and unit-test CMakeLists. - Bump qvac-lib-inference-addon-cpp version>= to 1.1.7 in the embed, llm, and nmtcpp vcpkg.json files so they pick up the upstream ptr-API header from the registry. The OpenSSL find_package() addition stays — it's an unrelated local-build fix. Co-authored-by: Cursor <cursoragent@cursor.com> * Cap ocr-lighton predict to 1800 (desktop) / 768 (mobile) so the LightOnOCR response can't overrun ctx_size=4096. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Rewrite sliding-context test to use the post-GGML_PAD effective n_ctx (512) and retune n_predict / n_discarded so all 8 cases match the current ContextSlider semantics. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Allow embed batching test to override ctx_size and pin gte-large to batch_size=512 / ctx_size=384 to probe the Mali Vulkan first-submit ErrorDeviceLost. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Fix reverse-prompt scenario by removing comma, space, listing both 'pizza' and 'Pizza', and lowercasing the assertion comparisons to match 'Pizza' and 'pizza'. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Sanitize media Uint8Array prompts before logging to avoid V8 Zone OOM. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Use Qwen3 family chat-template to fix Qwen3.5-0.8B gibberish output on macOS Metal. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Update portfiles to point to the latest fabric. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Revert "Allow embed batching test to override ctx_size and pin gte-large to batch_size=512 / ctx_size=384 to probe the Mali Vulkan first-submit ErrorDeviceLost." This reverts commit 0e9eca7. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Raise AfriqueGemma cancel maxWait to 60s, and apply the use_jinja gate-drop so Qwen3-family models always pick the fixed jinja template. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Drop the retired AfriqueGemma integration tests. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Update portfiles to point to the latest head. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Update portfiles to point to the latest head. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Drop qwen35 from the Qwen3-template detection and the supported-finetune-architecture list since neither path is actually validated for Qwen3.5. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Update portfiles to point to the latest head. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Enable coopmat. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Drop the Qwen3 use_jinja override pairing now that qwen35 is no longer treated as Qwen3-family. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Use only general.architecture for Qwen3 detection so Qwen3.5 stops getting the Qwen3 chat-template via the model-name substring fallback. Drop modelNameLooksLikeQwen3 / getModelName and the modelName parameter from supportsToolsCompactForModelMetadata and selectToolsCompactMarkerForModelMetadata. The substring match on general.name treated "Qwen3.5-..." as Qwen3 and overrode the model's embedded tokenizer.chat_template, contradicting the recent decision to keep qwen35 out of the Qwen3 family. Update the LlamaModel call site and unit tests; add explicit qwen35/nullopt negative cases. Co-authored-by: Cursor <cursoragent@cursor.com> * Accept HuggingFace function-call XML in extractToolCalls so the Qwen3.5 tool-calling integration test parses the model's native <tool_call><function=...><parameter=...>...</parameter></function></tool_call> envelope produced by its embedded chat template, in addition to the Qwen3-style JSON envelope. Co-authored-by: Cursor <cursoragent@cursor.com> * Bump n_predict in the Qwen3.5 basic and multi-turn integration tests so the embedded chat-template's reasoning block has room to finish before the answer on slower CI backends. Co-authored-by: Cursor <cursoragent@cursor.com> * Enable coopmat and point to the latest fabric. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Route Qwen3.5 inference and all finetuning on Mali to CPU, disable Vulkan coopmat at build time, halve mobile finetune workload to account for CPU training. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Point to the latest fabric version. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Force Bert to the CPU on Mali. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Run finetuning on Mali GPU. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Run Qwen 3.5 on Mali GPU. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Point to the latest fabric version and enable coopmat path. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * vcpkg: drop per-package qvac-fabric overlays Removes the qvac-fabric overlay-ports infrastructure from the LLM, Embed, and NMT manifests. The default-registry baseline is left untouched, so vcpkg now resolves qvac-fabric directly from the registry at the existing baseline (7248.2.3). Bumping to fabric 8189.0.0 will be handled by a separate baseline update; this commit only undoes the overlay-based development setup that was no longer needed. - vcpkg-configuration.json (3x): drop "overlay-ports" entry. - vcpkg/ports/qvac-fabric/ (3x): remove overlay portfile.cmake, vcpkg.json, and android-vulkan-version.cmake. Co-authored-by: Cursor <cursoragent@cursor.com> * vcpkg: bump qvac-fabric version constraint to 8189.0.0 Updates the consumer manifests in the LLM, Embed, and NMT packages to require qvac-fabric >= 8189.0.0. The default-registry baseline is intentionally left untouched. Co-authored-by: Cursor <cursoragent@cursor.com> * llm/embed/nmtcpp: bump versions for qvac-fabric 8189.0.0 - qvac-lib-infer-llamacpp-llm: 0.19.2 -> 0.20.0 (minor) - qvac-lib-infer-llamacpp-embed: 0.15.0 -> 0.16.0 (minor) - qvac-lib-infer-nmtcpp: 2.1.1 -> 3.0.0 (major) The nmtcpp major bump reflects a real behavioural regression: the previous overlay built ggml unconditionally with every GPU backend the platform supported (Vulkan/Metal/OpenCL); switching to the upstream registry port with the existing "default-features": false in nmtcpp's vcpkg.json now disables the new "gpu-backends" feature, so out-of-the-box ggml exposes only the CPU backend. Consumers that rely on GPU-accelerated nmt inference must add '"features": ["gpu-backends"]' to the qvac-fabric block of their nmtcpp build manifest. CHANGELOG entries added in all three packages. Co-authored-by: Cursor <cursoragent@cursor.com> * nmtcpp: opt into qvac-fabric gpu-backends feature; downgrade bump to 2.2.0 The previous commit (3.0.0) flagged a breaking change: switching from the always-on overlay to the registry port with default-features:false disabled GPU backends in ggml. Adding "features": ["gpu-backends"] to nmtcpp's qvac-fabric dep restores the previous Vulkan/Metal/OpenCL behaviour, so the bump is now a non-breaking minor (2.2.0) and the BREAKING note in the changelog is replaced with a plain Changed entry. Co-authored-by: Cursor <cursoragent@cursor.com> * nmtcpp: re-bump to 3.0.0 (major) Restores the major version bump for nmtcpp. The new fabric port schema (features split between gpu-backends/llama) and the move from a vendored overlay to the upstream registry are large enough downstream changes that consumers should treat this as a major release, even though runtime behaviour is preserved by opting into "gpu-backends". Co-authored-by: Cursor <cursoragent@cursor.com> * vcpkg: pin qvac-fabric to >=8189.0.0#1 The 8189.0.0 (port-version 0) qvac-fabric port shipped a configure-time bug for consumers without the "llama" feature (i.e. nmtcpp): -DLLAMA_MTMD=ON was passed unconditionally, which transitively enables LLAMA_BUILD_COMMON, which makes upstream call license_generate(common) -- but BUILD_LLAMA=OFF skips defining the 'common' target, so the cmake configure aborts. The fix landed in tetherto/qvac-registry-vcpkg#136 as qvac-fabric port-version 1. Bumping the consumer constraint from "version>=": "8189.0.0" to "version>=": "8189.0.0#1" forces vcpkg to pick the fixed port-version (otherwise it picks the lowest satisfying version, which is the broken #0). Validated: nmtcpp arm64-android cross-build now configures and builds end-to-end against the upstream registry, no overlay needed. Co-authored-by: Cursor <cursoragent@cursor.com> * docs: drop overlay-removal note from changelogs Removes the changelog bullet describing the deletion of the per-package qvac-fabric vcpkg overlay. The overlay teardown is mechanical packaging plumbing rather than a user-facing change worth documenting. Co-authored-by: Cursor <cursoragent@cursor.com> * test/llm: restore AfriqueGemma integration tests (desktop-only) Reverts e257a19's deletion of the afriquegemma-edge-cases and afriquegemma-translation integration tests, and adds a 'desktopOnly' opt-out so they're skipped on mobile without breaking the per-test group coverage invariant. - packages/qvac-lib-infer-llamacpp-llm/test/integration/afriquegemma-edge-cases.test.js: restored. - packages/qvac-lib-infer-llamacpp-llm/test/integration/afriquegemma-translation.test.js: restored. - test/mobile/test-groups.json: new top-level "desktopOnly" array listing runAfriquegemmaEdgeCasesTest and runAfriquegemmaTranslationTest. - scripts/generate-mobile-integration-tests.js: validateGroups now reads the desktopOnly list; entries are still emitted into integration.auto.cjs (so validate-mobile-tests stays happy) but excluded from the per-platform "missing" check, so the mobile runners never invoke them. - test/mobile/integration.auto.cjs: regenerated by `npm run test:mobile:generate`. - CHANGELOG note in qvac-lib-infer-llamacpp-llm under Tests. Validated via `npm run test:mobile:generate` + `npm run test:mobile:validate`. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(llm): drop AfriqueGemma test restoration changelog note Co-authored-by: Cursor <cursoragent@cursor.com> * test/llm: switch AfriqueGemma desktop-only skip to in-test pattern Per review: don't change generate-mobile-integration-tests.js. Use the same skip:isMobile pattern other tests already use (config-parameters, tool-calling, image), and keep the AfriqueGemma functions in the iOS lightA / Android groupA groups so the existing per-test coverage invariant stays intact. - packages/qvac-lib-infer-llamacpp-llm/scripts/generate-mobile-integration-tests.js: reverted to upstream/main (drops the desktopOnly opt-out plumbing). - test/mobile/test-groups.json: drops 'desktopOnly', adds runAfriquegemmaEdgeCasesTest and runAfriquegemmaTranslationTest back to ios.lightA and android.groupA. - test/integration/afriquegemma-edge-cases.test.js, test/integration/afriquegemma-translation.test.js: add isMobile = platform === 'ios' || platform === 'android', and skip:isMobile to every test() options object (13 total). - test/mobile/integration.auto.cjs: regenerated. Validators both green: npm run test:mobile:generate -> "all tests assigned for every platform" npm run test:mobile:validate -> ok Co-authored-by: Cursor <cursoragent@cursor.com> * test/llm: skip ocr-lighton on mobile Adds skip:isMobile to the single test in ocr-lighton.test.js, matching the AfriqueGemma / config-parameters / tool-calling pattern. isMobile is already defined in this file. The test stays in ios.heavy4 / android.groupB so per-platform group coverage is unaffected; the brittle test itself just skips on mobile. Co-authored-by: Cursor <cursoragent@cursor.com> * ci: revert workflow timeout change for llm mobile integration Drops PR #1874's edit to .github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml (parameterised jobTimeoutMinutes + 90-minute override for Android GroupB). Workflow is restored to the upstream/main version. Co-authored-by: Cursor <cursoragent@cursor.com> * addons: disable flash-attn by default on the OpenCL backend Flash attention is not reliably supported by the OpenCL ggml backend (Adreno path), so when the chosen GPU backend ends up being OpenCL the addons now force "flash-attn=off" unless the user explicitly passed flash-attn / flash_attn in their config. LLM (LlamaModel.cpp / LlamaModel.hpp): - Add a bool isOpenCl parameter to tuneConfigMap (defaulted to false to keep the existing test_tune_config_map.cpp call sites working). - Mirror the BitNet-disabling branch with an else-if for OpenCL + notUserSet("flash-attn", "flash_attn"). - At the call site, read chosenBackend.first/second after chooseBackend returns and pass isOpenCl through. Embed (BertModel.cpp): - No tuneConfigMap equivalent here. Inject the same logic inline immediately after chooseBackend, before configFilemap is serialised into configVector. Honour user-set "flash-attn"/"flash_attn". Both packages compile cleanly via bare-make build on macOS-arm64. Co-authored-by: Cursor <cursoragent@cursor.com> * fixup! tuneConfigMap: keep ABI for existing 4-arg test callers CI failure on cpp-tests-darwin-arm64 (PR #1874): test/unit/test_tune_config_map.cpp:199:43: fatal error: no viable conversion from 'FtOverrides' to 'bool' The previous commit inserted bool isOpenCl as the 4th parameter of tuneConfigMap, but several existing tests pass FtOverrides{...} as the 4th positional argument (relying on it being finetuneOverrides). Swap the order so the new isOpenCl parameter comes after the existing finetuneOverrides; both stay defaulted, so all old 3-arg and 4-arg call sites compile unchanged. The production call site in LlamaModel.cpp is updated accordingly. Also adds 4 new TuneConfigMapTest cases covering the OpenCL branch: - OpenCl_NonBitnet_FlashAttnDisabledByDefault - OpenCl_UserSetFlashAttnHyphen_Respected - OpenCl_UserSetFlashAttnUnderscore_Respected - NotOpenCl_NonBitnet_FlashAttnUnchanged All 53 TuneConfigMapTest cases pass locally on macOS-arm64. Co-authored-by: Cursor <cursoragent@cursor.com> * Add QWen 3.5 vision test. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Route vision models with mmproj to CPU on Apple M1. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Route only the projector to CPU on Apple M1. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * run qwen3-5.test.js on IOS GPU * js lint * Recognize Gemma 4 channel reasoning markers in Qwen3ReasoningUtils, and bump gemma4 basic-test n_predict so the answer fits after the thinking preamble. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Wire reasoning-budget config to inputs.enable_thinking so passing reasoning-budget=0 disables the model's <think> reasoning channel, and add coverage for Qwen3, Qwen3.5, and Gemma 4. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * vcpkg: bump qvac-fabric to >=8189.0.1 The 8189.0.1 port (tetherto/qvac-registry-vcpkg#138) drops port-version 1's BUILD_LLAMA=OFF portfile workaround and ships the new fabric tip 739b309ae. Notable upstream fixes pulled in: - Inject enable_thinking into the Jinja template context so Qwen 3.5 and Gemma 4 actually emit <think> reasoning content. - GGML_OP_DELTA_NET_AR Vulkan compute shader (Qwen 3.5 / DeltaNet decode no longer falls back to CPU per token). - vulkan: f32 src1 strided cpy fix (embedding-model crash). Validated on macOS-arm64: vcpkg resolves qvac-fabric[core,gpu-backends,llama]:arm64-osx@8189.0.1 and the addon builds end-to-end. Co-authored-by: Cursor <cursoragent@cursor.com> * Disable the embed addon's BERT-on-Mali CPU override. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Prepend <think> opener to the visible stream when the chat template force-opens the reasoning channel. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Remove the Mali detection plumbing from the embed addon now that BERT runs on Mali GPU. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Bump n_predict and ctx_size in the Qwen3.5 reasoning-budget baseline so the model reliably reaches </think>. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Restore the mobile finetune dataset to 8 samples. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * test: drop AfriqueGemma + MedGemma + Dolphin-MoE tests Per review: cull tests that exercise models we no longer want covered in the LLM/SDK CI matrix. LLM (packages/llm-llamacpp): - Delete integration tests: - test/integration/afriquegemma-edge-cases.test.js - test/integration/afriquegemma-translation.test.js - test/integration/moe.test.js (dolphin-mixtral-2x7b) - Delete docs/afriquegemma-translation.md (only documents the now-removed integration tests). - Strip the medgemma-4b-it variant from: - test/integration/tool-calling.test.js (collapses ALL_TOOL_MODEL_VARIANTS / TOOL_MODEL_VARIANTS to qwen3-1.7b only, drops the now-unused isMobile derived var). - test/integration/finetuning-pause-resume.test.js (drops the medgemma-4b-it-q4_0 entry from FINETUNE_MODELS). - test/unit/test_model_metadata.cpp: drop the gemma3Model_ fixture + the two Gemma3-specific TEST_F cases (DiskSingleFile_Gemma3Arch_*); update the comment block listing exercised arches accordingly. - test/unit/pick-primary-gguf-path.test.js: keep the tensors.txt-first ordering test, but rebase the fixture filenames on Qwen3-4B-Q4_K_M-* so no medgemma names remain in the test corpus. - test/mobile/test-groups.json + test/mobile/integration.auto.cjs: drop runAfriquegemmaEdgeCasesTest, runAfriquegemmaTranslationTest, runMoeTest from both ios and android groups; auto.cjs trimmed to match. `validate-mobile-tests.js` is green. SDK (packages/sdk/tests-qvac): - Delete tests/translation-afriquegemma-tests.ts. - tests/test-definitions.ts: drop translationAfriquegemmaTests import + spread. - tests/shared/executors/translation-executor.ts: drop the import, the spread, and the |afriquegemma branch from the dispatch regex. - tests/mobile/consumer.ts + tests/desktop/consumer.ts: drop the AFRICAN_4B_TRANSLATION_Q4_K_M import and the resources.define("afriquegemma", ...) block; mobile also drops the afriquegemma-only SkipExecutor. - tests/shared/resource-lifecycle.ts: rephrase the eviction-comment example to a generic "large translation model" so it no longer references the deleted resource. Not touched: NOTICE/CHANGELOG (auto-generated/historical), sdk/models/registry/* (model constants in the registry are data, not tests), sdk/examples/translation/translation-llm-afriquegemma.ts (consumer-facing example, not a test). * Revert "test: drop AfriqueGemma references from packages/sdk/tests-qvac" Per review: keep packages/sdk/tests-qvac/ untouched. Restore the SDK afriquegemma test file, the test-definitions / translation-executor / desktop+mobile consumer / resource-lifecycle edits to their state prior to commit 36de6ec. Only the LLM-side cull (packages/llm-llamacpp + the deleted afrique / moe / medgemma test files there) from 36de6ec is kept. * Restore packages/llm-llamacpp/docs/afriquegemma-translation.md Per review: keep the AfriqueGemma translation doc. Commit 36de6ec removed it together with the LLM AfriqueGemma test files; restore it unchanged from the merge tip (e29836d). * chore: pin qvac-fabric to 8189.0.2 via overlay-ports for testing Adds an overlay port copy of qvac-fabric pointing at v8189.0.2 of tetherto/qvac-fabric-llm.cpp (tetherto/qvac-registry-vcpkg#140) to llm-llamacpp, embed-llamacpp, and translation-nmtcpp, declared via each package's vcpkg-configuration.json. Lets this PR exercise the new fabric build (incl. the Mali coopmat1 BitNet TQ NaN fix) without waiting for the registry baseline bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: pin overlay qvac-fabric to temp-8189 tip f686a1324 Point REF at the latest qvac-fabric-llm.cpp temp-8189 commit (f686a1324e13184d3257cb74c1ba17f9cf8ef575) instead of v8189.0.2 so the overlay tracks branch tip while the branch is still moving. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: extend Android LLM mobile test timeouts Allow slower Android Device Farm runs to finish model-heavy LLM tests before the harness marks them as timed out. Co-authored-by: Cursor <cursoragent@cursor.com> * vcpkg: drop qvac-fabric overlay-ports, bump version>= to 8189.0.2 tetherto/qvac-registry-vcpkg#140 publishes qvac-fabric@8189.0.2 in the default registry, so the temporary per-package overlay we used while the new fabric build was still being shaken out is no longer necessary. For llm-llamacpp, embed-llamacpp, and translation-nmtcpp: - Delete `packages/<pkg>/vcpkg/ports/qvac-fabric/` (portfile.cmake, vcpkg.json, android-vulkan-version.cmake) — the overlay copy. - Drop the `overlay-ports` entry from each package's vcpkg-configuration.json. The `default-registry` baseline is left untouched intentionally; the `version>=` constraints below are what forces vcpkg to resolve to the new fabric revision against the unchanged baseline. - Bump the `qvac-fabric` `version>=` pin from `8189.0.1` -> `8189.0.2` in each package's vcpkg.json. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(llm): drop dead sawMali plumbing from BackendSelection `sawMali` was threaded through `emplaceIfValidDevice` / `tryEmplaceDevice` / `chooseBackend` but never read by any caller — leftover from the earlier "Force BERT/Qwen3.5 to CPU on Mali" iterations. The embed-side cleanup already landed in 2ac5de0 ("Remove the Mali detection plumbing from the embed addon now that BERT runs on Mali GPU."); this finishes the symmetric removal on the LLM side. `sawAppleM1` plumbing is preserved unchanged. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(llm): explain why MtmdLlmContext skips inside_reasoning flip TextLlmContext flips reasoningState_.inside_reasoning = true alongside the forced "<think>\n" opener; MtmdLlmContext doesn't because it doesn't carry a reasoningState_ today. Add an inline note so the asymmetry isn't read as a bug, and point at the symmetric site to update if reasoning-aware EOS replacement is later added on the multimodal path. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(llm): narrow tool-call args quoter to leading bare key only The previous post-generation regex (`([{,])(\s*)([A-Za-z_]…)(\s*):` -> quote the ident) was too broad: it also matched `, ident:` substrings sitting inside JSON string values, so a tool call with a free-form string argument like `{"query":"phase one, step: validate"}` came out corrupted as `{"query":"phase one, "step": validate"}`, which then failed JSON.parse on the consumer side. In practice the rewrite is only needed for one upstream quirk: the Gemma 4 parser's `gemma4_args_to_json` (common/chat-parser.cpp) uses an `at_key_start()` helper that peeks backwards in the output buffer for a `{`/`,` -- so the very first top-level key is left bare while every nested or post-comma key is already quoted. All other tool dialects reach us via `json::dump()` upstream and already start with a quoted key. Replace the broad regex with one anchored at `^\{(\s*)<ident>\s*:`, which fixes exactly that single leading-bare-key case and cannot match anywhere inside a JSON string value. Verified end-to-end on linux-x64 against gemma-4-E2B-it-Q8_0 (CPU): - Adversarial prompt forcing `phase one, step: validate` as a tool arg string: baseline produced invalid JSON `{"query":"phase one, "step": validate"}` (parse fail at pos 55); this fix yields `{"query":"phase one, step: validate"}` and the test passes 7/7 assertions. - Existing simple-args happy path (`get_weather` with city/unit) still passes 5/5. Co-authored-by: Cursor <cursoragent@cursor.com> * revert(llm): drop synthetic <tool_call>{json}</tool_call> post-processing Each model now streams only its own native tool-call dialect: - Qwen3 / Hermes: <tool_call>{json}</tool_call> (already canonical) - Qwen3.5: <tool_call><function=name><parameter=k>v</parameter></function></tool_call> - Gemma 4: <|tool_call>call:NAME{key:<|"|>val<|"|>,...}<tool_call|> - Mistral, DeepSeek-R1, Functionary, GPT-OSS, etc. emit their own markers. The previous PR added a post-generation common_chat_parse pass that appended a uniform <tool_call>{json}</tool_call> envelope for every detected call. That duplicated tokens for Hermes-shape models (the envelope is already in the native stream) and inflated Gemma 4 output by ~14% with two synthetic copies per call. The leading-bare-key handling for Gemma 4's tc.arguments was also a constant source of sharp edges (broad regex corrupted string values containing ", ident:"; narrow anchored regex still required follow-up). Per-dialect parsing belongs at the SDK consumer layer, not in the addon. Removed: - Post-generation block in LlamaModel::processPromptImpl (synthesizer). - needsOutputCapture widening to include !resolved.tools.empty(). - LlmContext::getLastChatFormat() virtual. - lastChatFormat_ members + overrides in TextLlmContext, MtmdLlmContext. - common_chat_format* outFormat parameter from getPrompt(). - <regex> include in LlamaModel.cpp (no remaining users). Kept: - outThinkingForcedOpen mechanism (independent reasoning-channel feature). - toolsCompact_ controller and KV-cache trim logic. - All other PR work. Validated on linux-x64/CPU after incremental rebuild: - Gemma 4 (gemma-4-E2B-it-Q8_0): 6/6 asserts pass with native-dialect parser, no synthetic envelope leaks, output 941 chars (down from ~1100 with synthesizer). - Qwen3.5 (Qwen3.5-0.8B-Q8_0): 5/5 asserts pass with the existing parseXmlToolCall path, output 394 chars. Co-authored-by: Cursor <cursoragent@cursor.com> * test(llm): parse Gemma 4 native tool-call dialect in gemma4.test.js Without the synthetic <tool_call>{json}</tool_call> envelope reverted in the previous commit, Gemma 4 emits its own dialect: <|tool_call>call:NAME{key:<|"|>val<|"|>,...}<tool_call|> Strings are wrapped in <|"|>...<|"|> instead of "...", keys are bare, and the closing tag is <tool_call|> (trailing pipe, no slash). extractToolCalls now matches that shape directly and returns { name, argsRaw }. argsContainStringValue() helper checks the args body for a Gemma-4-quoted string literal. Substring-based assertion is sufficient to verify the model called the right tool with the right argument values; full dialect-to-JSON conversion lives upstream in fabric's gemma4_args_to_json and is not the addon test's job. qwen3-5.test.js was unchanged: Qwen3.5 wraps its <function=name> <parameter=k>v</parameter></function> XML in <tool_call>...</tool_call> natively, so the existing parseXmlToolCall path keeps working. Validated on linux-x64/CPU against gemma-4-E2B-it-Q8_0: 4/4 tests, 13/13 asserts (3 synthetic-input parser sanity checks + 1 live LLM run). Co-authored-by: Cursor <cursoragent@cursor.com> * revert(llm): drop Apple M1 detection + projector-CPU routing The PR added an Apple-M1-specific code path that detected the chip via the GPU description string and routed `params.mmproj_use_gpu = false` so the vision projector ran on CPU instead of Metal, working around a SIGSEGV in the projector's image-encoding kernel observed on M1 Metal at the time. Re-tested on M1 with the current fabric tip: no SIGSEGV, projector runs fine on Metal end-to-end. The carve-out is no longer needed. Removed: - BackendSelection: `isAppleM1Device()` helper, `bool& sawAppleM1` plumbing through `emplaceIfValidDevice` / `tryEmplaceDevice` / `chooseBackend`, and `bool* outSawAppleM1` parameter on both `chooseBackend` overloads. - LlamaModel: the `bool sawAppleM1 = false` local, the call-site argument, and the `params.mmproj_use_gpu = !sawAppleM1` ternary; mmproj now uses GPU on every desktop platform (Android still hardcoded to false). - test_backend_selection.cpp: `APPLE_M{1,2,3,4}_DESC` constants, `chooseBackendWithM1Flag()` helper, and the four `AppleM*_*` test cases. - gemma4.test.js / qwen3-5.test.js: the comment blocks describing the M1 carve-out; `useCpuForVision` semantics are unchanged (`useCpu || isMobile` on gemma4 and `useCpu` on qwen3-5). Verified on linux-x64/CPU after rebuild: 148/148 C++ unit tests pass (BackendSelectionTest, TuneConfigMapTest, ChatTemplateUtilsTest). Co-authored-by: Cursor <cursoragent@cursor.com> * revert(llm): drop dead Gemma 4 markers from updateQwen3ReasoningBuffer The PR added two extra substring scans for Gemma 4's reasoning channel markers (<|channel>thought open, <channel|> close) to updateQwen3ReasoningBuffer. The intent was to extend the EOS-rescue path (handleQwen3ReasoningEOS rewrites EOS-while-thinking into a closing tag) to Gemma 4. That never actually fires though: both the buffer-update call and the EOS-rescue call in TextLlmContext are gated by `if (isQwen3Model_)`, and isQwen3Model_ resolves to `general.architecture == "qwen3"` only. Gemma 4 reports architecture "gemma4", so the gate never opens, the markers never get scanned, and the rescue path never runs for Gemma 4. In live runs Gemma 4 always emits <channel|> cleanly before <eos>, so the rescue isn't needed on the happy path; if Gemma 4 ever truncates mid-thought under context pressure we will need a real dialect-aware rescue (per-arch close-tag token + extended gate) and a follow-up will add that. For this PR we just want the dead code gone so it doesn't mislead future readers about what's actually wired up. Net: -9 lines, file is now identical to upstream main. Co-authored-by: Cursor <cursoragent@cursor.com> * test(llm): switch gemma4 fixtures from unsloth to bartowski The unsloth GGUF pack (huggingface.co/unsloth/gemma-4-E2B-it-GGUF) tags <turn|> as the EOG token in tokenizer.ggml.eos_token_id and leaves <eos> classified as a regular text token. Gemma 4's training-baked behaviour after assistant content is to emit a few <eos> tokens before <turn|>, so with that pack the addon's generation loop -- which terminates on llama_vocab_is_eog -- doesn't stop until <turn|> arrives. We were observing ~9 spurious <eos> tokens trailing every Gemma 4 response, eating into n_predict and KV cache for no gain. bartowski's GGUF (huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF) ships the exact same vocabulary but tags <eos> as EOG (matching the base google/gemma-4-E2B-it tokenizer config). With that pack the addon terminates on the first <eos> -- empirically 0 trailing tokens, ~30 % shorter completions on the same prompt, same dialect output that the native-dialect parser added in 87e6c35 handles unchanged. Verified on linux-x64/CPU (qvac-dev-linux-x64) with the same get_weather tool prompt: unsloth Q8_0 : 941 chars, 9 trailing <eos>, EOG = {<turn|>, </s>} bartowski Q4_K_M: 676 chars, 0 trailing <eos>, EOG = {<eos>, </s>} Note: the unsloth metadata bug deserves an upstream issue against the unsloth pack maintainers; this PR's scope is just to stop our tests paying the wasted-tokens tax. Co-authored-by: Cursor <cursoragent@cursor.com> * test(llm): unblock gemma4 image test on mobile + fix ctx overflow Three changes to packages/llm-llamacpp/test/integration/gemma4.test.js (image-describe subtest): 1. Drop the mobile CPU-vision carve-out. useCpuForVision used to force `device: 'cpu'` on Android/iOS to dodge Adreno OpenCL SIGABRT and Mali Vulkan instability that bit us with the unsloth mmproj. With bartowski's mmproj (now the fixture in 787c3322) we want CI to actually exercise the device-farm GPU code path for vision -- if that path regresses on a real Adreno or Mali chip we want to find out from CI, not by accident in production. Desktop x64-darwin / linux-arm64 keep CPU fallback because those hosts don't have a working GPU stack here. 2. Bump ctx_size 2048 -> 8192. A single elephant.jpg encodes to ~260 mtmd image tokens. With ctx_size=2048 plus Gemma 4's verbose CoT preamble the generation loop overflowed nPast > n_ctx during sampling (MtmdLlmContext.cpp:452), throwing 'processPromptImpl: context overflow'. 8192 leaves comfortable headroom on every backend. 3. Set reasoning-budget=0 for this test. We literally ask the model "Answer in one word" -- the <|channel>thought ...<channel|> CoT preamble that Gemma 4 wants to emit by default is wasted tokens here, and was the actual cause of the overflow above (CoT was running 8k+ tokens before the model reached the one-word answer and emitted <eos>). Disabling thinking gives us a deterministic ~10-token "Elephant" + <eos> response, which is what the substring-based assertion is testing for anyway. Verified on linux-x64 (qvac-dev-linux-x64, 2x RTX 5090, Vulkan backend) end-to-end: output: "Elephant" asserts: 3/3 total time: ~2 s Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(llm): drop dead selectToolsCompactMarker(string) overload selectToolsCompactMarker(const std::string& architecture) had no production callers anywhere -- only its two unit tests (SelectToolsCompactMarkerForQwen3, SelectToolsCompactMarkerForUnsupportedArchitecture) referenced it. Live production code goes through selectToolsCompactMarkerForModelMetadata (LlamaModel::resolveToolsCompactConfig calls that one), which takes std::optional<std::string> and is the only path that ever reaches the "qwen3" -> "<tool_call>" mapping at runtime. Removed the .cpp definition, the .hpp declaration, and the two unit tests. selectToolsCompactMarkerForModelMetadata is unchanged and still covered by SelectToolsCompactMarkerForModelMetadataUsesArchitecture. ChatTemplateUtilsTest now runs 19/19 tests on linux-x64 (was 21/21). Co-authored-by: Cursor <cursoragent@cursor.com> * test(llm): drop redundant useCpuForVision alias; vision runs on GPU on mobile After we removed the per-mobile CPU carve-out for Gemma 4 vision (commit 2843297) and never had one for Qwen3.5 vision, useCpuForVision was just a no-op alias of useCpu used at exactly one call site each. Inline it. Net effect on the device routing matrix is unchanged but explicit: platform/arch useCpu device used -------------------------------------------------------- darwin-x64 true cpu (no working GPU here) linux-arm64 true cpu (no working GPU here) darwin-arm64 (M-series) false gpu (Metal) linux-x64 false gpu (Vulkan/OpenCL) ios false gpu (Metal -- device farm) android false gpu (Adreno OpenCL / Mali Vulkan -- device farm) So on iOS / Android the gemma4 and qwen3-5 image-describe subtests run through the actual GPU vision path -- the same path users hit -- and will surface any regression from CI rather than from production. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(llm): correct thinkingForcedOpen_ comment re: gemma4 Gemma4 does not hit this code path: upstream common_chat_params_init_gemma4 explicitly leaves thinking_forced_open unset because gemma4's reasoning channel is model-emitted. Drop the misleading reference and call out the actual templates that trigger this path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(changelog): refresh PR-1874 entries to reflect actual shipped scope The original CHANGELOG entries for llm-llamacpp 0.20.0, embed-llamacpp 0.16.0, and translation-nmtcpp 3.0.0 were drafted before the synthesizer revert, the M1 / sawMali / dead-code cleanups, the bartowski fixture swap, the native-dialect tool-call parsing, the reasoning-budget knob, the thinkingForcedOpen synthetic-opener, the new integration tests, and the move from 8189.0.0 to 8189.0.2. They now match what the PR actually ships. Compressed every entry to a flat bullet list grouped by Keep-a-Changelog section (Changed / Added / Removed / Fixed / Deprecated / Internals) and bumped the date to 2026-05-10. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(changelog): trim items that round-trip to net-zero in the PR Removed lines that described code that's neither in upstream/main nor in the PR head (so it has no observable impact on consumers): - llm-llamacpp 0.20.0: * "tool-call streaming: each model now streams its native dialect / no re-shaping" -- main already streamed native dialects; the PR-internal synthesizer never shipped, so this is a non-change. * "Dropped sawMali plumbing / Apple-M1 detection / dead Gemma 4 markers in Qwen3ReasoningUtils" -- all three were added and removed inside this PR's commit history; net diff is zero. - embed-llamacpp 0.16.0: * "Dropped Mali-detection plumbing" -- same: added and removed within this PR's history, net diff is zero. Kept genuine net removals against upstream/main: - Qwen3 model-name-based fallback. - Dead `selectToolsCompactMarker(std::string)` overload (was pre-existing in main, only ever called from unit tests). Co-authored-by: Cursor <cursoragent@cursor.com> * docs(notice): regenerate NOTICE for embed-llamacpp, llm-llamacpp, translation-nmtcpp Re-ran the notice-generate skill (.cursor/skills/notice-generate) for the three addons whose dependency surfaces changed in this PR: - qvac-fabric bumped from 7248.x to 8189.0.2 -- different transitive C++ license set. - holepunch / hyperswarm libs moved to peerDependencies on main, so the JS attribution lists shrink accordingly. - @qvac/infer-base bumped to 0.4.1. Per-package C++ resolution after the run: embed-llamacpp : opencl/qvac-fabric/qvac-lib-inference-addon-cpp/ qvac-lint-cpp + libc++ (5 deps) llm-llamacpp : the above + picojson + nlohmann-json (7 deps) translation-nmtcpp : bergamot-translator/sentencepiece/ssplit/ qvac-fabric/qvac-lib-inference-addon-cpp/ qvac-lint-cpp + libc++ (7 deps) Net: +206 / -585 lines across the three NOTICE files (mostly transitive JS attribution shrink from the holepunch peerDeps refactor). Co-authored-by: Cursor <cursoragent@cursor.com> * test(llm): make gemma4 reasoning-budget test tolerate model-emitted reasoning Gemma 4's reasoning channel is model-emitted (no template force-open), so the model decides per-prompt whether to engage reasoning. For trivial prompts like "What is the capital of France?" the model can short-circuit and skip the <|channel>thought…<channel|> markers, which made the test flaky on CI. Gate the marker / length assertions on the baseline actually emitting the opening marker; if it didn't, log a comment and skip the dependent checks instead of failing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * types(llm): declare reasoning_budget in LlamaConfig The C++ config parser already accepts `reasoning_budget` (and the kebab-case `reasoning-budget` alias), but neither was a typed property on `LlamaConfig` — they only typechecked via the catch-all index signature. Add a typed entry with JSDoc so TypeScript consumers get autocomplete and the accepted values (-1 default, 0 disabled). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(llm): allow per-request reasoning_budget override in run() `reasoning_budget` was load-time only. Add it to `GenerationParams` so `model.run(messages, { generationParams: { reasoning_budget: 0 } })` can disable reasoning for a single request without re-loading the model — same shape as `temp` / `top_p` / `seed` overrides. Wiring: - `LlmContext::GenerationParams` gains an optional `reasoning_budget` field and `hasOverrides()` covers it. - `applyGenerationParamsToContext` snapshots / overrides / restores `params.reasoning_budget` alongside `n_predict`. - `AddonJs::runJob` parses `generationParams.reasoning_budget` from JS and rejects values other than `-1` or `0`. - `index.d.ts` exposes `reasoning_budget?: -1 | 0` on `GenerationParams` with a JSDoc note. `tokenizeChat` already reads `params_.reasoning_budget`, so no change is needed in `TextLlmContext` / `MtmdLlmContext` — the temporary override naturally propagates to `inputs.enable_thinking`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(llm): cover per-request reasoning_budget override on Qwen3.5 Validates the new per-request `generationParams.reasoning_budget` override end-to-end in two runs against a single loaded model: 1. `reasoning_budget: 0` override suppresses the `<think>…</think>` reasoning markers for that one request. 2. The next `run()` with no override restores the load-time default (reasoning enabled), proving the override is request-scoped and not sticky. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(llm): case-insensitive antiprompt substring matching `checkAntiprompt` now lowercases both the recent output window and each antiprompt before the `find()` so a single `Pizza` entry catches the model's `pizza`, `Pizza`, `PIZZA`, etc. Callers no longer need to list every casing variant. Applied identically in `TextLlmContext` and `MtmdLlmContext`. The token-level early-exit path is unchanged (BPE tokens are case-specific; the substring path is the authoritative check). Also drop the stale comment on the `Reverse prompt stops generation` scenario in `config-parameters.test.js`: it claimed the addon split on `,` without trimming, but `LlamaModel.cpp::split()` already trims and drops empty segments. Replaced with a brief note that documents the new (current) behaviour and simplified the antiprompt list to `'network, Pizza, bitcoin, blockchain'` so the test exercises both the trim and the case-insensitive match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(llm): stress case-insensitive antiprompt with PiZzA mixed-case entry Swap the `Pizza` reverse_prompt entry for `PiZzA`. With case-sensitive matching `PiZzA` would never match the model's `pizza` / `Pizza` output; only case-insensitive comparison fires the stop. Verified locally — the test still completes with output length 5, so the antiprompt trips on the first emitted "Pizza". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(llm): validate reasoning_budget before truncating to int Address @jpgaribotti's review: previously the value was cast to int *before* the `0` / `-1` check, so fractional inputs like `0.5` or `-1.1` would silently truncate to a "valid" 0 / -1 and pass through. Validate against the exact double values (both `0` and `-1` are exactly representable in IEEE-754, so `==` comparison is safe) before casting to int when storing in `ov.reasoning_budget`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(llm): use std::from_chars for reasoning_budget load-time parse Address @jpgaribotti's review: `std::stoi` silently accepts trailing garbage (`"0abc"` → `0`) and throws an uncaught `std::out_of_range` for inputs that overflow `int`. Switch to `std::from_chars`, which fails clean on non-numeric input, overflow (`errc::result_out_of_range`), and trailing garbage (`ptr != end`), then validate against the allowed `-1` / `0` values in the same check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Signed-off-by: Marcus Edel <marcus.edel@collabora.com> Co-authored-by: gianni-cor <gianfranco.cordella@tether.io> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bumps the `qvac-fabric` port to upstream tag `v8189.0.1` (commit `739b309ae`, the SSH-resigned tip of `temp-8189`).
What's new vs `v8189.0.0#1` (the previous registry pin):
Diff
Test plan
Made with Cursor