Skip to content

qvac-fabric: fix BUILD_LLAMA=OFF configure failure (port-version 1)#136

Merged
gianni-cor merged 1 commit into
mainfrom
qvac-fabric-fix-llama-off
May 7, 2026
Merged

qvac-fabric: fix BUILD_LLAMA=OFF configure failure (port-version 1)#136
gianni-cor merged 1 commit into
mainfrom
qvac-fabric-fix-llama-off

Conversation

@gianni-cor

Copy link
Copy Markdown
Contributor

Summary

Fixes a portfile bug introduced together with the new `gpu-backends`/`llama` feature split (#133) and the `8189.0.0` bump (#135): when a consumer requests `qvac-fabric` without the `llama` feature, the configure step aborts with:

```
-- BUILD_LLAMA is OFF: skipping LLAMA targets.
-- Generating embedded license file for target: common
CMake Error at cmake/license.cmake:38 (message):
Target 'common' does not exist
```

Root cause: the portfile passes `-DLLAMA_MTMD=ON` unconditionally. `LLAMA_MTMD` transitively enables `LLAMA_BUILD_COMMON`, which causes upstream's `CMakeLists.txt` to call `license_generate(common)` — but `BUILD_LLAMA=OFF` (which is what `vcpkg_check_features` derives when `llama` is not requested) skips defining the `common` target.

Fix: move the `LLAMA_*` configure flags into a feature-gated `LLAMA_OPTIONS` list. With `llama` on, behaviour is unchanged. With `llama` off, pass `-DLLAMA_MTMD=OFF` + `-DLLAMA_BUILD_COMMON=OFF` so the license-generation block is skipped.

Test plan

  • Reproduced the failure with tetherto/qvac#1874's `qvac-lib-infer-nmtcpp` (`qvac-fabric` declared as `default-features: false` + `features: ["gpu-backends"]`) cross-building for `arm64-android` from a macOS host.
  • After this PR merges, re-run the same build and confirm `qvac-fabric[core,gpu-backends]:arm64-android@8189.0.0#1` configures and installs successfully.
  • Confirm `qvac-fabric[core,gpu-backends,llama]` (LLM/Embed consumers, the previous behaviour) still builds — the `llama` branch of the new conditional emits the same flags as before, so this is a no-op for them.

Made with Cursor

@gianni-cor

Copy link
Copy Markdown
Contributor Author

Validated locally by overlaying this fixed port (`VCPKG_OVERLAY_PORTS`) into the tetherto/qvac#1874 build of `qvac-lib-infer-nmtcpp` (`qvac-fabric` declared with `default-features: false` + `features: ["gpu-backends"]`) cross-compiling for `arm64-android` from a macOS-arm64 host:

```
Installing 5/7 qvac-fabric[core,gpu-backends]:arm64-android@8189.0.0#1...

All requested installations completed successfully in: 3.4 min

[14/14] Linking CXX shared library qvac__translation-nmtcpp@3.bare
```

Configure now finishes (no more `Target 'common' does not exist` error) and the addon builds end-to-end.

Spot-check that the `llama` branch is still fine: previous validation in #135 built `qvac-lib-infer-llamacpp-llm` (`features: [llama, gpu-backends]` implicitly) for `arm64-osx` against `8189.0.0#0`. The new `#1` portfile preserves that branch byte-for-byte (same `-DLLAMA_*` flag set), so no regression risk for LLM/Embed consumers.

@gianni-cor gianni-cor force-pushed the qvac-fabric-fix-llama-off branch from 528a31d to 5b9dd7c Compare May 7, 2026 18:01
@gianni-cor

Copy link
Copy Markdown
Contributor Author

Force-pushed `5b9dd7c`: per review feedback, hoist all the always-OFF `LLAMA_*` flags (`LLAMA_CURL`, `LLAMA_BUILD_TESTS`, `LLAMA_BUILD_TOOLS`, `LLAMA_BUILD_EXAMPLES`, `LLAMA_BUILD_SERVER`, `LLAMA_ALL_WARNINGS`) into the `LLAMA_OPTIONS` initialization so they're set in both branches.

Original commit only set those flags in the `llama` branch, on the assumption that `BUILD_LLAMA=OFF` already prunes the upstream code paths they affect. That's true today (`-- BUILD_LLAMA is OFF: skipping LLAMA targets`) but defensive symmetry is cheaper than a future regression. Net structure:

```cmake
set(LLAMA_OPTIONS
-DLLAMA_CURL=OFF
-DLLAMA_BUILD_TESTS=OFF
-DLLAMA_BUILD_TOOLS=OFF
-DLLAMA_BUILD_EXAMPLES=OFF
-DLLAMA_BUILD_SERVER=OFF
-DLLAMA_ALL_WARNINGS=OFF
)
if("llama" IN_LIST FEATURES)
list(APPEND LLAMA_OPTIONS -DLLAMA_MTMD=ON)
else()
list(APPEND LLAMA_OPTIONS
-DLLAMA_MTMD=OFF
-DLLAMA_BUILD_COMMON=OFF
)
endif()
```

Re-validated: nmtcpp arm64-android cross-build still configures (both `arm64-android-dbg` and `-rel`) with this revision. Behaviour for the `llama` branch is identical to before.

When the consumer requests qvac-fabric without the 'llama' feature
(e.g. tetherto/qvac's nmtcpp, which only needs ggml), the portfile
still passed -DLLAMA_MTMD=ON unconditionally. LLAMA_MTMD pulls in
LLAMA_BUILD_COMMON, which makes upstream's CMakeLists call
license_generate(common) -- but BUILD_LLAMA=OFF skips defining the
'common' target, so the build aborts:

  -- BUILD_LLAMA is OFF: skipping LLAMA targets.
  -- Generating embedded license file for target: common
  CMake Error at cmake/license.cmake:38 (message):
    Target 'common' does not exist

Move the LLAMA_* configure flags into a feature-gated LLAMA_OPTIONS
list. With the 'llama' feature on, behaviour is unchanged. Without
it, pass -DLLAMA_MTMD=OFF and -DLLAMA_BUILD_COMMON=OFF so the
license-generation block is skipped and the configure succeeds.

Validated by trying to build qvac-lib-infer-nmtcpp (vcpkg.json:
qvac-fabric default-features:false + features:["gpu-backends"])
for arm64-android: configure now completes without error.

Co-authored-by: Cursor <cursoragent@cursor.com>
@gianni-cor gianni-cor force-pushed the qvac-fabric-fix-llama-off branch from 5b9dd7c to a47f4f2 Compare May 7, 2026 18:07
@gianni-cor gianni-cor merged commit f74fc66 into main May 7, 2026
2 checks passed
gianni-cor added a commit to zoq/qvac-fork that referenced this pull request May 7, 2026
The 8189.0.0 (port-version 0) qvac-fabric port shipped a
configure-time bug for consumers without the "llama" feature
(i.e. nmtcpp): -DLLAMA_MTMD=ON was passed unconditionally, which
transitively enables LLAMA_BUILD_COMMON, which makes upstream call
license_generate(common) -- but BUILD_LLAMA=OFF skips defining the
'common' target, so the cmake configure aborts.

The fix landed in tetherto/qvac-registry-vcpkg#136 as
qvac-fabric port-version 1. Bumping the consumer constraint from
"version>=": "8189.0.0" to "version>=": "8189.0.0#1" forces vcpkg
to pick the fixed port-version (otherwise it picks the lowest
satisfying version, which is the broken #0).

Validated: nmtcpp arm64-android cross-build now configures and
builds end-to-end against the upstream registry, no overlay needed.

Co-authored-by: Cursor <cursoragent@cursor.com>
@gianni-cor gianni-cor mentioned this pull request May 8, 2026
4 tasks
jpgaribotti pushed a commit to jpgaribotti/qvac-registry-vcpkg that referenced this pull request May 8, 2026
Updates the qvac-fabric port to upstream tag v8189.0.1 (commit
739b309ae, the SSH-resigned tip of temp-8189). Refreshes the source
tarball SHA512, resets port-version to 0, and bumps the baseline
plus per-version manifest. The portfile bug-fix for BUILD_LLAMA=OFF
introduced in tetherto#136 (port-version 1) is preserved.

What's new in v8189.0.1 over v8189.0.0:

- Inject enable_thinking into the Jinja template context so models
  like Qwen 3.5 and Gemma 4 actually emit reasoning content
  (tetherto/qvac-fabric-llm.cpp#128).
- Add GGML_OP_DELTA_NET_AR Vulkan compute shader + dispatch path
  (tetherto#129) so Vulkan no longer falls back to CPU per token on
  Qwen 3.5 / DeltaNet decode.
- vulkan: Force f32 src1 through the strided cpy path to fix an
  embedding-model crash (tetherto#130).

Co-authored-by: Cursor <cursoragent@cursor.com>
jpgaribotti pushed a commit that referenced this pull request May 8, 2026
Updates the qvac-fabric port to upstream tag v8189.0.1 (commit
739b309ae, the SSH-resigned tip of temp-8189). Refreshes the source
tarball SHA512, resets port-version to 0, and bumps the baseline
plus per-version manifest. The portfile bug-fix for BUILD_LLAMA=OFF
introduced in #136 (port-version 1) is preserved.

What's new in v8189.0.1 over v8189.0.0:

- Inject enable_thinking into the Jinja template context so models
  like Qwen 3.5 and Gemma 4 actually emit reasoning content
  (tetherto/qvac-fabric-llm.cpp#128).
- Add GGML_OP_DELTA_NET_AR Vulkan compute shader + dispatch path
  (#129) so Vulkan no longer falls back to CPU per token on
  Qwen 3.5 / DeltaNet decode.
- vulkan: Force f32 src1 through the strided cpy path to fix an
  embedding-model crash (#130).
@gianni-cor gianni-cor deleted the qvac-fabric-fix-llama-off branch May 9, 2026 06:58
gianni-cor added a commit to tetherto/qvac that referenced this pull request May 11, 2026
* Restore Qwen3.5 / Gemma4 / PaddleOCR-VL tests + Mali coopmat fix

Stack of three logical changes squashed into one commit so the test
ports stay self-consistent with the build/runtime they depend on:

1. qvac-fabric overlay ports (LLM + embed + nmtcpp):
   - Pin to fabric 78db8bf4 (PR tetherto/qvac-fabric-llm.cpp#121 HEAD,
     includes c79a8851 "ggml-vulkan: Fix NaN outputs on Mali").
   - Drop -DGGML_VULKAN_DISABLE_COOPMAT*=ON for Android so coopmat
     shaders are compiled in. With coopmat off, runtime
     device->coopmat_support is false and the Mali fix's ARM-gated
     branches were skipped, leaving Qwen3-Q8_0 finetuning NaN on
     Pixel 9 Pro Mali.
   - Wire up overlay-ports in each package's vcpkg-configuration.json.
   - Add find_package(OpenSSL) before find_package(llama) in the LLM
     CMakeLists so llama-targets.cmake's transitive OpenSSL::SSL
     reference (via cpp-httplib) resolves on local builds.

2. utils.js downloadFile redirect race:
   - Track a handedOff flag set when the redirect branch hands off
     dest to a recursive call. All cleanup paths now skip fs.unlink
     once ownership is transferred, so a late error from the outer
     writestream can't delete the freshly-downloaded file (Pixel
     ENOENT after "successful" mmproj download).

3. Three new integration tests + their mobile harness wiring:
   - qwen3-5.test.js — basic / multi-turn / tool-calling
   - gemma4.test.js — text / multi-turn / image (forced to CPU on
     darwin + mobile because gemma4v projector SIGSEGVs on Metal and
     Adreno OpenCL) / tool-calling
   - ocr-paddle.test.js — OCR; mobile maxTokens capped to 768
   - Ported to the new addon API (files: { model: [absPath],
     projectionModel?: absPath }, config: …).
   - Added matching unit test test_text_llm_context_qwen3.cpp.
   - integration.auto.cjs registers runQwen35Test, runGemma4Test,
     runOcrPaddleTest dispatchers.
   - test-groups.json: iOS heavy4 cluster
     (Gemma4+OcrLighton+OcrPaddle), iOS lightB adds Qwen35,
     Android groupB has Qwen35 first then Gemma4 / OcrPaddle.
   - Workflow: Android GroupB Device Farm jobTimeout 60→90 min.

* API port + Gemma4 tool-call fix.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Wire addon/src/patches ahead of the vcpkg include path to pick up the LlamacppUtils.hpp ptr-API override.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* API port + Gemma4 tool-call fix.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Split iOS heavy4 into three single-test specs (heavy4 = OcrLighton, new heavy7 = Gemma4, new heavy8 = OcrPaddle) and schedule them as separate Device Farm runs to avoid memory pressure.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Drop LlamacppUtils.hpp patch override; bump addon-cpp to 1.1.7

The LlamacppUtils.hpp common_init_result_ptr API now ships in
qvac-lib-inference-addon-cpp 1.1.7 (PR #1887), so the local
addon/src/patches/qvac-lib-inference-addon-cpp/LlamacppUtils.hpp
shim is no longer needed in the embed and llm addons.

- Delete the patch headers in embed and llm.
- Drop the BEFORE PRIVATE addon/src/patches include path from the
  embed/llm production and unit-test CMakeLists.
- Bump qvac-lib-inference-addon-cpp version>= to 1.1.7 in the embed,
  llm, and nmtcpp vcpkg.json files so they pick up the upstream
  ptr-API header from the registry.

The OpenSSL find_package() addition stays — it's an unrelated
local-build fix.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Cap ocr-lighton predict to 1800 (desktop) / 768 (mobile) so the LightOnOCR response can't overrun ctx_size=4096.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Rewrite sliding-context test to use the post-GGML_PAD effective n_ctx (512) and retune n_predict / n_discarded so all 8 cases match the current ContextSlider semantics.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Allow embed batching test to override ctx_size and pin gte-large to batch_size=512 / ctx_size=384 to probe the Mali Vulkan first-submit ErrorDeviceLost.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Fix reverse-prompt scenario by removing comma, space, listing both 'pizza' and 'Pizza', and lowercasing the assertion comparisons to match 'Pizza' and 'pizza'.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Sanitize media Uint8Array prompts before logging to avoid V8 Zone OOM.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Use Qwen3 family chat-template to fix Qwen3.5-0.8B gibberish output on macOS Metal.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Update portfiles to point to the latest fabric.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Revert "Allow embed batching test to override ctx_size and pin gte-large to batch_size=512 / ctx_size=384 to probe the Mali Vulkan first-submit ErrorDeviceLost."

This reverts commit 1408896.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Raise AfriqueGemma cancel maxWait to 60s, and apply the use_jinja gate-drop so Qwen3-family models always pick the fixed jinja template.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Drop the retired AfriqueGemma integration tests.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Update portfiles to point to the latest head.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Update portfiles to point to the latest head.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Drop qwen35 from the Qwen3-template detection and the supported-finetune-architecture list since neither path is actually validated for Qwen3.5.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Update portfiles to point to the latest head.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Enable coopmat.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Drop the Qwen3 use_jinja override pairing now that qwen35 is no longer treated as Qwen3-family.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Use only general.architecture for Qwen3 detection so Qwen3.5 stops getting the Qwen3 chat-template via the model-name substring fallback.

Drop modelNameLooksLikeQwen3 / getModelName and the modelName parameter from supportsToolsCompactForModelMetadata and selectToolsCompactMarkerForModelMetadata. The substring match on general.name treated "Qwen3.5-..." as Qwen3 and overrode the model's embedded tokenizer.chat_template, contradicting the recent decision to keep qwen35 out of the Qwen3 family. Update the LlamaModel call site and unit tests; add explicit qwen35/nullopt negative cases.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Accept HuggingFace function-call XML in extractToolCalls so the Qwen3.5 tool-calling integration test parses the model's native <tool_call><function=...><parameter=...>...</parameter></function></tool_call> envelope produced by its embedded chat template, in addition to the Qwen3-style JSON envelope.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Bump n_predict in the Qwen3.5 basic and multi-turn integration tests so the embedded chat-template's reasoning block has room to finish before the answer on slower CI backends.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Enable coopmat and point to the latest fabric.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Route Qwen3.5 inference and all finetuning on Mali to CPU, disable Vulkan coopmat at build time, halve mobile finetune workload to account for CPU training.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Point to the latest fabric version.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Force Bert to the CPU on Mali.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Run finetuning on Mali GPU.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Run Qwen 3.5 on Mali GPU.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Point to the latest fabric version and enable coopmat path.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* vcpkg: drop per-package qvac-fabric overlays

Removes the qvac-fabric overlay-ports infrastructure from the LLM,
Embed, and NMT manifests. The default-registry baseline is left
untouched, so vcpkg now resolves qvac-fabric directly from the
registry at the existing baseline (7248.2.3).

Bumping to fabric 8189.0.0 will be handled by a separate baseline
update; this commit only undoes the overlay-based development setup
that was no longer needed.

- vcpkg-configuration.json (3x): drop "overlay-ports" entry.
- vcpkg/ports/qvac-fabric/ (3x): remove overlay portfile.cmake,
  vcpkg.json, and android-vulkan-version.cmake.

Co-authored-by: Cursor <cursoragent@cursor.com>

* vcpkg: bump qvac-fabric version constraint to 8189.0.0

Updates the consumer manifests in the LLM, Embed, and NMT packages
to require qvac-fabric >= 8189.0.0. The default-registry baseline
is intentionally left untouched.

Co-authored-by: Cursor <cursoragent@cursor.com>

* llm/embed/nmtcpp: bump versions for qvac-fabric 8189.0.0

- qvac-lib-infer-llamacpp-llm: 0.19.2 -> 0.20.0 (minor)
- qvac-lib-infer-llamacpp-embed: 0.15.0 -> 0.16.0 (minor)
- qvac-lib-infer-nmtcpp: 2.1.1 -> 3.0.0 (major)

The nmtcpp major bump reflects a real behavioural regression: the
previous overlay built ggml unconditionally with every GPU backend
the platform supported (Vulkan/Metal/OpenCL); switching to the
upstream registry port with the existing "default-features": false
in nmtcpp's vcpkg.json now disables the new "gpu-backends" feature,
so out-of-the-box ggml exposes only the CPU backend. Consumers that
rely on GPU-accelerated nmt inference must add
'"features": ["gpu-backends"]' to the qvac-fabric block of their
nmtcpp build manifest.

CHANGELOG entries added in all three packages.

Co-authored-by: Cursor <cursoragent@cursor.com>

* nmtcpp: opt into qvac-fabric gpu-backends feature; downgrade bump to 2.2.0

The previous commit (3.0.0) flagged a breaking change: switching from
the always-on overlay to the registry port with default-features:false
disabled GPU backends in ggml. Adding "features": ["gpu-backends"]
to nmtcpp's qvac-fabric dep restores the previous Vulkan/Metal/OpenCL
behaviour, so the bump is now a non-breaking minor (2.2.0) and the
BREAKING note in the changelog is replaced with a plain Changed entry.

Co-authored-by: Cursor <cursoragent@cursor.com>

* nmtcpp: re-bump to 3.0.0 (major)

Restores the major version bump for nmtcpp. The new fabric port schema
(features split between gpu-backends/llama) and the move from a vendored
overlay to the upstream registry are large enough downstream changes
that consumers should treat this as a major release, even though
runtime behaviour is preserved by opting into "gpu-backends".

Co-authored-by: Cursor <cursoragent@cursor.com>

* vcpkg: pin qvac-fabric to >=8189.0.0#1

The 8189.0.0 (port-version 0) qvac-fabric port shipped a
configure-time bug for consumers without the "llama" feature
(i.e. nmtcpp): -DLLAMA_MTMD=ON was passed unconditionally, which
transitively enables LLAMA_BUILD_COMMON, which makes upstream call
license_generate(common) -- but BUILD_LLAMA=OFF skips defining the
'common' target, so the cmake configure aborts.

The fix landed in tetherto/qvac-registry-vcpkg#136 as
qvac-fabric port-version 1. Bumping the consumer constraint from
"version>=": "8189.0.0" to "version>=": "8189.0.0#1" forces vcpkg
to pick the fixed port-version (otherwise it picks the lowest
satisfying version, which is the broken #0).

Validated: nmtcpp arm64-android cross-build now configures and
builds end-to-end against the upstream registry, no overlay needed.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs: drop overlay-removal note from changelogs

Removes the changelog bullet describing the deletion of the per-package
qvac-fabric vcpkg overlay. The overlay teardown is mechanical packaging
plumbing rather than a user-facing change worth documenting.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test/llm: restore AfriqueGemma integration tests (desktop-only)

Reverts e257a19's deletion of the afriquegemma-edge-cases and
afriquegemma-translation integration tests, and adds a 'desktopOnly'
opt-out so they're skipped on mobile without breaking the per-test
group coverage invariant.

- packages/qvac-lib-infer-llamacpp-llm/test/integration/afriquegemma-edge-cases.test.js: restored.
- packages/qvac-lib-infer-llamacpp-llm/test/integration/afriquegemma-translation.test.js: restored.
- test/mobile/test-groups.json: new top-level "desktopOnly" array
  listing runAfriquegemmaEdgeCasesTest and runAfriquegemmaTranslationTest.
- scripts/generate-mobile-integration-tests.js: validateGroups now
  reads the desktopOnly list; entries are still emitted into
  integration.auto.cjs (so validate-mobile-tests stays happy) but
  excluded from the per-platform "missing" check, so the mobile
  runners never invoke them.
- test/mobile/integration.auto.cjs: regenerated by
  `npm run test:mobile:generate`.
- CHANGELOG note in qvac-lib-infer-llamacpp-llm under Tests.

Validated via `npm run test:mobile:generate` + `npm run test:mobile:validate`.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(llm): drop AfriqueGemma test restoration changelog note

Co-authored-by: Cursor <cursoragent@cursor.com>

* test/llm: switch AfriqueGemma desktop-only skip to in-test pattern

Per review: don't change generate-mobile-integration-tests.js. Use the
same skip:isMobile pattern other tests already use (config-parameters,
tool-calling, image), and keep the AfriqueGemma functions in the iOS
lightA / Android groupA groups so the existing per-test coverage
invariant stays intact.

- packages/qvac-lib-infer-llamacpp-llm/scripts/generate-mobile-integration-tests.js:
  reverted to upstream/main (drops the desktopOnly opt-out plumbing).
- test/mobile/test-groups.json: drops 'desktopOnly', adds
  runAfriquegemmaEdgeCasesTest and runAfriquegemmaTranslationTest
  back to ios.lightA and android.groupA.
- test/integration/afriquegemma-edge-cases.test.js,
  test/integration/afriquegemma-translation.test.js: add
  isMobile = platform === 'ios' || platform === 'android', and
  skip:isMobile to every test() options object (13 total).
- test/mobile/integration.auto.cjs: regenerated.

Validators both green:
  npm run test:mobile:generate -> "all tests assigned for every platform"
  npm run test:mobile:validate -> ok

Co-authored-by: Cursor <cursoragent@cursor.com>

* test/llm: skip ocr-lighton on mobile

Adds skip:isMobile to the single test in ocr-lighton.test.js,
matching the AfriqueGemma / config-parameters / tool-calling
pattern. isMobile is already defined in this file. The test stays
in ios.heavy4 / android.groupB so per-platform group coverage is
unaffected; the brittle test itself just skips on mobile.

Co-authored-by: Cursor <cursoragent@cursor.com>

* ci: revert workflow timeout change for llm mobile integration

Drops PR #1874's edit to
.github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml
(parameterised jobTimeoutMinutes + 90-minute override for Android
GroupB). Workflow is restored to the upstream/main version.

Co-authored-by: Cursor <cursoragent@cursor.com>

* addons: disable flash-attn by default on the OpenCL backend

Flash attention is not reliably supported by the OpenCL ggml backend
(Adreno path), so when the chosen GPU backend ends up being OpenCL
the addons now force "flash-attn=off" unless the user explicitly
passed flash-attn / flash_attn in their config.

LLM (LlamaModel.cpp / LlamaModel.hpp):

- Add a bool isOpenCl parameter to tuneConfigMap (defaulted to false
  to keep the existing test_tune_config_map.cpp call sites working).
- Mirror the BitNet-disabling branch with an else-if for OpenCL +
  notUserSet("flash-attn", "flash_attn").
- At the call site, read chosenBackend.first/second after chooseBackend
  returns and pass isOpenCl through.

Embed (BertModel.cpp):

- No tuneConfigMap equivalent here. Inject the same logic inline
  immediately after chooseBackend, before configFilemap is serialised
  into configVector. Honour user-set "flash-attn"/"flash_attn".

Both packages compile cleanly via bare-make build on macOS-arm64.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fixup! tuneConfigMap: keep ABI for existing 4-arg test callers

CI failure on cpp-tests-darwin-arm64 (PR #1874):
  test/unit/test_tune_config_map.cpp:199:43: fatal error: no viable
  conversion from 'FtOverrides' to 'bool'

The previous commit inserted bool isOpenCl as the 4th parameter of
tuneConfigMap, but several existing tests pass FtOverrides{...} as
the 4th positional argument (relying on it being finetuneOverrides).

Swap the order so the new isOpenCl parameter comes after the existing
finetuneOverrides; both stay defaulted, so all old 3-arg and 4-arg
call sites compile unchanged. The production call site in
LlamaModel.cpp is updated accordingly.

Also adds 4 new TuneConfigMapTest cases covering the OpenCL branch:
- OpenCl_NonBitnet_FlashAttnDisabledByDefault
- OpenCl_UserSetFlashAttnHyphen_Respected
- OpenCl_UserSetFlashAttnUnderscore_Respected
- NotOpenCl_NonBitnet_FlashAttnUnchanged

All 53 TuneConfigMapTest cases pass locally on macOS-arm64.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add QWen 3.5 vision test.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Route vision models with mmproj to CPU on Apple M1.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Route only the projector to CPU on Apple M1.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* run qwen3-5.test.js on IOS GPU

* js lint

* Recognize Gemma 4 channel reasoning markers in Qwen3ReasoningUtils, and bump gemma4 basic-test n_predict so the answer fits after the thinking preamble.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Wire reasoning-budget config to inputs.enable_thinking so passing reasoning-budget=0 disables the model's <think> reasoning channel, and add coverage for Qwen3, Qwen3.5, and Gemma 4.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* vcpkg: bump qvac-fabric to >=8189.0.1

The 8189.0.1 port (tetherto/qvac-registry-vcpkg#138) drops
port-version 1's BUILD_LLAMA=OFF portfile workaround and ships the
new fabric tip 739b309ae. Notable upstream fixes pulled in:

- Inject enable_thinking into the Jinja template context so Qwen 3.5
  and Gemma 4 actually emit <think> reasoning content.
- GGML_OP_DELTA_NET_AR Vulkan compute shader (Qwen 3.5 / DeltaNet
  decode no longer falls back to CPU per token).
- vulkan: f32 src1 strided cpy fix (embedding-model crash).

Validated on macOS-arm64: vcpkg resolves
qvac-fabric[core,gpu-backends,llama]:arm64-osx@8189.0.1 and the
addon builds end-to-end.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Disable the embed addon's BERT-on-Mali CPU override.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Prepend <think> opener to the visible stream when the chat template force-opens the reasoning channel.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Remove the Mali detection plumbing from the embed addon now that BERT runs on Mali GPU.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Bump n_predict and ctx_size in the Qwen3.5 reasoning-budget baseline so the model reliably reaches </think>.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Restore the mobile finetune dataset to 8 samples.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* test: drop AfriqueGemma + MedGemma + Dolphin-MoE tests

Per review: cull tests that exercise models we no longer want covered
in the LLM/SDK CI matrix.

LLM (packages/llm-llamacpp):
- Delete integration tests:
  - test/integration/afriquegemma-edge-cases.test.js
  - test/integration/afriquegemma-translation.test.js
  - test/integration/moe.test.js (dolphin-mixtral-2x7b)
- Delete docs/afriquegemma-translation.md (only documents the
  now-removed integration tests).
- Strip the medgemma-4b-it variant from:
  - test/integration/tool-calling.test.js (collapses
    ALL_TOOL_MODEL_VARIANTS / TOOL_MODEL_VARIANTS to qwen3-1.7b only,
    drops the now-unused isMobile derived var).
  - test/integration/finetuning-pause-resume.test.js (drops the
    medgemma-4b-it-q4_0 entry from FINETUNE_MODELS).
- test/unit/test_model_metadata.cpp: drop the gemma3Model_ fixture +
  the two Gemma3-specific TEST_F cases
  (DiskSingleFile_Gemma3Arch_*); update the comment block listing
  exercised arches accordingly.
- test/unit/pick-primary-gguf-path.test.js: keep the tensors.txt-first
  ordering test, but rebase the fixture filenames on
  Qwen3-4B-Q4_K_M-* so no medgemma names remain in the test corpus.
- test/mobile/test-groups.json + test/mobile/integration.auto.cjs:
  drop runAfriquegemmaEdgeCasesTest, runAfriquegemmaTranslationTest,
  runMoeTest from both ios and android groups; auto.cjs trimmed to
  match. `validate-mobile-tests.js` is green.

SDK (packages/sdk/tests-qvac):
- Delete tests/translation-afriquegemma-tests.ts.
- tests/test-definitions.ts: drop translationAfriquegemmaTests
  import + spread.
- tests/shared/executors/translation-executor.ts: drop the import,
  the spread, and the |afriquegemma branch from the dispatch regex.
- tests/mobile/consumer.ts + tests/desktop/consumer.ts: drop the
  AFRICAN_4B_TRANSLATION_Q4_K_M import and the
  resources.define("afriquegemma", ...) block; mobile also drops the
  afriquegemma-only SkipExecutor.
- tests/shared/resource-lifecycle.ts: rephrase the eviction-comment
  example to a generic "large translation model" so it no longer
  references the deleted resource.

Not touched: NOTICE/CHANGELOG (auto-generated/historical),
sdk/models/registry/* (model constants in the registry are data, not
tests), sdk/examples/translation/translation-llm-afriquegemma.ts
(consumer-facing example, not a test).

* Revert "test: drop AfriqueGemma references from packages/sdk/tests-qvac"

Per review: keep packages/sdk/tests-qvac/ untouched. Restore the SDK
afriquegemma test file, the test-definitions / translation-executor /
desktop+mobile consumer / resource-lifecycle edits to their state
prior to commit 36de6ec.

Only the LLM-side cull (packages/llm-llamacpp + the deleted afrique /
moe / medgemma test files there) from 36de6ec is kept.

* Restore packages/llm-llamacpp/docs/afriquegemma-translation.md

Per review: keep the AfriqueGemma translation doc. Commit 36de6ec
removed it together with the LLM AfriqueGemma test files; restore it
unchanged from the merge tip (e29836d).

* chore: pin qvac-fabric to 8189.0.2 via overlay-ports for testing

Adds an overlay port copy of qvac-fabric pointing at v8189.0.2 of
tetherto/qvac-fabric-llm.cpp (tetherto/qvac-registry-vcpkg#140)
to llm-llamacpp, embed-llamacpp, and translation-nmtcpp, declared via
each package's vcpkg-configuration.json. Lets this PR exercise the new
fabric build (incl. the Mali coopmat1 BitNet TQ NaN fix) without
waiting for the registry baseline bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: pin overlay qvac-fabric to temp-8189 tip f686a1324

Point REF at the latest qvac-fabric-llm.cpp temp-8189 commit
(f686a1324e13184d3257cb74c1ba17f9cf8ef575) instead of v8189.0.2 so the
overlay tracks branch tip while the branch is still moving.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: extend Android LLM mobile test timeouts

Allow slower Android Device Farm runs to finish model-heavy LLM tests before the harness marks them as timed out.

Co-authored-by: Cursor <cursoragent@cursor.com>

* vcpkg: drop qvac-fabric overlay-ports, bump version>= to 8189.0.2

tetherto/qvac-registry-vcpkg#140 publishes qvac-fabric@8189.0.2 in the
default registry, so the temporary per-package overlay we used while the
new fabric build was still being shaken out is no longer necessary.

For llm-llamacpp, embed-llamacpp, and translation-nmtcpp:

- Delete `packages/<pkg>/vcpkg/ports/qvac-fabric/` (portfile.cmake,
  vcpkg.json, android-vulkan-version.cmake) — the overlay copy.
- Drop the `overlay-ports` entry from each package's
  vcpkg-configuration.json. The `default-registry` baseline is left
  untouched intentionally; the `version>=` constraints below are what
  forces vcpkg to resolve to the new fabric revision against the
  unchanged baseline.
- Bump the `qvac-fabric` `version>=` pin from `8189.0.1` -> `8189.0.2`
  in each package's vcpkg.json.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(llm): drop dead sawMali plumbing from BackendSelection

`sawMali` was threaded through `emplaceIfValidDevice` / `tryEmplaceDevice` /
`chooseBackend` but never read by any caller — leftover from the earlier
"Force BERT/Qwen3.5 to CPU on Mali" iterations. The embed-side cleanup
already landed in 2ac5de0 ("Remove the Mali detection plumbing from the
embed addon now that BERT runs on Mali GPU."); this finishes the symmetric
removal on the LLM side. `sawAppleM1` plumbing is preserved unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(llm): explain why MtmdLlmContext skips inside_reasoning flip

TextLlmContext flips reasoningState_.inside_reasoning = true alongside the
forced "<think>\n" opener; MtmdLlmContext doesn't because it doesn't carry a
reasoningState_ today. Add an inline note so the asymmetry isn't read as a
bug, and point at the symmetric site to update if reasoning-aware EOS
replacement is later added on the multimodal path.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(llm): narrow tool-call args quoter to leading bare key only

The previous post-generation regex (`([{,])(\s*)([A-Za-z_]…)(\s*):` -> quote
the ident) was too broad: it also matched `, ident:` substrings sitting
inside JSON string values, so a tool call with a free-form string argument
like `{"query":"phase one, step: validate"}` came out corrupted as
`{"query":"phase one, "step": validate"}`, which then failed JSON.parse on
the consumer side.

In practice the rewrite is only needed for one upstream quirk: the Gemma 4
parser's `gemma4_args_to_json` (common/chat-parser.cpp) uses an
`at_key_start()` helper that peeks backwards in the output buffer for a
`{`/`,` -- so the very first top-level key is left bare while every nested
or post-comma key is already quoted. All other tool dialects reach us via
`json::dump()` upstream and already start with a quoted key.

Replace the broad regex with one anchored at `^\{(\s*)<ident>\s*:`, which
fixes exactly that single leading-bare-key case and cannot match anywhere
inside a JSON string value.

Verified end-to-end on linux-x64 against gemma-4-E2B-it-Q8_0 (CPU):

- Adversarial prompt forcing `phase one, step: validate` as a tool arg
  string: baseline produced invalid JSON
  `{"query":"phase one, "step": validate"}` (parse fail at pos 55);
  this fix yields `{"query":"phase one, step: validate"}` and the test
  passes 7/7 assertions.
- Existing simple-args happy path (`get_weather` with city/unit) still
  passes 5/5.

Co-authored-by: Cursor <cursoragent@cursor.com>

* revert(llm): drop synthetic <tool_call>{json}</tool_call> post-processing

Each model now streams only its own native tool-call dialect:
- Qwen3 / Hermes: <tool_call>{json}</tool_call> (already canonical)
- Qwen3.5: <tool_call><function=name><parameter=k>v</parameter></function></tool_call>
- Gemma 4: <|tool_call>call:NAME{key:<|"|>val<|"|>,...}<tool_call|>
- Mistral, DeepSeek-R1, Functionary, GPT-OSS, etc. emit their own markers.

The previous PR added a post-generation common_chat_parse pass that
appended a uniform <tool_call>{json}</tool_call> envelope for every
detected call. That duplicated tokens for Hermes-shape models (the
envelope is already in the native stream) and inflated Gemma 4 output
by ~14% with two synthetic copies per call. The leading-bare-key
handling for Gemma 4's tc.arguments was also a constant source of sharp
edges (broad regex corrupted string values containing ", ident:";
narrow anchored regex still required follow-up). Per-dialect parsing
belongs at the SDK consumer layer, not in the addon.

Removed:
- Post-generation block in LlamaModel::processPromptImpl (synthesizer).
- needsOutputCapture widening to include !resolved.tools.empty().
- LlmContext::getLastChatFormat() virtual.
- lastChatFormat_ members + overrides in TextLlmContext, MtmdLlmContext.
- common_chat_format* outFormat parameter from getPrompt().
- <regex> include in LlamaModel.cpp (no remaining users).

Kept:
- outThinkingForcedOpen mechanism (independent reasoning-channel feature).
- toolsCompact_ controller and KV-cache trim logic.
- All other PR work.

Validated on linux-x64/CPU after incremental rebuild:
- Gemma 4 (gemma-4-E2B-it-Q8_0): 6/6 asserts pass with native-dialect
  parser, no synthetic envelope leaks, output 941 chars (down from
  ~1100 with synthesizer).
- Qwen3.5 (Qwen3.5-0.8B-Q8_0): 5/5 asserts pass with the existing
  parseXmlToolCall path, output 394 chars.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(llm): parse Gemma 4 native tool-call dialect in gemma4.test.js

Without the synthetic <tool_call>{json}</tool_call> envelope reverted
in the previous commit, Gemma 4 emits its own dialect:

  <|tool_call>call:NAME{key:<|"|>val<|"|>,...}<tool_call|>

Strings are wrapped in <|"|>...<|"|> instead of "...", keys are bare,
and the closing tag is <tool_call|> (trailing pipe, no slash).

extractToolCalls now matches that shape directly and returns
{ name, argsRaw }. argsContainStringValue() helper checks the args
body for a Gemma-4-quoted string literal. Substring-based assertion
is sufficient to verify the model called the right tool with the
right argument values; full dialect-to-JSON conversion lives upstream
in fabric's gemma4_args_to_json and is not the addon test's job.

qwen3-5.test.js was unchanged: Qwen3.5 wraps its <function=name>
<parameter=k>v</parameter></function> XML in <tool_call>...</tool_call>
natively, so the existing parseXmlToolCall path keeps working.

Validated on linux-x64/CPU against gemma-4-E2B-it-Q8_0:
4/4 tests, 13/13 asserts (3 synthetic-input parser sanity checks +
1 live LLM run).

Co-authored-by: Cursor <cursoragent@cursor.com>

* revert(llm): drop Apple M1 detection + projector-CPU routing

The PR added an Apple-M1-specific code path that detected the chip via the
GPU description string and routed `params.mmproj_use_gpu = false` so the
vision projector ran on CPU instead of Metal, working around a SIGSEGV in
the projector's image-encoding kernel observed on M1 Metal at the time.

Re-tested on M1 with the current fabric tip: no SIGSEGV, projector runs
fine on Metal end-to-end. The carve-out is no longer needed.

Removed:
- BackendSelection: `isAppleM1Device()` helper, `bool& sawAppleM1` plumbing
  through `emplaceIfValidDevice` / `tryEmplaceDevice` / `chooseBackend`,
  and `bool* outSawAppleM1` parameter on both `chooseBackend` overloads.
- LlamaModel: the `bool sawAppleM1 = false` local, the call-site argument,
  and the `params.mmproj_use_gpu = !sawAppleM1` ternary; mmproj now uses
  GPU on every desktop platform (Android still hardcoded to false).
- test_backend_selection.cpp: `APPLE_M{1,2,3,4}_DESC` constants,
  `chooseBackendWithM1Flag()` helper, and the four `AppleM*_*` test cases.
- gemma4.test.js / qwen3-5.test.js: the comment blocks describing the M1
  carve-out; `useCpuForVision` semantics are unchanged (`useCpu || isMobile`
  on gemma4 and `useCpu` on qwen3-5).

Verified on linux-x64/CPU after rebuild: 148/148 C++ unit tests pass
(BackendSelectionTest, TuneConfigMapTest, ChatTemplateUtilsTest).

Co-authored-by: Cursor <cursoragent@cursor.com>

* revert(llm): drop dead Gemma 4 markers from updateQwen3ReasoningBuffer

The PR added two extra substring scans for Gemma 4's reasoning channel
markers (<|channel>thought open, <channel|> close) to
updateQwen3ReasoningBuffer. The intent was to extend the EOS-rescue
path (handleQwen3ReasoningEOS rewrites EOS-while-thinking into a
closing tag) to Gemma 4. That never actually fires though: both the
buffer-update call and the EOS-rescue call in TextLlmContext are gated
by `if (isQwen3Model_)`, and isQwen3Model_ resolves to
`general.architecture == "qwen3"` only. Gemma 4 reports architecture
"gemma4", so the gate never opens, the markers never get scanned, and
the rescue path never runs for Gemma 4.

In live runs Gemma 4 always emits <channel|> cleanly before <eos>, so
the rescue isn't needed on the happy path; if Gemma 4 ever truncates
mid-thought under context pressure we will need a real dialect-aware
rescue (per-arch close-tag token + extended gate) and a follow-up will
add that. For this PR we just want the dead code gone so it doesn't
mislead future readers about what's actually wired up.

Net: -9 lines, file is now identical to upstream main.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(llm): switch gemma4 fixtures from unsloth to bartowski

The unsloth GGUF pack
(huggingface.co/unsloth/gemma-4-E2B-it-GGUF) tags <turn|> as the EOG token
in tokenizer.ggml.eos_token_id and leaves <eos> classified as a regular
text token. Gemma 4's training-baked behaviour after assistant content is
to emit a few <eos> tokens before <turn|>, so with that pack the addon's
generation loop -- which terminates on llama_vocab_is_eog -- doesn't stop
until <turn|> arrives. We were observing ~9 spurious <eos> tokens
trailing every Gemma 4 response, eating into n_predict and KV cache for
no gain.

bartowski's GGUF
(huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF) ships the exact
same vocabulary but tags <eos> as EOG (matching the base
google/gemma-4-E2B-it tokenizer config). With that pack the addon
terminates on the first <eos> -- empirically 0 trailing tokens, ~30 %
shorter completions on the same prompt, same dialect output that the
native-dialect parser added in 87e6c35 handles unchanged.

Verified on linux-x64/CPU (qvac-dev-linux-x64) with the same
get_weather tool prompt:

  unsloth Q8_0    : 941 chars, 9 trailing <eos>, EOG = {<turn|>, </s>}
  bartowski Q4_K_M:  676 chars, 0 trailing <eos>, EOG = {<eos>,    </s>}

Note: the unsloth metadata bug deserves an upstream issue against the
unsloth pack maintainers; this PR's scope is just to stop our tests
paying the wasted-tokens tax.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(llm): unblock gemma4 image test on mobile + fix ctx overflow

Three changes to packages/llm-llamacpp/test/integration/gemma4.test.js
(image-describe subtest):

1. Drop the mobile CPU-vision carve-out.
   useCpuForVision used to force `device: 'cpu'` on Android/iOS to dodge
   Adreno OpenCL SIGABRT and Mali Vulkan instability that bit us with
   the unsloth mmproj. With bartowski's mmproj (now the fixture in
   787c3322) we want CI to actually exercise the device-farm GPU code
   path for vision -- if that path regresses on a real Adreno or Mali
   chip we want to find out from CI, not by accident in production.
   Desktop x64-darwin / linux-arm64 keep CPU fallback because those
   hosts don't have a working GPU stack here.

2. Bump ctx_size 2048 -> 8192. A single elephant.jpg encodes to ~260
   mtmd image tokens. With ctx_size=2048 plus Gemma 4's verbose CoT
   preamble the generation loop overflowed nPast > n_ctx during
   sampling (MtmdLlmContext.cpp:452), throwing
   'processPromptImpl: context overflow'. 8192 leaves comfortable
   headroom on every backend.

3. Set reasoning-budget=0 for this test. We literally ask the model
   "Answer in one word" -- the <|channel>thought ...<channel|> CoT
   preamble that Gemma 4 wants to emit by default is wasted tokens
   here, and was the actual cause of the overflow above (CoT was
   running 8k+ tokens before the model reached the one-word answer
   and emitted <eos>). Disabling thinking gives us a deterministic
   ~10-token "Elephant" + <eos> response, which is what the
   substring-based assertion is testing for anyway.

Verified on linux-x64 (qvac-dev-linux-x64, 2x RTX 5090, Vulkan
backend) end-to-end:
  output: "Elephant"
  asserts: 3/3
  total time: ~2 s

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(llm): drop dead selectToolsCompactMarker(string) overload

selectToolsCompactMarker(const std::string& architecture) had no production
callers anywhere -- only its two unit tests
(SelectToolsCompactMarkerForQwen3,
SelectToolsCompactMarkerForUnsupportedArchitecture) referenced it. Live
production code goes through selectToolsCompactMarkerForModelMetadata
(LlamaModel::resolveToolsCompactConfig calls that one), which takes
std::optional<std::string> and is the only path that ever reaches the
"qwen3" -> "<tool_call>" mapping at runtime.

Removed the .cpp definition, the .hpp declaration, and the two unit
tests. selectToolsCompactMarkerForModelMetadata is unchanged and still
covered by SelectToolsCompactMarkerForModelMetadataUsesArchitecture.

ChatTemplateUtilsTest now runs 19/19 tests on linux-x64 (was 21/21).

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(llm): drop redundant useCpuForVision alias; vision runs on GPU on mobile

After we removed the per-mobile CPU carve-out for Gemma 4 vision (commit
2843297) and never had one for Qwen3.5 vision, useCpuForVision was just
a no-op alias of useCpu used at exactly one call site each. Inline it.

Net effect on the device routing matrix is unchanged but explicit:

  platform/arch              useCpu  device used
  --------------------------------------------------------
  darwin-x64                 true    cpu  (no working GPU here)
  linux-arm64                true    cpu  (no working GPU here)
  darwin-arm64 (M-series)    false   gpu  (Metal)
  linux-x64                  false   gpu  (Vulkan/OpenCL)
  ios                        false   gpu  (Metal -- device farm)
  android                    false   gpu  (Adreno OpenCL / Mali Vulkan -- device farm)

So on iOS / Android the gemma4 and qwen3-5 image-describe subtests run
through the actual GPU vision path -- the same path users hit -- and
will surface any regression from CI rather than from production.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(llm): correct thinkingForcedOpen_ comment re: gemma4

Gemma4 does not hit this code path: upstream
common_chat_params_init_gemma4 explicitly leaves thinking_forced_open
unset because gemma4's reasoning channel is model-emitted. Drop the
misleading reference and call out the actual templates that trigger
this path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): refresh PR-1874 entries to reflect actual shipped scope

The original CHANGELOG entries for llm-llamacpp 0.20.0, embed-llamacpp
0.16.0, and translation-nmtcpp 3.0.0 were drafted before the synthesizer
revert, the M1 / sawMali / dead-code cleanups, the bartowski fixture
swap, the native-dialect tool-call parsing, the reasoning-budget knob,
the thinkingForcedOpen synthetic-opener, the new integration tests, and
the move from 8189.0.0 to 8189.0.2. They now match what the PR
actually ships.

Compressed every entry to a flat bullet list grouped by Keep-a-Changelog
section (Changed / Added / Removed / Fixed / Deprecated / Internals)
and bumped the date to 2026-05-10.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(changelog): trim items that round-trip to net-zero in the PR

Removed lines that described code that's neither in upstream/main nor in
the PR head (so it has no observable impact on consumers):

- llm-llamacpp 0.20.0:
  * "tool-call streaming: each model now streams its native dialect /
     no re-shaping" -- main already streamed native dialects; the
     PR-internal synthesizer never shipped, so this is a non-change.
  * "Dropped sawMali plumbing / Apple-M1 detection / dead Gemma 4
     markers in Qwen3ReasoningUtils" -- all three were added and
     removed inside this PR's commit history; net diff is zero.

- embed-llamacpp 0.16.0:
  * "Dropped Mali-detection plumbing" -- same: added and removed
     within this PR's history, net diff is zero.

Kept genuine net removals against upstream/main:
- Qwen3 model-name-based fallback.
- Dead `selectToolsCompactMarker(std::string)` overload (was
  pre-existing in main, only ever called from unit tests).

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(notice): regenerate NOTICE for embed-llamacpp, llm-llamacpp, translation-nmtcpp

Re-ran the notice-generate skill (.cursor/skills/notice-generate) for the
three addons whose dependency surfaces changed in this PR:

- qvac-fabric bumped from 7248.x to 8189.0.2 -- different transitive C++
  license set.
- holepunch / hyperswarm libs moved to peerDependencies on main, so the
  JS attribution lists shrink accordingly.
- @qvac/infer-base bumped to 0.4.1.

Per-package C++ resolution after the run:

  embed-llamacpp        : opencl/qvac-fabric/qvac-lib-inference-addon-cpp/
                          qvac-lint-cpp + libc++          (5 deps)
  llm-llamacpp          : the above + picojson + nlohmann-json (7 deps)
  translation-nmtcpp    : bergamot-translator/sentencepiece/ssplit/
                          qvac-fabric/qvac-lib-inference-addon-cpp/
                          qvac-lint-cpp + libc++          (7 deps)

Net: +206 / -585 lines across the three NOTICE files (mostly transitive
JS attribution shrink from the holepunch peerDeps refactor).

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(llm): make gemma4 reasoning-budget test tolerate model-emitted reasoning

Gemma 4's reasoning channel is model-emitted (no template force-open),
so the model decides per-prompt whether to engage reasoning. For
trivial prompts like "What is the capital of France?" the model can
short-circuit and skip the <|channel>thought…<channel|> markers, which
made the test flaky on CI.

Gate the marker / length assertions on the baseline actually emitting
the opening marker; if it didn't, log a comment and skip the dependent
checks instead of failing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* types(llm): declare reasoning_budget in LlamaConfig

The C++ config parser already accepts `reasoning_budget` (and the
kebab-case `reasoning-budget` alias), but neither was a typed property
on `LlamaConfig` — they only typechecked via the catch-all index
signature. Add a typed entry with JSDoc so TypeScript consumers get
autocomplete and the accepted values (-1 default, 0 disabled).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(llm): allow per-request reasoning_budget override in run()

`reasoning_budget` was load-time only. Add it to `GenerationParams` so
`model.run(messages, { generationParams: { reasoning_budget: 0 } })`
can disable reasoning for a single request without re-loading the
model — same shape as `temp` / `top_p` / `seed` overrides.

Wiring:
- `LlmContext::GenerationParams` gains an optional `reasoning_budget`
  field and `hasOverrides()` covers it.
- `applyGenerationParamsToContext` snapshots / overrides /
  restores `params.reasoning_budget` alongside `n_predict`.
- `AddonJs::runJob` parses `generationParams.reasoning_budget` from
  JS and rejects values other than `-1` or `0`.
- `index.d.ts` exposes `reasoning_budget?: -1 | 0` on
  `GenerationParams` with a JSDoc note.

`tokenizeChat` already reads `params_.reasoning_budget`, so no change
is needed in `TextLlmContext` / `MtmdLlmContext` — the temporary
override naturally propagates to `inputs.enable_thinking`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(llm): cover per-request reasoning_budget override on Qwen3.5

Validates the new per-request `generationParams.reasoning_budget`
override end-to-end in two runs against a single loaded model:

1. `reasoning_budget: 0` override suppresses the `<think>…</think>`
   reasoning markers for that one request.
2. The next `run()` with no override restores the load-time default
   (reasoning enabled), proving the override is request-scoped and
   not sticky.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(llm): case-insensitive antiprompt substring matching

`checkAntiprompt` now lowercases both the recent output window and each
antiprompt before the `find()` so a single `Pizza` entry catches the
model's `pizza`, `Pizza`, `PIZZA`, etc. Callers no longer need to list
every casing variant. Applied identically in `TextLlmContext` and
`MtmdLlmContext`. The token-level early-exit path is unchanged (BPE
tokens are case-specific; the substring path is the authoritative
check).

Also drop the stale comment on the `Reverse prompt stops generation`
scenario in `config-parameters.test.js`: it claimed the addon split
on `,` without trimming, but `LlamaModel.cpp::split()` already
trims and drops empty segments. Replaced with a brief note that
documents the new (current) behaviour and simplified the antiprompt
list to `'network, Pizza, bitcoin, blockchain'` so the test exercises
both the trim and the case-insensitive match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(llm): stress case-insensitive antiprompt with PiZzA mixed-case entry

Swap the `Pizza` reverse_prompt entry for `PiZzA`. With case-sensitive
matching `PiZzA` would never match the model's `pizza` / `Pizza`
output; only case-insensitive comparison fires the stop. Verified
locally — the test still completes with output length 5, so the
antiprompt trips on the first emitted "Pizza".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(llm): validate reasoning_budget before truncating to int

Address @jpgaribotti's review: previously the value was cast to int
*before* the `0` / `-1` check, so fractional inputs like `0.5` or
`-1.1` would silently truncate to a "valid" 0 / -1 and pass through.

Validate against the exact double values (both `0` and `-1` are
exactly representable in IEEE-754, so `==` comparison is safe) before
casting to int when storing in `ov.reasoning_budget`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(llm): use std::from_chars for reasoning_budget load-time parse

Address @jpgaribotti's review: `std::stoi` silently accepts trailing
garbage (`"0abc"` → `0`) and throws an uncaught `std::out_of_range`
for inputs that overflow `int`. Switch to `std::from_chars`, which
fails clean on non-numeric input, overflow (`errc::result_out_of_range`),
and trailing garbage (`ptr != end`), then validate against the
allowed `-1` / `0` values in the same check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>
Co-authored-by: gianni-cor <gianfranco.cordella@tether.io>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Proletter pushed a commit to tetherto/qvac that referenced this pull request May 24, 2026
* Restore Qwen3.5 / Gemma4 / PaddleOCR-VL tests + Mali coopmat fix

Stack of three logical changes squashed into one commit so the test
ports stay self-consistent with the build/runtime they depend on:

1. qvac-fabric overlay ports (LLM + embed + nmtcpp):
   - Pin to fabric 78db8bf4 (PR tetherto/qvac-fabric-llm.cpp#121 HEAD,
     includes c79a8851 "ggml-vulkan: Fix NaN outputs on Mali").
   - Drop -DGGML_VULKAN_DISABLE_COOPMAT*=ON for Android so coopmat
     shaders are compiled in. With coopmat off, runtime
     device->coopmat_support is false and the Mali fix's ARM-gated
     branches were skipped, leaving Qwen3-Q8_0 finetuning NaN on
     Pixel 9 Pro Mali.
   - Wire up overlay-ports in each package's vcpkg-configuration.json.
   - Add find_package(OpenSSL) before find_package(llama) in the LLM
     CMakeLists so llama-targets.cmake's transitive OpenSSL::SSL
     reference (via cpp-httplib) resolves on local builds.

2. utils.js downloadFile redirect race:
   - Track a handedOff flag set when the redirect branch hands off
     dest to a recursive call. All cleanup paths now skip fs.unlink
     once ownership is transferred, so a late error from the outer
     writestream can't delete the freshly-downloaded file (Pixel
     ENOENT after "successful" mmproj download).

3. Three new integration tests + their mobile harness wiring:
   - qwen3-5.test.js — basic / multi-turn / tool-calling
   - gemma4.test.js — text / multi-turn / image (forced to CPU on
     darwin + mobile because gemma4v projector SIGSEGVs on Metal and
     Adreno OpenCL) / tool-calling
   - ocr-paddle.test.js — OCR; mobile maxTokens capped to 768
   - Ported to the new addon API (files: { model: [absPath],
     projectionModel?: absPath }, config: …).
   - Added matching unit test test_text_llm_context_qwen3.cpp.
   - integration.auto.cjs registers runQwen35Test, runGemma4Test,
     runOcrPaddleTest dispatchers.
   - test-groups.json: iOS heavy4 cluster
     (Gemma4+OcrLighton+OcrPaddle), iOS lightB adds Qwen35,
     Android groupB has Qwen35 first then Gemma4 / OcrPaddle.
   - Workflow: Android GroupB Device Farm jobTimeout 60→90 min.

* API port + Gemma4 tool-call fix.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Wire addon/src/patches ahead of the vcpkg include path to pick up the LlamacppUtils.hpp ptr-API override.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* API port + Gemma4 tool-call fix.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Split iOS heavy4 into three single-test specs (heavy4 = OcrLighton, new heavy7 = Gemma4, new heavy8 = OcrPaddle) and schedule them as separate Device Farm runs to avoid memory pressure.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Drop LlamacppUtils.hpp patch override; bump addon-cpp to 1.1.7

The LlamacppUtils.hpp common_init_result_ptr API now ships in
qvac-lib-inference-addon-cpp 1.1.7 (PR #1887), so the local
addon/src/patches/qvac-lib-inference-addon-cpp/LlamacppUtils.hpp
shim is no longer needed in the embed and llm addons.

- Delete the patch headers in embed and llm.
- Drop the BEFORE PRIVATE addon/src/patches include path from the
  embed/llm production and unit-test CMakeLists.
- Bump qvac-lib-inference-addon-cpp version>= to 1.1.7 in the embed,
  llm, and nmtcpp vcpkg.json files so they pick up the upstream
  ptr-API header from the registry.

The OpenSSL find_package() addition stays — it's an unrelated
local-build fix.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Cap ocr-lighton predict to 1800 (desktop) / 768 (mobile) so the LightOnOCR response can't overrun ctx_size=4096.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Rewrite sliding-context test to use the post-GGML_PAD effective n_ctx (512) and retune n_predict / n_discarded so all 8 cases match the current ContextSlider semantics.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Allow embed batching test to override ctx_size and pin gte-large to batch_size=512 / ctx_size=384 to probe the Mali Vulkan first-submit ErrorDeviceLost.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Fix reverse-prompt scenario by removing comma, space, listing both 'pizza' and 'Pizza', and lowercasing the assertion comparisons to match 'Pizza' and 'pizza'.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Sanitize media Uint8Array prompts before logging to avoid V8 Zone OOM.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Use Qwen3 family chat-template to fix Qwen3.5-0.8B gibberish output on macOS Metal.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Update portfiles to point to the latest fabric.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Revert "Allow embed batching test to override ctx_size and pin gte-large to batch_size=512 / ctx_size=384 to probe the Mali Vulkan first-submit ErrorDeviceLost."

This reverts commit 0e9eca7.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Raise AfriqueGemma cancel maxWait to 60s, and apply the use_jinja gate-drop so Qwen3-family models always pick the fixed jinja template.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Drop the retired AfriqueGemma integration tests.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Update portfiles to point to the latest head.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Update portfiles to point to the latest head.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Drop qwen35 from the Qwen3-template detection and the supported-finetune-architecture list since neither path is actually validated for Qwen3.5.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Update portfiles to point to the latest head.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Enable coopmat.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Drop the Qwen3 use_jinja override pairing now that qwen35 is no longer treated as Qwen3-family.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Use only general.architecture for Qwen3 detection so Qwen3.5 stops getting the Qwen3 chat-template via the model-name substring fallback.

Drop modelNameLooksLikeQwen3 / getModelName and the modelName parameter from supportsToolsCompactForModelMetadata and selectToolsCompactMarkerForModelMetadata. The substring match on general.name treated "Qwen3.5-..." as Qwen3 and overrode the model's embedded tokenizer.chat_template, contradicting the recent decision to keep qwen35 out of the Qwen3 family. Update the LlamaModel call site and unit tests; add explicit qwen35/nullopt negative cases.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Accept HuggingFace function-call XML in extractToolCalls so the Qwen3.5 tool-calling integration test parses the model's native <tool_call><function=...><parameter=...>...</parameter></function></tool_call> envelope produced by its embedded chat template, in addition to the Qwen3-style JSON envelope.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Bump n_predict in the Qwen3.5 basic and multi-turn integration tests so the embedded chat-template's reasoning block has room to finish before the answer on slower CI backends.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Enable coopmat and point to the latest fabric.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Route Qwen3.5 inference and all finetuning on Mali to CPU, disable Vulkan coopmat at build time, halve mobile finetune workload to account for CPU training.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Point to the latest fabric version.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Force Bert to the CPU on Mali.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Run finetuning on Mali GPU.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Run Qwen 3.5 on Mali GPU.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Point to the latest fabric version and enable coopmat path.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* vcpkg: drop per-package qvac-fabric overlays

Removes the qvac-fabric overlay-ports infrastructure from the LLM,
Embed, and NMT manifests. The default-registry baseline is left
untouched, so vcpkg now resolves qvac-fabric directly from the
registry at the existing baseline (7248.2.3).

Bumping to fabric 8189.0.0 will be handled by a separate baseline
update; this commit only undoes the overlay-based development setup
that was no longer needed.

- vcpkg-configuration.json (3x): drop "overlay-ports" entry.
- vcpkg/ports/qvac-fabric/ (3x): remove overlay portfile.cmake,
  vcpkg.json, and android-vulkan-version.cmake.

Co-authored-by: Cursor <cursoragent@cursor.com>

* vcpkg: bump qvac-fabric version constraint to 8189.0.0

Updates the consumer manifests in the LLM, Embed, and NMT packages
to require qvac-fabric >= 8189.0.0. The default-registry baseline
is intentionally left untouched.

Co-authored-by: Cursor <cursoragent@cursor.com>

* llm/embed/nmtcpp: bump versions for qvac-fabric 8189.0.0

- qvac-lib-infer-llamacpp-llm: 0.19.2 -> 0.20.0 (minor)
- qvac-lib-infer-llamacpp-embed: 0.15.0 -> 0.16.0 (minor)
- qvac-lib-infer-nmtcpp: 2.1.1 -> 3.0.0 (major)

The nmtcpp major bump reflects a real behavioural regression: the
previous overlay built ggml unconditionally with every GPU backend
the platform supported (Vulkan/Metal/OpenCL); switching to the
upstream registry port with the existing "default-features": false
in nmtcpp's vcpkg.json now disables the new "gpu-backends" feature,
so out-of-the-box ggml exposes only the CPU backend. Consumers that
rely on GPU-accelerated nmt inference must add
'"features": ["gpu-backends"]' to the qvac-fabric block of their
nmtcpp build manifest.

CHANGELOG entries added in all three packages.

Co-authored-by: Cursor <cursoragent@cursor.com>

* nmtcpp: opt into qvac-fabric gpu-backends feature; downgrade bump to 2.2.0

The previous commit (3.0.0) flagged a breaking change: switching from
the always-on overlay to the registry port with default-features:false
disabled GPU backends in ggml. Adding "features": ["gpu-backends"]
to nmtcpp's qvac-fabric dep restores the previous Vulkan/Metal/OpenCL
behaviour, so the bump is now a non-breaking minor (2.2.0) and the
BREAKING note in the changelog is replaced with a plain Changed entry.

Co-authored-by: Cursor <cursoragent@cursor.com>

* nmtcpp: re-bump to 3.0.0 (major)

Restores the major version bump for nmtcpp. The new fabric port schema
(features split between gpu-backends/llama) and the move from a vendored
overlay to the upstream registry are large enough downstream changes
that consumers should treat this as a major release, even though
runtime behaviour is preserved by opting into "gpu-backends".

Co-authored-by: Cursor <cursoragent@cursor.com>

* vcpkg: pin qvac-fabric to >=8189.0.0#1

The 8189.0.0 (port-version 0) qvac-fabric port shipped a
configure-time bug for consumers without the "llama" feature
(i.e. nmtcpp): -DLLAMA_MTMD=ON was passed unconditionally, which
transitively enables LLAMA_BUILD_COMMON, which makes upstream call
license_generate(common) -- but BUILD_LLAMA=OFF skips defining the
'common' target, so the cmake configure aborts.

The fix landed in tetherto/qvac-registry-vcpkg#136 as
qvac-fabric port-version 1. Bumping the consumer constraint from
"version>=": "8189.0.0" to "version>=": "8189.0.0#1" forces vcpkg
to pick the fixed port-version (otherwise it picks the lowest
satisfying version, which is the broken #0).

Validated: nmtcpp arm64-android cross-build now configures and
builds end-to-end against the upstream registry, no overlay needed.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs: drop overlay-removal note from changelogs

Removes the changelog bullet describing the deletion of the per-package
qvac-fabric vcpkg overlay. The overlay teardown is mechanical packaging
plumbing rather than a user-facing change worth documenting.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test/llm: restore AfriqueGemma integration tests (desktop-only)

Reverts e257a19's deletion of the afriquegemma-edge-cases and
afriquegemma-translation integration tests, and adds a 'desktopOnly'
opt-out so they're skipped on mobile without breaking the per-test
group coverage invariant.

- packages/qvac-lib-infer-llamacpp-llm/test/integration/afriquegemma-edge-cases.test.js: restored.
- packages/qvac-lib-infer-llamacpp-llm/test/integration/afriquegemma-translation.test.js: restored.
- test/mobile/test-groups.json: new top-level "desktopOnly" array
  listing runAfriquegemmaEdgeCasesTest and runAfriquegemmaTranslationTest.
- scripts/generate-mobile-integration-tests.js: validateGroups now
  reads the desktopOnly list; entries are still emitted into
  integration.auto.cjs (so validate-mobile-tests stays happy) but
  excluded from the per-platform "missing" check, so the mobile
  runners never invoke them.
- test/mobile/integration.auto.cjs: regenerated by
  `npm run test:mobile:generate`.
- CHANGELOG note in qvac-lib-infer-llamacpp-llm under Tests.

Validated via `npm run test:mobile:generate` + `npm run test:mobile:validate`.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(llm): drop AfriqueGemma test restoration changelog note

Co-authored-by: Cursor <cursoragent@cursor.com>

* test/llm: switch AfriqueGemma desktop-only skip to in-test pattern

Per review: don't change generate-mobile-integration-tests.js. Use the
same skip:isMobile pattern other tests already use (config-parameters,
tool-calling, image), and keep the AfriqueGemma functions in the iOS
lightA / Android groupA groups so the existing per-test coverage
invariant stays intact.

- packages/qvac-lib-infer-llamacpp-llm/scripts/generate-mobile-integration-tests.js:
  reverted to upstream/main (drops the desktopOnly opt-out plumbing).
- test/mobile/test-groups.json: drops 'desktopOnly', adds
  runAfriquegemmaEdgeCasesTest and runAfriquegemmaTranslationTest
  back to ios.lightA and android.groupA.
- test/integration/afriquegemma-edge-cases.test.js,
  test/integration/afriquegemma-translation.test.js: add
  isMobile = platform === 'ios' || platform === 'android', and
  skip:isMobile to every test() options object (13 total).
- test/mobile/integration.auto.cjs: regenerated.

Validators both green:
  npm run test:mobile:generate -> "all tests assigned for every platform"
  npm run test:mobile:validate -> ok

Co-authored-by: Cursor <cursoragent@cursor.com>

* test/llm: skip ocr-lighton on mobile

Adds skip:isMobile to the single test in ocr-lighton.test.js,
matching the AfriqueGemma / config-parameters / tool-calling
pattern. isMobile is already defined in this file. The test stays
in ios.heavy4 / android.groupB so per-platform group coverage is
unaffected; the brittle test itself just skips on mobile.

Co-authored-by: Cursor <cursoragent@cursor.com>

* ci: revert workflow timeout change for llm mobile integration

Drops PR #1874's edit to
.github/workflows/integration-mobile-test-qvac-lib-infer-llamacpp-llm.yml
(parameterised jobTimeoutMinutes + 90-minute override for Android
GroupB). Workflow is restored to the upstream/main version.

Co-authored-by: Cursor <cursoragent@cursor.com>

* addons: disable flash-attn by default on the OpenCL backend

Flash attention is not reliably supported by the OpenCL ggml backend
(Adreno path), so when the chosen GPU backend ends up being OpenCL
the addons now force "flash-attn=off" unless the user explicitly
passed flash-attn / flash_attn in their config.

LLM (LlamaModel.cpp / LlamaModel.hpp):

- Add a bool isOpenCl parameter to tuneConfigMap (defaulted to false
  to keep the existing test_tune_config_map.cpp call sites working).
- Mirror the BitNet-disabling branch with an else-if for OpenCL +
  notUserSet("flash-attn", "flash_attn").
- At the call site, read chosenBackend.first/second after chooseBackend
  returns and pass isOpenCl through.

Embed (BertModel.cpp):

- No tuneConfigMap equivalent here. Inject the same logic inline
  immediately after chooseBackend, before configFilemap is serialised
  into configVector. Honour user-set "flash-attn"/"flash_attn".

Both packages compile cleanly via bare-make build on macOS-arm64.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fixup! tuneConfigMap: keep ABI for existing 4-arg test callers

CI failure on cpp-tests-darwin-arm64 (PR #1874):
  test/unit/test_tune_config_map.cpp:199:43: fatal error: no viable
  conversion from 'FtOverrides' to 'bool'

The previous commit inserted bool isOpenCl as the 4th parameter of
tuneConfigMap, but several existing tests pass FtOverrides{...} as
the 4th positional argument (relying on it being finetuneOverrides).

Swap the order so the new isOpenCl parameter comes after the existing
finetuneOverrides; both stay defaulted, so all old 3-arg and 4-arg
call sites compile unchanged. The production call site in
LlamaModel.cpp is updated accordingly.

Also adds 4 new TuneConfigMapTest cases covering the OpenCL branch:
- OpenCl_NonBitnet_FlashAttnDisabledByDefault
- OpenCl_UserSetFlashAttnHyphen_Respected
- OpenCl_UserSetFlashAttnUnderscore_Respected
- NotOpenCl_NonBitnet_FlashAttnUnchanged

All 53 TuneConfigMapTest cases pass locally on macOS-arm64.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add QWen 3.5 vision test.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Route vision models with mmproj to CPU on Apple M1.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Route only the projector to CPU on Apple M1.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* run qwen3-5.test.js on IOS GPU

* js lint

* Recognize Gemma 4 channel reasoning markers in Qwen3ReasoningUtils, and bump gemma4 basic-test n_predict so the answer fits after the thinking preamble.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Wire reasoning-budget config to inputs.enable_thinking so passing reasoning-budget=0 disables the model's <think> reasoning channel, and add coverage for Qwen3, Qwen3.5, and Gemma 4.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* vcpkg: bump qvac-fabric to >=8189.0.1

The 8189.0.1 port (tetherto/qvac-registry-vcpkg#138) drops
port-version 1's BUILD_LLAMA=OFF portfile workaround and ships the
new fabric tip 739b309ae. Notable upstream fixes pulled in:

- Inject enable_thinking into the Jinja template context so Qwen 3.5
  and Gemma 4 actually emit <think> reasoning content.
- GGML_OP_DELTA_NET_AR Vulkan compute shader (Qwen 3.5 / DeltaNet
  decode no longer falls back to CPU per token).
- vulkan: f32 src1 strided cpy fix (embedding-model crash).

Validated on macOS-arm64: vcpkg resolves
qvac-fabric[core,gpu-backends,llama]:arm64-osx@8189.0.1 and the
addon builds end-to-end.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Disable the embed addon's BERT-on-Mali CPU override.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Prepend <think> opener to the visible stream when the chat template force-opens the reasoning channel.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Remove the Mali detection plumbing from the embed addon now that BERT runs on Mali GPU.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Bump n_predict and ctx_size in the Qwen3.5 reasoning-budget baseline so the model reliably reaches </think>.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Restore the mobile finetune dataset to 8 samples.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* test: drop AfriqueGemma + MedGemma + Dolphin-MoE tests

Per review: cull tests that exercise models we no longer want covered
in the LLM/SDK CI matrix.

LLM (packages/llm-llamacpp):
- Delete integration tests:
  - test/integration/afriquegemma-edge-cases.test.js
  - test/integration/afriquegemma-translation.test.js
  - test/integration/moe.test.js (dolphin-mixtral-2x7b)
- Delete docs/afriquegemma-translation.md (only documents the
  now-removed integration tests).
- Strip the medgemma-4b-it variant from:
  - test/integration/tool-calling.test.js (collapses
    ALL_TOOL_MODEL_VARIANTS / TOOL_MODEL_VARIANTS to qwen3-1.7b only,
    drops the now-unused isMobile derived var).
  - test/integration/finetuning-pause-resume.test.js (drops the
    medgemma-4b-it-q4_0 entry from FINETUNE_MODELS).
- test/unit/test_model_metadata.cpp: drop the gemma3Model_ fixture +
  the two Gemma3-specific TEST_F cases
  (DiskSingleFile_Gemma3Arch_*); update the comment block listing
  exercised arches accordingly.
- test/unit/pick-primary-gguf-path.test.js: keep the tensors.txt-first
  ordering test, but rebase the fixture filenames on
  Qwen3-4B-Q4_K_M-* so no medgemma names remain in the test corpus.
- test/mobile/test-groups.json + test/mobile/integration.auto.cjs:
  drop runAfriquegemmaEdgeCasesTest, runAfriquegemmaTranslationTest,
  runMoeTest from both ios and android groups; auto.cjs trimmed to
  match. `validate-mobile-tests.js` is green.

SDK (packages/sdk/tests-qvac):
- Delete tests/translation-afriquegemma-tests.ts.
- tests/test-definitions.ts: drop translationAfriquegemmaTests
  import + spread.
- tests/shared/executors/translation-executor.ts: drop the import,
  the spread, and the |afriquegemma branch from the dispatch regex.
- tests/mobile/consumer.ts + tests/desktop/consumer.ts: drop the
  AFRICAN_4B_TRANSLATION_Q4_K_M import and the
  resources.define("afriquegemma", ...) block; mobile also drops the
  afriquegemma-only SkipExecutor.
- tests/shared/resource-lifecycle.ts: rephrase the eviction-comment
  example to a generic "large translation model" so it no longer
  references the deleted resource.

Not touched: NOTICE/CHANGELOG (auto-generated/historical),
sdk/models/registry/* (model constants in the registry are data, not
tests), sdk/examples/translation/translation-llm-afriquegemma.ts
(consumer-facing example, not a test).

* Revert "test: drop AfriqueGemma references from packages/sdk/tests-qvac"

Per review: keep packages/sdk/tests-qvac/ untouched. Restore the SDK
afriquegemma test file, the test-definitions / translation-executor /
desktop+mobile consumer / resource-lifecycle edits to their state
prior to commit 36de6ec.

Only the LLM-side cull (packages/llm-llamacpp + the deleted afrique /
moe / medgemma test files there) from 36de6ec is kept.

* Restore packages/llm-llamacpp/docs/afriquegemma-translation.md

Per review: keep the AfriqueGemma translation doc. Commit 36de6ec
removed it together with the LLM AfriqueGemma test files; restore it
unchanged from the merge tip (e29836d).

* chore: pin qvac-fabric to 8189.0.2 via overlay-ports for testing

Adds an overlay port copy of qvac-fabric pointing at v8189.0.2 of
tetherto/qvac-fabric-llm.cpp (tetherto/qvac-registry-vcpkg#140)
to llm-llamacpp, embed-llamacpp, and translation-nmtcpp, declared via
each package's vcpkg-configuration.json. Lets this PR exercise the new
fabric build (incl. the Mali coopmat1 BitNet TQ NaN fix) without
waiting for the registry baseline bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: pin overlay qvac-fabric to temp-8189 tip f686a1324

Point REF at the latest qvac-fabric-llm.cpp temp-8189 commit
(f686a1324e13184d3257cb74c1ba17f9cf8ef575) instead of v8189.0.2 so the
overlay tracks branch tip while the branch is still moving.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: extend Android LLM mobile test timeouts

Allow slower Android Device Farm runs to finish model-heavy LLM tests before the harness marks them as timed out.

Co-authored-by: Cursor <cursoragent@cursor.com>

* vcpkg: drop qvac-fabric overlay-ports, bump version>= to 8189.0.2

tetherto/qvac-registry-vcpkg#140 publishes qvac-fabric@8189.0.2 in the
default registry, so the temporary per-package overlay we used while the
new fabric build was still being shaken out is no longer necessary.

For llm-llamacpp, embed-llamacpp, and translation-nmtcpp:

- Delete `packages/<pkg>/vcpkg/ports/qvac-fabric/` (portfile.cmake,
  vcpkg.json, android-vulkan-version.cmake) — the overlay copy.
- Drop the `overlay-ports` entry from each package's
  vcpkg-configuration.json. The `default-registry` baseline is left
  untouched intentionally; the `version>=` constraints below are what
  forces vcpkg to resolve to the new fabric revision against the
  unchanged baseline.
- Bump the `qvac-fabric` `version>=` pin from `8189.0.1` -> `8189.0.2`
  in each package's vcpkg.json.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(llm): drop dead sawMali plumbing from BackendSelection

`sawMali` was threaded through `emplaceIfValidDevice` / `tryEmplaceDevice` /
`chooseBackend` but never read by any caller — leftover from the earlier
"Force BERT/Qwen3.5 to CPU on Mali" iterations. The embed-side cleanup
already landed in 2ac5de0 ("Remove the Mali detection plumbing from the
embed addon now that BERT runs on Mali GPU."); this finishes the symmetric
removal on the LLM side. `sawAppleM1` plumbing is preserved unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(llm): explain why MtmdLlmContext skips inside_reasoning flip

TextLlmContext flips reasoningState_.inside_reasoning = true alongside the
forced "<think>\n" opener; MtmdLlmContext doesn't because it doesn't carry a
reasoningState_ today. Add an inline note so the asymmetry isn't read as a
bug, and point at the symmetric site to update if reasoning-aware EOS
replacement is later added on the multimodal path.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(llm): narrow tool-call args quoter to leading bare key only

The previous post-generation regex (`([{,])(\s*)([A-Za-z_]…)(\s*):` -> quote
the ident) was too broad: it also matched `, ident:` substrings sitting
inside JSON string values, so a tool call with a free-form string argument
like `{"query":"phase one, step: validate"}` came out corrupted as
`{"query":"phase one, "step": validate"}`, which then failed JSON.parse on
the consumer side.

In practice the rewrite is only needed for one upstream quirk: the Gemma 4
parser's `gemma4_args_to_json` (common/chat-parser.cpp) uses an
`at_key_start()` helper that peeks backwards in the output buffer for a
`{`/`,` -- so the very first top-level key is left bare while every nested
or post-comma key is already quoted. All other tool dialects reach us via
`json::dump()` upstream and already start with a quoted key.

Replace the broad regex with one anchored at `^\{(\s*)<ident>\s*:`, which
fixes exactly that single leading-bare-key case and cannot match anywhere
inside a JSON string value.

Verified end-to-end on linux-x64 against gemma-4-E2B-it-Q8_0 (CPU):

- Adversarial prompt forcing `phase one, step: validate` as a tool arg
  string: baseline produced invalid JSON
  `{"query":"phase one, "step": validate"}` (parse fail at pos 55);
  this fix yields `{"query":"phase one, step: validate"}` and the test
  passes 7/7 assertions.
- Existing simple-args happy path (`get_weather` with city/unit) still
  passes 5/5.

Co-authored-by: Cursor <cursoragent@cursor.com>

* revert(llm): drop synthetic <tool_call>{json}</tool_call> post-processing

Each model now streams only its own native tool-call dialect:
- Qwen3 / Hermes: <tool_call>{json}</tool_call> (already canonical)
- Qwen3.5: <tool_call><function=name><parameter=k>v</parameter></function></tool_call>
- Gemma 4: <|tool_call>call:NAME{key:<|"|>val<|"|>,...}<tool_call|>
- Mistral, DeepSeek-R1, Functionary, GPT-OSS, etc. emit their own markers.

The previous PR added a post-generation common_chat_parse pass that
appended a uniform <tool_call>{json}</tool_call> envelope for every
detected call. That duplicated tokens for Hermes-shape models (the
envelope is already in the native stream) and inflated Gemma 4 output
by ~14% with two synthetic copies per call. The leading-bare-key
handling for Gemma 4's tc.arguments was also a constant source of sharp
edges (broad regex corrupted string values containing ", ident:";
narrow anchored regex still required follow-up). Per-dialect parsing
belongs at the SDK consumer layer, not in the addon.

Removed:
- Post-generation block in LlamaModel::processPromptImpl (synthesizer).
- needsOutputCapture widening to include !resolved.tools.empty().
- LlmContext::getLastChatFormat() virtual.
- lastChatFormat_ members + overrides in TextLlmContext, MtmdLlmContext.
- common_chat_format* outFormat parameter from getPrompt().
- <regex> include in LlamaModel.cpp (no remaining users).

Kept:
- outThinkingForcedOpen mechanism (independent reasoning-channel feature).
- toolsCompact_ controller and KV-cache trim logic.
- All other PR work.

Validated on linux-x64/CPU after incremental rebuild:
- Gemma 4 (gemma-4-E2B-it-Q8_0): 6/6 asserts pass with native-dialect
  parser, no synthetic envelope leaks, output 941 chars (down from
  ~1100 with synthesizer).
- Qwen3.5 (Qwen3.5-0.8B-Q8_0): 5/5 asserts pass with the existing
  parseXmlToolCall path, output 394 chars.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(llm): parse Gemma 4 native tool-call dialect in gemma4.test.js

Without the synthetic <tool_call>{json}</tool_call> envelope reverted
in the previous commit, Gemma 4 emits its own dialect:

  <|tool_call>call:NAME{key:<|"|>val<|"|>,...}<tool_call|>

Strings are wrapped in <|"|>...<|"|> instead of "...", keys are bare,
and the closing tag is <tool_call|> (trailing pipe, no slash).

extractToolCalls now matches that shape directly and returns
{ name, argsRaw }. argsContainStringValue() helper checks the args
body for a Gemma-4-quoted string literal. Substring-based assertion
is sufficient to verify the model called the right tool with the
right argument values; full dialect-to-JSON conversion lives upstream
in fabric's gemma4_args_to_json and is not the addon test's job.

qwen3-5.test.js was unchanged: Qwen3.5 wraps its <function=name>
<parameter=k>v</parameter></function> XML in <tool_call>...</tool_call>
natively, so the existing parseXmlToolCall path keeps working.

Validated on linux-x64/CPU against gemma-4-E2B-it-Q8_0:
4/4 tests, 13/13 asserts (3 synthetic-input parser sanity checks +
1 live LLM run).

Co-authored-by: Cursor <cursoragent@cursor.com>

* revert(llm): drop Apple M1 detection + projector-CPU routing

The PR added an Apple-M1-specific code path that detected the chip via the
GPU description string and routed `params.mmproj_use_gpu = false` so the
vision projector ran on CPU instead of Metal, working around a SIGSEGV in
the projector's image-encoding kernel observed on M1 Metal at the time.

Re-tested on M1 with the current fabric tip: no SIGSEGV, projector runs
fine on Metal end-to-end. The carve-out is no longer needed.

Removed:
- BackendSelection: `isAppleM1Device()` helper, `bool& sawAppleM1` plumbing
  through `emplaceIfValidDevice` / `tryEmplaceDevice` / `chooseBackend`,
  and `bool* outSawAppleM1` parameter on both `chooseBackend` overloads.
- LlamaModel: the `bool sawAppleM1 = false` local, the call-site argument,
  and the `params.mmproj_use_gpu = !sawAppleM1` ternary; mmproj now uses
  GPU on every desktop platform (Android still hardcoded to false).
- test_backend_selection.cpp: `APPLE_M{1,2,3,4}_DESC` constants,
  `chooseBackendWithM1Flag()` helper, and the four `AppleM*_*` test cases.
- gemma4.test.js / qwen3-5.test.js: the comment blocks describing the M1
  carve-out; `useCpuForVision` semantics are unchanged (`useCpu || isMobile`
  on gemma4 and `useCpu` on qwen3-5).

Verified on linux-x64/CPU after rebuild: 148/148 C++ unit tests pass
(BackendSelectionTest, TuneConfigMapTest, ChatTemplateUtilsTest).

Co-authored-by: Cursor <cursoragent@cursor.com>

* revert(llm): drop dead Gemma 4 markers from updateQwen3ReasoningBuffer

The PR added two extra substring scans for Gemma 4's reasoning channel
markers (<|channel>thought open, <channel|> close) to
updateQwen3ReasoningBuffer. The intent was to extend the EOS-rescue
path (handleQwen3ReasoningEOS rewrites EOS-while-thinking into a
closing tag) to Gemma 4. That never actually fires though: both the
buffer-update call and the EOS-rescue call in TextLlmContext are gated
by `if (isQwen3Model_)`, and isQwen3Model_ resolves to
`general.architecture == "qwen3"` only. Gemma 4 reports architecture
"gemma4", so the gate never opens, the markers never get scanned, and
the rescue path never runs for Gemma 4.

In live runs Gemma 4 always emits <channel|> cleanly before <eos>, so
the rescue isn't needed on the happy path; if Gemma 4 ever truncates
mid-thought under context pressure we will need a real dialect-aware
rescue (per-arch close-tag token + extended gate) and a follow-up will
add that. For this PR we just want the dead code gone so it doesn't
mislead future readers about what's actually wired up.

Net: -9 lines, file is now identical to upstream main.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(llm): switch gemma4 fixtures from unsloth to bartowski

The unsloth GGUF pack
(huggingface.co/unsloth/gemma-4-E2B-it-GGUF) tags <turn|> as the EOG token
in tokenizer.ggml.eos_token_id and leaves <eos> classified as a regular
text token. Gemma 4's training-baked behaviour after assistant content is
to emit a few <eos> tokens before <turn|>, so with that pack the addon's
generation loop -- which terminates on llama_vocab_is_eog -- doesn't stop
until <turn|> arrives. We were observing ~9 spurious <eos> tokens
trailing every Gemma 4 response, eating into n_predict and KV cache for
no gain.

bartowski's GGUF
(huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF) ships the exact
same vocabulary but tags <eos> as EOG (matching the base
google/gemma-4-E2B-it tokenizer config). With that pack the addon
terminates on the first <eos> -- empirically 0 trailing tokens, ~30 %
shorter completions on the same prompt, same dialect output that the
native-dialect parser added in 87e6c35 handles unchanged.

Verified on linux-x64/CPU (qvac-dev-linux-x64) with the same
get_weather tool prompt:

  unsloth Q8_0    : 941 chars, 9 trailing <eos>, EOG = {<turn|>, </s>}
  bartowski Q4_K_M:  676 chars, 0 trailing <eos>, EOG = {<eos>,    </s>}

Note: the unsloth metadata bug deserves an upstream issue against the
unsloth pack maintainers; this PR's scope is just to stop our tests
paying the wasted-tokens tax.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(llm): unblock gemma4 image test on mobile + fix ctx overflow

Three changes to packages/llm-llamacpp/test/integration/gemma4.test.js
(image-describe subtest):

1. Drop the mobile CPU-vision carve-out.
   useCpuForVision used to force `device: 'cpu'` on Android/iOS to dodge
   Adreno OpenCL SIGABRT and Mali Vulkan instability that bit us with
   the unsloth mmproj. With bartowski's mmproj (now the fixture in
   787c3322) we want CI to actually exercise the device-farm GPU code
   path for vision -- if that path regresses on a real Adreno or Mali
   chip we want to find out from CI, not by accident in production.
   Desktop x64-darwin / linux-arm64 keep CPU fallback because those
   hosts don't have a working GPU stack here.

2. Bump ctx_size 2048 -> 8192. A single elephant.jpg encodes to ~260
   mtmd image tokens. With ctx_size=2048 plus Gemma 4's verbose CoT
   preamble the generation loop overflowed nPast > n_ctx during
   sampling (MtmdLlmContext.cpp:452), throwing
   'processPromptImpl: context overflow'. 8192 leaves comfortable
   headroom on every backend.

3. Set reasoning-budget=0 for this test. We literally ask the model
   "Answer in one word" -- the <|channel>thought ...<channel|> CoT
   preamble that Gemma 4 wants to emit by default is wasted tokens
   here, and was the actual cause of the overflow above (CoT was
   running 8k+ tokens before the model reached the one-word answer
   and emitted <eos>). Disabling thinking gives us a deterministic
   ~10-token "Elephant" + <eos> response, which is what the
   substring-based assertion is testing for anyway.

Verified on linux-x64 (qvac-dev-linux-x64, 2x RTX 5090, Vulkan
backend) end-to-end:
  output: "Elephant"
  asserts: 3/3
  total time: ~2 s

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(llm): drop dead selectToolsCompactMarker(string) overload

selectToolsCompactMarker(const std::string& architecture) had no production
callers anywhere -- only its two unit tests
(SelectToolsCompactMarkerForQwen3,
SelectToolsCompactMarkerForUnsupportedArchitecture) referenced it. Live
production code goes through selectToolsCompactMarkerForModelMetadata
(LlamaModel::resolveToolsCompactConfig calls that one), which takes
std::optional<std::string> and is the only path that ever reaches the
"qwen3" -> "<tool_call>" mapping at runtime.

Removed the .cpp definition, the .hpp declaration, and the two unit
tests. selectToolsCompactMarkerForModelMetadata is unchanged and still
covered by SelectToolsCompactMarkerForModelMetadataUsesArchitecture.

ChatTemplateUtilsTest now runs 19/19 tests on linux-x64 (was 21/21).

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(llm): drop redundant useCpuForVision alias; vision runs on GPU on mobile

After we removed the per-mobile CPU carve-out for Gemma 4 vision (commit
2843297) and never had one for Qwen3.5 vision, useCpuForVision was just
a no-op alias of useCpu used at exactly one call site each. Inline it.

Net effect on the device routing matrix is unchanged but explicit:

  platform/arch              useCpu  device used
  --------------------------------------------------------
  darwin-x64                 true    cpu  (no working GPU here)
  linux-arm64                true    cpu  (no working GPU here)
  darwin-arm64 (M-series)    false   gpu  (Metal)
  linux-x64                  false   gpu  (Vulkan/OpenCL)
  ios                        false   gpu  (Metal -- device farm)
  android                    false   gpu  (Adreno OpenCL / Mali Vulkan -- device farm)

So on iOS / Android the gemma4 and qwen3-5 image-describe subtests run
through the actual GPU vision path -- the same path users hit -- and
will surface any regression from CI rather than from production.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(llm): correct thinkingForcedOpen_ comment re: gemma4

Gemma4 does not hit this code path: upstream
common_chat_params_init_gemma4 explicitly leaves thinking_forced_open
unset because gemma4's reasoning channel is model-emitted. Drop the
misleading reference and call out the actual templates that trigger
this path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): refresh PR-1874 entries to reflect actual shipped scope

The original CHANGELOG entries for llm-llamacpp 0.20.0, embed-llamacpp
0.16.0, and translation-nmtcpp 3.0.0 were drafted before the synthesizer
revert, the M1 / sawMali / dead-code cleanups, the bartowski fixture
swap, the native-dialect tool-call parsing, the reasoning-budget knob,
the thinkingForcedOpen synthetic-opener, the new integration tests, and
the move from 8189.0.0 to 8189.0.2. They now match what the PR
actually ships.

Compressed every entry to a flat bullet list grouped by Keep-a-Changelog
section (Changed / Added / Removed / Fixed / Deprecated / Internals)
and bumped the date to 2026-05-10.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(changelog): trim items that round-trip to net-zero in the PR

Removed lines that described code that's neither in upstream/main nor in
the PR head (so it has no observable impact on consumers):

- llm-llamacpp 0.20.0:
  * "tool-call streaming: each model now streams its native dialect /
     no re-shaping" -- main already streamed native dialects; the
     PR-internal synthesizer never shipped, so this is a non-change.
  * "Dropped sawMali plumbing / Apple-M1 detection / dead Gemma 4
     markers in Qwen3ReasoningUtils" -- all three were added and
     removed inside this PR's commit history; net diff is zero.

- embed-llamacpp 0.16.0:
  * "Dropped Mali-detection plumbing" -- same: added and removed
     within this PR's history, net diff is zero.

Kept genuine net removals against upstream/main:
- Qwen3 model-name-based fallback.
- Dead `selectToolsCompactMarker(std::string)` overload (was
  pre-existing in main, only ever called from unit tests).

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(notice): regenerate NOTICE for embed-llamacpp, llm-llamacpp, translation-nmtcpp

Re-ran the notice-generate skill (.cursor/skills/notice-generate) for the
three addons whose dependency surfaces changed in this PR:

- qvac-fabric bumped from 7248.x to 8189.0.2 -- different transitive C++
  license set.
- holepunch / hyperswarm libs moved to peerDependencies on main, so the
  JS attribution lists shrink accordingly.
- @qvac/infer-base bumped to 0.4.1.

Per-package C++ resolution after the run:

  embed-llamacpp        : opencl/qvac-fabric/qvac-lib-inference-addon-cpp/
                          qvac-lint-cpp + libc++          (5 deps)
  llm-llamacpp          : the above + picojson + nlohmann-json (7 deps)
  translation-nmtcpp    : bergamot-translator/sentencepiece/ssplit/
                          qvac-fabric/qvac-lib-inference-addon-cpp/
                          qvac-lint-cpp + libc++          (7 deps)

Net: +206 / -585 lines across the three NOTICE files (mostly transitive
JS attribution shrink from the holepunch peerDeps refactor).

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(llm): make gemma4 reasoning-budget test tolerate model-emitted reasoning

Gemma 4's reasoning channel is model-emitted (no template force-open),
so the model decides per-prompt whether to engage reasoning. For
trivial prompts like "What is the capital of France?" the model can
short-circuit and skip the <|channel>thought…<channel|> markers, which
made the test flaky on CI.

Gate the marker / length assertions on the baseline actually emitting
the opening marker; if it didn't, log a comment and skip the dependent
checks instead of failing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* types(llm): declare reasoning_budget in LlamaConfig

The C++ config parser already accepts `reasoning_budget` (and the
kebab-case `reasoning-budget` alias), but neither was a typed property
on `LlamaConfig` — they only typechecked via the catch-all index
signature. Add a typed entry with JSDoc so TypeScript consumers get
autocomplete and the accepted values (-1 default, 0 disabled).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(llm): allow per-request reasoning_budget override in run()

`reasoning_budget` was load-time only. Add it to `GenerationParams` so
`model.run(messages, { generationParams: { reasoning_budget: 0 } })`
can disable reasoning for a single request without re-loading the
model — same shape as `temp` / `top_p` / `seed` overrides.

Wiring:
- `LlmContext::GenerationParams` gains an optional `reasoning_budget`
  field and `hasOverrides()` covers it.
- `applyGenerationParamsToContext` snapshots / overrides /
  restores `params.reasoning_budget` alongside `n_predict`.
- `AddonJs::runJob` parses `generationParams.reasoning_budget` from
  JS and rejects values other than `-1` or `0`.
- `index.d.ts` exposes `reasoning_budget?: -1 | 0` on
  `GenerationParams` with a JSDoc note.

`tokenizeChat` already reads `params_.reasoning_budget`, so no change
is needed in `TextLlmContext` / `MtmdLlmContext` — the temporary
override naturally propagates to `inputs.enable_thinking`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(llm): cover per-request reasoning_budget override on Qwen3.5

Validates the new per-request `generationParams.reasoning_budget`
override end-to-end in two runs against a single loaded model:

1. `reasoning_budget: 0` override suppresses the `<think>…</think>`
   reasoning markers for that one request.
2. The next `run()` with no override restores the load-time default
   (reasoning enabled), proving the override is request-scoped and
   not sticky.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(llm): case-insensitive antiprompt substring matching

`checkAntiprompt` now lowercases both the recent output window and each
antiprompt before the `find()` so a single `Pizza` entry catches the
model's `pizza`, `Pizza`, `PIZZA`, etc. Callers no longer need to list
every casing variant. Applied identically in `TextLlmContext` and
`MtmdLlmContext`. The token-level early-exit path is unchanged (BPE
tokens are case-specific; the substring path is the authoritative
check).

Also drop the stale comment on the `Reverse prompt stops generation`
scenario in `config-parameters.test.js`: it claimed the addon split
on `,` without trimming, but `LlamaModel.cpp::split()` already
trims and drops empty segments. Replaced with a brief note that
documents the new (current) behaviour and simplified the antiprompt
list to `'network, Pizza, bitcoin, blockchain'` so the test exercises
both the trim and the case-insensitive match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(llm): stress case-insensitive antiprompt with PiZzA mixed-case entry

Swap the `Pizza` reverse_prompt entry for `PiZzA`. With case-sensitive
matching `PiZzA` would never match the model's `pizza` / `Pizza`
output; only case-insensitive comparison fires the stop. Verified
locally — the test still completes with output length 5, so the
antiprompt trips on the first emitted "Pizza".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(llm): validate reasoning_budget before truncating to int

Address @jpgaribotti's review: previously the value was cast to int
*before* the `0` / `-1` check, so fractional inputs like `0.5` or
`-1.1` would silently truncate to a "valid" 0 / -1 and pass through.

Validate against the exact double values (both `0` and `-1` are
exactly representable in IEEE-754, so `==` comparison is safe) before
casting to int when storing in `ov.reasoning_budget`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(llm): use std::from_chars for reasoning_budget load-time parse

Address @jpgaribotti's review: `std::stoi` silently accepts trailing
garbage (`"0abc"` → `0`) and throws an uncaught `std::out_of_range`
for inputs that overflow `int`. Switch to `std::from_chars`, which
fails clean on non-numeric input, overflow (`errc::result_out_of_range`),
and trailing garbage (`ptr != end`), then validate against the
allowed `-1` / `0` values in the same check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>
Co-authored-by: gianni-cor <gianfranco.cordella@tether.io>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants