Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
541e3a4
chore[bc]: remove BaseInference inheritance and WeightsProvider from …
donriddo Apr 8, 2026
7494997
fix: correct FinetuneProgress and finetune terminal handling in outpu…
donriddo Apr 9, 2026
ace7b57
fix: update all LLM examples and model-loading test to new constructo…
donriddo Apr 9, 2026
91c2375
fix: update sharded model test to download shards to disk first
donriddo Apr 9, 2026
502be84
fix: update LLM benchmark tooling to new constructor shape
donriddo Apr 9, 2026
b3dc214
fix: update LLM perf benchmark sweep and judge to new constructor shape
donriddo Apr 10, 2026
a087ffd
docs: update LLM README, finetuning, and afriquegemma docs for new co…
donriddo Apr 10, 2026
ebcf734
fix: update LLM prepare-prompts and verify-prompts to new constructor
donriddo Apr 10, 2026
0ada4a5
fix: update LLM finetuning unit tests to new constructor and exclusiv…
donriddo Apr 10, 2026
26bb3c0
docs: update LLM architecture, data-flows, finetuning, README sharded…
donriddo Apr 10, 2026
2e8d063
docs: align LLM finetuning docs and mobile README with new constructor
donriddo Apr 10, 2026
c9cfb3e
chore[bc]: address PR #1494 review findings and bump to 0.15.0
donriddo Apr 10, 2026
bca021d
refactor: move LLM C++ event normalization into addon.js
donriddo Apr 10, 2026
90a07d3
fix: address PR #1494 second-round review findings
donriddo Apr 10, 2026
61350ae
Merge remote-tracking branch 'origin/main' into chore/llm-addon-inter…
donriddo Apr 10, 2026
4388126
fix: extract pickPrimaryGgufPath, restore multiModal example, fix docs
donriddo Apr 12, 2026
9088c9e
fix: correct version in architecture.md and remove stale dl-filesyste…
donriddo Apr 12, 2026
187e694
fix: align _hasActiveResponse clearing with embed pattern
donriddo Apr 14, 2026
754da22
fix: throw on second load(), log rejected responses, add mapAddonEven…
donriddo Apr 14, 2026
334950a
Merge remote-tracking branch 'upstream/main' into chore/llm-addon-int…
donriddo Apr 14, 2026
b0a6d08
fix: restore JSDoc on run() that was dropped during BaseInference rem…
donriddo Apr 14, 2026
80c63a8
Merge remote-tracking branch 'upstream/main' into chore/llm-addon-int…
donriddo Apr 14, 2026
8131641
fix: migrate afriquegemma-edge-cases test to new addon constructor
donriddo Apr 14, 2026
b51ca89
fix: make load() idempotent when already loaded
donriddo Apr 15, 2026
f6b17c0
Merge remote-tracking branch 'upstream/main' into chore/llm-addon-int…
donriddo Apr 15, 2026
408506d
test: regenerate mobile integration auto.cjs
donriddo Apr 15, 2026
bc7414b
Merge remote-tracking branch 'upstream/main' into chore/llm-addon-int…
donriddo Apr 15, 2026
39455e5
doc: document missing breaking changes from BaseInference removal
donriddo Apr 15, 2026
24f0a6c
fix: address lifecycle, validation, and CI-surface review findings
donriddo Apr 16, 2026
b5d9c96
doc: add CHANGELOG entries for load() serialization and absolute-path…
donriddo Apr 16, 2026
e9b0b27
fix[ci]: run test:unit via run-lint-and-unit-tests action
donriddo Apr 16, 2026
979a070
doc: fix mermaid parsing errors in architecture.md and finetuning.md
donriddo Apr 16, 2026
e578002
Merge remote-tracking branch 'upstream/main' into chore/llm-addon-int…
donriddo Apr 16, 2026
18c1d7e
chore[ci]: rename step to reflect what the action actually runs
donriddo Apr 16, 2026
87ece27
fix: readme, finetune lifecycle, multimodal type
donriddo Apr 16, 2026
e320f00
fix: preserve LogMsg event name in mapAddonEvent
donriddo Apr 16, 2026
3ba64ab
doc: restore class JSDoc, method JSDoc, and media-separation comments
donriddo Apr 16, 2026
3d219ca
doc: shorten pickPrimaryGgufPath JSDoc in d.ts to a single line
donriddo Apr 16, 2026
89649c6
doc: trim verbose comments added during the refactor
donriddo Apr 16, 2026
123337a
doc: drop narration comment on _addonOutputCallback
donriddo Apr 16, 2026
c95afd0
doc: restore FinetuneOptions JSDoc to pre-refactor forms
donriddo Apr 16, 2026
b855c60
doc: restore pre-refactor load/createAddon logs and JSDoc
donriddo Apr 16, 2026
b77629c
chore: drop unused 'test' script, inline into 'test:all'
donriddo Apr 16, 2026
94e4d27
doc: correct pre-refactor constructor marker to <= 0.15.x
donriddo Apr 16, 2026
64d4a52
test: run AfriqueGemma tests on mobile, matching main
donriddo Apr 16, 2026
98d52cd
doc, test: fix _createAddon JSDoc and cover string-path media content
donriddo Apr 16, 2026
06384c0
build: promote @qvac/logging to runtime dependency
donriddo Apr 16, 2026
f6170e0
doc: finish finetuning.md mermaid fix
donriddo Apr 17, 2026
de3d93e
Merge branch 'main' into chore/llm-addon-interface-refactor
donriddo Apr 17, 2026
196ac82
fix: move addon construction into crash-safe try block
donriddo Apr 17, 2026
25d37c2
Merge branch 'main' into chore/llm-addon-interface-refactor
donriddo Apr 17, 2026
24eb15c
Merge branch 'main' into chore/llm-addon-interface-refactor
donriddo Apr 17, 2026
a1dfe8f
Merge branch 'main' into chore/llm-addon-interface-refactor
gianni-cor Apr 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .github/workflows/on-pr-qvac-lib-infer-llamacpp-llm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,15 @@ jobs:
working-directory: packages/qvac-lib-infer-llamacpp-llm
run: npm run test:dts

- name: Run lint and unit tests
id: run_lint_and_unit_tests
uses: tetherto/oss-actions/.github/actions/run-lint-and-unit-tests@4c64bed91fc8eba3a201adb1495e61b4c1a2246d
with:
gpr-token: ${{ secrets.GITHUB_TOKEN }}
pat-token: ${{ secrets.GITHUB_TOKEN }}
registry-type: gpr
workdir: packages/qvac-lib-infer-llamacpp-llm

prebuild:
needs: [authorize, sanity-checks]
if: needs.authorize.outputs.allowed == 'true'
Expand Down
130 changes: 130 additions & 0 deletions packages/qvac-lib-infer-llamacpp-llm/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,135 @@
# Changelog

## [0.16.0] - 2026-04-14

This release migrates the LLM addon off `BaseInference` inheritance and the `WeightsProvider` download layer onto the composable `createJobHandler` + `exclusiveRunQueue` utilities from `@qvac/infer-base@^0.4.0`. The constructor signature is replaced with a single object whose `files.model` field is an ordered array of absolute paths and `files.projectionModel` is an optional absolute path for multimodal models. This is a breaking change β€” every caller must update.

## Breaking Changes

### Constructor signature: single object with `files`, no `Loader`

`LlmLlamacpp` now takes a single `{ files, config, logger?, opts? }` object. The old `Loader` + `diskPath` + `modelName` + two-arg `(args, config)` shape is gone β€” callers pre-resolve absolute paths and supply them as `files.model`.

```js
// BEFORE (≀ 0.15.x)
const FilesystemDL = require('@qvac/dl-filesystem')
const loader = new FilesystemDL({ dirPath: '/models' })
const model = new LlmLlamacpp({
loader,
modelName: 'Qwen3-1.7B-Q4_0.gguf',
diskPath: '/models',
logger: console,
opts: { stats: true }
}, { ctx_size: '4096', gpu_layers: '99' })

// AFTER (0.16.0)
const model = new LlmLlamacpp({
files: {
model: ['/models/Qwen3-1.7B-Q4_0.gguf']
},
config: { ctx_size: '4096', gpu_layers: '99' },
logger: console,
opts: { stats: true }
})
```

For sharded models the caller passes the full ordered list β€” the `<basename>.tensors.txt` companion first, followed by every `<basename>-NNNNN-of-MMMMM.gguf` shard in ascending order. For multimodal models, `files.projectionModel` carries the absolute path to the mmproj file:

```js
const model = new LlmLlamacpp({
files: {
model: [
'/models/medgemma-4b-it-Q4_1.tensors.txt',
'/models/medgemma-4b-it-Q4_1-00001-of-00005.gguf',
'/models/medgemma-4b-it-Q4_1-00002-of-00005.gguf',
'/models/medgemma-4b-it-Q4_1-00003-of-00005.gguf',
'/models/medgemma-4b-it-Q4_1-00004-of-00005.gguf',
'/models/medgemma-4b-it-Q4_1-00005-of-00005.gguf'
],
projectionModel: '/models/mmproj-model-f16.gguf'
},
config: { gpu_layers: '99' }
})
```
Comment thread
jesusmb1995 marked this conversation as resolved.

### `BaseInference` inheritance and `WeightsProvider` removed

`LlmLlamacpp` no longer extends `BaseInference` and no longer touches the `WeightsProvider` download layer. The class composes `createJobHandler` and `exclusiveRunQueue` from `@qvac/infer-base@^0.4.0` directly. Public lifecycle methods (`load` / `run` / `finetune` / `pause` / `cancel` / `unload` / `getState`) are unchanged in shape, but `downloadWeights` and the loader-based progress callbacks are gone β€” the caller is responsible for placing files on disk before constructing the model.

In-memory streaming from network sources (URLs, Hyperdrive) is no longer supported in the current API. The SDK does not currently use it (models are stored to disk first); this can be re-added when/if the SDK plans to support that feature. Before, it was possible through the `Loader` abstraction.

### Dependency changes

- `@qvac/infer-base` bumped from `^0.3.0` to `^0.4.0`.
- `bare-fs` is now a runtime dependency (used to stream shards from disk).
- `@qvac/dl-base` and `@qvac/dl-filesystem` are no longer used by this package and have been removed from `devDependencies`.

### `getState()` returns a narrower shape

`getState()` previously returned `{ configLoaded, weightsLoaded, destroyed }` (the three-field shape inherited from `BaseInference`). It now returns `{ configLoaded }` only. The `weightsLoaded` and `destroyed` fields are gone β€” `weightsLoaded` collapsed into `configLoaded` because the refactored `load()` does both in one step, and `destroyed` is no longer tracked since `unload()` resets `configLoaded` and nulls the addon handle instead. Callers reading `state.weightsLoaded` or `state.destroyed` must switch to `state.configLoaded`.

### Public methods removed from `LlmLlamacpp`

`LlmLlamacpp` previously exposed these methods via `BaseInference` inheritance, all of which are now gone:

- `downloadWeights(onDownloadProgress, opts)` β€” the download layer is removed; the caller places files on disk and passes absolute paths in `files.model` / `files.projectionModel`.
- `unpause()` / `stop()` β€” BaseInference job-lifecycle helpers. The refactor still exposes `pause()` and `cancel()`; `unpause` is superseded by issuing a new `run()` after `cancel()`.
- `status()` β€” replaced by `getState()` for the static readiness flag; per-job state is observed via the `QvacResponse` returned by `run()`.
- `destroy()` β€” folded into `unload()`, which now both releases native resources and nulls `this.addon`.
- `getApiDefinition()` β€” no longer exposed; consumers should import types from `index.d.ts`.

### `load()` takes no arguments

`load()` previously forwarded `...args` through `BaseInference.load` into LLM's `_load(closeLoader, onDownloadProgress)`. Both arguments are gone β€” `closeLoader` is meaningless without a `Loader`, and `onDownloadProgress` is superseded by the caller owning download-and-placement before construction. Call `await model.load()` with no arguments.

### Type exports removed from `index.d.ts`

The following exports are no longer part of the package's public type surface because the loader/download layer they described is gone: `ReportProgressCallback`, `Loader`, `DownloadWeightsOptions`, `DownloadResult`. TypeScript consumers importing any of these must update to the new `LlmLlamacppArgs` / `files` shape.

## Features

### Constructor input validation

The constructor now throws `TypeError('files.model must be a non-empty array of absolute paths')` when `files` or `files.model` is missing or empty. This produces a clear error for callers porting old code instead of a confusing `Cannot read properties of undefined`.

### `run()`-before-`load()` guard

Calling `run()` before `load()` now throws `Error('Addon not initialized. Call load() first.')` instead of dereferencing `null` and crashing. `finetune()` already had this guard since the previous release.

### `load()` is now idempotent when already loaded

A second `load()` call on an already-loaded instance is now a silent no-op instead of unloading and reloading. This aligns with the ReadyResource pattern used elsewhere in QVAC and prevents accidental double-loads from triggering expensive work. Callers that intentionally want to swap weights must call `unload()` first (which clears `configLoaded`) and then `load()` again.

### Crash-safe shard streaming

If `_streamShards()` or `addon.activate()` throws mid-load (for example a corrupted shard file or a native init failure), the partially-initialized addon is now best-effort-unloaded and `this.addon` is reset to `null`. A subsequent `load()` call starts cleanly instead of leaking a zombie native instance.

### Restored JSDoc on `FinetuneOptions`

Every `FinetuneOptions` field carries a `/** … */` doc comment again, including the default values (`numberOfEpochs = 1`, `learningRate = 1e-4`, `batchSize = 128`, …) so IDE tooltips show them without needing to read `docs/finetuning.md`.

## Bug Fixes

### `unload()` clears the addon reference

`unload()` now sets `this.addon = null` after `await this.addon.unload()`, so post-unload `cancel()` / `pause()` / `run()` calls hit the explicit guards rather than dereferencing a disposed native handle. `pause()`, `cancel()`, and the job-handler cancel closure all use optional chaining for the same reason.

### Removed dead `_isSuppressedNoResponseLog` filter

The `_createFilteredLogger` infrastructure that wrapped the user-supplied logger to swallow `'No response found for job'` warnings was tied to the old `BaseInference` `_jobToResponse` Map. The new architecture cannot emit that message at all, so the filter, the wrapped logger, and the `_originalLogger` indirection are all removed. The user-supplied logger is now used directly.

### `load()` is serialized through the exclusive run queue

`load()` is now routed through the same `exclusiveRunQueue` used by `run()`, `finetune()`, and `unload()`. Previously two overlapping `load()` calls on the same instance could both pass the `configLoaded` guard before it flipped to `true`, both stream shards into and activate the native addon, and clobber `this.addon` β€” leaking one native handle. Concurrent `load()` on a single instance is now safe.

### Constructor rejects non-absolute path entries

Each entry in `files.model` is now validated with `path.isAbsolute()` (matching the existing error-message contract), and the same check now applies to the optional `files.projectionModel` β€” previously it had no validation at all. Relative paths are rejected at construction time instead of bubbling up from `bare-fs` or the native load.

## Pull Requests

- [#1494](https://github.com/tetherto/qvac/pull/1494) - chore[bc]: LLM addon interface refactor β€” remove BaseInference and WeightsProvider

## [0.15.0] - 2026-04-09

### Breaking Changes
Expand Down
162 changes: 84 additions & 78 deletions packages/qvac-lib-infer-llamacpp-llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ This native C++ addon, built using the `Bare` Runtime, simplifies running Large
- [Building from Source](#building-from-source)
- [Usage](#usage)
- [1. Import the Model Class](#1-import-the-model-class)
- [2. Create a Data Loader](#2-create-a-data-loader)
- [3. Create the `args` obj](#3-create-the-args-obj)
- [4. Create the `config` obj](#4-create-the-config-obj)
- [5. Create Model Instance](#5-create-model-instance)
- [6. Load Model](#6-load-model)
- [7. Run Inference](#7-run-inference)
- [8. Release Resources](#8-release-resources)
- [2. Create the `args` obj](#2-create-the-args-obj)
- [Sharded models](#sharded-models)
- [3. Create the `config` obj](#3-create-the-config-obj)
- [4. Create Model Instance](#4-create-model-instance)
- [5. Load Model](#5-load-model)
- [6. Run Inference](#6-run-inference)
- [7. Release Resources](#7-release-resources)
- [API behavior by state](#api-behavior-by-state)
- [Fine-tuning](#fine-tuning)
- [Quickstart Example](#quickstart-example)
Expand Down Expand Up @@ -72,47 +72,77 @@ See [build.md](./build.md) for detailed instructions on how to build the addon f

```js
const LlmLlamacpp = require('@qvac/llm-llamacpp')
const path = require('bare-path')
```

### 2. Create a Data Loader

Data Loaders abstract the way model files are accessed. Use a [`FileSystemDataLoader`](../dl-filesystem) to load model files from your local file system. Models can be downloaded directly from HuggingFace.
### 2. Create the `args` obj

```js
const FilesystemDL = require('@qvac/dl-filesystem')

// Download model from HuggingFace (see examples/utils.js for downloadModel helper)
const [modelName, dirPath] = await downloadModel(
'https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf',
'Llama-3.2-1B-Instruct-Q4_0.gguf'
)

const fsDL = new FilesystemDL({ dirPath })
```

### 3. Create the `args` obj
const dirPath = path.resolve('./models')
const modelName = 'Llama-3.2-1B-Instruct-Q4_0.gguf'

```js
const args = {
loader: fsDL,
files: {
model: [path.join(dirPath, modelName)]
// projectionModel: path.join(dirPath, 'mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf') // for multimodal support pass the projection model path
},
config,
opts: { stats: true },
logger: console,
diskPath: dirPath,
modelName,
// projectionModel: 'mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf' // for multimodal support you need to pass the projection model name
logger: console
}
```

The `args` obj contains the following properties:

* `loader`: The Data Loader instance from which the model file will be streamed.
* `logger`: This property is used to create a [`QvacLogger`](../logging) instance, which handles all logging functionality.
* `files.model`: Required. An array of absolute paths to the GGUF model file(s) to load. The caller is responsible for passing the complete set of files for the model, including every shard and the `.tensors.txt` companion for multi-shard models (see [Sharded models](#sharded-models) below).
* `files.projectionModel`: Optional. Absolute path to the projection model file. This is required for multimodal support.
* `config`: The model configuration object (see next section).
* `logger`: This property is used to create a [`QvacLogger`](../logging) instance, which handles all logging functionality.
* `opts.stats`: This flag determines whether to calculate inference stats.
* `diskPath`: The local directory where the model file will be downloaded to.
* `modelName`: The name of model file in the Data Loader.
* `projectionModel`: The name of the projection model file in the Data Loader. This is required for multimodal support.

### 4. Create the `config` obj
#### Sharded models

The addon no longer expands sharded models internally. If you are loading a multi-shard GGUF model, **the caller MUST pass every file** β€” including the `.tensors.txt` companion file that lives alongside the shards β€” in `files.model`. Anything missing will cause the addon to fail during weight streaming.

**Required ordering for multi-shard models:**
1. The `.tensors.txt` companion file **first**.
2. Each `*-NNNNN-of-MMMMM.gguf` shard in **numerical order** (shard `00001` before `00002`, and so on).

Example β€” loading a 5-shard model:

```js
const path = require('bare-path')
const LlmLlamacpp = require('@qvac/llm-llamacpp')

const dir = path.resolve('./models')
const modelBase = 'my-big-model-Q4_K_M'

const model = new LlmLlamacpp({
files: {
model: [
path.join(dir, `${modelBase}.tensors.txt`),
path.join(dir, `${modelBase}-00001-of-00005.gguf`),
path.join(dir, `${modelBase}-00002-of-00005.gguf`),
path.join(dir, `${modelBase}-00003-of-00005.gguf`),
path.join(dir, `${modelBase}-00004-of-00005.gguf`),
path.join(dir, `${modelBase}-00005-of-00005.gguf`)
]
},
config,
logger: console,
opts: { stats: true }
})

await model.load()
```

For single-file GGUF models, pass a one-element array:

```js
files: { model: [path.join(dir, 'Llama-3.2-1B-Instruct-Q4_0.gguf')] }
```

### 3. Create the `config` obj

The `config` obj consists of a set of hyper-parameters which can be used to tweak the behaviour of the model.
*All parameters must by strings.*
Expand Down Expand Up @@ -159,43 +189,21 @@ const config = {
| System with both | βœ… Uses dedicated GPU (preferred) | βœ… Uses dedicated GPU | βœ… Uses integrated GPU |


### 5. Create Model Instance
### 4. Create Model Instance

```js
const model = new LlmLlamacpp(args, config)
const model = new LlmLlamacpp(args)
```

### 6. Load Model
### 5. Load Model

```js
await model.load()
```

_Optionally_ you can pass the following parameters to tweak the loading behaviour.
* `close?`: This boolean value determines whether to close the Data Loader after loading. Defaults to `true`
* `reportProgressCallback?`: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.
Loads the model file(s) passed in `files.model` and activates the native addon. If a projection model was provided (`files.projectionModel`), it is loaded as part of the same step.

_For example:_

```js
await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))
```

**Progress Callback Data**

The progress callback receives an object with the following properties:

| Property | Type | Description |
|---------------------|--------|-----------------------------------------|
| `action` | string | Current operation being performed |
| `totalSize` | number | Total bytes to be loaded |
| `totalFiles` | number | Total number of files to process |
| `filesProcessed` | number | Number of files completed so far |
| `currentFile` | string | Name of file currently being processed |
| `currentFileProgress` | string | Percentage progress on current file |
| `overallProgress` | string | Overall loading progress percentage |

### 7. Run Inference
### 6. Run Inference

Pass an array of messages (following the chat completion format) to the `run` method. Process the generated tokens asynchronously:

Expand Down Expand Up @@ -227,14 +235,13 @@ try {

When `opts.stats` is enabled, `response.stats` includes runtime metrics such as `TTFT`, `TPS`, token counters, and `backendDevice` (`"cpu"` or `"gpu"`). `backendDevice` reflects the resolved device used at runtime after backend selection/fallback logic, not only the requested config.

### 8. Release Resources
### 7. Release Resources

Unload the model when finished:

```javascript
try {
await model.unload()
await fsDL.close()
} catch (error) {
console.error('Failed to unload model:', error)
}
Expand Down Expand Up @@ -341,24 +348,24 @@ In addition to ONNX-based OCR (`@qvac/ocr-onnx`), you can use vision-language mo

```js
const LlmLlamacpp = require('@qvac/llm-llamacpp')
const FilesystemDL = require('@qvac/dl-filesystem')
const fs = require('bare-fs')
const path = require('bare-path')

const dirPath = './models'
const loader = new FilesystemDL({ dirPath })
const dirPath = path.resolve('./models')

const model = new LlmLlamacpp({
modelName: 'LightOnOCR-2-1B-ocr-soup-Q4_K_M.gguf',
loader,
logger: console,
diskPath: dirPath,
projectionModel: 'mmproj-F16.gguf'
}, {
device: 'cpu',
gpu_layers: '0',
ctx_size: '4096',
temp: '0.1',
predict: '2048'
files: {
model: [path.join(dirPath, 'LightOnOCR-2-1B-ocr-soup-Q4_K_M.gguf')],
projectionModel: path.join(dirPath, 'mmproj-F16.gguf')
},
config: {
device: 'cpu',
gpu_layers: '0',
ctx_size: '4096',
temp: '0.1',
predict: '2048'
},
logger: console
})

await model.load()
Expand All @@ -382,7 +389,6 @@ await response.await()
console.log(output.join(''))

await model.unload()
await loader.close()
```

## Architecture
Expand Down
Loading
Loading