tetherto · gianni-cor · Apr 17, 2026 · Apr 8, 2026 · Apr 9, 2026 · Apr 9, 2026
@@ -115,6 +115,15 @@ jobs:
         working-directory: packages/qvac-lib-infer-llamacpp-llm
         run: npm run test:dts
 
+      - name: Run lint and unit tests
+        id: run_lint_and_unit_tests
+        uses: tetherto/oss-actions/.github/actions/run-lint-and-unit-tests@4c64bed91fc8eba3a201adb1495e61b4c1a2246d
+        with:
+          gpr-token: ${{ secrets.GITHUB_TOKEN }}
+          pat-token: ${{ secrets.GITHUB_TOKEN }}
+          registry-type: gpr
+          workdir: packages/qvac-lib-infer-llamacpp-llm
+
   prebuild:
     needs: [authorize, sanity-checks]
     if: needs.authorize.outputs.allowed == 'true'

@@ -1,5 +1,135 @@
 # Changelog
 
+## [0.16.0] - 2026-04-14
+
+This release migrates the LLM addon off `BaseInference` inheritance and the `WeightsProvider` download layer onto the composable `createJobHandler` + `exclusiveRunQueue` utilities from `@qvac/infer-base@^0.4.0`. The constructor signature is replaced with a single object whose `files.model` field is an ordered array of absolute paths and `files.projectionModel` is an optional absolute path for multimodal models. This is a breaking change — every caller must update.
+
+## Breaking Changes
+
+### Constructor signature: single object with `files`, no `Loader`
+
+`LlmLlamacpp` now takes a single `{ files, config, logger?, opts? }` object. The old `Loader` + `diskPath` + `modelName` + two-arg `(args, config)` shape is gone — callers pre-resolve absolute paths and supply them as `files.model`.
+
+```js
+// BEFORE (≤ 0.15.x)
+const FilesystemDL = require('@qvac/dl-filesystem')
+const loader = new FilesystemDL({ dirPath: '/models' })
+const model = new LlmLlamacpp({
+  loader,
+  modelName: 'Qwen3-1.7B-Q4_0.gguf',
+  diskPath: '/models',
+  logger: console,
+  opts: { stats: true }
+}, { ctx_size: '4096', gpu_layers: '99' })
+
+// AFTER (0.16.0)
+const model = new LlmLlamacpp({
+  files: {
+    model: ['/models/Qwen3-1.7B-Q4_0.gguf']
+  },
+  config: { ctx_size: '4096', gpu_layers: '99' },
+  logger: console,
+  opts: { stats: true }
+})
+```
+
+For sharded models the caller passes the full ordered list — the `<basename>.tensors.txt` companion first, followed by every `<basename>-NNNNN-of-MMMMM.gguf` shard in ascending order. For multimodal models, `files.projectionModel` carries the absolute path to the mmproj file:
+
+```js
+const model = new LlmLlamacpp({
+  files: {
+    model: [
+      '/models/medgemma-4b-it-Q4_1.tensors.txt',
+      '/models/medgemma-4b-it-Q4_1-00001-of-00005.gguf',
+      '/models/medgemma-4b-it-Q4_1-00002-of-00005.gguf',
+      '/models/medgemma-4b-it-Q4_1-00003-of-00005.gguf',
+      '/models/medgemma-4b-it-Q4_1-00004-of-00005.gguf',
+      '/models/medgemma-4b-it-Q4_1-00005-of-00005.gguf'
+    ],
+    projectionModel: '/models/mmproj-model-f16.gguf'
+  },
+  config: { gpu_layers: '99' }
+})
+```
+
+### `BaseInference` inheritance and `WeightsProvider` removed
+
+`LlmLlamacpp` no longer extends `BaseInference` and no longer touches the `WeightsProvider` download layer. The class composes `createJobHandler` and `exclusiveRunQueue` from `@qvac/infer-base@^0.4.0` directly. Public lifecycle methods (`load` / `run` / `finetune` / `pause` / `cancel` / `unload` / `getState`) are unchanged in shape, but `downloadWeights` and the loader-based progress callbacks are gone — the caller is responsible for placing files on disk before constructing the model.
+
+In-memory streaming from network sources (URLs, Hyperdrive) is no longer supported in the current API. The SDK does not currently use it (models are stored to disk first); this can be re-added when/if the SDK plans to support that feature. Before, it was possible through the `Loader` abstraction.
+
+### Dependency changes
+
+- `@qvac/infer-base` bumped from `^0.3.0` to `^0.4.0`.
+- `bare-fs` is now a runtime dependency (used to stream shards from disk).
+- `@qvac/dl-base` and `@qvac/dl-filesystem` are no longer used by this package and have been removed from `devDependencies`.
+
+### `getState()` returns a narrower shape
+
+`getState()` previously returned `{ configLoaded, weightsLoaded, destroyed }` (the three-field shape inherited from `BaseInference`). It now returns `{ configLoaded }` only. The `weightsLoaded` and `destroyed` fields are gone — `weightsLoaded` collapsed into `configLoaded` because the refactored `load()` does both in one step, and `destroyed` is no longer tracked since `unload()` resets `configLoaded` and nulls the addon handle instead. Callers reading `state.weightsLoaded` or `state.destroyed` must switch to `state.configLoaded`.
+
+### Public methods removed from `LlmLlamacpp`
+
+`LlmLlamacpp` previously exposed these methods via `BaseInference` inheritance, all of which are now gone:
+
+- `downloadWeights(onDownloadProgress, opts)` — the download layer is removed; the caller places files on disk and passes absolute paths in `files.model` / `files.projectionModel`.
+- `unpause()` / `stop()` — BaseInference job-lifecycle helpers. The refactor still exposes `pause()` and `cancel()`; `unpause` is superseded by issuing a new `run()` after `cancel()`.
+- `status()` — replaced by `getState()` for the static readiness flag; per-job state is observed via the `QvacResponse` returned by `run()`.
+- `destroy()` — folded into `unload()`, which now both releases native resources and nulls `this.addon`.
+- `getApiDefinition()` — no longer exposed; consumers should import types from `index.d.ts`.
+
+### `load()` takes no arguments
+
+`load()` previously forwarded `...args` through `BaseInference.load` into LLM's `_load(closeLoader, onDownloadProgress)`. Both arguments are gone — `closeLoader` is meaningless without a `Loader`, and `onDownloadProgress` is superseded by the caller owning download-and-placement before construction. Call `await model.load()` with no arguments.
+
+### Type exports removed from `index.d.ts`
+
+The following exports are no longer part of the package's public type surface because the loader/download layer they described is gone: `ReportProgressCallback`, `Loader`, `DownloadWeightsOptions`, `DownloadResult`. TypeScript consumers importing any of these must update to the new `LlmLlamacppArgs` / `files` shape.
+
+## Features
+
+### Constructor input validation
+
+The constructor now throws `TypeError('files.model must be a non-empty array of absolute paths')` when `files` or `files.model` is missing or empty. This produces a clear error for callers porting old code instead of a confusing `Cannot read properties of undefined`.
+
+### `run()`-before-`load()` guard
+
+Calling `run()` before `load()` now throws `Error('Addon not initialized. Call load() first.')` instead of dereferencing `null` and crashing. `finetune()` already had this guard since the previous release.
+
+### `load()` is now idempotent when already loaded
+
+A second `load()` call on an already-loaded instance is now a silent no-op instead of unloading and reloading. This aligns with the ReadyResource pattern used elsewhere in QVAC and prevents accidental double-loads from triggering expensive work. Callers that intentionally want to swap weights must call `unload()` first (which clears `configLoaded`) and then `load()` again.
+
+### Crash-safe shard streaming
+
+If `_streamShards()` or `addon.activate()` throws mid-load (for example a corrupted shard file or a native init failure), the partially-initialized addon is now best-effort-unloaded and `this.addon` is reset to `null`. A subsequent `load()` call starts cleanly instead of leaking a zombie native instance.
+
+### Restored JSDoc on `FinetuneOptions`
+
+Every `FinetuneOptions` field carries a `/** … */` doc comment again, including the default values (`numberOfEpochs = 1`, `learningRate = 1e-4`, `batchSize = 128`, …) so IDE tooltips show them without needing to read `docs/finetuning.md`.
+
+## Bug Fixes
+
+### `unload()` clears the addon reference
+
+`unload()` now sets `this.addon = null` after `await this.addon.unload()`, so post-unload `cancel()` / `pause()` / `run()` calls hit the explicit guards rather than dereferencing a disposed native handle. `pause()`, `cancel()`, and the job-handler cancel closure all use optional chaining for the same reason.
+
+### Removed dead `_isSuppressedNoResponseLog` filter
+
+The `_createFilteredLogger` infrastructure that wrapped the user-supplied logger to swallow `'No response found for job'` warnings was tied to the old `BaseInference` `_jobToResponse` Map. The new architecture cannot emit that message at all, so the filter, the wrapped logger, and the `_originalLogger` indirection are all removed. The user-supplied logger is now used directly.
+
+### `load()` is serialized through the exclusive run queue
+
+`load()` is now routed through the same `exclusiveRunQueue` used by `run()`, `finetune()`, and `unload()`. Previously two overlapping `load()` calls on the same instance could both pass the `configLoaded` guard before it flipped to `true`, both stream shards into and activate the native addon, and clobber `this.addon` — leaking one native handle. Concurrent `load()` on a single instance is now safe.
+
+### Constructor rejects non-absolute path entries
+
+Each entry in `files.model` is now validated with `path.isAbsolute()` (matching the existing error-message contract), and the same check now applies to the optional `files.projectionModel` — previously it had no validation at all. Relative paths are rejected at construction time instead of bubbling up from `bare-fs` or the native load.
+
+## Pull Requests
+
+- [#1494](https://github.com/tetherto/qvac/pull/1494) - chore[bc]: LLM addon interface refactor — remove BaseInference and WeightsProvider
+
 ## [0.15.0] - 2026-04-09
 
 ### Breaking Changes

@@ -8,13 +8,13 @@ This native C++ addon, built using the `Bare` Runtime, simplifies running Large
 - [Building from Source](#building-from-source)
 - [Usage](#usage)
   - [1. Import the Model Class](#1-import-the-model-class)
-  - [2. Create a Data Loader](#2-create-a-data-loader)
-  - [3. Create the `args` obj](#3-create-the-args-obj)
-  - [4. Create the `config` obj](#4-create-the-config-obj)
-  - [5. Create Model Instance](#5-create-model-instance)
-  - [6. Load Model](#6-load-model)
-  - [7. Run Inference](#7-run-inference)
-  - [8. Release Resources](#8-release-resources)
+  - [2. Create the `args` obj](#2-create-the-args-obj)
+    - [Sharded models](#sharded-models)
+  - [3. Create the `config` obj](#3-create-the-config-obj)
+  - [4. Create Model Instance](#4-create-model-instance)
+  - [5. Load Model](#5-load-model)
+  - [6. Run Inference](#6-run-inference)
+  - [7. Release Resources](#7-release-resources)
 - [API behavior by state](#api-behavior-by-state)
 - [Fine-tuning](#fine-tuning)
 - [Quickstart Example](#quickstart-example)
@@ -72,47 +72,77 @@ See [build.md](./build.md) for detailed instructions on how to build the addon f
 
 ```js
 const LlmLlamacpp = require('@qvac/llm-llamacpp')
+const path = require('bare-path')
 ```
 
-### 2. Create a Data Loader
-
-Data Loaders abstract the way model files are accessed. Use a [`FileSystemDataLoader`](../dl-filesystem) to load model files from your local file system. Models can be downloaded directly from HuggingFace.
+### 2. Create the `args` obj
 
 ```js
-const FilesystemDL = require('@qvac/dl-filesystem')
-
-// Download model from HuggingFace (see examples/utils.js for downloadModel helper)
-const [modelName, dirPath] = await downloadModel(
-  'https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf',
-  'Llama-3.2-1B-Instruct-Q4_0.gguf'
-)
-
-const fsDL = new FilesystemDL({ dirPath })
-```
-
-### 3. Create the `args` obj
+const dirPath = path.resolve('./models')
+const modelName = 'Llama-3.2-1B-Instruct-Q4_0.gguf'
 
-```js
 const args = {
-  loader: fsDL,
+  files: {
+    model: [path.join(dirPath, modelName)]
+    // projectionModel: path.join(dirPath, 'mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf') // for multimodal support pass the projection model path
+  },
+  config,
   opts: { stats: true },
-  logger: console,
-  diskPath: dirPath,
-  modelName,
-  // projectionModel: 'mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf' // for multimodal support you need to pass the projection model name
+  logger: console
 }
 ```
 
 The `args` obj contains the following properties:
 
-* `loader`: The Data Loader instance from which the model file will be streamed.
-* `logger`: This property is used to create a [`QvacLogger`](../logging) instance, which handles all logging functionality. 
+* `files.model`: Required. An array of absolute paths to the GGUF model file(s) to load. The caller is responsible for passing the complete set of files for the model, including every shard and the `.tensors.txt` companion for multi-shard models (see [Sharded models](#sharded-models) below).
+* `files.projectionModel`: Optional. Absolute path to the projection model file. This is required for multimodal support.
+* `config`: The model configuration object (see next section).
+* `logger`: This property is used to create a [`QvacLogger`](../logging) instance, which handles all logging functionality.
 * `opts.stats`: This flag determines whether to calculate inference stats.
-* `diskPath`: The local directory where the model file will be downloaded to.
-* `modelName`: The name of model file in the Data Loader.
-* `projectionModel`: The name of the projection model file in the Data Loader. This is required for multimodal support.
 
-### 4. Create the `config` obj
+#### Sharded models
+
+The addon no longer expands sharded models internally. If you are loading a multi-shard GGUF model, **the caller MUST pass every file** — including the `.tensors.txt` companion file that lives alongside the shards — in `files.model`. Anything missing will cause the addon to fail during weight streaming.
+
+**Required ordering for multi-shard models:**
+1. The `.tensors.txt` companion file **first**.
+2. Each `*-NNNNN-of-MMMMM.gguf` shard in **numerical order** (shard `00001` before `00002`, and so on).
+
+Example — loading a 5-shard model:
+
+```js
+const path = require('bare-path')
+const LlmLlamacpp = require('@qvac/llm-llamacpp')
+
+const dir = path.resolve('./models')
+const modelBase = 'my-big-model-Q4_K_M'
+
+const model = new LlmLlamacpp({
+  files: {
+    model: [
+      path.join(dir, `${modelBase}.tensors.txt`),
+      path.join(dir, `${modelBase}-00001-of-00005.gguf`),
+      path.join(dir, `${modelBase}-00002-of-00005.gguf`),
+      path.join(dir, `${modelBase}-00003-of-00005.gguf`),
+      path.join(dir, `${modelBase}-00004-of-00005.gguf`),
+      path.join(dir, `${modelBase}-00005-of-00005.gguf`)
+    ]
+  },
+  config,
+  logger: console,
+  opts: { stats: true }
+})
+
+await model.load()
+```
+
+For single-file GGUF models, pass a one-element array:
+
+```js
+files: { model: [path.join(dir, 'Llama-3.2-1B-Instruct-Q4_0.gguf')] }
+```
+
+### 3. Create the `config` obj
 
 The `config` obj consists of a set of hyper-parameters which can be used to tweak the behaviour of the model.  
 *All parameters must by strings.*
@@ -159,43 +189,21 @@ const config = {
 | System with both                | ✅ Uses dedicated GPU (preferred)     | ✅ Uses dedicated GPU               | ✅ Uses integrated GPU              |
 
 
-### 5. Create Model Instance
+### 4. Create Model Instance
 
 ```js
-const model = new LlmLlamacpp(args, config)
+const model = new LlmLlamacpp(args)
 ```
 
-### 6. Load Model
+### 5. Load Model
 
 ```js
 await model.load()
 ```
 
-_Optionally_ you can pass the following parameters to tweak the loading behaviour.
-* `close?`: This boolean value determines whether to close the Data Loader after loading. Defaults to `true`
-* `reportProgressCallback?`: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.
+Loads the model file(s) passed in `files.model` and activates the native addon. If a projection model was provided (`files.projectionModel`), it is loaded as part of the same step.
 
-_For example:_
-
-```js
-await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))
-```
-
-**Progress Callback Data**
-
-The progress callback receives an object with the following properties:
-
-| Property            | Type   | Description                             |
-|---------------------|--------|-----------------------------------------|
-| `action`            | string | Current operation being performed       |
-| `totalSize`         | number | Total bytes to be loaded                |
-| `totalFiles`        | number | Total number of files to process        |
-| `filesProcessed`    | number | Number of files completed so far        |
-| `currentFile`       | string | Name of file currently being processed  |
-| `currentFileProgress` | string | Percentage progress on current file     |
-| `overallProgress`   | string | Overall loading progress percentage     |
-
-### 7. Run Inference
+### 6. Run Inference
 
 Pass an array of messages (following the chat completion format) to the `run` method. Process the generated tokens asynchronously:
 
@@ -227,14 +235,13 @@ try {
 
 When `opts.stats` is enabled, `response.stats` includes runtime metrics such as `TTFT`, `TPS`, token counters, and `backendDevice` (`"cpu"` or `"gpu"`). `backendDevice` reflects the resolved device used at runtime after backend selection/fallback logic, not only the requested config.
 
-### 8. Release Resources
+### 7. Release Resources
 
 Unload the model when finished:
 
 ```javascript
 try {
   await model.unload()
-  await fsDL.close()
 } catch (error) {
   console.error('Failed to unload model:', error)
 }
@@ -341,24 +348,24 @@ In addition to ONNX-based OCR (`@qvac/ocr-onnx`), you can use vision-language mo
 
 ```js
 const LlmLlamacpp = require('@qvac/llm-llamacpp')
-const FilesystemDL = require('@qvac/dl-filesystem')
 const fs = require('bare-fs')
+const path = require('bare-path')
 
-const dirPath = './models'
-const loader = new FilesystemDL({ dirPath })
+const dirPath = path.resolve('./models')
 
 const model = new LlmLlamacpp({
-  modelName: 'LightOnOCR-2-1B-ocr-soup-Q4_K_M.gguf',
-  loader,
-  logger: console,
-  diskPath: dirPath,
-  projectionModel: 'mmproj-F16.gguf'
-}, {
-  device: 'cpu',
-  gpu_layers: '0',
-  ctx_size: '4096',
-  temp: '0.1',
-  predict: '2048'
+  files: {
+    model: [path.join(dirPath, 'LightOnOCR-2-1B-ocr-soup-Q4_K_M.gguf')],
+    projectionModel: path.join(dirPath, 'mmproj-F16.gguf')
+  },
+  config: {
+    device: 'cpu',
+    gpu_layers: '0',
+    ctx_size: '4096',
+    temp: '0.1',
+    predict: '2048'
+  },
+  logger: console
 })
 
 await model.load()
@@ -382,7 +389,6 @@ await response.await()
 console.log(output.join(''))
 
 await model.unload()
-await loader.close()
 ```
 
 ## Architecture