Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ You can load any [`llama.cpp`](https://github.com/ggml-org/llama.cpp)-compatible

## Features

* Event stream: `completion()` exposes a single ordered `events` async iterable plus an aggregated `final` promise. Events are discriminated by `type` — `contentDelta`, `thinkingDelta`, `toolCall`, `toolError`, `completionStats`, `completionDone`, `rawDelta`.
* Event stream: `completion()` exposes a single ordered `events` async iterable plus an aggregated `final` promise. Events are discriminated by `type` — `contentDelta`, `thinkingDelta`, `toolCall`, `toolError`, `completionStats`, `completionDone`, `rawDelta`. The terminal `completionDone` event carries a `stopReason` (e.g. `"eos"`, `"length"`, `"cancelled"`).
* Thinking content: models that emit `<think>` blocks surface them as dedicated `thinkingDelta` events (enable with `captureThinking: true`), so consumers don't have to parse tags from raw text.
* Tool calls: the model emits structured tool calls as `toolCall` events ordered alongside content and thinking in the same stream.
* MCP: plug MCP servers into `completion()` so the model can use external tools (e.g., web search) via the same tool-call mechanism.
Expand Down
1 change: 1 addition & 0 deletions docs/website/content/docs/introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ The JS SDK is cross-platform, type-safe, and pluggable, exposing all QVAC capabi
* [**Profiler:**](/runtime/profiler) measure and export timing metrics across model loading, inference, and P2P delegation.
* [**Download Lifecycle:**](/models/download-lifecycle) pause and resume model downloads.
* [**Runtime lifecycle:**](/runtime/lifecycle) suspend and resume the SDK runtime (e.g., on app background/foreground) and query lifecycle state.
* [**Cancellation:**](/runtime/cancellation) cancel any in-flight inference, model load, or download by `requestId`, or broad-cancel by `modelId` for unload/shutdown.
* [**Sharded models:**](/models/sharded-models) download a model that is sharded into multiple parts.

## Flow
Expand Down
38 changes: 32 additions & 6 deletions docs/website/content/docs/models/download-lifecycle.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,50 @@ schemaType: HowTo

## Overview

Downloads in QVAC are _resumable by default_. When you download an asset via [`downloadAsset()`](/reference/api#downloadasset) or [`loadModel()`](/reference/api#loadmodel)), the SDK writes partial files to disk so the next run can continue from where it left off. The progress callback provides a `downloadKey`. Persist it if you want to programmatically pause/cancel the download.
Downloads in QVAC are _resumable by default_. When you download an asset via [`downloadAsset()`](/reference/api#downloadasset) or [`loadModel()`](/reference/api#loadmodel)), the SDK writes partial files to disk so the next run can continue from where it left off. The progress callback provides a `downloadKey` that identifies the underlying transfer (useful for dedup and cache identification), but cancellation is targeted by `requestId`.

Both `downloadAsset()` and `loadModel()` return a decorated promise (`Promise<string> & { requestId: string }`) that exposes a synchronous `requestId` field, so you can wire a stop button to a specific in-flight call without waiting for the first progress event. See [Cancel a specific call by `requestId`](#cancel-a-specific-call-by-requestid) below.

## Functions

1. [`downloadAsset()`](/reference/api#downloadasset) or [`loadModel()`](/reference/api#loadmodel) — with `onProgress` for progress tracking
2. [`cancel()`](/reference/api#cancel) — pause/cancel a download using the `downloadKey`
1. [`downloadAsset()`](/reference/api#downloadasset) or [`loadModel()`](/reference/api#loadmodel) — with `onProgress` for progress tracking; both return a decorated promise that exposes `op.requestId` synchronously.
2. [`cancel()`](/reference/api#cancel) — either:
- `cancel({ requestId: op.requestId })` — pause this specific call (preserves the partial file for automatic resume on the next run).
- `cancel({ requestId: op.requestId, clearCache: true })` — discard the partial file along with the cancel.
- `cancel({ modelId })` — broad sweep that cancels every in-flight request on the given model, including non-download ops. See [Cancellation — broad cancel by `modelId`](/runtime/cancellation#broad-cancel-by-modelid-escape-hatch).

For how to use each function, see [SDK — API reference](/reference/api/).

## Cancel a specific call by `requestId`

Both `downloadAsset()` and `loadModel()` return `Promise<string> & { requestId: string }`. The await result is unchanged (the asset path or model id, respectively), but `op.requestId` is available **synchronously** before `await` resolves — so a stop button can be wired immediately, before the first progress event arrives:

```ts
const op = downloadAsset({ assetSrc: "https://example.com/big.gguf" });
op.requestId; // synchronously available, before await

// Pause: preserves the partial file for automatic resume on the next call.
stopButton.onclick = () => cancel({ requestId: op.requestId });

// Or: discard the partial file along with the cancel.
clearButton.onclick = () => cancel({ requestId: op.requestId, clearCache: true });

await op; // rejects with InferenceCancelledError if cancelled
```

When two callers request the same artifact, the SDK deduplicates them onto a single underlying transfer. `cancel({ requestId })` rejects only the cancelling subscriber's promise; the underlying transfer keeps running to serve any other subscribers. The transfer is aborted only when the **last** subscriber leaves.

For the broader cancellation contract (errors, decorated-promise pattern across other SDK operations, broad cancel by `modelId`), see [Cancellation](/runtime/cancellation).

## Flow

- Pause: call [`cancel()`](/reference/api#cancel) with `operation: "downloadAsset"` and the `downloadKey` from progress events.
- Pause: call [`cancel()`](/reference/api#cancel) with `requestId: op.requestId` from the decorated promise returned by `downloadAsset()` / `loadModel()`.
- Resume: run the same `downloadAsset()` / `loadModel()` call again — the SDK will reuse the partial file and continue downloading.
- Discard partial file: call `cancel({ ..., clearCache: true })`.
- Discard partial file: call `cancel({ requestId: op.requestId, clearCache: true })`.

## Example

The following script shows an example of pausing and resuming a download using `cancel()` + `downloadKey`:
The following script shows an example of pausing and resuming a download using `cancel({ requestId })` + the decorated-promise pattern:

<Tabs>
<Tab value="js" label="JavaScript" default>
Expand Down
171 changes: 171 additions & 0 deletions docs/website/content/docs/runtime/cancellation.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
---
title: Cancellation
description: Cancel any in-flight long-running SDK operation by requestId, or broad-cancel by modelId.
schemaType: HowTo
---

## Overview

Every long-running SDK operation that goes through the request registry can be cancelled at any point during execution. Coverage spans inference (`completion()`, `embed()`, `transcribe()`, `translate()`, `textToSpeech()`, `ocr()`, `diffusion()`, `upscale()`), workspace operations (`rag*()`), and resource-acquisition calls (`loadModel()`, `downloadAsset()`). The cancel surface differs by operation — see the coverage callout below for which path applies to which call.

The mental model is: **the primary path is `requestId`** — pass the run's `requestId` to `cancel()` to stop that exact call. **The `modelId` path is an escape hatch** — use it for model unload, app shutdown, admin sweeps, or for ops whose addons cannot interrupt mid-decode (`translate`, `textToSpeech`, `ocr`, `diffusion`, `upscale`).

<Callout title="Coverage" type="info">
**Targeted cancel by `requestId`** works for: `completion()`, `loadModel()`, `embed()`, `transcribe()`, `downloadAsset()`, and `rag*()` (`ragIngest`, `ragSaveEmbeddings`, `ragReindex`).

**Broad cancel by `modelId`** additionally covers `translate()`, `textToSpeech()`, `ocr()`, `diffusion()`, and `upscale()`. These accept `cancel({ modelId })` but their addons cannot interrupt mid-decode — the in-flight call stops yielding when `signal.aborted` flips on the next yield point, and the C++ work runs to completion in the background.

**Duplex sessions** — `transcribeStream(...)` and `textToSpeechStream(...)` use `.destroy()` on the returned session.

**Finetune** — keeps its own cancel surface: `finetune({ operation: "cancel", ... })`. See [Fine-tuning](/ai-capabilities/fine-tuning).
</Callout>

## Functions

1. [`completion()`](/reference/api#completion) — returns a `CompletionRun` that exposes a synchronous `requestId` field.
2. [`loadModel()`](/reference/api#loadmodel), [`downloadAsset()`](/reference/api#downloadasset), [`embed()`](/reference/api#embed), [`transcribe()`](/reference/api#transcribe), and the [`rag*()`](/ai-capabilities/rag) workspace operations (`ragIngest`, `ragSaveEmbeddings`, `ragReindex`) — return a decorated promise (`Promise<T> & { requestId: string }`) that exposes a synchronous `requestId` field before the first await resolves.
3. [`cancel()`](/reference/api#cancel) — cancel by `requestId` (targeted) or by `modelId` (broad, optionally narrowed by `kind`).

For how to use each function, see [SDK — API reference](/reference/api/).

## Where `requestId` comes from

Two shapes show up across the SDK surface:

- **`completion()`** — returns a `CompletionRun` with `run.requestId` (UUIDv4 generated client-side, available synchronously on the returned run).
- **`loadModel()`, `downloadAsset()`, `embed()`, `transcribe()`, `ragIngest()`, `ragSaveEmbeddings()`, `ragReindex()`** — return `Promise<T> & { requestId: string }`. The await result is unchanged (`await loadModel(...)` still resolves to the model id, `await embed(...)` still resolves to the embedding vector, etc.), but `op.requestId` is available synchronously *before* `await` resolves so a stop button can be wired immediately.

```ts
// Pattern 1 — completion: requestId is on the returned run
const run = completion({ modelId, history, stream: true });
await cancel({ requestId: run.requestId });

// Pattern 2 — decorated promise: op.requestId is synchronously available
const op = loadModel({ modelSrc: "..." });
op.requestId; // synchronously available, before await
stopButton.onclick = () => cancel({ requestId: op.requestId });
const id = await op; // legacy unwrap still returns the modelId
```

## Targeted cancel by `requestId`

Once you have a `requestId` (via either of the two patterns above), cancel is a single call. The `requestId` is available **synchronously** — before the first network round-trip — so you can wire a stop button to it immediately, without waiting for the first chunk to arrive.

There are two equivalent forms:

```ts
const run = completion({ modelId, history, stream: true });

// Sugar form (recommended for most callers)
await cancel({ requestId: run.requestId });

// Explicit form (same behavior)
await cancel({ operation: "request", requestId: run.requestId });
```

Outcome on the consumer side (using `completion()` as the reference):

- The `events` async iterable closes cleanly.
- The terminal `completionDone` event carries `stopReason: "cancelled"`.
- The `final` promise rejects with [`InferenceCancelledError`](/reference/api#errors) (code `52419`).

Other operations that go through `cancel({ requestId })` (`loadModel`, `downloadAsset`, `embed`, `transcribe`, `rag*`) all reject their returned promise with the same `InferenceCancelledError` (code `52419`) — the error class is reused across non-inference handlers, no new code was added.

Only the targeted call is affected — other in-flight calls on the same `modelId` keep running. To cancel `translate`, `textToSpeech`, `ocr`, `diffusion`, or `upscale` — or to sweep every in-flight call on a model in one shot — use the broad-cancel form below.

## Broad cancel by `modelId` (escape hatch)

When you don't have a `requestId` — typically because you're unloading the model, shutting down the app, or sweeping stale requests from admin code — use the broad-cancel form. The canonical 0.11.0 shape is `{ modelId, kind? }`:

```ts
// Cancel every in-flight request on this model, regardless of kind
await cancel({ modelId });

// Narrow to a specific kind
await cancel({ modelId, kind: "completion" });
await cancel({ modelId, kind: "embeddings" });
await cancel({ modelId, kind: "transcribe" });
await cancel({ modelId, kind: "translate" });

// Legacy per-kind sugars — still supported via the client wrapper.
await cancel({ operation: "inference", modelId });
await cancel({ operation: "embeddings", modelId });
```

Broad cancel terminates **every** in-flight request matching the target on the model simultaneously. Prefer the targeted `{ requestId }` form when you do have a `requestId` — it scopes the cancellation precisely and avoids killing unrelated work that happens to share the model.

For ops whose addon does not support mid-decode abort (`translate`, `textToSpeech`, `ocr`, `diffusion`, `upscale`), broad cancel by `modelId` is the only cancel path, and it is **soft** — the in-flight call stops yielding when `signal.aborted` flips on the next yield point, but the underlying C++ work runs to completion in the background. The client's promise still rejects with `InferenceCancelledError`; just don't expect the model to stop computing immediately.

<Callout type="info">
`loadModel` is per-`requestId` only: the registry slot for an in-progress load is keyed by `requestId` (the model id isn't known until the config hash is computed inside the handler), so `cancel({ modelId })` is a no-op against an in-progress load.
</Callout>

## Soft-cancel caveat for `loadModel`

The download phase of `loadModel()` honors `cancel({ requestId })` end-to-end. The subsequent **addon load phase** (`plugin.createModel(...)` / `model.load(false)`) does not accept a cancellation signal today — a cancel that lands during the load phase still rejects the client's promise with `InferenceCancelledError`, but the addon finishes loading the model into memory in the background.

The result is an **orphan model**: registered as loaded server-side, but the client believes the call failed. If you re-trigger `loadModel()` shortly after a cancel, prefer calling `unloadModel({ modelId })` first (using the model id you can derive deterministically from `modelSrc`) to avoid leaking RAM. A per-load cancel surface on the addon would close this gap; tracked as a follow-up.

## `cancelFinetune` timing change

`finetune({ operation: "cancel", modelId })` (the legacy domain-specific cancel surface for fine-tunes) now returns `{ status: "CANCELLED" }` immediately — the cancel is dispatched synchronously through the registry and the addon's `model.cancel()` runs out-of-band on the in-flight `startFinetune` promise. Previously, the call awaited the addon ack before resolving.

Functionally cancel still lands; observably, `await finetune({ operation: "cancel", ... })` now resolves before the addon has acknowledged. If you previously gated subsequent calls on the cancel-resolution timing, switch to awaiting the original `finetune(...)` handle's `result` to observe the actual training-side termination. The `cancel({ requestId })` path is unchanged across milestones — it has always been synchronous-after-abort.

## History-trim

A cancelled assistant turn is **partial** — the model stopped mid-decode, so its content cuts off in the middle of a thought. Drop it (or mark it as partial) before appending the next user turn to `history` on the follow-up `completion()`. Otherwise the model sees a truncated assistant message as if it were complete, which biases subsequent generations:

```ts
const run = completion({ modelId, history, stream: true });
let cancelled = false;

for await (const event of run.events) {
if (event.type === "completionDone" && event.stopReason === "cancelled") {
cancelled = true;
}
}

const nextHistory = cancelled
? history // drop the partial assistant turn
: [...history, { role: "assistant", content: (await run.final).contentText }];
```

<Callout type="info">
The same partial-turn rule applies if you abort the `events` iterator early (e.g., `break` out of the `for await` loop) without calling `cancel()`. The model still committed a truncated turn — treat it as partial.
</Callout>

## Example

The following script loads a model, starts a streaming `completion()`, cancels it shortly after by `requestId`, and prints how many content deltas streamed before the cancel took effect:

<Tabs>
<Tab value="js" label="JavaScript" default>
<WrapCode>

```js file=<rootDir>/packages/sdk/dist/examples/cancel-by-request-id.js title="cancel-by-request-id.js" lineNumbers
```
</WrapCode>
</Tab>

<Tab value="ts" label="TypeScript">
<WrapCode>

```ts file=<rootDir>/packages/sdk/examples/cancel-by-request-id.ts title="cancel-by-request-id.ts" lineNumbers
```
</WrapCode>
</Tab>
</Tabs>

<Callout type="success">
**Tip:** all examples throughout this documentation are self-contained and runnable. For instructions on how to run them, see [SDK quickstart](/quickstart).
</Callout>

## Errors

- `InferenceCancelledError` (code `52419`) — expected on the `final` promise (and any aggregate promise) after a consumer-initiated cancel. Treat it as a normal outcome, not a failure. Carries `requestId` plus a `partial: { text?, toolCalls?, stats? }` payload with whatever was accumulated before the cancel point.
- `RequestNotFoundError` (code `52418`) — registry lookup miss for the given `requestId`. Rare in practice because `cancel({ requestId })` against an already-terminated id is a no-op on the handler (returns `success: true, cancelled: 0`); consumer code that narrows by class will see this for other call sites that look up a request by id.
- `RequestIdConflictError` (code `52417`) — two requests landed with the same `requestId`. Astronomically unlikely with UUIDv4; if you see it, report.
- `RequestRejectedByPolicyError` (code `52420`) — the registry's concurrency policy rejected the request before it began (e.g. `oneAtATimePerModel` for `completion` — the second concurrent completion against the same model is admissibility-rejected). Carries `requestId`, `kind`, `modelId`, and a human-readable `reason`.
- `AsyncDisposeUnavailableError` (code `53503`) — the runtime is missing `Symbol.asyncDispose` (older Bare builds). Upgrade Bare.

8 changes: 7 additions & 1 deletion docs/website/src/lib/custom-tree.ts
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,13 @@ export const customTree: Node[] = [
name: 'Runtime',
},
{
name: 'Runtime lifecycle',
name: 'Cancellation',
url: '/runtime/cancellation',
type: 'page',
icon: resolveIcon('CircleStop'),
},
{
name: 'Lifecycle',
url: '/runtime/lifecycle',
type: 'page',
icon: resolveIcon('Moon'),
Expand Down