Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions .cursor/rules/sdk/docs/kv-cache-system.mdc
Original file line number Diff line number Diff line change
Expand Up @@ -88,28 +88,28 @@ When a new cache key is used for the first time:

| File | Purpose |
|------|---------|
| `server/bare/plugins/llamacpp-completion/ops/kv-cache-session.ts` | **`KvCacheSession` β€” single owner of the three KV-cache bookkeeping layers** (on-disk `.bin`, `initializedCaches`, `cachedMessageCounts`). Exposes `beginTurn` / `commitTurn` / `rollback` / `dropStaleSavedCount` plus the module-level `deleteKvCacheState(...)` administrative API. M2 (QVAC-18182). |
| `server/bare/plugins/llamacpp-completion/ops/kv-cache-session.ts` | **`KvCacheSession` β€” single owner of the three KV-cache bookkeeping layers** (on-disk `.bin`, `initializedCaches`, `cachedMessageCounts`). Exposes `beginTurn` / `commitTurn` / `rollback` / `dropStaleSavedCount` plus the module-level `deleteKvCacheState(...)` administrative API. (QVAC-18182). |
| `server/bare/plugins/llamacpp-completion/ops/completion-stream.ts` | Completion handler. Calls `session.beginTurn(...)`, registers `scope.defer(() => session.rollback(turn))` once, and calls `session.commitTurn(...)` on the happy path (which suppresses the deferred rollback). No direct references to the three layers. |
| `server/bare/plugins/llamacpp-completion/ops/kv-cache-state.ts` | Pure `decideCachedHistorySlice(...)` helper used by the session β€” slice decision for the next addon call. No state. |
| `server/bare/ops/kv-cache-utils.ts` | Path / hash / fs utilities: `getCacheFilePath`, `generateConfigHash`, `findMatchingCache`, `getCurrentCacheInfo`, `renameCacheFile`, `deleteCache`. No in-memory state. |
| `server/bare/plugins/llamacpp-completion/ops/cache-logger.ts` | Debug logging for cache operations |
| `server/rpc/handlers/delete-cache.ts` | `handleDeleteCache` RPC entry point. Delegates to `deleteKvCacheState(...)` β€” zero direct references to the three layers (M2 deliverable 5). |
| `server/rpc/handlers/delete-cache.ts` | `handleDeleteCache` RPC entry point. Delegates to `deleteKvCacheState(...)` β€” zero direct references to the three layers. |
| `server/utils/cache.ts` | `getKVCacheDir()` base directory |
| `client/api/delete-cache.ts` | Client-side delete cache API |

## Key Behaviors

### `KvCacheSession` Ownership (M2)
### `KvCacheSession` Ownership

Before M2 the completion handler coordinated three independent bookkeeping layers around every cancel/error branch:
Before 0.11.0 the completion handler coordinated three independent bookkeeping layers around every cancel/error branch:

1. An in-memory `Set<string>` of "initialized caches" (`kv-cache-utils.ts`).
2. A `Map<string, number>` of saved-message counts (`kv-cache-state.ts`).
3. The on-disk `.bin` files written by the addon.

Three near-identical cleanup blocks in `completion-stream.ts` had to touch all three on every cancel / zero-token / rename-failed / tool-call exit. Any one of those blocks forgetting a layer produced the drift bugs the pitch documents (QVAC-17780 family).

**M2 collapses this into `KvCacheSession`**, the **single mutation point** for the three layers. The handler's loop is now:
**0.11.0 collapses this into `KvCacheSession`**, the **single mutation point** for the three layers. The handler's loop is now:

```typescript
const session = createKvCacheSession(modelId);
Expand Down
6 changes: 3 additions & 3 deletions .cursor/rules/sdk/docs/request-lifecycle-system.mdc
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ Concretely, Vercel's AI SDK (`streamText`) is a public-codebase example of the s

`request-lifecycle-primitives.mdc` has the worked code examples and the dispatch-level truth table for which `RequestKind`s currently route through the registry.

For kinds that haven't been migrated onto the registry yet, the broad-cancel path (`cancel({ operation: <kind>, modelId })`) falls back to `addon.cancel()` directly β€” see the fallback in `server/bare/ops/cancel.ts`. The wire contract for non-migrated kinds is unchanged: callers continue to use `cancel({ operation: <kind>, modelId })` exactly as before.
Every server-side cancellable handler is on the registry as of 0.11.0. The broad-cancel path (`cancel({ modelId, kind? })` and its legacy `{ operation: "inference"|"embeddings", modelId }` aliases) is a single registry walk; the legacy pre-registry addon-cancel fallback in `server/bare/ops/cancel.ts` was removed in 0.11.0. Handlers whose addon declares `cancel: { scope: "none" }` (TTS, OCR, NMT, upscale) still respect a broad cancel at the registry layer β€” the in-flight call yields when `ctx.signal.aborted` flips on its next yield point β€” they just don't get a hard mid-decode abort.

## FAQ

Expand Down Expand Up @@ -192,8 +192,8 @@ Test coverage: `same-tick cancel-before-begin retroactively aborts the later beg
| `server/bare/runtime/with-request-context.ts` | `withRequestContext(logger, ctx)` β€” per-request logger wrapper prefixing every emit with the lifecycle correlation tuple |
| `server/bare/runtime/request-id.ts` | UUID generation helper for caller-provided ids |
| `server/bare/runtime/index.ts` | Public re-exports β€” handlers import from `@/server/bare/runtime` |
| `server/bare/ops/cancel.ts` | Broad-cancel op: registry-routed with addon fallback for non-migrated handlers |
| `server/rpc/handlers/cancelHandler.ts` | RPC entry point: dispatches by `operation` (inference / embeddings / request / downloadAsset / rag) |
| `server/bare/ops/cancel.ts` | Broad-cancel op: pure registry walk, legacy addon-cancel fallback removed in 0.11.0 |
| `server/rpc/handlers/cancelHandler.ts` | RPC entry point: 2-arm `request` / `broad` dispatch (5-arm union collapsed in 0.11.0). Targeted `request` goes through `RequestRegistry.cancel({ requestId })` plus an optional `markClearCacheForRequest(...)` for downloads; `broad` delegates to `server/bare/ops/cancel.ts` |
| `server/rpc/handlers/delete-cache.ts` | Delegates to `deleteKvCacheState(...)` β€” zero direct references to the three KV-cache layers |
| `server/bare/plugins/llamacpp-completion/plugin.ts` | Reference plugin manifest; declares `cancel: { scope: "model", hard: true }`; builds `withRequestContext(...)` once per request and threads it into `completion(...)`; `finetune` declares `{ scope: "model", hard: true }`; `translate` handler threads `requestId` into the shared bare op |
| `server/bare/plugins/llamacpp-completion/ops/completion-stream.ts` | Reference implementation of the canonical handler shape; uses `KvCacheSession`; accepts a request-scoped `logger` |
Expand Down
22 changes: 18 additions & 4 deletions .cursor/rules/sdk/error-handling.mdc
Original file line number Diff line number Diff line change
Expand Up @@ -177,10 +177,24 @@ Located in `@/utils/errors-server`
- `AttachmentNotFoundError` - Attachment not found
- `CancelFailedError` - Cancel failed
- `TextToSpeechFailedError` - TTS failed
- `RequestIdConflictError` (52417) - `registry.begin(...)` called with a `requestId` already present
- `RequestNotFoundError` (52418) - registry lookup miss (no in-flight request for the given id)
- `InferenceCancelledError` (52419) - cancelled inference run; carries `requestId` + `partial: { text?, toolCalls?, stats? }`. Constructed client-side on `stopReason: "cancelled"` (event stream ends normally; promise-aggregates reject with this). Re-exported from `@qvac/sdk` for `instanceof` checks.
- `RequestRejectedByPolicyError` (52420) - registry concurrency-policy admission failure (e.g. `oneAtATimePerModel`); carries `requestId`, `kind`, `modelId`, and a `reason` string. Re-exported from `@qvac/sdk` for `instanceof` checks. See `.cursor/rules/sdk/request-lifecycle-primitives.mdc` for the policy contract.
- `RequestIdConflictError` (52417) - `registry.begin(...)` called with a `requestId` already present. Carries `requestId`. Re-exported from `@qvac/sdk` for `instanceof` checks (reconstructed across RPC by the typed-error reconstructor β€” see "Typed errors across RPC" below).
- `RequestNotFoundError` (52418) - registry lookup miss (no in-flight request for the given id). Carries `requestId`. Re-exported from `@qvac/sdk` for `instanceof` checks (reconstructed across RPC).
- `InferenceCancelledError` (52419) - cancelled inference run; carries `requestId` + `partial: { text?, toolCalls?, stats? }`. Constructed client-side on `stopReason: "cancelled"` (event stream ends normally; promise-aggregates reject with this). Re-exported from `@qvac/sdk` for `instanceof` checks. **Not** RPC-reconstructed β€” client-side only.
- `RequestRejectedByPolicyError` (52420) - registry concurrency-policy admission failure (e.g. `oneAtATimePerModel`); carries `requestId`, `kind`, `modelId`, and a `reason` string. Re-exported from `@qvac/sdk` for `instanceof` checks (reconstructed across RPC). See `.cursor/rules/sdk/request-lifecycle-primitives.mdc` for the policy contract.

### Typed errors across RPC

Server-thrown `QvacError` subclasses that need to survive the RPC boundary as their original class (so `err instanceof RequestRejectedByPolicyError` works on the consumer side) are wired through a small reconstructor pipeline:

1. The class extends `QvacErrorBase` and implements `toErrorResponseFields(): Record<string, unknown>` listing the named constructor fields the client needs to rebuild it. The base envelope (`name`, `code`, `message`, `stack`, `timestamp`, `cause`) is already carried by `createErrorResponse(...)`; `typedFields` is the per-class extension.
2. The class is re-exported from `@qvac/sdk` (root `index.ts`). Forgetting this means consumers can't `import { Foo } from "@qvac/sdk"` even though the reconstructor builds a `Foo` instance, and `instanceof` regresses.
3. A row is added to the `RECONSTRUCTORS` map in `client/rpc/rpc-error.ts`, keyed by the class `name`. The row reads from `response.typedFields` (defaulting missing fields defensively) and forwards `response.cause`.

`client/rpc/rpc-client.ts` calls `reconstructError(response)` instead of `new RPCError(response)`: a registered class is rebuilt; an unknown `name` falls through to `RPCError` so consumers using `code`-based predicates still work.

Three classes are wired today: `RequestIdConflictError`, `RequestNotFoundError`, `RequestRejectedByPolicyError`. Add new rows whenever a new cross-RPC server-thrown class is introduced β€” the maintenance contract lives at the top of the `RECONSTRUCTORS` map.

`InferenceCancelledError` is **not** in this map: it's constructed client-side in `client/api/completion-stream.ts` from the aggregated partial state on `stopReason: "cancelled"`. Adding a reconstructor for a client-constructed class creates a parallel construction path and is a smell.

#### RAG Operations (52,800-52,999)
- `RAGSaveFailedError` - Save failed
Expand Down
12 changes: 9 additions & 3 deletions .cursor/rules/sdk/request-lifecycle-primitives.mdc
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Server-side long-running operations (`completion`, `embeddings`, `transcribe`, `
- **`RequestContext`** β€” per-request handle bundling `requestId`, `kind`, `modelId`, `signal`, `scope`, `state`.
- **`RequestRegistry`** β€” module-scoped registry that mints contexts via `begin(...)` and routes `cancel(...)` by `requestId` or `modelId`.

The contract below applies to every cancellable server-side handler. The truth table further down ("Truth table for built-in plugins") tracks each handler's addon-level cancel surface; the dispatch-level table ("What's on the registry today") tracks which `RequestKind`s are currently routed through the registry. Kinds not on the registry use the broad-cancel fallback in `server/bare/ops/cancel.ts`.
The contract below applies to every cancellable server-side handler. The truth table further down ("Truth table for built-in plugins") tracks each handler's addon-level cancel surface; the dispatch-level table ("What's on the registry today") tracks which `RequestKind`s are currently routed through the registry. As of 0.11.0 every handler in the SDK is registered β€” the legacy pre-registry addon-cancel fallback in `server/bare/ops/cancel.ts` has been removed.

## Canonical Handler Shape

Expand Down Expand Up @@ -244,7 +244,7 @@ The truth table above describes the addon-level capability for plugin handlers.
| `downloadAsset` | `server/rpc/handlers/download-asset.ts` | Hard (signal threaded to `resolveModelPath`) | Per-`requestId` cancel preserves the content-addressed dedup in `download-manager.ts` β€” two subscribers on the same `downloadKey` share one transfer, and the transfer aborts only when the **last** subscriber leaves. |
| `rag` | `server/rpc/handlers/rag.ts` | Soft (workspace-bound; ingest/saveEmbeddings/reindex) | Dispatcher-level pre-emption: starting a new RAG op on a workspace cancels the prior in-flight op on the same workspace **before** `registry.begin(...)`. Workspace admission lives in the dispatcher rather than as a registry policy primitive. |

Kinds **not** in this table (e.g. `textToSpeech`, `ocr`, `diffusion`, `upscale`) still use the broad-cancel fallback in `server/bare/ops/cancel.ts`.
Kinds **not** in this table (e.g. `textToSpeech`, `ocr`, `diffusion`, `upscale`) declare `cancel: { scope: "none" }` at the addon level β€” they do not expose a hard mid-decode abort surface but still respect `cancel({ requestId })` and `cancel({ modelId })` at the registry layer (the in-flight call yields when `ctx.signal.aborted` flips on the next yield point). The broad-cancel path is a single registry walk; the per-kind fallback was removed in 0.11.0.


## Concurrency Policy
Expand Down Expand Up @@ -361,11 +361,17 @@ await sdk.cancel({ requestId: op.requestId });
Cancel every in-flight request matching a `modelId` β€” for model unload, app shutdown, admin sweeps. Kept stable from pre-0.11.0:

```typescript
// Generic broad-cancel β€” preferred shape going forward (0.11.0+).
await sdk.cancel({ modelId });
await sdk.cancel({ modelId, kind: "completion" });
await sdk.cancel({ modelId, kind: "embeddings" });

// Legacy per-kind sugars β€” still supported via the client wrapper.
await sdk.cancel({ operation: "inference", modelId });
await sdk.cancel({ operation: "embeddings", modelId });
```

Internally, both paths land on `RequestRegistry.cancel(...)`. The broad path falls back to `addon.cancel()` for handler kinds that haven't been registry-migrated yet β€” see the "What's on the registry today" table above for the current set of registry-routed kinds.
Internally every path lands on `RequestRegistry.cancel(...)`. The legacy pre-registry addon-cancel fallback in `server/bare/ops/cancel.ts` was removed in 0.11.0 β€” every handler is now on the registry, so a broad cancel is one registry walk and nothing else. See the "What's on the registry today" table above; it lists "all kinds" because the migration is complete in 0.11.0.

## Decorated-Promise Pattern

Expand Down
22 changes: 18 additions & 4 deletions packages/cli/src/serve/adapters/openai/routes/chat.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ import { readBody, sendJson, sendError, initSSE, sendSSE, endSSE } from '../../.
import { resolveModelAlias } from '../../../config.js'
import { sdkCompletion } from '../../../core/sdk.js'
import type { SDKTool, SDKGenerationParams, SDKResponseFormat } from '../../../core/sdk.js'
import { bindClientDisconnectCancel } from '../../../core/cancel-bridge.js'
import {
openaiMessagesToHistory,
openaiToolsToSdk,
Expand Down Expand Up @@ -96,9 +97,9 @@ export async function handleChatCompletions (req: IncomingMessage, res: ServerRe

try {
if (streaming) {
await handleStreamingCompletion(res, { sdkModelId, history, tools, generationParams, responseFormat, modelAlias, logger: ctx.logger })
await handleStreamingCompletion(req, res, { sdkModelId, history, tools, generationParams, responseFormat, modelAlias, logger: ctx.logger })
} else {
await handleBlockingCompletion(res, { sdkModelId, history, tools, generationParams, responseFormat, modelAlias, logger: ctx.logger })
await handleBlockingCompletion(req, res, { sdkModelId, history, tools, generationParams, responseFormat, modelAlias, logger: ctx.logger })
}
} catch (err) {
const message = err instanceof Error ? err.message : String(err)
Expand All @@ -124,7 +125,7 @@ function completionTokensFromStats (text: string, stats: { generatedTokens?: num
return text ? text.split(/\s+/).filter(Boolean).length : 0
}

async function handleBlockingCompletion (res: ServerResponse, params: CompletionParams): Promise<void> {
async function handleBlockingCompletion (req: IncomingMessage, res: ServerResponse, params: CompletionParams): Promise<void> {
const result = await sdkCompletion({
modelId: params.sdkModelId,
history: params.history,
Expand All @@ -134,6 +135,12 @@ async function handleBlockingCompletion (res: ServerResponse, params: Completion
responseFormat: params.responseFormat
})

// Bridge HTTP client disconnect β†’ SDK cancel. Bound after the
// wrapper await but before any `await` on the result aggregates,
// so a fetch-abort mid-completion lands on the in-flight requestId
// before tokens have fully resolved.
bindClientDisconnectCancel(req, res, result.requestId, params.logger)

const text = await result.text
const toolCalls = await result.toolCalls
const stats = await result.stats
Expand Down Expand Up @@ -171,7 +178,7 @@ async function handleBlockingCompletion (res: ServerResponse, params: Completion
})
}

async function handleStreamingCompletion (res: ServerResponse, params: CompletionParams): Promise<void> {
async function handleStreamingCompletion (req: IncomingMessage, res: ServerResponse, params: CompletionParams): Promise<void> {
const result = await sdkCompletion({
modelId: params.sdkModelId,
history: params.history,
Expand All @@ -181,6 +188,13 @@ async function handleStreamingCompletion (res: ServerResponse, params: Completio
responseFormat: params.responseFormat
})

// Bridge HTTP client disconnect β†’ SDK cancel. The synchronous
// `result.requestId` (decorated on the `CompletionRun`) is what makes
// this work: we can bind the listener before the first SSE frame
// streams, so a fetch-abort during inference aborts the in-flight
// SDK request rather than letting it run to natural completion.
bindClientDisconnectCancel(req, res, result.requestId, params.logger)

initSSE(res)

const id = `chatcmpl-${randomId()}`
Expand Down
10 changes: 9 additions & 1 deletion packages/cli/src/serve/adapters/openai/routes/embeddings.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ import type { IncomingMessage, ServerResponse } from 'node:http'
import { readBody, sendJson, sendError } from '../../../http.js'
import { resolveModelAlias } from '../../../config.js'
import { sdkEmbed } from '../../../core/sdk.js'
import { bindClientDisconnectCancel } from '../../../core/cancel-bridge.js'
import type { RouteContext } from '../../types.js'

export async function handleEmbeddings (req: IncomingMessage, res: ServerResponse, ctx: RouteContext): Promise<void> {
Expand Down Expand Up @@ -60,11 +61,18 @@ export async function handleEmbeddings (req: IncomingMessage, res: ServerRespons
ctx.logger.info(` embed model=${modelAlias} inputs=${inputs.length}`)

try {
const embeddings = await sdkEmbed({
const op = await sdkEmbed({
modelId: sdkModelId,
text: inputs.length === 1 ? inputs[0]! : inputs
})

// Bind the disconnect bridge before awaiting the result so a
// client-abort during a long batch embed lands on the in-flight
// requestId rather than completing the whole batch.
bindClientDisconnectCancel(req, res, op.requestId, ctx.logger)

const embeddings = await op.result

const isBatch = Array.isArray(embeddings[0])
const vectors = isBatch ? embeddings as number[][] : [embeddings as number[]]

Expand Down
Loading
Loading