QVAC-18036 fix: prime system-prompt KV cache via addon prefill instead of cancel-on-first-token by maxim-smotrov · Pull Request #1880 · tetherto/qvac

maxim-smotrov · 2026-05-04T10:25:40Z

🎯 What problem does this PR solve?

initSystemPromptCache primed a new KV cache by starting generation and then cancelling on the first output token. That still produced one token of work, relied on a race between output and cancel, and created unnecessary load when warming caches at startup or on cache-key churn.

📝 How does it solve it?

Switch system-prompt cache priming to the addon's new prefill: true runtime option. The prompt + tools are ingested into the KV cache and persisted to disk without producing any output tokens, so runModel resolves as soon as priming finishes. Specifically:
- Extend the internal CompletionRunOptions type to forward prefill alongside cacheKey / saveCacheToDisk.
- Drop the output → cancel race in initSystemPromptCache; just await the prime response.
- Bump @qvac/llm-llamacpp from ^0.17.1 to ^0.17.3 to pick up the prefill flag.
- Update kv-cache-system.mdc to document the new prefill-based priming flow.

🧪 How was it tested?

Ran examples/kv-cache-custom-key.js.
Ran e2e tests for cache

…ancel-on-first-token

simon-iribarren · 2026-05-04T13:01:24Z

Following up on the missing-regression-test nit from review — here are concrete suggestions, scoped to what's testable without a real model.

Suggested regression tests

The behavior to lock down is narrow: initSystemPromptCache must call the addon with prefill: true and must not wire an output → cancel race anymore. That's verifiable with a fake AnyModel whose run() records the RunOptions and returns a stub QvacResponse.

1. Unit — `prefill: true` is forwarded, `cancel()` is never called

File: packages/sdk/test/unit/init-system-prompt-cache.test.ts (new), using brittle like kv-cache-cancel.test.ts.

import test from "brittle";
import { initSystemPromptCache } from "@/server/bare/plugins/llamacpp-completion/ops/completion-stream";

function makeFakeModel() {
  const calls: { prompt: unknown; opts: unknown }[] = [];
  let cancelCount = 0;
  const fakeResponse = {
    once: (_event: string, _cb: () => void) => fakeResponse,
    cancel: async () => { cancelCount++; },
    await: async () => {},
  };
  return {
    model: { run: (prompt: unknown, opts: unknown) => {
      calls.push({ prompt, opts });
      return fakeResponse;
    }},
    calls,
    getCancelCount: () => cancelCount,
  };
}

test("initSystemPromptCache primes with prefill: true and never cancels", async (t) => {
  const { model, calls, getCancelCount } = makeFakeModel();
  await initSystemPromptCache(model as any, '/tmp/cache.bin', 'sys', 'key', []);
  t.is(calls.length, 1);
  t.alike(calls[0].opts, { cacheKey: '/tmp/cache.bin', saveCacheToDisk: true, prefill: true });
  t.is(getCancelCount(), 0, 'must not cancel — prefill produces no output tokens');
});

(Requires exporting initSystemPromptCache from completion-stream.ts — currently file-private. One-line change.)

2. Unit — priming errors propagate (regression for the masked-failure class)

The old code's once('output', cancel) could mask priming failures if the addon ever emitted output after a failed prefill. The new code makes await primeResponse.await() the only signal, so an error must surface:

test("initSystemPromptCache propagates addon errors from await()", async (t) => {
  const failingResponse = {
    once: () => failingResponse,
    cancel: async () => {},
    await: async () => { throw new Error('addon prefill failed'); },
  };
  const model = { run: () => failingResponse };
  await t.exception(() => initSystemPromptCache(model as any, '/tmp/cache.bin', 'sys', 'key', []));
});

3. Unit — tools are forwarded into the prime payload

Confirms the system-prompt + tools transformation didn't regress when the cancel block was removed:

test("initSystemPromptCache includes tools in the prime payload", async (t) => {
  const { model, calls } = makeFakeModel();
  const tools = [{ type: 'function', name: 'add', description: 'sum', parameters: {} }];
  await initSystemPromptCache(model as any, '/tmp/cache.bin', 'sys', 'key', tools as any);
  const prompt = calls[0].prompt as any[];
  t.is(prompt[0].role, 'system');
  t.ok(prompt.some((m: any) => m.type === 'function' && m.name === 'add'));
});

4. (Optional, lives outside unit suite) Integration assertion

In whichever e2e the team already runs for examples/kv-cache-custom-key.ts, add two post-conditions:

The prime call's QvacResponse emits zero output events end-to-end (proves prefill: true truly suppresses generation).
The cache file exists on disk after await primeResponse.await() resolves.

These three unit tests are cheap, don't require a model, and pin exactly the contract this PR is establishing — they would have caught the prior race regressing back as well.

github-actions · 2026-05-04T13:01:48Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

github-actions · 2026-05-05T07:43:00Z

QVAC E2E — `windows` — ⚠️ no results

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

The test job did not produce a results artifact. Check the run for job-level failures.

github-actions · 2026-05-05T07:43:00Z

QVAC E2E — `linux` — ✅ all tests passed (89/89, 466s)

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

github-actions · 2026-05-05T07:43:01Z

QVAC E2E — `macos` — ✅ all tests passed (89/89, 332s)

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

github-actions · 2026-05-05T07:43:01Z

QVAC E2E — `ios` — ❌ failed

Totals: 18/89 passed · 71 failed · 20.2% · 660s
Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

Results by section

addon-logging: 0/2 ❌
config: 0/1 ❌
config-reload: 0/2 ❌
delegated-inference: 0/4 ❌
diffusion: 0/2 ❌
download: 0/1 ❌
embedding: 0/2 ❌
error: 0/5 ❌
finetune: 0/1 ❌
http: 0/2 ❌
kv-cache: 0/4 ❌
lifecycle: 0/2 ❌
model: 6/8 ❌
model-info: 0/4 ❌
ocr: 0/4 ❌
rag: 0/3 ❌
registry: 0/4 ❌
sharded-model: 0/3 ❌
tools: 0/2 ❌
transcription: 0/5 ❌
translation-afriquegemma: 0/1 ❌
translation-bergamot: 0/3 ❌
translation-indictrans: 0/2 ❌
translation-llm: 0/2 ❌
translation-salamandra: 0/2 ❌
tts: 0/2 ❌
vision: 0/3 ❌
wrong-model: 0/1 ❌

Failed tests

addon-logging-llm: Consumer died before test could be executed
addon-logging-during-inference: Consumer died before test could be executed
config-reload-invalid-model-id: Consumer died before test could be executed
config-reload-then-transcribe: Consumer died before test could be executed
config-registry-download-smoke: Consumer died before test could be executed
delegated-provider-start: Consumer died before test could be executed
delegated-provider-stop: Consumer died before test could be executed
delegated-load-model-fallback-local: Consumer died before test could be executed
delegated-connection-failure: Consumer died before test could be executed
diffusion-basic-txt2img: Consumer died before test could be executed
diffusion-streaming-progress: Consumer died before test could be executed
download-cancel-isolation: Consumer died before test could be executed
embed-simple-text: Consumer died before test could be executed
embed-batch: Consumer died before test could be executed
error-invalid-model-id: Consumer died before test could be executed
error-structured-error-code: Consumer died before test could be executed
error-rag-operation-failed: Consumer died before test could be executed
error-completion-negative-temperature: Consumer died before test could be executed
error-use-unloaded-model: Consumer died before test could be executed
finetune-start-complete: Consumer died before test could be executed
http-sharded-embed-load: Consumer died before test could be executed
http-sharded-embed-inference: Consumer died before test could be executed
kv-cache-delete-all: Consumer died before test could be executed
kv-cache-sliding-window: Consumer died before test could be executed
kv-cache-streaming-sliding-window: Consumer died before test could be executed
kv-cache-stats-verification: Consumer died before test could be executed
lifecycle-suspend-resume-inference: Consumer died before test could be executed
lifecycle-state-transitions: Consumer died before test could be executed
model-info-get: Consumer died before test could be executed
model-info-persists-after-unload: Consumer died before test could be executed
model-info-loaded-get: Consumer died before test could be executed
model-info-loaded-not-found: Consumer died before test could be executed
model-load-inferred-type: Consumer died before test could be executed
model-load-missing-type-string-src: Consumer died before test could be executed
ocr-basic-png: Consumer died before test could be executed
ocr-streaming: Consumer died before test could be executed
ocr-stats: Consumer died before test could be executed
ocr-block-structure: Consumer died before test could be executed
rag-embeddings-small-chunks: Consumer died before test could be executed
rag-large-document-32kb: Consumer died before test could be executed
rag-medium-document-10kb: Consumer died before test could be executed
registry-list-returns-models: Consumer died before test could be executed
registry-search-by-engine-llm: Consumer died before test could be executed
registry-get-model-valid: Consumer died before test could be executed
registry-get-model-not-found: Consumer died before test could be executed
sharded-model-load: Consumer died before test could be executed
sharded-model-detection: Consumer died before test could be executed
sharded-model-inference: Consumer died before test could be executed
tools-simple-function: Consumer died before test could be executed
tools-multiple-functions: Consumer died before test could be executed
transcription-short-wav: Consumer became unresponsive (no heartbeat for 121s)
transcription-short-mp3: Consumer died before test could be executed
transcription-streaming: Consumer died before test could be executed
transcription-corrupted-mp3: Consumer died before test could be executed
transcription-with-prompt: Consumer died before test could be executed
translation-indictrans-en-hi-basic: Consumer died before test could be executed
translation-indictrans-hi-en-basic: Consumer died before test could be executed
translation-bergamot-en-fr-basic: Consumer died before test could be executed
translation-bergamot-en-fr-streaming: Consumer died before test could be executed
translation-bergamot-en-es-basic: Consumer died before test could be executed
translation-llm-en-es: Consumer died before test could be executed
translation-llm-streaming: Consumer died before test could be executed
translation-salamandra-en-es: Consumer died before test could be executed
translation-salamandra-streaming: Consumer died before test could be executed
translation-afriquegemma-sw-en: Consumer died before test could be executed
tts-chatterbox-short-text: Consumer died before test could be executed
tts-supertonic-streaming: Consumer died before test could be executed
vision-basic: Consumer died before test could be executed
vision-streaming: Consumer died before test could be executed
vision-error-missing-image: Consumer died before test could be executed
wrong-model-transcribe-on-llm: Consumer died before test could be executed

github-actions · 2026-05-05T07:43:01Z

QVAC E2E — `android` — ⚠️ no results

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

The test job did not produce a results artifact. Check the run for job-level failures.

maxim-smotrov · 2026-05-05T08:17:35Z

/review

maxim-smotrov · 2026-05-05T10:31:09Z

/review

…ancel-on-first-token (#1880)

fix(llm): prime system-prompt KV cache via addon prefill instead of c…

cf0106a

…ancel-on-first-token

maxim-smotrov requested review from a team as code owners May 4, 2026 10:25

Merge branch 'main' into fix/llm-use-prefill

7be92ba

maxim-smotrov temporarily deployed to release May 4, 2026 10:26 — with GitHub Actions Inactive

Merge branch 'main' into fix/llm-use-prefill

4a13b73

maxim-smotrov requested a review from a team as a code owner May 4, 2026 12:42

maxim-smotrov had a problem deploying to release May 4, 2026 12:42 — with GitHub Actions Failure

Merge branch 'main' into fix/llm-use-prefill

960009a

maxim-smotrov had a problem deploying to release May 4, 2026 14:15 — with GitHub Actions Failure

Merge branch 'main' into fix/llm-use-prefill

1fdce5a

maxim-smotrov had a problem deploying to release May 4, 2026 14:16 — with GitHub Actions Failure

Merge branch 'main' into fix/llm-use-prefill

5360bf4

maxim-smotrov temporarily deployed to release May 4, 2026 14:27 — with GitHub Actions Inactive

Merge branch 'main' into fix/llm-use-prefill

6fa9086

maxim-smotrov temporarily deployed to release May 4, 2026 15:07 — with GitHub Actions Inactive

Merge branch 'main' into fix/llm-use-prefill

1c9df8c

maxim-smotrov had a problem deploying to release May 4, 2026 16:46 — with GitHub Actions Failure

simon-iribarren approved these changes May 4, 2026

View reviewed changes

Merge branch 'main' into fix/llm-use-prefill

540c72b

maxim-smotrov had a problem deploying to release May 5, 2026 07:34 — with GitHub Actions Failure

maxim-smotrov added the test-e2e-smoke Triggers smoke e2e test suite [Currently SDK-only] label May 5, 2026

maxim-smotrov temporarily deployed to release May 5, 2026 07:42 — with GitHub Actions Inactive

maxim-smotrov had a problem deploying to release May 5, 2026 07:42 — with GitHub Actions Error

maxim-smotrov temporarily deployed to release May 5, 2026 07:42 — with GitHub Actions Inactive

maxim-smotrov temporarily deployed to release May 5, 2026 07:55 — with GitHub Actions Inactive

maxim-smotrov had a problem deploying to release May 5, 2026 07:55 — with GitHub Actions Error

maxim-smotrov had a problem deploying to release May 5, 2026 07:56 — with GitHub Actions Failure

maxim-smotrov temporarily deployed to release May 5, 2026 07:56 — with GitHub Actions Inactive

NamelsKing approved these changes May 5, 2026

View reviewed changes

maxim-smotrov temporarily deployed to release May 5, 2026 08:30 — with GitHub Actions Inactive

donriddo approved these changes May 5, 2026

View reviewed changes

maxim-smotrov temporarily deployed to release May 5, 2026 09:28 — with GitHub Actions Inactive

Merge branch 'main' into fix/llm-use-prefill

f1bb27a

maxim-smotrov had a problem deploying to release May 5, 2026 10:31 — with GitHub Actions Failure

maxim-smotrov merged commit 0d755b4 into tetherto:main May 5, 2026
12 of 13 checks passed

maxim-smotrov had a problem deploying to release May 5, 2026 10:31 — with GitHub Actions Failure

tamer-hassan-tether pushed a commit that referenced this pull request May 5, 2026

fix(llm): prime system-prompt KV cache via addon prefill instead of c…

ad9753f

…ancel-on-first-token (#1880)

Proletter pushed a commit that referenced this pull request May 24, 2026

fix(llm): prime system-prompt KV cache via addon prefill instead of c…

3ae54b2

…ancel-on-first-token (#1880)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-18036 fix: prime system-prompt KV cache via addon prefill instead of cancel-on-first-token#1880

QVAC-18036 fix: prime system-prompt KV cache via addon prefill instead of cancel-on-first-token#1880
maxim-smotrov merged 10 commits into
tetherto:mainfrom
maxim-smotrov:fix/llm-use-prefill

maxim-smotrov commented May 4, 2026

Uh oh!

simon-iribarren commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

maxim-smotrov commented May 5, 2026

Uh oh!

maxim-smotrov commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

maxim-smotrov commented May 4, 2026

🎯 What problem does this PR solve?

📝 How does it solve it?

🧪 How was it tested?

Uh oh!

simon-iribarren commented May 4, 2026

Suggested regression tests

1. Unit — prefill: true is forwarded, cancel() is never called

2. Unit — priming errors propagate (regression for the masked-failure class)

3. Unit — tools are forwarded into the prime payload

4. (Optional, lives outside unit suite) Integration assertion

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tier-based Approval Status

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

QVAC E2E — windows — ⚠️ no results

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

QVAC E2E — linux — ✅ all tests passed (89/89, 466s)

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

QVAC E2E — macos — ✅ all tests passed (89/89, 332s)

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

QVAC E2E — ios — ❌ failed

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

QVAC E2E — android — ⚠️ no results

Uh oh!

maxim-smotrov commented May 5, 2026

Uh oh!

maxim-smotrov commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. Unit — `prefill: true` is forwarded, `cancel()` is never called

github-actions Bot commented May 4, 2026 •

edited

Loading

github-actions Bot commented May 5, 2026 •

edited

Loading

QVAC E2E — `windows` — ⚠️ no results

github-actions Bot commented May 5, 2026 •

edited

Loading

QVAC E2E — `linux` — ✅ all tests passed (89/89, 466s)

github-actions Bot commented May 5, 2026 •

edited

Loading

QVAC E2E — `macos` — ✅ all tests passed (89/89, 332s)

github-actions Bot commented May 5, 2026 •

edited

Loading

QVAC E2E — `ios` — ❌ failed

github-actions Bot commented May 5, 2026 •

edited

Loading

QVAC E2E — `android` — ⚠️ no results