Skip to content

QVAC-18036 fix: prime system-prompt KV cache via addon prefill instead of cancel-on-first-token#1880

Merged
maxim-smotrov merged 10 commits into
tetherto:mainfrom
maxim-smotrov:fix/llm-use-prefill
May 5, 2026
Merged

QVAC-18036 fix: prime system-prompt KV cache via addon prefill instead of cancel-on-first-token#1880
maxim-smotrov merged 10 commits into
tetherto:mainfrom
maxim-smotrov:fix/llm-use-prefill

Conversation

@maxim-smotrov

Copy link
Copy Markdown
Contributor

🎯 What problem does this PR solve?

  • initSystemPromptCache primed a new KV cache by starting generation and then cancelling on the first output token. That still produced one token of work, relied on a race between output and cancel, and created unnecessary load when warming caches at startup or on cache-key churn.

📝 How does it solve it?

  • Switch system-prompt cache priming to the addon's new prefill: true runtime option. The prompt + tools are ingested into the KV cache and persisted to disk without producing any output tokens, so runModel resolves as soon as priming finishes. Specifically:
    • Extend the internal CompletionRunOptions type to forward prefill alongside cacheKey / saveCacheToDisk.
    • Drop the outputcancel race in initSystemPromptCache; just await the prime response.
    • Bump @qvac/llm-llamacpp from ^0.17.1 to ^0.17.3 to pick up the prefill flag.
    • Update kv-cache-system.mdc to document the new prefill-based priming flow.

🧪 How was it tested?

  • Ran examples/kv-cache-custom-key.js.
  • Ran e2e tests for cache

@maxim-smotrov maxim-smotrov requested review from a team as code owners May 4, 2026 10:25
@maxim-smotrov maxim-smotrov requested a review from a team as a code owner May 4, 2026 12:42
@simon-iribarren

Copy link
Copy Markdown
Contributor

Following up on the missing-regression-test nit from review — here are concrete suggestions, scoped to what's testable without a real model.

Suggested regression tests

The behavior to lock down is narrow: initSystemPromptCache must call the addon with prefill: true and must not wire an output → cancel race anymore. That's verifiable with a fake AnyModel whose run() records the RunOptions and returns a stub QvacResponse.

1. Unit — prefill: true is forwarded, cancel() is never called

File: packages/sdk/test/unit/init-system-prompt-cache.test.ts (new), using brittle like kv-cache-cancel.test.ts.

import test from "brittle";
import { initSystemPromptCache } from "@/server/bare/plugins/llamacpp-completion/ops/completion-stream";

function makeFakeModel() {
  const calls: { prompt: unknown; opts: unknown }[] = [];
  let cancelCount = 0;
  const fakeResponse = {
    once: (_event: string, _cb: () => void) => fakeResponse,
    cancel: async () => { cancelCount++; },
    await: async () => {},
  };
  return {
    model: { run: (prompt: unknown, opts: unknown) => {
      calls.push({ prompt, opts });
      return fakeResponse;
    }},
    calls,
    getCancelCount: () => cancelCount,
  };
}

test("initSystemPromptCache primes with prefill: true and never cancels", async (t) => {
  const { model, calls, getCancelCount } = makeFakeModel();
  await initSystemPromptCache(model as any, '/tmp/cache.bin', 'sys', 'key', []);
  t.is(calls.length, 1);
  t.alike(calls[0].opts, { cacheKey: '/tmp/cache.bin', saveCacheToDisk: true, prefill: true });
  t.is(getCancelCount(), 0, 'must not cancel — prefill produces no output tokens');
});

(Requires exporting initSystemPromptCache from completion-stream.ts — currently file-private. One-line change.)

2. Unit — priming errors propagate (regression for the masked-failure class)

The old code's once('output', cancel) could mask priming failures if the addon ever emitted output after a failed prefill. The new code makes await primeResponse.await() the only signal, so an error must surface:

test("initSystemPromptCache propagates addon errors from await()", async (t) => {
  const failingResponse = {
    once: () => failingResponse,
    cancel: async () => {},
    await: async () => { throw new Error('addon prefill failed'); },
  };
  const model = { run: () => failingResponse };
  await t.exception(() => initSystemPromptCache(model as any, '/tmp/cache.bin', 'sys', 'key', []));
});

3. Unit — tools are forwarded into the prime payload

Confirms the system-prompt + tools transformation didn't regress when the cancel block was removed:

test("initSystemPromptCache includes tools in the prime payload", async (t) => {
  const { model, calls } = makeFakeModel();
  const tools = [{ type: 'function', name: 'add', description: 'sum', parameters: {} }];
  await initSystemPromptCache(model as any, '/tmp/cache.bin', 'sys', 'key', tools as any);
  const prompt = calls[0].prompt as any[];
  t.is(prompt[0].role, 'system');
  t.ok(prompt.some((m: any) => m.type === 'function' && m.name === 'add'));
});

4. (Optional, lives outside unit suite) Integration assertion

In whichever e2e the team already runs for examples/kv-cache-custom-key.ts, add two post-conditions:

  • The prime call's QvacResponse emits zero output events end-to-end (proves prefill: true truly suppresses generation).
  • The cache file exists on disk after await primeResponse.await() resolves.

These three unit tests are cheap, don't require a model, and pin exactly the contract this PR is establishing — they would have caught the prior race regressing back as well.

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

QVAC E2E — windows⚠️ no results

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

The test job did not produce a results artifact. Check the run for job-level failures.

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

QVAC E2E — linux — ✅ all tests passed (89/89, 466s)

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

QVAC E2E — macos — ✅ all tests passed (89/89, 332s)

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

QVAC E2E — ios — ❌ failed

Totals: 18/89 passed · 71 failed · 20.2% · 660s
Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

Results by section

  • addon-logging: 0/2 ❌
  • config: 0/1 ❌
  • config-reload: 0/2 ❌
  • delegated-inference: 0/4 ❌
  • diffusion: 0/2 ❌
  • download: 0/1 ❌
  • embedding: 0/2 ❌
  • error: 0/5 ❌
  • finetune: 0/1 ❌
  • http: 0/2 ❌
  • kv-cache: 0/4 ❌
  • lifecycle: 0/2 ❌
  • model: 6/8 ❌
  • model-info: 0/4 ❌
  • ocr: 0/4 ❌
  • rag: 0/3 ❌
  • registry: 0/4 ❌
  • sharded-model: 0/3 ❌
  • tools: 0/2 ❌
  • transcription: 0/5 ❌
  • translation-afriquegemma: 0/1 ❌
  • translation-bergamot: 0/3 ❌
  • translation-indictrans: 0/2 ❌
  • translation-llm: 0/2 ❌
  • translation-salamandra: 0/2 ❌
  • tts: 0/2 ❌
  • vision: 0/3 ❌
  • wrong-model: 0/1 ❌

Failed tests

  • addon-logging-llm: Consumer died before test could be executed
  • addon-logging-during-inference: Consumer died before test could be executed
  • config-reload-invalid-model-id: Consumer died before test could be executed
  • config-reload-then-transcribe: Consumer died before test could be executed
  • config-registry-download-smoke: Consumer died before test could be executed
  • delegated-provider-start: Consumer died before test could be executed
  • delegated-provider-stop: Consumer died before test could be executed
  • delegated-load-model-fallback-local: Consumer died before test could be executed
  • delegated-connection-failure: Consumer died before test could be executed
  • diffusion-basic-txt2img: Consumer died before test could be executed
  • diffusion-streaming-progress: Consumer died before test could be executed
  • download-cancel-isolation: Consumer died before test could be executed
  • embed-simple-text: Consumer died before test could be executed
  • embed-batch: Consumer died before test could be executed
  • error-invalid-model-id: Consumer died before test could be executed
  • error-structured-error-code: Consumer died before test could be executed
  • error-rag-operation-failed: Consumer died before test could be executed
  • error-completion-negative-temperature: Consumer died before test could be executed
  • error-use-unloaded-model: Consumer died before test could be executed
  • finetune-start-complete: Consumer died before test could be executed
  • http-sharded-embed-load: Consumer died before test could be executed
  • http-sharded-embed-inference: Consumer died before test could be executed
  • kv-cache-delete-all: Consumer died before test could be executed
  • kv-cache-sliding-window: Consumer died before test could be executed
  • kv-cache-streaming-sliding-window: Consumer died before test could be executed
  • kv-cache-stats-verification: Consumer died before test could be executed
  • lifecycle-suspend-resume-inference: Consumer died before test could be executed
  • lifecycle-state-transitions: Consumer died before test could be executed
  • model-info-get: Consumer died before test could be executed
  • model-info-persists-after-unload: Consumer died before test could be executed
  • model-info-loaded-get: Consumer died before test could be executed
  • model-info-loaded-not-found: Consumer died before test could be executed
  • model-load-inferred-type: Consumer died before test could be executed
  • model-load-missing-type-string-src: Consumer died before test could be executed
  • ocr-basic-png: Consumer died before test could be executed
  • ocr-streaming: Consumer died before test could be executed
  • ocr-stats: Consumer died before test could be executed
  • ocr-block-structure: Consumer died before test could be executed
  • rag-embeddings-small-chunks: Consumer died before test could be executed
  • rag-large-document-32kb: Consumer died before test could be executed
  • rag-medium-document-10kb: Consumer died before test could be executed
  • registry-list-returns-models: Consumer died before test could be executed
  • registry-search-by-engine-llm: Consumer died before test could be executed
  • registry-get-model-valid: Consumer died before test could be executed
  • registry-get-model-not-found: Consumer died before test could be executed
  • sharded-model-load: Consumer died before test could be executed
  • sharded-model-detection: Consumer died before test could be executed
  • sharded-model-inference: Consumer died before test could be executed
  • tools-simple-function: Consumer died before test could be executed
  • tools-multiple-functions: Consumer died before test could be executed
  • transcription-short-wav: Consumer became unresponsive (no heartbeat for 121s)
  • transcription-short-mp3: Consumer died before test could be executed
  • transcription-streaming: Consumer died before test could be executed
  • transcription-corrupted-mp3: Consumer died before test could be executed
  • transcription-with-prompt: Consumer died before test could be executed
  • translation-indictrans-en-hi-basic: Consumer died before test could be executed
  • translation-indictrans-hi-en-basic: Consumer died before test could be executed
  • translation-bergamot-en-fr-basic: Consumer died before test could be executed
  • translation-bergamot-en-fr-streaming: Consumer died before test could be executed
  • translation-bergamot-en-es-basic: Consumer died before test could be executed
  • translation-llm-en-es: Consumer died before test could be executed
  • translation-llm-streaming: Consumer died before test could be executed
  • translation-salamandra-en-es: Consumer died before test could be executed
  • translation-salamandra-streaming: Consumer died before test could be executed
  • translation-afriquegemma-sw-en: Consumer died before test could be executed
  • tts-chatterbox-short-text: Consumer died before test could be executed
  • tts-supertonic-streaming: Consumer died before test could be executed
  • vision-basic: Consumer died before test could be executed
  • vision-streaming: Consumer died before test could be executed
  • vision-error-missing-image: Consumer died before test could be executed
  • wrong-model-transcribe-on-llm: Consumer died before test could be executed

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

QVAC E2E — android⚠️ no results

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

The test job did not produce a results artifact. Check the run for job-level failures.

@maxim-smotrov

Copy link
Copy Markdown
Contributor Author

/review

@maxim-smotrov

Copy link
Copy Markdown
Contributor Author

/review

@maxim-smotrov maxim-smotrov merged commit 0d755b4 into tetherto:main May 5, 2026
12 of 13 checks passed
tamer-hassan-tether pushed a commit that referenced this pull request May 5, 2026
Proletter pushed a commit that referenced this pull request May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test-e2e-smoke Triggers smoke e2e test suite [Currently SDK-only]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants