QVAC-18036 fix: prime system-prompt KV cache via addon prefill instead of cancel-on-first-token#1880
Conversation
…ancel-on-first-token
|
Following up on the missing-regression-test nit from review — here are concrete suggestions, scoped to what's testable without a real model. Suggested regression testsThe behavior to lock down is narrow: 1. Unit —
|
Tier-based Approval Status |
QVAC E2E —
|
|
/review |
|
/review |
…ancel-on-first-token (#1880)
…ancel-on-first-token (#1880)
🎯 What problem does this PR solve?
initSystemPromptCacheprimed a new KV cache by starting generation and then cancelling on the first output token. That still produced one token of work, relied on a race betweenoutputandcancel, and created unnecessary load when warming caches at startup or on cache-key churn.📝 How does it solve it?
prefill: trueruntime option. The prompt + tools are ingested into the KV cache and persisted to disk without producing any output tokens, sorunModelresolves as soon as priming finishes. Specifically:CompletionRunOptionstype to forwardprefillalongsidecacheKey/saveCacheToDisk.output→cancelrace ininitSystemPromptCache; justawaitthe prime response.@qvac/llm-llamacppfrom^0.17.1to^0.17.3to pick up theprefillflag.kv-cache-system.mdcto document the new prefill-based priming flow.🧪 How was it tested?