- 
                Notifications
    You must be signed in to change notification settings 
- Fork 116
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
llama.cpp server accepts an optional parameter cache_prompt in the request to reuse the KV cache for matching prefixes, see: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md . This can massively speed up prompt processing. Furthermore, it is a required parameter to utilize the newly introduced speculative decoding, see: ggml-org/llama.cpp#10455 (comment)
Very few clients support setting this optional parameter. That means there is no easy way to use this functionality. Therefor, I believe it might be ideal if llama-swap would be capable of adding this parameter to requests. This way, this feature can be implemented without requiring to implement it on a client-by-client basis.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request