[Feature] Support llama.cpp cache_prompt parameter

llama.cpp server accepts an optional parameter `cache_prompt` in the request to reuse the KV cache for matching prefixes, see: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md . This can massively speed up prompt processing. Furthermore, it is a required parameter to utilize the newly introduced speculative decoding, see: https://github.com/ggerganov/llama.cpp/pull/10455#issuecomment-2496209000

Very few clients support setting this optional parameter. That means there is no easy way to use this functionality. Therefor, I believe it might be ideal if llama-swap would be capable of adding this parameter to requests. This way, this feature can be implemented without requiring to implement it on a client-by-client basis.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support llama.cpp cache_prompt parameter #16

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] Support llama.cpp cache_prompt parameter #16

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions