Skip to content

[Feature] Support llama.cpp cache_prompt parameter #16

@Mushoz

Description

@Mushoz

llama.cpp server accepts an optional parameter cache_prompt in the request to reuse the KV cache for matching prefixes, see: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md . This can massively speed up prompt processing. Furthermore, it is a required parameter to utilize the newly introduced speculative decoding, see: ggml-org/llama.cpp#10455 (comment)

Very few clients support setting this optional parameter. That means there is no easy way to use this functionality. Therefor, I believe it might be ideal if llama-swap would be capable of adding this parameter to requests. This way, this feature can be implemented without requiring to implement it on a client-by-client basis.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions