Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLamaDiskCache: needs a RO / 'static' disk cache for RAG use cases #1737

Open
tc-wolf opened this issue Sep 11, 2024 · 0 comments
Open

LLamaDiskCache: needs a RO / 'static' disk cache for RAG use cases #1737

tc-wolf opened this issue Sep 11, 2024 · 0 comments

Comments

@tc-wolf
Copy link
Contributor

tc-wolf commented Sep 11, 2024

Is your feature request related to a problem? Please describe.
I have a dataset that I'm using for RAG. The user's question determines lookup for the top N most relevant documents, then builds a prompt that looks like:

<system>
You are an assistant for domain <x>, you summarize information and blah blah blah.<eot>
<user>
Document 1: Some information that is relevant
Document 2: Other information
Document 3: Final information

User question: "How do I do <y>?"<eot>

In order to minimize latency, I've developed a "static" disk cache that contains the system prompt + the first document as context for every document in my dataset. (An example (though old) script for doing this is also in my branch).

This way, I only need to ingest the remaining documents + user question when doing prompt processing, so I save a lot of time in time-to-first-token for this use case.

Describe the solution you'd like
I'd like to upstream the LlamaStaticDiskCache class in my branch. It's very similar to the existing LlamaDiskCache but:

  • Cache is not mutable once built
    • (Does not pop in __getitem__)
  • It uses a trie for finding the longest matching prefix (if any) in the cache
  • It has a convenience factory method for building from a list of prompts

So it's well-suited for use cases where you want to build the cache once (for a given model + context size + batch size) and then reload at inference time based on matching the prefix of the prompt.

Complication / Details with this

I've found that when running locally (Mac OS + Metal GPU) and deploying on different hardware (Linux + CPU), I have had to make a minor change to llama.cpp to avoid serializing / deserializing the RNG state.

I.e., skip loading and set seed for reproducibility: tc-wolf/llama.cpp@ea43d92

I don't think that this will be a factor anymore because ggerganov/llama.cpp#9294 has removed serializing / deserializing the RNG when saving.

Describe alternatives you've considered

  • Use lower-level state saving functions (rather than pickling llama.save_state() to save less on-disk than full model file
  • Use more efficient strategy for saving - right now if each key has the same system prompt (for example), that will be saved independently for every stored prompt. There's a lot of space that could be saved if dedupe and only save each prefix once, but it complicates the saving/loading logic.
    • Could also allow for partial matches when checking cache - right now a key has to be a full prefix of the input tokens, but could try and look for partial match to allow for more graceful failure.
    • This also complicates the __getitem__ logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant