Skip to content

Conversation

ggerganov
Copy link
Member

Implemented just for CPU and Metal.
With Q4_0 there is ~8% performance hit for TG 500 and ~4% for PP 512, but might be possible to optimize the rope kernel further to compensate

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Sep 17, 2023
@slaren
Copy link
Member

slaren commented Sep 18, 2023

These are the results I obtained on CPU (13900k):

model size params backend th test Master t/s PR t/s speedup
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B CPU 8 pp 512 34.35 ± 0.31 36.52 ± 0.70 1.06
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B CPU 8 tg 128 16.03 ± 0.21 15.59 ± 0.16 0.97
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B CPU 8 tg 256 15.69 ± 0.35 14.97 ± 0.28 0.95
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B CPU 8 tg 512 15.56 ± 1.06 13.87 ± 0.06 0.89
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B CPU 8 tg 1024 15.12 ± 0.17 11.44 ± 0.22 0.76
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B CPU 8 tg 2048 14.60 ± 0.22 8.48 ± 0.10 0.58
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B CPU 8 tg 4096 12.89 ± 0.10 5.52 ± 0.04 0.42
LLaMA 7B mostly Q4_0 3.56 GiB 6.74 B CPU 8 tg 8192 10.55 ± 0.11 3.25 ± 0.01 0.31

@ggerganov
Copy link
Member Author

Yup, these results might be a strong argument against the non-RoPEd K cache.

@slaren
Copy link
Member

slaren commented Sep 18, 2023

I suspect that the main cost is the copy of K, but the computational cost of calculating the RoPE shouldn't be too bad, and most of it could be replaced with a lookup table if needed. So a fused attention op that applies RoPE on the fly during the computation of KQ could be a viable way to do this.

@ggerganov
Copy link
Member Author

I suspect that the main cost is the copy of K

This is likely the case. We can verify this by replacing the rope with a cpy.

@ggerganov ggerganov force-pushed the custom-attention-mask branch from 5bda9e2 to 0161372 Compare September 18, 2023 17:37
@Olexorus
Copy link

Do I understand correctly that this makes the cache values independent of the token position?

Would this make it possible to precompute the KV cache for all (or maybe just the most common) tokens in the vocabulary, so that during inference time you only need to copy it and apply RoPE?

@ggerganov
Copy link
Member Author

The memory requirements would be too huge for this to work

@Olexorus
Copy link

Olexorus commented Oct 2, 2023

The memory requirements would be too huge for this to work

Really? When I run a 13B model with 4096 context, I get the following output:
llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 3200.00 MB llama_new_context_with_model: compute buffer total size = 363.88 MB

I'm guessing this means that 4096 tokens of context require 3.2GB of memory. I believe Llama 2 has a vocabulary size of 32000, wouldn't that mean that precomputing the cache for all tokens requires 32000 / 4096 * 3.2GB = 25GB?
While that is a lot, it doesn't seem unrealistic, at least when talking about CPU RAM. Also I'm guessing this could be halved with #2969, so only 12.5GB, and this could probably be decreased much further by only storing the most common tokens. I imagine this could massively speed up prompt processing on CPU. It might also be useful for very long context lengths, since it would actually consume less memory if the context is larger than the vocabulary size.

Though I don't know if memory requirements can actually be calculated like this, maybe all of this is wrong.

@ggerganov
Copy link
Member Author

Thinking more about this, I guess the idea would work but only if you had a single layer of the transformer. In that case the KV is always computed on the token embeddings from the model and they are indeed n_vocab in count and thus could be precomputed. However, in each layer after that, the embeddings for the KV would have some extra information intermingled from the other tokens in the context due to the attention from the previous layer. And therefore I think the idea breaks down.

But let's give it some more thought - I could be missing something

@Olexorus
Copy link

Olexorus commented Oct 3, 2023

However, in each layer after that, the embeddings for the KV would have some extra information intermingled from the other tokens in the context due to the attention from the previous layer. And therefore I think the idea breaks down.

Oh, I didn't realize that (my understanding of how transformers work is extremely basic), thank you for clarifying. Though I probably should've guessed that it wouldn't work out as nicely as I imagined.

@slaren
Copy link
Member

slaren commented Oct 3, 2023

You can test the best scenario of this by removing the mul mats with wk, wq and wv. I tested this, and with 7B models on CPU I got between 10% and 40% higher t/s, depending on the model. GQA models (mistral) benefit less. But I agree with @ggerganov that this wouldn't work for anything other than the first layer, and in that case the performance difference would very likely be negligible.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 25, 2024

I'm a bit confused on the reason behind it, what do we use a non rope'd cache for ?
Isn't rope additive ? As if we need to modify the rope of the cache we can reprocess it to add/remove positional rotations from it ?

So creating a temporary 0 position rope when needed, couldn't we just run a "rope graph" on a cache copy to "unrope" it ?

If that's the case, I think it is, can't we just add a couple nice API functions to reropeprocess the kv cache (either modify it or into a copy/seq). I'd guess for sliding window and similar tricks that's useful ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

demo Demonstrate some concept or idea, not intended to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants