- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.5k
memory : remove KV cache size padding #16812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| FYI in the CUDA backend, while non-padded inputs will produce correct results, the performance will be worse. One reason is that the vector and mma kernels don't have support for it (in principle also the wmma kernel but I want to remove that one soon). Another reason is that like with e.g. ALiBi I've added support for a non-padded KV cache only to the template specialization that doesn't have GQA-specific optimizations (to keep compilation times low and to mitigate the few % performance penalty from the OOB checks). | 
| What is the optimal padding for CUDA - 128 or 256? | 
| As of right now the required padding is still 256. The vector and tile kernels I've already adapted to be able to only need a padding of 128 (without OOB checks). The mma kernel, wmma kernel, and some utility kernels still need a padding of 256. My plan is to implement support for Volta, AMD WMMA instructions, and AMD MFMA instructions directly in the mma kernel, then I can just remove the wmma kernel without having to make any changes to it. The hardware I need for development are a V100 (which should arrive today), a MI100 (which arrived yesterday), and an RDNA3+ GPU (already in hand). The mma kernel and the utility kernels can then be made to work with a padding of 128 without issue. Caveat: the MI100 doesn't seem to work with the motherboard that I intended to use with it so I may need to shuffle around my hardware a bit. | 
| @JohannesGaessler The updated PR now continues to pad the  For example, we can now allocate a KV cache with  The main goal here is to simplify the logic around context size allocation per sequence and localize it during the context creation. | 
1473d59    to
    7ebe7f7      
    Compare
  
            
          
                tools/server/server.cpp
              
                Outdated
          
        
      | const auto n_ctx_train = llama_model_n_ctx_train(model); | ||
|  | ||
| if (slot.task->params.n_predict < 1 && slot.n_prompt_tokens() + slot.n_decoded >= n_ctx_train) { | ||
| if (slot.task->params.n_predict < 1 && slot.n_prompt_tokens() + slot.n_decoded >= slot.n_ctx) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old logic of using the training context as a generation limit seems dubious - the context size of the slot should impose the limit.
After #16736, the slot.n_ctx will be capped to the training context either way, so this change should not make a really big difference either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, do you think this code branch can be removed altogether? The n_ctx cap is already imposed on the slot.n_past >= slot.n_ctx condition above (L2933)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point.
| } | ||
|  | ||
| // if context shifting is disabled, make sure that we don't run out of context | ||
| if (!params_base.ctx_shift && slot.n_past + 1 >= slot.n_ctx) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just note that slot.n_past won't reflect the correct number of tokens in KV cache in case the model uses M-RoPE. We should fix this later but I'm not entirely sure how. I'm thinking of these 2 solutions:
- Rely on the server_tokens::size()
- Add an API like llama_memory_seq_is_fullwhich returns true if the memory is full
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll open a PR to fix this.
ref #16736 (comment)
Simplify the logic during memory module creation. Before, for KV caches, we used to pad the buffer size up to 256 cells since flash attention implementations did not support arbitrary K, V sizes. After the improvements in #16148 and related, we no longer need to explicitly do this padding.
For now, keeping support for
llama_kv_cachesize padding via the constructor'sn_padargument, although it is not currently used anymore.Note that we continue to pad
n_kv- this is the tensor shape for the K and V tensors for each graph:llama.cpp/src/llama-kv-cache.cpp
Lines 957 to 972 in 1473d59
We need to do this in order to reuse most of the compute graphs during the text generation phase. Additionally, this helps the performance with some of the backends.
Also,
llama_model::create_memory()no longer mutatescparams.Next, will rebase #16736 on top of this change and finish it.