cuda: reserve space for quantize kv-cache at startup#23907
Conversation
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
|
Did you test |
|
I was going to say that |
|
I did not check yet, we can wait for #23792 to get merged and check again because currently it will throw. |
i have after also merging #23792, and as far as i see there is not more memory creeping, on my system (5060ti + 2060 super) in all the following cases there is a 125 mp allocation during first prompt processing without kv quant with or without kv quant + ngram-mod on a different issue with ngram-mod with or without kv quant |
|
A couple of other things that I believe the fit algorithm is not reserving space for: |
|
I think this should be ok to merge. |
|
@ggml-org/ggml-cuda can I get another approval? |
|
cc @ggml-org/maintainers, need another approval |
|
Thank you! This commit solves the issue #23978 |
|
Hello @am17an, this change has resulted in an additional increase in GPU memory usage. b9488 -> GPU:0 13.8/16.0 GB GPU:1 13.8/16.0 GB |
|
An increase in VRAM consumption is expected since llama.cpp is now pre-allocating the VRAM as part of the compute graphs. On master the initial VRAM consumption would be lower but eventually end up higher as the context fills up because the VRAM for converting the KV cache cannot be recycled for other operations. And previously the crash from OOMing would only happen after the program has already been running for some time which is undesirable if you're not babysitting it. In any case, the VRAM consumption for KV cache conversion can be reduced by setting |
|
After the test, after the pre-allocation adjustment was made, and after conducting multiple tests with long conversations, the GPU memory usage of my computer only increased by approximately 300 MB. Then is it absolutely safe from OOM? |
Overview
ref #23646 (comment). Quantized kv-cache can lead to OOM even when using
--fitsince it does not know about these backend allocations. There are some other quantization buffers in FA and MMQ which should also be removed, but this one seems it takes the most space as it scales with ctx size.Additional information
Requirements