-
Couldn't load subscription status.
- Fork 155
Reduce size of compute buffers #237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Here some example usage. My GPU is RTX-4080 and I'm running We get this:
Let's now run it with the option added in this PR, but increase We get this
Limiting the size of the compute buffer via Can we not do the same without Clearly no. Let's look at compute buffer sizes for this context of 16k tokens and
Why is the |
If all the op does is to concatenate the second tensor to the first, why would we want to have a loop?
|
This has been an incredible PR. Hugely beneficial in multiple ways. The compute buffer is drastically lower, and now can run context at max context, no issues. It has also allowed to increasse For reference, on 15x3090 (360GB total), Q2_K R1 (230GB~), I'm able to run full context context with the following: Here is how it's loaded: Huge improvement across the board! From a speed perspective, I've seen over 200t/s. I still want to experiment with lower I cannot express how much this starts to make the model usable. I wonder what could either: a) reduce the compute buffer to even smaller, because if so, can run a much higher quant, or b) speed up the PP or TG further. Even during inference the gpu's are really only using like 5-10% usage! I am picking up another 3090 tomorrow, so I'll have 16 in total, and provided I can get it loaded, I'll have more VRAM to play with and potentially a higher quant. Excellent work on this. |
|
Also, gave the 84853b9 commit a test run and it seems to be producing different outcomes each time on regeneration with a fixed seed. Not sure if it’s something I’m doing wrong on my end. |
I wouldn't know why that could affect your results. The change in 84853b9 only runs on the CPU, so never gets executed in your case. |
Ah weird. Maybe I’m going insane. Was late last night! Thank you again 👌🏽 |
I have been focusing on reducing the KV cache size, but as per the lengthy exchange in #235 the actual issue for using a very long context is the size of the compute buffers. E.g., if one attempted to run DeepSeekV3/R1 with the claimed 163k tokens maximum context length, one would need over 40 GB of CUDA compute buffer per GPU. But even if running on the CPU, 40 GB is nothing to sneeze at.
This PR solves the problem. For GPU and CPU inference.
Where is the issue? The
K*Qtensor, computed in the attention portion of the network, is of sizen_ctx x n_ubatch x n_head x sizeof(float). One also needssoftmax(K*Q)(of the same size), but the back-end is fortunately clever enough to reuse the same buffer. DeepSeekV3/R1 hasn_head = 128, so with the default u-batch size of 512 tokens, this works out to 256 kB per token in the KV cache. During model load, a tets compute graph is run where the KV cache has the maximum context length (specified by the model or set on the command line) to determine the size of the compute buffer. For very long context lengths, the determined size is dominated by the size of theK*Qtensor. For 163k tokens it is `163,000 x 256 kB = 42.7 GB. One can of course reduce the compute buffer size by using a smaller u-batch. But this comes at a heavy performance hit for prompt processing speed. E.g., to reduce the 42.7 GB compute buffer size to, say, 5 GB to have enough VRAM left for KV cache and at least the attention tensors of DeepSeekV3/R1, one needs to lower u-batch to 64, and this comes at the price of 3X slower prefill.How do we solve it?
We add a command line parameter that specifies the maximum
K*Qsize we want to tolerate.Let's call this$M_{\rm max}$ .$M$ required by $M \le M_{\rm max}$ , the computation proceeds as usual. If $M > M_{\rm max}$ , the $n = (M + M_{\rm max} - 1) / M_{\rm max}$ steps ($M$ and $M_{\rm max}$ are integers rounded to the nearest MiB). If the number of heads is $K$ , each step computes $K/n$ heads. In each step the $n$ times smaller. After multiplication with $n$ steps.
During inference, before performing the
K*Qmultiplication the sizeK*Qis computed. IfV*softmax(K*Q)is performed inK*Qtensor isV, the resulting tensor contains onlyn_embd * n_tokenentries, which is negligible compared to the size ofK*Qfor such a long context. The finalV*softmax(K*Q)result is assembled by concatenating the results of theLet's look at some examples for DeepSeekV3/R1 using the full 163k context and
amb = 2048(so, 2 GiB)u_batch = 1), theK*Qsize is163,000 x 128 x 1 x 4 = 79 MiB, so the computation will proceed as usualK*Qforu_batch = 512will be163,000 x 128 x 512 x 4 = 40750 MiB. Hence, the computation will be done in 20 steps, each step processing 6 or 7 heads. The back-end will record 2 GiB as the size of theK*Qtensor, so the compute buffer will be only slightly larger than that (to accommodate other intermediate results).K*Qwill not be exceeded before there are 8k tokens in the KV cache. After that and up to 16k tokens theV*softmax(K*Q)calculation will be done in 2 steps, from 16k to 24k in 3 steps, etc. For such largeKandQtensors, the cost of the matrix multiplication is many times higher than the cost of launching 2, 3. etc. matrix multiplications and soft-max computations. Hence, there will be negligible impact on performance.As a side note: I wasted at least 2 hours trying to figure out why my implementation wasn't working. At the end it turned out to be a bug in the CUDA implementation of
GGML_OP_CONCATused to concatenate the step results. This PR fixes the issue for the use case required by the PR (contiguous tensors, second tensor simply appended at the end of the first).As another side note: I wasted at least another two hours fighting with the$2 n$ copies needed to concatenate the intermediate results by first allocating the final result, and then simply storing the step results at the appropriate offset. The back-end did not like this idea at all, and was crashing on a null pojnter access.
ggmlback-end. I was trying to avoid the