cuda: reserve space for quantize kv-cache at startup by am17an · Pull Request #23907 · ggml-org/llama.cpp

am17an · 2026-05-30T10:25:34Z

Overview

ref #23646 (comment). Quantized kv-cache can lead to OOM even when using --fit since it does not know about these backend allocations. There are some other quantization buffers in FA and MMQ which should also be removed, but this one seems it takes the most space as it scales with ctx size.

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, codex wrote this on my direction. I tested it on a few devices

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

ggerganov · 2026-05-31T16:05:03Z

Did you test -sm tensor setups? I think it should work, but might be worth double-checking.

JohannesGaessler · 2026-05-31T16:09:22Z

I was going to say that -sm tensor is implicitly being tested via test-llama-archs but there only the FP16/FP16 configuration is being tested. More generally: do we already have automated tests for different KV cache types? If not it may make sense to add some in test-llama-archs.

am17an · 2026-05-31T17:01:19Z

I did not check yet, we can wait for #23792 to get merged and check again because currently it will throw.

AbdulrahmanHashem · 2026-05-31T17:23:29Z

Did you test -sm tensor setups? I think it should work, but might be worth double-checking.

i have after also merging #23792, and as far as i see there is not more memory creeping, on my system (5060ti + 2060 super)

in all the following cases there is a 125 mp allocation during first prompt processing

without kv quant
without kv quant + MTP
without kv quant + ngram-mod
with kv quant
with kv quant + MTP
with kv quant + ngram-mod

with or without kv quant + ngram-mod
there is an additional variable amount of allocation under 150mp during tg

on a different issue with ngram-mod with or without kv quant
i tried ngram-mod + MTP it causes a crash with no error during tg after thinking is done.
it lags the llama ui very hard and just stops tg with no logs before it crashes

coder543 · 2026-05-31T17:36:03Z

A couple of other things that I believe the fit algorithm is not reserving space for: cache-ram and ctx-checkpoints. On unified memory systems like the DGX Spark, this makes it hard to rely on the fit algorithm without specifying an arbitrarily large fit-target.

am17an · 2026-06-02T03:51:37Z

I think this should be ok to merge.

am17an · 2026-06-03T08:20:26Z

@ggml-org/ggml-cuda can I get another approval?

am17an · 2026-06-03T10:13:30Z

cc @ggml-org/maintainers, need another approval

TomTheWise · 2026-06-03T15:07:54Z

Thank you! This commit solves the issue #23978

thomasbergersen · 2026-06-05T09:55:44Z

Hello @am17an, this change has resulted in an additional increase in GPU memory usage.

b9488 -> GPU:0 13.8/16.0 GB GPU:1 13.8/16.0 GB
b9489 -> GPU:0 14.9/16.0 GB GPU:1 14.2/16.0 GB

JohannesGaessler · 2026-06-05T10:14:59Z

An increase in VRAM consumption is expected since llama.cpp is now pre-allocating the VRAM as part of the compute graphs. On master the initial VRAM consumption would be lower but eventually end up higher as the context fills up because the VRAM for converting the KV cache cannot be recycled for other operations. And previously the crash from OOMing would only happen after the program has already been running for some time which is undesirable if you're not babysitting it.

In any case, the VRAM consumption for KV cache conversion can be reduced by setting -ub to a lower value than 512.

thomasbergersen · 2026-06-05T11:08:05Z

After the test, after the pre-allocation adjustment was made, and after conducting multiple tests with long conversations, the GPU memory usage of my computer only increased by approximately 300 MB. Then is it absolutely safe from OOM?

cuda: reserve space for quantize kv-cache at startup

a4273ef

am17an requested a review from a team as a code owner May 30, 2026 10:25

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 30, 2026

JohannesGaessler reviewed May 30, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/fattn-common.cuh Outdated

Comment thread ggml/src/ggml-cuda/fattn.cu

Comment thread ggml/src/ggml-cuda/fattn.cu Outdated

address review comments

9f584d3

JohannesGaessler approved these changes May 31, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/common.cuh Outdated

remove forward decl

32e6898

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

JohannesGaessler reviewed May 31, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/fattn-common.cuh Outdated

remove assert in ggml-cuda.cu

b324987

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

JohannesGaessler approved these changes May 31, 2026

View reviewed changes

Kononnable mentioned this pull request Jun 1, 2026

Misc. bug: Memory leak (ggml_cuda_pool_vmm) during prompt prefill with MTP and quantized KV cache #23635

Closed

SarcasticBaka29 mentioned this pull request Jun 1, 2026

Misc. bug: llama-server vram usage gradually increasing each run until OOM #23446

Closed

This was referenced Jun 2, 2026

Misc. bug: -sm tensor + MTP + ngram-mod = crash #23929

Closed

Eval bug: ROCm/CUDA tensor-split segfault in meta backend during ON_DEVICE prompt checkpoint save after #22616 #23719

Closed

ServeurpersoCom approved these changes Jun 3, 2026

View reviewed changes

am17an merged commit f8f0a47 into ggml-org:master Jun 3, 2026
31 checks passed

am17an deleted the fattn-static-kv-cache branch June 3, 2026 10:40

TomTheWise mentioned this pull request Jun 3, 2026

Eval bug: --swa-full incompatible with cache quantization (on gemma4 at least) VRAM usage expands heavily with use #23978

Closed

johnbelo mentioned this pull request Jun 4, 2026

Eval bug: #23907's persistent flash-attention dequant-KV reservation has no opt-out — regresses multi-context MTP + quantized KV on a single 16 GiB GPU that ran fine pre-#23907 #24135

Closed

bobvious mentioned this pull request Jun 5, 2026

[CUDA] PR #23907 flash-attn F16 KV dequant-scratch sized by allocated (not used) KV -> large-context q8_0 decode regression / VRAM thrashing #24166

Closed

JohannesGaessler mentioned this pull request Jun 5, 2026

Misc. bug: Is this a backend VRAM tracking bug? #24159

Open

Conversation

am17an commented May 30, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented May 31, 2026

Uh oh!

JohannesGaessler commented May 31, 2026

Uh oh!

am17an commented May 31, 2026

Uh oh!

AbdulrahmanHashem commented May 31, 2026

Uh oh!

coder543 commented May 31, 2026

Uh oh!

am17an commented Jun 2, 2026

Uh oh!

am17an commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Jun 3, 2026

Uh oh!

Uh oh!

TomTheWise commented Jun 3, 2026

Uh oh!

thomasbergersen commented Jun 5, 2026

Uh oh!

JohannesGaessler commented Jun 5, 2026

Uh oh!

thomasbergersen commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

am17an commented Jun 3, 2026 •

edited

Loading