fix: step35 MTP does not allocate KV cache for all layers by forforever73 · Pull Request #24125 · ggml-org/llama.cpp

forforever73 · 2026-06-04T13:27:00Z

Overview

While testing the Step3.5 mtp feature from #23274 (cc @pwilkin ), the memory watermark felt high. Turns out draft context allocates a KV cache for all layers, even though it only runs the NextN block(s).

STEP35 isn't a hybrid arch, so it misses the per-context KV layer filter that Qwen3.5 already has. This just adds the same filter for STEP35: the MTP context keeps only the NextN blocks (il >= n_main), the main context keeps the trunk (il < n_main).

Before:

5051:0.03.567.532 I srv    load_model: loading draft model '/xxx/Step3.7-flash-mtp-Q8_0.gguf'
llama_kv_cache: size = 1644.00 MiB (35072 cells, 12 layers, 1/1 seqs), K (f16): 822.00 MiB, V (f16): 822.00 MiB
llama_kv_cache: size =  216.00 MiB ( 1536 cells, 36 layers, 1/1 seqs), K (f16): 108.00 MiB, V (f16): 108.00 MiB

After :

5050:0.03.544.909 I srv    load_model: loading draft model '/xxx/Step3.7-flash-mtp-Q8_0.gguf'
llama_kv_cache: size =    0.00 MiB (35072 cells,  0 layers, 1/1 seqs), K (f16):   0.00 MiB, V (f16):   0.00 MiB
llama_kv_cache: size =   18.00 MiB ( 1536 cells,  3 layers, 1/1 seqs), K (f16):   9.00 MiB, V (f16):   9.00 MiB

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: No

CISC

Nice catch.

CISC · 2026-06-04T13:38:26Z

                        filter = [n_main](int32_t il) { return (uint32_t)il >= n_main; };
                    }

+                    if (arch == LLM_ARCH_STEP35 && hparams.nextn_predict_layers > 0) {


This should probably be reworked to not use an arch check.

I see @pwilkin volunteered. :)

lol I was just about to fix. now park it in next pr

No worries, it was meant as a follow-up comment anyway.

* origin/master: (57 commits) server : disable on-device spec checkpoints (ggml-org#24108) arg: fix double mtp downloads (ggml-org#24128) webui: [a11y] fix keyboard navigation issues in chat interface and sidebar (ggml-org#23132) Move duplicated imatrix code into single common imatrix-loader.cpp (ggml-org#22445) ui: Fixed packages (ggml-org#24119) ui: added single line reasoning preview (ggml-org#23601) return filter to save memory (ggml-org#24125) convert: Fix Gemma 4 Unified conversion (ggml-org#24118) ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (ggml-org#22209) server: avoid unnecessary checkpoint restore when new tokens are present (ggml-org#24110) agents: refactor, include more guidelines (ggml-org#24111) webui: fix tool selector toggle/counter, key tools by stable identity (ggml-org#24065) build : use umbrella Headers directory for XCFramework module map (ggml-org#23974) server : add header to tools/server/server-http.h (ggml-org#24089) cmake: skip cvector-generator and export-lora when CPU backend is disabled (ggml-org#24053) fix(mtmd): handle Gemma 4 audio projector embedding size (ggml-org#24091) readme : add status badges (ggml-org#24104) tests : refactor test-save-load-state to accept token input (ggml-org#24073) metal : reduce rset heartbeat from 500ms -> 5ms (ggml-org#24074) ggml-webgpu: FlashAttention refactor + standardize quantization support (ggml-org#23834) ...

return filter to save memory

d377215

forforever73 requested a review from CISC as a code owner June 4, 2026 13:27

CISC approved these changes Jun 4, 2026

View reviewed changes

CISC requested a review from am17an June 4, 2026 13:38

am17an approved these changes Jun 4, 2026

View reviewed changes

pwilkin merged commit 0dbfa66 into ggml-org:master Jun 4, 2026
23 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: step35 MTP does not allocate KV cache for all layers#24125

fix: step35 MTP does not allocate KV cache for all layers#24125
pwilkin merged 1 commit into
ggml-org:masterfrom
forforever73:step35-mtp-kv-filter

forforever73 commented Jun 4, 2026 •

edited

Loading

Uh oh!

CISC left a comment

Uh oh!

CISC Jun 4, 2026

Uh oh!

CISC Jun 4, 2026

Uh oh!

forforever73 Jun 4, 2026

Uh oh!

CISC Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

forforever73 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

CISC Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

CISC Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

forforever73 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

CISC Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

forforever73 commented Jun 4, 2026 •

edited

Loading