Skip to content

fix: step35 MTP does not allocate KV cache for all layers#24125

Merged
pwilkin merged 1 commit into
ggml-org:masterfrom
forforever73:step35-mtp-kv-filter
Jun 4, 2026
Merged

fix: step35 MTP does not allocate KV cache for all layers#24125
pwilkin merged 1 commit into
ggml-org:masterfrom
forforever73:step35-mtp-kv-filter

Conversation

@forforever73
Copy link
Copy Markdown
Contributor

@forforever73 forforever73 commented Jun 4, 2026

Overview

While testing the Step3.5 mtp feature from #23274 (cc @pwilkin ), the memory watermark felt high. Turns out draft context allocates a KV cache for all layers, even though it only runs the NextN block(s).

STEP35 isn't a hybrid arch, so it misses the per-context KV layer filter that Qwen3.5 already has. This just adds the same filter for STEP35: the MTP context keeps only the NextN blocks (il >= n_main), the main context keeps the trunk (il < n_main).

Before:

5051:0.03.567.532 I srv    load_model: loading draft model '/xxx/Step3.7-flash-mtp-Q8_0.gguf'
llama_kv_cache: size = 1644.00 MiB (35072 cells, 12 layers, 1/1 seqs), K (f16): 822.00 MiB, V (f16): 822.00 MiB
llama_kv_cache: size =  216.00 MiB ( 1536 cells, 36 layers, 1/1 seqs), K (f16): 108.00 MiB, V (f16): 108.00 MiB

After :

5050:0.03.544.909 I srv    load_model: loading draft model '/xxx/Step3.7-flash-mtp-Q8_0.gguf'
llama_kv_cache: size =    0.00 MiB (35072 cells,  0 layers, 1/1 seqs), K (f16):   0.00 MiB, V (f16):   0.00 MiB
llama_kv_cache: size =   18.00 MiB ( 1536 cells,  3 layers, 1/1 seqs), K (f16):   9.00 MiB, V (f16):   9.00 MiB

Requirements

@forforever73 forforever73 requested a review from CISC as a code owner June 4, 2026 13:27
Copy link
Copy Markdown
Member

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

Comment thread src/llama-model.cpp
filter = [n_main](int32_t il) { return (uint32_t)il >= n_main; };
}

if (arch == LLM_ARCH_STEP35 && hparams.nextn_predict_layers > 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be reworked to not use an arch check.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see @pwilkin volunteered. :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol I was just about to fix. now park it in next pr

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, it was meant as a follow-up comment anyway.

@CISC CISC requested a review from am17an June 4, 2026 13:38
@pwilkin pwilkin merged commit 0dbfa66 into ggml-org:master Jun 4, 2026
23 of 25 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jun 4, 2026
* origin/master: (57 commits)
server : disable on-device spec checkpoints (ggml-org#24108)
arg: fix double mtp downloads (ggml-org#24128)
webui: [a11y] fix keyboard navigation issues in chat interface and sidebar (ggml-org#23132)
Move duplicated imatrix code into single common imatrix-loader.cpp (ggml-org#22445)
ui: Fixed packages (ggml-org#24119)
ui: added single line reasoning preview (ggml-org#23601)
return filter to save memory (ggml-org#24125)
convert: Fix Gemma 4 Unified conversion (ggml-org#24118)
ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (ggml-org#22209)
server: avoid unnecessary checkpoint restore when new tokens are present (ggml-org#24110)
agents: refactor, include more guidelines (ggml-org#24111)
webui: fix tool selector toggle/counter, key tools by stable identity (ggml-org#24065)
build : use umbrella Headers directory for XCFramework module map (ggml-org#23974)
server : add header to tools/server/server-http.h (ggml-org#24089)
cmake: skip cvector-generator and export-lora when CPU backend is disabled (ggml-org#24053)
fix(mtmd): handle Gemma 4 audio projector embedding size (ggml-org#24091)
readme : add status badges (ggml-org#24104)
tests : refactor test-save-load-state to accept token input (ggml-org#24073)
metal : reduce rset heartbeat from 500ms -> 5ms (ggml-org#24074)
ggml-webgpu: FlashAttention refactor + standardize quantization support (ggml-org#23834)
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants