fix: step35 MTP does not allocate KV cache for all layers#24125
Merged
Conversation
CISC
approved these changes
Jun 4, 2026
| filter = [n_main](int32_t il) { return (uint32_t)il >= n_main; }; | ||
| } | ||
|
|
||
| if (arch == LLM_ARCH_STEP35 && hparams.nextn_predict_layers > 0) { |
Member
There was a problem hiding this comment.
This should probably be reworked to not use an arch check.
Contributor
Author
There was a problem hiding this comment.
lol I was just about to fix. now park it in next pr
Member
There was a problem hiding this comment.
No worries, it was meant as a follow-up comment anyway.
am17an
approved these changes
Jun 4, 2026
gabe-l-hart
added a commit
to gabe-l-hart/llama.cpp
that referenced
this pull request
Jun 4, 2026
* origin/master: (57 commits) server : disable on-device spec checkpoints (ggml-org#24108) arg: fix double mtp downloads (ggml-org#24128) webui: [a11y] fix keyboard navigation issues in chat interface and sidebar (ggml-org#23132) Move duplicated imatrix code into single common imatrix-loader.cpp (ggml-org#22445) ui: Fixed packages (ggml-org#24119) ui: added single line reasoning preview (ggml-org#23601) return filter to save memory (ggml-org#24125) convert: Fix Gemma 4 Unified conversion (ggml-org#24118) ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (ggml-org#22209) server: avoid unnecessary checkpoint restore when new tokens are present (ggml-org#24110) agents: refactor, include more guidelines (ggml-org#24111) webui: fix tool selector toggle/counter, key tools by stable identity (ggml-org#24065) build : use umbrella Headers directory for XCFramework module map (ggml-org#23974) server : add header to tools/server/server-http.h (ggml-org#24089) cmake: skip cvector-generator and export-lora when CPU backend is disabled (ggml-org#24053) fix(mtmd): handle Gemma 4 audio projector embedding size (ggml-org#24091) readme : add status badges (ggml-org#24104) tests : refactor test-save-load-state to accept token input (ggml-org#24073) metal : reduce rset heartbeat from 500ms -> 5ms (ggml-org#24074) ggml-webgpu: FlashAttention refactor + standardize quantization support (ggml-org#23834) ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
While testing the Step3.5 mtp feature from #23274 (cc @pwilkin ), the memory watermark felt high. Turns out draft context allocates a KV cache for all layers, even though it only runs the NextN block(s).
STEP35 isn't a hybrid arch, so it misses the per-context KV layer filter that Qwen3.5 already has. This just adds the same filter for STEP35: the MTP context keeps only the NextN blocks (
il >= n_main), the main context keeps the trunk (il < n_main).Before:
After :
Requirements