Glm4 mtp optimizations #4

SamuelOliveirads · 2025-10-24T02:26:56Z

I've created this draft to share my findings on what to fix or improve to make MTP usable. Currently, MTP's output quality is good, but its performance is worse than not using it at all. Therefore, it's not enough to be on par with the baseline; we need to be faster.

My initial plan is to find areas for improvement. It's not necessary to implement everything at once, but some of these should be on our radar for the future. They are:

Graph reuse
llama_context::decode calls
Multi-token drafts

There are likely more things to improve, but for now, I find these to be the most impactful. Below are my thoughts on each:

1) Graph Reuse: The baseline implementation always reuses the graph. The process is simple: it stores the graph, and in the next call to llama_context::process_ubatch, it checks if the stored graph can be reused. If not, it's deleted and the new one is stored. This works well after the first token is generated, as subsequent graphs are identical. The main bottleneck isn't calling llama_model::build_graph constantly, but rather ggml_backend_sched_alloc_graph, which has to allocate and compute resources for the backend.

The first fix was simple: just store one graph. In this case, the main model's token generation graph, which is one of the most expensive, will always be reused. On my machine, this gave an uplift of 13.8% for small prompts.

Current state: Halted.

After that, I tried to store the graph for every operation, or at least the ones that didn't involve the KV cache. By applying llm_graph_context::cb to certain layers, I could store and reuse the graph, and I was able to compile and test this using only the CPU backend. However, I was unable to get it working with the offload policy. In theory, the cb function should handle that, but something else seems to be preventing specifically the allocation and computation. Is it mixing the offload policies of the main model and the MTP? This needs a deeper investigation, and I lack the proper knowledge in this area, so I'm setting it aside for now.

2) decode calls: MTP was successfully implemented inside decode, but it uses the old logic where each operation requires an expensive function call. Here is a comparison of how many calls we make in different scenarios:

LLM - Normal:
- Loop 1: Prompt + Generation = 1 call
- Loop 2: Token generation = 1 call
Draft Model:
- Loop 1: Prompt + Generation -> Draft generation -> Main model validation = 3 calls
- Loop 2: Token generation -> Draft generation -> Main model validation = 3 calls
MTP (Current Slow Implementation):
- Loop 1: Prompt + Generation -> MTP warmup -> MTP draft -> Main model validation -> MTP KV update = 5 calls
- Loop 2: Token generation -> MTP draft -> Main model validation -> MTP KV update = 4 calls

One way to make MTP more usable is to match the number of calls of a typical draft model. To do that, it's necessary to combine the KV cache update and the draft generation into a single call.

Current state: In progress.

I successfully merged the KV cache update with the draft generation. This required creating a custom batch and sinfo, and changing some logic regarding the embeddings and hidden states necessary for the MTP to work. The version in this branch works in terms of output, meaning it's not breaking quality. However, the draft acceptance rate has dropped to around 25%. I believe this happens because while the first step (KV update) works using the correct hidden state from the main model, the subsequent operation (draft) is using a new hidden state generated by the MTP itself during the update. I still need to confirm this theory and apply a fix to hopefully see the acceptance rate rise back to its previous level.

One last thing: this change will still require a separate warmup call on the first interaction, but this is less impactful than merging the update and draft steps. To merge the warmup step, it would be necessary to track the sinfo to know when the prompt processing has finished its last batch, and then insert a new slot for the draft token.

3) Multi-token drafts: We discussed this in another PR. The problem was that for each new draft token, the MTP's KV cache needed to be updated, which was painful to do before. Now that we are using the decode function, it's more feasible. If the unified update/draft implementation works, we could simply increase the batch and sinfo size to make the model draft more tokens.

These are some of my ideas. I'd appreciate any insights you might have on how to better handle some of these things, or even new ideas for improvements that I haven't spotted here.

…llama.cpp into glm4-mtp-graph-cache

SamuelOliveirads · 2025-11-23T00:16:44Z

Hi @F1LM1,

I've implemented the logic mentioned regarding the llama_context::decode calls. It makes the flow much simpler and easier to understand. We now essentially call decode three times: once for the main model (prompt + token), once for the MTP draft, and finally for validation. The KV cache updates have also been merged into these calls to minimize overhead.

One issue I'm currently investigating is a potential bug with position handling in large prompts. The model sometimes ignores the stop token for "thinking" or repeats the thought process in the final reply.

In terms of results, I didn't observe a major boost in tokens/s yet, but the code is now much cleaner to maintain and sets a better foundation for applying optimizations like graph reuse and compute management.

I also wanted to touch base regarding the future of this MTP implementation. I shared my roadmap in the original PR, but since you are the owner, I'd like to know your plans/availability.

Thanks for your time!

F1LM1 · 2025-11-23T01:49:54Z

Hey @SamuelOliveirads,

I've been following along with the PRs even though I've been quiet these past few weeks. Been busier + I feel that I have little experience on the optimization front so haven't been able to contribute much, sadly. Outside of random bugs/lower-level optimizations like was just found in ggml-org#15225 (comment), the two things that stand out to me:

In the original PR I noted an issue where we alternate between main model calls with draft and main model calls without drafting. This is not an MTP-specific issue but it'll especially hurt MTP performance versus typical draft models because we're only drafting one token at once without multi-token. I'm referring to this:
```
Loop 2: Token generation -> Draft generation -> Main model validation = 3 calls
```
which could really be simplified to two calls: Draft generation -> Main model validation -> Draft generation -> Main model validation, etc. after the first "standard" token is generated, since model validation itself produces a previous main-model-validated token that can be plugged into the MTP head. Think this is low-hanging fruit, it could be as simple as using a boolean to make sure the main decode loop (most of lines 3474 to 3587) only run once. With MTP draft acceptance rates as high as they are, this should provide a decent speedup during generation, and this will scale as we continue to optimize MTP overhead.
If we want to hit the ambitious performance speedups claimed by MTP elsewhere we will have to get started on multi-token.

As for high-level questions about this PR's future, I'm not sure what standard we would have to hit to get it pushed through to the main llama.cpp branch. Noticed you asked ggerganov the same in the original PR. It's a question I would love to know the answer to as well. But I'd guess that if we can get this to the point where it's providing meaningful speedup over no speculative draft, it will be worth merging the original PR, at which point further optimizations can go in their own PRs.

I'm hoping 7c4b2c1 gets overhead low enough that combined with my first suggestion here and maybe some more minor optimizations, it wouldn't be a reach to hit something like 30-40% speedup over no MTP, which I think would be a reasonable checkpoint to push to get the original PR done with. Does that seem reasonable to you?

SamuelOliveirads added 3 commits October 12, 2025 16:33

mtp-graph(feat): Reactivate graph reuse only for main model path

171346c

Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/…

15dff20

…llama.cpp into glm4-mtp-graph-cache

mtp-graph (wip): testing different ways to allow graph reuse

5859cb9

SamuelOliveirads mentioned this pull request Nov 2, 2025

server: implement GLM-style MTP ggml-org/llama.cpp#15225

Draft

mtp-graph (feat): simplify graph logic

3bfa5d3

SamuelOliveirads force-pushed the glm4-mtp-optimizations branch from b229c6a to 3bfa5d3 Compare November 22, 2025 22:12

mtp-graph (fix): move llama_get_logits_ith outside the loop

7c4b2c1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Glm4 mtp optimizations #4

Glm4 mtp optimizations #4

SamuelOliveirads commented Oct 24, 2025

Uh oh!

SamuelOliveirads commented Nov 23, 2025

Uh oh!

F1LM1 commented Nov 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Glm4 mtp optimizations #4

Are you sure you want to change the base?

Glm4 mtp optimizations #4

Conversation

SamuelOliveirads commented Oct 24, 2025

Uh oh!

SamuelOliveirads commented Nov 23, 2025

Uh oh!

F1LM1 commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

F1LM1 commented Nov 23, 2025 •

edited

Loading