qwen35: use post-norm hidden state for MTP by am17an · Pull Request #24025 · ggml-org/llama.cpp

am17an · 2026-06-02T13:49:45Z

Overview

It looks like qwen3.6 MTP actually uses the post-norm hidden state rather than the pre-norm hidden state. All credit to @jtjstock for pointing this out. Unfortunately deepseek/glm actually do use pre_norm hidden state, so perhaps adding an API get_embedding_post_norm makes sense? @ggerganov

Additional information

Master

  code_python        pred= 192 draft= 166 acc= 134 rate=0.807 tok/s=18.4
  code_cpp           pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=16.7
  explain_concept    pred= 192 draft= 190 acc= 127 rate=0.668 tok/s=15.6
  summarize          pred= 192 draft= 183 acc= 129 rate=0.705 tok/s=16.3
  qa_factual         pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=15.6
  translation        pred= 192 draft= 192 acc= 126 rate=0.656 tok/s=15.4
  creative_short     pred= 192 draft= 195 acc= 126 rate=0.646 tok/s=15.7
  stepwise_math      pred= 192 draft= 175 acc= 132 rate=0.754 tok/s=18.1
  long_code_review   pred= 192 draft= 210 acc= 121 rate=0.576 tok/s=14.3

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1681,
  "total_draft_accepted": 1153,
  "aggregate_accept_rate": 0.6859,
  "wall_s_total": 113.03
}

PR

  code_python        pred= 192 draft= 163 acc= 136 rate=0.834 tok/s=19.3
  code_cpp           pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=17.6
  explain_concept    pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=18.0
  summarize          pred= 192 draft= 171 acc= 134 rate=0.784 tok/s=17.8
  qa_factual         pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=15.5
  translation        pred= 192 draft= 174 acc= 132 rate=0.759 tok/s=17.4
  creative_short     pred= 192 draft= 204 acc= 122 rate=0.598 tok/s=15.6
  stepwise_math      pred= 192 draft= 169 acc= 134 rate=0.793 tok/s=18.5
  long_code_review   pred= 192 draft= 191 acc= 126 rate=0.660 tok/s=16.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1620,
  "total_draft_accepted": 1173,
  "aggregate_accept_rate": 0.7241,
  "wall_s_total": 105.99
}

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, for confirming that post norm hidden state is used in both vLLM and sglang.

ggerganov · 2026-06-02T14:02:10Z

get_embedding_post_norm

Unless I am missing something, we already have API for the post-norm embeddings. These are the regular embeddings that we always have supported. The pre-norm embeddings extraction was added for MTP. AFAIU, it's not needed and we just have to use the regular (a.k.a. post-norm embeddings)?

If yes, I'd suggest to move this to draft and properly extract the embeddings using the correct API. I.e. no need to use llama_get/set_embeddings_pre_norm if it is not really necessary.

am17an · 2026-06-02T14:10:04Z

I think a separate API or decoupling is required, because the embeddings API is currently coupled with n_outputs, which would again make us reserve for logits space for all tokens in a batch

ggerganov · 2026-06-02T14:39:49Z

I think a separate API or decoupling is required, because the embeddings API is currently coupled with n_outputs, which would again make us reserve for logits space for all tokens in a batch

Wouldn't extending llama_set_embeddings() to also accept a bool masked be enough?

ngxson · 2026-06-02T17:01:50Z

@ggerganov I think there is another problem (as describe on PR description), some MTP models may want to use pre-norm, while other may want to use post-norm. If I understand correctly, that will require another API like llama_model_get_nextn_embd_type just to check which embd type the MTP head want.

Rather than exposing too many low-level API to the surface, probably we should make a user-case-specific API for MTP?

am17an · 2026-06-03T05:40:42Z

I think set_mtp_hidden_states should be a good API? That way "embeddings" can stay a separate concept, and pre-norm and post-norm can exist as "mtp_hidden_state" depending on model. This is how it is implemented in other inference engines from a cursory glance.

ggerganov · 2026-06-03T05:54:54Z

Yes, I think introduce llama_get/set_embeddings_nextn() for now in llama-ext.h and it should be OK. These embeddings will not be at a fixed position but will be placed depending on the model graph implementation. Functionality-wise, it will be the same as the existing pre-norm embeddings.

R-SITES · 2026-06-04T14:27:42Z

@am17an — tested your post-norm PR (#24025) on Intel Arc Pro B70 (Battlemage, PCI 8086:e223, 32GB) with oneAPI 2026.0 + SYCL graphs enabled. Full results:

Test setup:

Build: mainline b9484 (63e66fd) + qwen35: use post-norm hidden state for MTP #24025 patch
Compiler: Intel oneAPI 2026.0.0 (DPC++/C++, Level Zero V2)
SYCL graphs: enabled and confirmed used (graphs reused = 62-124)
Model: Qwen3.6-35B-A3B MoE Q4_K_M (GGUF, Native MTP preserved)
GPU: Intel Arc Pro B70, all 99 layers offloaded
Flags: GGML_SYCL_F16=1, --spec-type draft-mtp, --spec-draft-n-max 3

Pre-norm (baseline, same b9484 + 2026.0 + graph, no patch):

Run	Draft Accept	Eval Time	Server tok/s
1	71.9% (136/189)	3904ms / 200tok	51.23
no-MTP baseline	—	—	82.17

Post-norm (with #24025, cold then warm cache):

Run	Draft Accept	Eval Time	Server tok/s
1 (cold)	72.7% (136/187)	3859ms / 200tok	51.83
2 (warm)	76.2% (138/181)	3636ms / 200tok	55.00
no-MTP baseline	—	—	82.17

Analysis:

The post-norm change is directionally correct — draft acceptance improved from 71.9% to 72.7-76.2%, and server-side throughput lifted from 51.23 to 51.83-55.00 tok/s. The warm-cache run at 76.2% acceptance is the best we've seen on SYCL.

However, MTP remains 33% slower than no-MTP on this hardware (55 vs 82 tok/s). The draft generation compute cost dominates (~413ms per 63 draft calls = 6.55ms/call). At ~12ms/token for plain AR decode, generating 3 draft tokens in 6.55ms and accepting ~2.3 of them is a net loss.

A couple of observations:

Draft generation time is the bottleneck, not acceptance. Even at 100% acceptance, the math doesn't close the gap — you'd save maybe 150ms over 63 draft calls, still far from beating 12ms/token AR decode.
SYCL graphs are functional (62-124 graph reuses confirmed) but don't meaningfully reduce draft gen time on Battlemage. The GPU is still executing the GATED_DELTA_NET + MUL_MAT + SOFTMAX chain as separate compute operations internally regardless of graph-level fusion.
Pure AR decode on this same GPU with Intel's own fused INT4 kernels (vLLM, Intel oneAPI stack) hits ~125 tok/s — nearly 2x our ggml-sycl AR decode. The gap is in kernel fusion and quant format, not MTP architecture.

The post-norm patch is worth landing — it's architecturally correct, marginally improves acceptance, and matches what other engines do. But closing the SYCL MTP performance gap likely requires kernel-level fusion beyond what the graph API can deliver.

Happy to run additional tests with different draft counts or profiling if helpful.

am17an requested a review from CISC as a code owner June 2, 2026 13:49

github-actions Bot added the model Model specific label Jun 2, 2026

am17an requested a review from ggerganov June 2, 2026 13:59

am17an marked this pull request as draft June 2, 2026 14:03

github-actions Bot added examples server labels Jun 3, 2026

am17an added 3 commits June 3, 2026 18:09

qwen35: use post-norm hidden state for MTP

e94341d

rename pre_norm to nextn

7e00869

fix step35

13d07d2

am17an force-pushed the mtp-fix-post-norm branch from 9140cb9 to 13d07d2 Compare June 3, 2026 10:16

am17an marked this pull request as ready for review June 3, 2026 13:47

am17an requested review from a team as code owners June 3, 2026 13:47

ggerganov approved these changes Jun 3, 2026

View reviewed changes

ServeurpersoCom approved these changes Jun 3, 2026

View reviewed changes

am17an merged commit 166fe29 into ggml-org:master Jun 3, 2026
24 of 25 checks passed

Godl1nk mentioned this pull request Jun 4, 2026

Feature Request: Anbeeld/beellama.cpp#53

Closed

4 tasks

ggerganov mentioned this pull request Jun 5, 2026

[Speculative decoding] feat: add EAGLE3 speculative decoding support #18039

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen35: use post-norm hidden state for MTP#24025

qwen35: use post-norm hidden state for MTP#24025
am17an merged 3 commits into
ggml-org:masterfrom
am17an:mtp-fix-post-norm

am17an commented Jun 2, 2026 •

edited

Loading

Uh oh!

ggerganov commented Jun 2, 2026

Uh oh!

am17an commented Jun 2, 2026

Uh oh!

ggerganov commented Jun 2, 2026

Uh oh!

ngxson commented Jun 2, 2026

Uh oh!

am17an commented Jun 3, 2026

Uh oh!

ggerganov commented Jun 3, 2026

Uh oh!

Uh oh!

R-SITES commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

am17an commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

ggerganov commented Jun 2, 2026

Uh oh!

am17an commented Jun 2, 2026

Uh oh!

ggerganov commented Jun 2, 2026

Uh oh!

ngxson commented Jun 2, 2026

Uh oh!

am17an commented Jun 3, 2026

Uh oh!

ggerganov commented Jun 3, 2026

Uh oh!

Uh oh!

R-SITES commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

am17an commented Jun 2, 2026 •

edited

Loading