Skip to content

qwen35: use post-norm hidden state for MTP#24025

Merged
am17an merged 3 commits into
ggml-org:masterfrom
am17an:mtp-fix-post-norm
Jun 3, 2026
Merged

qwen35: use post-norm hidden state for MTP#24025
am17an merged 3 commits into
ggml-org:masterfrom
am17an:mtp-fix-post-norm

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented Jun 2, 2026

Overview

It looks like qwen3.6 MTP actually uses the post-norm hidden state rather than the pre-norm hidden state. All credit to @jtjstock for pointing this out. Unfortunately deepseek/glm actually do use pre_norm hidden state, so perhaps adding an API get_embedding_post_norm makes sense? @ggerganov

Additional information

Master

  code_python        pred= 192 draft= 166 acc= 134 rate=0.807 tok/s=18.4
  code_cpp           pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=16.7
  explain_concept    pred= 192 draft= 190 acc= 127 rate=0.668 tok/s=15.6
  summarize          pred= 192 draft= 183 acc= 129 rate=0.705 tok/s=16.3
  qa_factual         pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=15.6
  translation        pred= 192 draft= 192 acc= 126 rate=0.656 tok/s=15.4
  creative_short     pred= 192 draft= 195 acc= 126 rate=0.646 tok/s=15.7
  stepwise_math      pred= 192 draft= 175 acc= 132 rate=0.754 tok/s=18.1
  long_code_review   pred= 192 draft= 210 acc= 121 rate=0.576 tok/s=14.3

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1681,
  "total_draft_accepted": 1153,
  "aggregate_accept_rate": 0.6859,
  "wall_s_total": 113.03
}

PR

  code_python        pred= 192 draft= 163 acc= 136 rate=0.834 tok/s=19.3
  code_cpp           pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=17.6
  explain_concept    pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=18.0
  summarize          pred= 192 draft= 171 acc= 134 rate=0.784 tok/s=17.8
  qa_factual         pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=15.5
  translation        pred= 192 draft= 174 acc= 132 rate=0.759 tok/s=17.4
  creative_short     pred= 192 draft= 204 acc= 122 rate=0.598 tok/s=15.6
  stepwise_math      pred= 192 draft= 169 acc= 134 rate=0.793 tok/s=18.5
  long_code_review   pred= 192 draft= 191 acc= 126 rate=0.660 tok/s=16.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1620,
  "total_draft_accepted": 1173,
  "aggregate_accept_rate": 0.7241,
  "wall_s_total": 105.99
}

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, for confirming that post norm hidden state is used in both vLLM and sglang.

@am17an am17an requested a review from CISC as a code owner June 2, 2026 13:49
@github-actions github-actions Bot added the model Model specific label Jun 2, 2026
@am17an am17an requested a review from ggerganov June 2, 2026 13:59
@ggerganov
Copy link
Copy Markdown
Member

get_embedding_post_norm

Unless I am missing something, we already have API for the post-norm embeddings. These are the regular embeddings that we always have supported. The pre-norm embeddings extraction was added for MTP. AFAIU, it's not needed and we just have to use the regular (a.k.a. post-norm embeddings)?

If yes, I'd suggest to move this to draft and properly extract the embeddings using the correct API. I.e. no need to use llama_get/set_embeddings_pre_norm if it is not really necessary.

@am17an am17an marked this pull request as draft June 2, 2026 14:03
@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Jun 2, 2026

I think a separate API or decoupling is required, because the embeddings API is currently coupled with n_outputs, which would again make us reserve for logits space for all tokens in a batch

@ggerganov
Copy link
Copy Markdown
Member

I think a separate API or decoupling is required, because the embeddings API is currently coupled with n_outputs, which would again make us reserve for logits space for all tokens in a batch

Wouldn't extending llama_set_embeddings() to also accept a bool masked be enough?

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jun 2, 2026

@ggerganov I think there is another problem (as describe on PR description), some MTP models may want to use pre-norm, while other may want to use post-norm. If I understand correctly, that will require another API like llama_model_get_nextn_embd_type just to check which embd type the MTP head want.

Rather than exposing too many low-level API to the surface, probably we should make a user-case-specific API for MTP?

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Jun 3, 2026

I think set_mtp_hidden_states should be a good API? That way "embeddings" can stay a separate concept, and pre-norm and post-norm can exist as "mtp_hidden_state" depending on model. This is how it is implemented in other inference engines from a cursory glance.

@ggerganov
Copy link
Copy Markdown
Member

Yes, I think introduce llama_get/set_embeddings_nextn() for now in llama-ext.h and it should be OK. These embeddings will not be at a fixed position but will be placed depending on the model graph implementation. Functionality-wise, it will be the same as the existing pre-norm embeddings.

@am17an am17an force-pushed the mtp-fix-post-norm branch from 9140cb9 to 13d07d2 Compare June 3, 2026 10:16
@am17an am17an marked this pull request as ready for review June 3, 2026 13:47
@am17an am17an requested review from a team as code owners June 3, 2026 13:47
@am17an am17an merged commit 166fe29 into ggml-org:master Jun 3, 2026
24 of 25 checks passed
@Godl1nk Godl1nk mentioned this pull request Jun 4, 2026
4 tasks
@R-SITES
Copy link
Copy Markdown

R-SITES commented Jun 4, 2026

@am17an — tested your post-norm PR (#24025) on Intel Arc Pro B70 (Battlemage, PCI 8086:e223, 32GB) with oneAPI 2026.0 + SYCL graphs enabled. Full results:

Test setup:

  • Build: mainline b9484 (63e66fd) + qwen35: use post-norm hidden state for MTP #24025 patch
  • Compiler: Intel oneAPI 2026.0.0 (DPC++/C++, Level Zero V2)
  • SYCL graphs: enabled and confirmed used (graphs reused = 62-124)
  • Model: Qwen3.6-35B-A3B MoE Q4_K_M (GGUF, Native MTP preserved)
  • GPU: Intel Arc Pro B70, all 99 layers offloaded
  • Flags: GGML_SYCL_F16=1, --spec-type draft-mtp, --spec-draft-n-max 3

Pre-norm (baseline, same b9484 + 2026.0 + graph, no patch):

Run Draft Accept Eval Time Server tok/s
1 71.9% (136/189) 3904ms / 200tok 51.23
no-MTP baseline 82.17

Post-norm (with #24025, cold then warm cache):

Run Draft Accept Eval Time Server tok/s
1 (cold) 72.7% (136/187) 3859ms / 200tok 51.83
2 (warm) 76.2% (138/181) 3636ms / 200tok 55.00
no-MTP baseline 82.17

Analysis:

The post-norm change is directionally correct — draft acceptance improved from 71.9% to 72.7-76.2%, and server-side throughput lifted from 51.23 to 51.83-55.00 tok/s. The warm-cache run at 76.2% acceptance is the best we've seen on SYCL.

However, MTP remains 33% slower than no-MTP on this hardware (55 vs 82 tok/s). The draft generation compute cost dominates (~413ms per 63 draft calls = 6.55ms/call). At ~12ms/token for plain AR decode, generating 3 draft tokens in 6.55ms and accepting ~2.3 of them is a net loss.

A couple of observations:

  1. Draft generation time is the bottleneck, not acceptance. Even at 100% acceptance, the math doesn't close the gap — you'd save maybe 150ms over 63 draft calls, still far from beating 12ms/token AR decode.

  2. SYCL graphs are functional (62-124 graph reuses confirmed) but don't meaningfully reduce draft gen time on Battlemage. The GPU is still executing the GATED_DELTA_NET + MUL_MAT + SOFTMAX chain as separate compute operations internally regardless of graph-level fusion.

  3. Pure AR decode on this same GPU with Intel's own fused INT4 kernels (vLLM, Intel oneAPI stack) hits ~125 tok/s — nearly 2x our ggml-sycl AR decode. The gap is in kernel fusion and quant format, not MTP architecture.

The post-norm patch is worth landing — it's architecturally correct, marginally improves acceptance, and matches what other engines do. But closing the SYCL MTP performance gap likely requires kernel-level fusion beyond what the graph API can deliver.

Happy to run additional tests with different draft counts or profiling if helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants