qwen35: use post-norm hidden state for MTP#24025
Conversation
Unless I am missing something, we already have API for the post-norm embeddings. These are the regular embeddings that we always have supported. The pre-norm embeddings extraction was added for MTP. AFAIU, it's not needed and we just have to use the regular (a.k.a. post-norm embeddings)? If yes, I'd suggest to move this to draft and properly extract the embeddings using the correct API. I.e. no need to use |
|
I think a separate API or decoupling is required, because the embeddings API is currently coupled with |
Wouldn't extending |
|
@ggerganov I think there is another problem (as describe on PR description), some MTP models may want to use pre-norm, while other may want to use post-norm. If I understand correctly, that will require another API like Rather than exposing too many low-level API to the surface, probably we should make a user-case-specific API for MTP? |
|
I think |
|
Yes, I think introduce |
|
@am17an — tested your post-norm PR (#24025) on Intel Arc Pro B70 (Battlemage, PCI 8086:e223, 32GB) with oneAPI 2026.0 + SYCL graphs enabled. Full results: Test setup:
Pre-norm (baseline, same b9484 + 2026.0 + graph, no patch):
Post-norm (with #24025, cold then warm cache):
Analysis: The post-norm change is directionally correct — draft acceptance improved from 71.9% to 72.7-76.2%, and server-side throughput lifted from 51.23 to 51.83-55.00 tok/s. The warm-cache run at 76.2% acceptance is the best we've seen on SYCL. However, MTP remains 33% slower than no-MTP on this hardware (55 vs 82 tok/s). The draft generation compute cost dominates (~413ms per 63 draft calls = 6.55ms/call). At ~12ms/token for plain AR decode, generating 3 draft tokens in 6.55ms and accepting ~2.3 of them is a net loss. A couple of observations:
The post-norm patch is worth landing — it's architecturally correct, marginally improves acceptance, and matches what other engines do. But closing the SYCL MTP performance gap likely requires kernel-level fusion beyond what the graph API can deliver. Happy to run additional tests with different draft counts or profiling if helpful. |
Overview
It looks like qwen3.6 MTP actually uses the post-norm hidden state rather than the pre-norm hidden state. All credit to @jtjstock for pointing this out. Unfortunately deepseek/glm actually do use pre_norm hidden state, so perhaps adding an API
get_embedding_post_normmakes sense? @ggerganovAdditional information
Master
PR
Requirements