Support multi-token prediction (Eagle MTP)#369
Conversation
|
Without vllm-rs-svc0 | 2026-05-27T14:13:41.648765Z WARN xinfer::server::server: --- Performance Metrics ---
vllm-rs-svc0 | 2026-05-27T14:13:41.648790Z INFO xinfer::server::server: [Seq 1] ⏱️ Prompt: 7540 tokens in 2.44s (3086.37 t/s)
vllm-rs-svc0 | 2026-05-27T14:13:41.648793Z INFO xinfer::server::server: [Seq 1] ⏱️ Decoded: 130 tokens in 3.89s (33.44 t/s)and with vllm-rs-svc0 | 2026-05-27T14:26:29.014323Z WARN xinfer::server::server: --- Performance Metrics ---
vllm-rs-svc0 | 2026-05-27T14:26:29.014339Z INFO xinfer::server::server: [Seq 1] ⏱️ Prompt: 7585 tokens in 2.48s (3063.41 t/s)
vllm-rs-svc0 | 2026-05-27T14:26:29.014343Z INFO xinfer::server::server: [Seq 1] ⏱️ Decoded: 129 tokens in 3.81s (33.85 t/s)which seemed odd until i remembered that only the larger models can do MTP and i'm using will retry with |
I'm optimizing the mtp decode path, if it's working with cuda graph and the duplicate verification and commit forwards can be combined within single path, that will delivers 50% decoding speedup compared to the baseline. |
|
Which model are you using to test? I'm cautiously hopeful that fixing the FP4 alignment issue will let the 397B run correctly at which point a 2X decoding speedup is very serious stuff. Separately: how stable do you think this API to be?
|
|
the NVFP4 397B seems unhappy: 2026-05-28T04:58:28.600861Z WARN xinfer::core::runner: Failed to load MTP head: cannot find tensor mtp.layers.0.mlp.gate_proj.weight
0: candle_core::error::Error::bt
1: candle_core::safetensors::MmapedSafetensors::get
2: <candle_nn::var_builder::ShardedSafeTensors as candle_nn::var_builder::Backend>::get
3: candle_nn::var_builder::VarBuilderArgs<B>::get_with_hints_dtype
4: xinfer::models::layers::linear::linear_no_bias
5: xinfer::models::layers::linear::linear_no_bias_x
6: xinfer::models::layers::mlp::MLP::new
7: xinfer::models::qwen3_5_mtp::Qwen3_5MtpHead::new
8: xinfer::core::runner::ModelRunner::new
9: xinfer::runner::run_runner_process
10: xinfer::main::{{closure}}
11: tokio::runtime::park::CachedParkThread::block_on
12: xinfer::main
13: std::sys::backtrace::__rust_begin_short_backtrace
14: std::rt::lang_start::{{closure}}
15: std::rt::lang_start_internal
16: main
17: <unknown>
18: __libc_start_main
19: _start
. MTP disabled.but what's more interesting is that it locks up the model after the first token emitted: $ aichat -m g60 hello
<think>
Thinkingand that's all - the logs don't even show decoding of that one word: 2026-05-28T04:59:25.760462Z WARN xinfer::core::engine: [Stream] New request [Seq_id 0, 11 tokens] received! (session_id: None)
2026-05-28T04:59:25.850888Z INFO xinfer::core::runner: User's thinking preference for reasoning models: None
2026-05-28T04:59:25.850913Z WARN xinfer::core::runner: Using user's sampling params: temp=Some(0.6), top_k=Some(20), top_p=Some(0.95), freq_penalty=None, pres_penalty=None
2026-05-28T04:59:25.858259Z INFO xinfer::core::engine: Prefilling 1 seq(s) [0]: 12 total tokens in 0.13s (95.24 tokens/s)
so might need some graceful fallback which is missing when MTP is requested but fails to load (or just a bail-out in that condition to make the user deal with it). |
|
Sidenote: i think this might be why the Spark's shared memory thing cannot leverage ~size of the model for KV cache: |
|
Also of note: setting |
This will be fixed, currently, if mtp parameter given, it will always try to load mtp layers. |
This model may not support MTP, so specifying mtp might cause issues. |
It's eagle style but not eagle3, eagle3 may not compatible with Qwen3.5/Qwen3.5 and has poorer performance for long context. |
I forgot to mention that currently Qwen3.5/Qwen3.6 Dense model supported, working on the MoE ones. |
|
The eagle3 one I linked is q3n which is 3.5 without the visual layer AFAIK. |
I made MoE models working with MTP and optimized the mtp decoding path, will push another commit soon. |
|
Any thoughts on specialized MTP setups like Would be handy to figure out how to move GDN states between hosts for disag PD - use blackwells to prefill and decode on v100's but i'm guessing they'd need the same flash black-end for that to work. |
Now supported MoE MTP and optimized MTP speed, currently it delivers 10 - 30% decoding speedup. But, haven't support concurrent MTP decoding. |
Seems promising, the Qwen3.5 27B from 42 tokens/s to 56 tokens/s (mtp=2) for short prompt. 2026-05-28T14:33:17.769192Z INFO xinfer::server::server: [Seq 0] ⏱️ Prompt: 15 tokens in 0.19s (77.32 t/s) |
It already supported. Sorry, I think we supported turboquant kvcache for PD. |
This should be supported under MTP. |
But it seems we didn't support modified Qwen3.5, they removed the vision part. But you can try. |
|
sm70 with NVFP4 RedHat Qwen 3.6 35B throws: 2026-05-28T19:17:34.651763Z WARN xinfer::utils::kvcache_allocator: KV cache dtype: Auto, cache dtype F16
thread 'main' (92) panicked at src/models/layers/moe.rs:497:9:
Invalid quantization format!
stack backtrace:
0: 0x55a1f5854e8a - <<std[52919eca6bce4da3]::sys::backtrace::BacktraceLock>::print::DisplayBacktrace as core[18c8dd30382e7099]::fmt::Display>::fmt
1: 0x55a1f587189a - core[18c8dd30382e7099]::fmt::write
2: 0x55a1f585d0b2 - <std[52919eca6bce4da3]::sys::stdio::unix::Stderr as std[52919eca6bce4da3]::io::Write>::write_fmt
3: 0x55a1f582f0ff - std[52919eca6bce4da3]::panicking::default_hook::{closure#0}
4: 0x55a1f584bac1 - std[52919eca6bce4da3]::panicking::default_hook
5: 0x55a1f584bd3b - std[52919eca6bce4da3]::panicking::panic_with_hook
6: 0x55a1f582f1ea - std[52919eca6bce4da3]::panicking::panic_handler::{closure#0}
7: 0x55a1f58239e9 - std[52919eca6bce4da3]::sys::backtrace::__rust_end_short_backtrace::<std[52919eca6bce4da3]::panicking::panic_handler::{closure#0}, !>
8: 0x55a1f583037d - __rustc[8068f81614cfe5c]::rust_begin_unwind
9: 0x55a1f587203c - core[18c8dd30382e7099]::panicking::panic_fmt
10: 0x55a1f45b7175 - xinfer::models::layers::moe::FusedMoe::new_with_gate::h89398d76d26597e9
11: 0x55a1f45b7b70 - xinfer::models::layers::moe::FusedMoe::new::hd8295fc49c97b5aa
12: 0x55a1f474ecf7 - xinfer::models::qwen3_5_mtp::Qwen3_5MtpHead::new::h0d038e5deef18880
13: 0x55a1f44d1e10 - xinfer::core::runner::ModelRunner::new::hd14df76a8d766038
14: 0x55a1f451708a - xinfer::runner::run_runner_process::h46054a23cc697256
15: 0x55a1f42332f2 - xinfer::main::{{closure}}::h842fad2f7d6b634a
16: 0x55a1f4231c8d - tokio::runtime::park::CachedParkThread::block_on::h6f407c769fb9c7d8
17: 0x55a1f4264743 - xinfer::main::h7b904e468da45404
18: 0x55a1f42489a6 - std::sys::backtrace::__rust_begin_short_backtrace::h0cc7bdcf367325ef
19: 0x55a1f4337bf5 - std::rt::lang_start::{{closure}}::h68963b64e105cd7d
20: 0x55a1f584a974 - std[52919eca6bce4da3]::rt::lang_start_internal
21: 0x55a1f4268c75 - main
22: 0x7ffb184a9d90 - <unknown>
23: 0x7ffb184a9e40 - __libc_start_main
24: 0x55a1f41766e5 - _start
25: 0x0 - <unknown>does MTP head stay in BF16 or some fun format? |
|
@guoqingbao: I think the modest performance gains arent due to the implementation - its the model family. I've seen a number of references on HF about the MTP head not being well-trained and having a fairly high miss rate. I think having the infrastructure to get tokens into the sequence by other means than |
Right, I need to make it support MTP batching, and it needs to pass several tests before merging. |
|
Some quantizations seem to leave the MTP head at 16b - not sure exactly what Detailsvllm-rs-svc0 | thread 'main' (75) panicked at src/models/layers/moe.rs:497:9:
vllm-rs-svc0 | Invalid quantization format!
vllm-rs-svc0 | stack backtrace:
vllm-rs-svc0 | stack backtrace:
vllm-rs-svc0 | 00: : 0x0x5597ebc9fb0a55aa818a1b0a - - <<<<std[std52919eca6bce4da3[]52919eca6bce4da3::]sys::::sysbacktrace::::backtraceBacktraceLock::>BacktraceLock::>print::::printDisplayBacktrace:: as DisplayBacktracecore as [core18c8dd30382e7099[]::18c8dd30382e7099fmt]::::Displayfmt>::::Displayfmt>
vllm-rs-svc0 | ::fmt
vllm-rs-svc0 | 11: : 0x0x55aa818be51a5597ebcbc51a - - corecore[[18c8dd30382e709918c8dd30382e7099]]::::fmtfmt::::writewrite
vllm-rs-svc0 |
vllm-rs-svc0 | 2 2: : 0x0x55aa818a9d325597ebca7d32 - - <<stdstd[[52919eca6bce4da352919eca6bce4da3]]::::syssys::::stdiostdio::::unixunix::::StderrStderr as as stdstd[[52919eca6bce4da352919eca6bce4da3]]::::ioio::::WriteWrite>>::::write_fmtwrite_fmt
vllm-rs-svc0 |
vllm-rs-svc0 | 33: : 0x0x5597ebc7a0ef55aa8187c0ef - - stdstd[[52919eca6bce4da352919eca6bce4da3]]::::panickingpanicking::::default_hookdefault_hook::{::{closureclosure##00}}
vllm-rs-svc0 |
vllm-rs-svc0 | 44: : 0x0x55aa818987e15597ebc967e1 - - stdstd[[52919eca6bce4da352919eca6bce4da3]]::::panickingpanicking::::default_hookdefault_hook
vllm-rs-svc0 |
vllm-rs-svc0 | 5 : 5 : 0x 55aa81898a5b - 0xstd5597ebc96a5b[ - 52919eca6bce4da3std][::52919eca6bce4da3panicking]::::panic_with_hookpanicking
vllm-rs-svc0 | ::panic_with_hook
vllm-rs-svc0 | 6 : 6 : 0x 55aa8187c1da - 0xstd5597ebc7a1da[ - 52919eca6bce4da3std][::52919eca6bce4da3panicking]::::panic_handlerpanicking::{::closurepanic_handler#::{0closure}#
vllm-rs-svc0 | 0}
vllm-rs-svc0 | 77: : 0x0x5597ebc6e9d955aa818709d9 - - stdstd[[52919eca6bce4da352919eca6bce4da3]]::::syssys::::backtracebacktrace::::__rust_end_short_backtrace__rust_end_short_backtrace::::<<stdstd[[52919eca6bce4da352919eca6bce4da3]]::::panickingpanicking::::panic_handlerpanic_handler::{::{closureclosure##00}}, , !!>>
vllm-rs-svc0 |
vllm-rs-svc0 | 88: : 0x0x55aa8187d36d5597ebc7b36d - - __rustc__rustc[[8068f81614cfe5c8068f81614cfe5c]]::::rust_begin_unwindrust_begin_unwind
vllm-rs-svc0 |
vllm-rs-svc0 | 99: : 0x0x5597ebcbccbc55aa818becbc - core[ - 18c8dd30382e7099core][::18c8dd30382e7099panicking]::::panic_fmtpanicking
vllm-rs-svc0 | ::panic_fmt
vllm-rs-svc0 | 1010: : 0x0x5597ea87606555aa80478065 - - xinferxinfer::::modelsmodels::::layers::moelayers::::FusedMoemoe::::new_with_gateFusedMoe::::h0a5b9e8671ec3593new_with_gate
vllm-rs-svc0 | ::h0a5b9e8671ec3593
vllm-rs-svc0 | 11 : 11 : 0x 55aa80478a60 - 0xxinfer5597ea876a60:: - modelsxinfer::::layersmodels::::moelayers::::FusedMoemoe::::newFusedMoe::::h8f1f8d37ce189be3new
vllm-rs-svc0 | ::h8f1f8d37ce189be3
vllm-rs-svc0 | 12: 0x1255aa804a7df1: - xinfer :: models ::0xqwen3_5_mtp5597ea8a5df1:: - Qwen3_5MtpHeadxinfer::::newmodels::::hca61aceeea96f95eqwen3_5_mtp
vllm-rs-svc0 | ::Qwen3_5MtpHead::new::hca61aceeea96f95e
vllm-rs-svc0 | 13: 0x55aa8042fb2d - xinfer :: core13::: runner :: ModelRunner :: new0x::5597ea82db2dhcf474f20453f1a9d -
vllm-rs-svc0 | xinfer::core::runner::ModelRunner::new::hcf474f20453f1a9d
|
There is a bug, we assume the MTP layer has same quantization format as main model, but it is not. So it currently only working with unquantized models or FP8 models. Another issue need to fix is mamba cache alignment under prefix cache. |

Tested case:
Supported Model Arch: Qwen3.5/Qwen3.6 (BF16, FP8; NVP4 - untested)
Acceptance rate around 60% and speedup 10 - 30%, e.g., Qwen3.5 27B model from 42 tokens/s to 56.20 t/s under mtp=2.
Num of speculative tokens stable at 2 (--mtp 2).
Details