Skip to content

Support multi-token prediction (Eagle MTP)#369

Open
guoqingbao wants to merge 6 commits into
mainfrom
eagle_mtp
Open

Support multi-token prediction (Eagle MTP)#369
guoqingbao wants to merge 6 commits into
mainfrom
eagle_mtp

Conversation

@guoqingbao

@guoqingbao guoqingbao commented May 27, 2026

Copy link
Copy Markdown
Owner

Tested case:

Supported Model Arch: Qwen3.5/Qwen3.6 (BF16, FP8; NVP4 - untested)

cargo run --release --features cuda,nccl,flashinfer,cutlass
xinfer --m Qwen/Qwen3.5-27B-FP8/ --mtp 2 --ui-server

Acceptance rate around 60% and speedup 10 - 30%, e.g., Qwen3.5 27B model from 42 tokens/s to 56.20 t/s under mtp=2.

Num of speculative tokens stable at 2 (--mtp 2).

Details
2026-05-28T14:32:58.852530Z  WARN xinfer::server::server: Stream request has session_id c3a435a5-736e-4031-9984-d0f6c82e2f96
2026-05-28T14:32:58.930939Z  WARN xinfer::core::engine: [Stream] New request [Seq_id 0, 15 tokens] received! (session_id: Some("c3a435a5-736e-4031-9984-d0f6c82e2f96"))

2026-05-28T14:32:59.036833Z  INFO xinfer::core::runner: User's thinking preference for reasoning models: Some(true)
2026-05-28T14:32:59.036866Z  WARN xinfer::core::runner: Using user's sampling params: temp=Some(0.6), top_k=Some(20),top_p=Some(0.95), freq_penalty=None, pres_penalty=None
2026-05-28T14:32:59.047469Z  INFO xinfer::core::engine: Prefilling 1 seq(s) [0]: 16 total tokens in 0.19s (82.47 tokens/s)
2026-05-28T14:32:59.972350Z  INFO xinfer::core::runner: MTP Stats: proposed=32, accepted=21, acceptance_rate=65.62%, avg_tokens/step=3.31
2026-05-28T14:33:00.895722Z  INFO xinfer::core::runner: MTP Stats: proposed=64, accepted=39, acceptance_rate=60.94%, avg_tokens/step=3.22
2026-05-28T14:33:01.822555Z  INFO xinfer::core::runner: MTP Stats: proposed=96, accepted=61, acceptance_rate=63.54%, avg_tokens/step=3.27
2026-05-28T14:33:02.747718Z  INFO xinfer::core::runner: MTP Stats: proposed=128, accepted=79, acceptance_rate=61.72%,avg_tokens/step=3.23
2026-05-28T14:33:03.674480Z  INFO xinfer::core::runner: MTP Stats: proposed=160, accepted=96, acceptance_rate=60.00%,avg_tokens/step=3.20
2026-05-28T14:33:04.602557Z  INFO xinfer::core::runner: MTP Stats: proposed=192, accepted=118, acceptance_rate=61.46%, avg_tokens/step=3.23
2026-05-28T14:33:05.531340Z  INFO xinfer::core::runner: MTP Stats: proposed=224, accepted=140, acceptance_rate=62.50%, avg_tokens/step=3.25
2026-05-28T14:33:06.463584Z  INFO xinfer::core::runner: MTP Stats: proposed=256, accepted=159, acceptance_rate=62.11%, avg_tokens/step=3.24
2026-05-28T14:33:07.396372Z  INFO xinfer::core::runner: MTP Stats: proposed=288, accepted=180, acceptance_rate=62.50%, avg_tokens/step=3.25
2026-05-28T14:33:08.332577Z  INFO xinfer::core::runner: MTP Stats: proposed=320, accepted=205, acceptance_rate=64.06%, avg_tokens/step=3.28
2026-05-28T14:33:09.270424Z  INFO xinfer::core::runner: MTP Stats: proposed=352, accepted=224, acceptance_rate=63.64%, avg_tokens/step=3.27
2026-05-28T14:33:10.208793Z  INFO xinfer::core::runner: MTP Stats: proposed=384, accepted=245, acceptance_rate=63.80%, avg_tokens/step=3.28
2026-05-28T14:33:11.146838Z  INFO xinfer::core::runner: MTP Stats: proposed=416, accepted=269, acceptance_rate=64.66%, avg_tokens/step=3.29
2026-05-28T14:33:12.088962Z  INFO xinfer::core::runner: MTP Stats: proposed=448, accepted=286, acceptance_rate=63.84%, avg_tokens/step=3.28
2026-05-28T14:33:13.032053Z  INFO xinfer::core::runner: MTP Stats: proposed=480, accepted=302, acceptance_rate=62.92%, avg_tokens/step=3.26
2026-05-28T14:33:13.976385Z  INFO xinfer::core::runner: MTP Stats: proposed=512, accepted=322, acceptance_rate=62.89%, avg_tokens/step=3.26
2026-05-28T14:33:14.921817Z  INFO xinfer::core::runner: MTP Stats: proposed=544, accepted=348, acceptance_rate=63.97%, avg_tokens/step=3.28
2026-05-28T14:33:15.867525Z  INFO xinfer::core::runner: MTP Stats: proposed=576, accepted=371, acceptance_rate=64.41%, avg_tokens/step=3.29
2026-05-28T14:33:16.816091Z  INFO xinfer::core::runner: MTP Stats: proposed=608, accepted=392, acceptance_rate=64.47%, avg_tokens/step=3.29
2026-05-28T14:33:17.765741Z  INFO xinfer::core::runner: MTP Stats: proposed=640, accepted=415, acceptance_rate=64.84%, avg_tokens/step=3.30
2026-05-28T14:33:17.767200Z  INFO xinfer::core::block_manager: Prefix cache insert seq 0 (1067 tokens, 16 blocks)
2026-05-28T14:33:17.769180Z  WARN xinfer::server::server: --- Performance Metrics ---
2026-05-28T14:33:17.769192Z  INFO xinfer::server::server: [Seq 0] ⏱️ Prompt: 15 tokens in 0.19s (77.32 t/s)
2026-05-28T14:33:17.769198Z  INFO xinfer::server::server: [Seq 0] ⏱️ Decoded: 1052 tokens in 18.72s (56.20 t/s)

@sempervictus

Copy link
Copy Markdown
Contributor

Without --mtp 2:

vllm-rs-svc0  | 2026-05-27T14:13:41.648765Z  WARN xinfer::server::server: --- Performance Metrics ---
vllm-rs-svc0  | 2026-05-27T14:13:41.648790Z  INFO xinfer::server::server: [Seq 1] ⏱️ Prompt: 7540 tokens in 2.44s (3086.37 t/s)
vllm-rs-svc0  | 2026-05-27T14:13:41.648793Z  INFO xinfer::server::server: [Seq 1] ⏱️ Decoded: 130 tokens in 3.89s (33.44 t/s)

and with --mtp 2:

vllm-rs-svc0  | 2026-05-27T14:26:29.014323Z  WARN xinfer::server::server: --- Performance Metrics ---
vllm-rs-svc0  | 2026-05-27T14:26:29.014339Z  INFO xinfer::server::server: [Seq 1] ⏱️ Prompt: 7585 tokens in 2.48s (3063.41 t/s)
vllm-rs-svc0  | 2026-05-27T14:26:29.014343Z  INFO xinfer::server::server: [Seq 1] ⏱️ Decoded: 129 tokens in 3.81s (33.85 t/s)

which seemed odd until i remembered that only the larger models can do MTP and i'm using Qwen/Qwen3.6-35B-A3B-FP8 to test (the 80B coder also lacks an MTP head):

2026-05-27T14:25:16.753522Z  WARN xinfer::core::runner: Failed to load MTP head: cannot find tensor mtp.layers.0.mlp.gate_proj.weight

will retry with RedHatAI/Qwen3.5-122B-A10B-NVFP4 once its on the Spark. Need to figure out how to run the 397 with scaled context on the 120s and not have them blow up trying to prefill (the chunk size setting unfortunately doesn't help).

@guoqingbao

Copy link
Copy Markdown
Owner Author

and with --mtp 2:

I'm optimizing the mtp decode path, if it's working with cuda graph and the duplicate verification and commit forwards can be combined within single path, that will delivers 50% decoding speedup compared to the baseline.

@sempervictus

sempervictus commented May 28, 2026

Copy link
Copy Markdown
Contributor

Which model are you using to test? I'm cautiously hopeful that fixing the FP4 alignment issue will let the 397B run correctly at which point a 2X decoding speedup is very serious stuff.

Separately: how stable do you think this API to be?

  1. LLG: Comprehensive Guided Decoding Infrastructure #265 offers infallible tokens at various seqpos. With tool-call and reasoning grammars enabled (and with constraint-based tests) i've seen north of 30 "free tokens pending" for a seqpos which cannot be anything but the grammar-prescribed ones meaning we get them before MTP in CPU context and can skip to the seqpos immediately after them right away.
  2. I have a prototypical ngram branch but that is a lot more context dependent. I have a sneaking suspicion that it can be combined with prefix cache to extract repeat patterns from precomputed history in the sequence but even without novel implementation like that it was regularly hitting 3-6 tokens of valid candidates (although i couldn't get them to append correctly in the scheduler - just got log outputs telling me how much potential throughput i'm missing 😁)
  3. Does the current branch support EAGLE3 mechanics to enable things like the aurora speculative coder model (no MTP but similar concept as a separate head as i understand it)? There's a branch somewhere in my repo for loading those alongside the base model using a common KV-cache since they're geometrically identical and working on the same sequence at once - far from done but some of the basic logic is there.

@sempervictus

Copy link
Copy Markdown
Contributor

the NVFP4 397B seems unhappy:

2026-05-28T04:58:28.600861Z  WARN xinfer::core::runner: Failed to load MTP head: cannot find tensor mtp.layers.0.mlp.gate_proj.weight
   0: candle_core::error::Error::bt
   1: candle_core::safetensors::MmapedSafetensors::get
   2: <candle_nn::var_builder::ShardedSafeTensors as candle_nn::var_builder::Backend>::get
   3: candle_nn::var_builder::VarBuilderArgs<B>::get_with_hints_dtype
   4: xinfer::models::layers::linear::linear_no_bias
   5: xinfer::models::layers::linear::linear_no_bias_x
   6: xinfer::models::layers::mlp::MLP::new
   7: xinfer::models::qwen3_5_mtp::Qwen3_5MtpHead::new
   8: xinfer::core::runner::ModelRunner::new
   9: xinfer::runner::run_runner_process
  10: xinfer::main::{{closure}}
  11: tokio::runtime::park::CachedParkThread::block_on
  12: xinfer::main
  13: std::sys::backtrace::__rust_begin_short_backtrace
  14: std::rt::lang_start::{{closure}}
  15: std::rt::lang_start_internal
  16: main
  17: <unknown>
  18: __libc_start_main
  19: _start
. MTP disabled.

but what's more interesting is that it locks up the model after the first token emitted:

$ aichat -m g60 hello
<think>

Thinking

and that's all - the logs don't even show decoding of that one word:

2026-05-28T04:59:25.760462Z  WARN xinfer::core::engine: [Stream] New request [Seq_id 0, 11 tokens] received! (session_id: None)

2026-05-28T04:59:25.850888Z  INFO xinfer::core::runner: User's thinking preference for reasoning models: None
2026-05-28T04:59:25.850913Z  WARN xinfer::core::runner: Using user's sampling params: temp=Some(0.6), top_k=Some(20), top_p=Some(0.95), freq_penalty=None, pres_penalty=None
2026-05-28T04:59:25.858259Z  INFO xinfer::core::engine: Prefilling 1 seq(s) [0]: 12 total tokens in 0.13s (95.24 tokens/s)
    

so might need some graceful fallback which is missing when MTP is requested but fails to load (or just a bail-out in that condition to make the user deal with it).

@sempervictus

Copy link
Copy Markdown
Contributor

Sidenote: i think this might be why the Spark's shared memory thing cannot leverage ~size of the model for KV cache: candle_core::safetensors::MmapedSafetensors::get indicates that it's mmaped to host-allocated RAM which in turn means that when GPU ram availability is calculated it cannot expect to use that space for KV. :-\

@sempervictus

Copy link
Copy Markdown
Contributor

Also of note: setting --mtp 2 on a model that doesn't support MTP at all (instead of just having a crash recorded trying to load it) produces the same "one word -> hang" effect (tested on FP8 coder 80B x4 SM120).

@guoqingbao

Copy link
Copy Markdown
Owner Author
2026-05-28T04:58:28.600861Z  WARN xinfer::core::runner: Failed to load MTP head: cannot find tensor mtp.layers.0.mlp.gate_proj.weight

This will be fixed, currently, if mtp parameter given, it will always try to load mtp layers.

@guoqingbao

Copy link
Copy Markdown
Owner Author

but what's more interesting is that it locks up the model after the first token emitted:

This model may not support MTP, so specifying mtp might cause issues.

@guoqingbao

Copy link
Copy Markdown
Owner Author

3. Does the current branch support EAGLE3 mechanics to enable things like the aurora speculative coder model (no MTP but similar concept as a separate head as i understand it)? There's a branch somewhere in my repo for loading those alongside the base model using a common KV-cache since they're geometrically identical and working on the same sequence at once - far from done but some of the basic logic is there.

It's eagle style but not eagle3, eagle3 may not compatible with Qwen3.5/Qwen3.5 and has poorer performance for long context.

@guoqingbao

Copy link
Copy Markdown
Owner Author

Also of note: setting --mtp 2 on a model that doesn't support MTP at all (instead of just having a crash recorded trying to load it) produces the same "one word -> hang" effect (tested on FP8 coder 80B x4 SM120).

I forgot to mention that currently Qwen3.5/Qwen3.6 Dense model supported, working on the MoE ones.

@sempervictus

Copy link
Copy Markdown
Contributor

The eagle3 one I linked is q3n which is 3.5 without the visual layer AFAIK.

@guoqingbao

Copy link
Copy Markdown
Owner Author

The eagle3 one I linked is q3n which is 3.5 without the visual layer AFAIK.

I made MoE models working with MTP and optimized the mtp decoding path, will push another commit soon.

@sempervictus

Copy link
Copy Markdown
Contributor

Any thoughts on specialized MTP setups like sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP?

Would be handy to figure out how to move GDN states between hosts for disag PD - use blackwells to prefill and decode on v100's but i'm guessing they'd need the same flash black-end for that to work.

@guoqingbao

Copy link
Copy Markdown
Owner Author

Any thoughts on specialized MTP setups like sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP?

Would be handy to figure out how to move GDN states between hosts for disag PD - use blackwells to prefill and decode on v100's but i'm guessing they'd need the same flash black-end for that to work.

Now supported MoE MTP and optimized MTP speed, currently it delivers 10 - 30% decoding speedup. But, haven't support concurrent MTP decoding.

@guoqingbao

Copy link
Copy Markdown
Owner Author

Any thoughts on specialized MTP setups like sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP?

Would be handy to figure out how to move GDN states between hosts for disag PD - use blackwells to prefill and decode on v100's but i'm guessing they'd need the same flash black-end for that to work.

Seems promising, the Qwen3.5 27B from 42 tokens/s to 56 tokens/s (mtp=2) for short prompt.

2026-05-28T14:33:17.769192Z INFO xinfer::server::server: [Seq 0] ⏱️ Prompt: 15 tokens in 0.19s (77.32 t/s)
2026-05-28T14:33:17.769198Z INFO xinfer::server::server: [Seq 0] ⏱️ Decoded: 1052 tokens in 18.72s (56.20 t/s)

@guoqingbao

guoqingbao commented May 28, 2026

Copy link
Copy Markdown
Owner Author

Would be handy to figure out how to move GDN states between hosts for disag PD - use blackwells to prefill and decode on v100's but i'm guessing they'd need the same flash black-end for that to work.

It already supported.

Sorry, I think we supported turboquant kvcache for PD.

@guoqingbao

Copy link
Copy Markdown
Owner Author

Any thoughts on specialized MTP setups like sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP?

This should be supported under MTP.

@guoqingbao

Copy link
Copy Markdown
Owner Author

Any thoughts on specialized MTP setups like sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP?

This should be supported under MTP.

But it seems we didn't support modified Qwen3.5, they removed the vision part. But you can try.

@sempervictus

Copy link
Copy Markdown
Contributor

sm70 with NVFP4 RedHat Qwen 3.6 35B throws:

2026-05-28T19:17:34.651763Z  WARN xinfer::utils::kvcache_allocator: KV cache dtype: Auto, cache dtype F16

thread 'main' (92) panicked at src/models/layers/moe.rs:497:9:
Invalid quantization format!
stack backtrace:
   0:     0x55a1f5854e8a - <<std[52919eca6bce4da3]::sys::backtrace::BacktraceLock>::print::DisplayBacktrace as core[18c8dd30382e7099]::fmt::Display>::fmt
   1:     0x55a1f587189a - core[18c8dd30382e7099]::fmt::write
   2:     0x55a1f585d0b2 - <std[52919eca6bce4da3]::sys::stdio::unix::Stderr as std[52919eca6bce4da3]::io::Write>::write_fmt
   3:     0x55a1f582f0ff - std[52919eca6bce4da3]::panicking::default_hook::{closure#0}
   4:     0x55a1f584bac1 - std[52919eca6bce4da3]::panicking::default_hook
   5:     0x55a1f584bd3b - std[52919eca6bce4da3]::panicking::panic_with_hook
   6:     0x55a1f582f1ea - std[52919eca6bce4da3]::panicking::panic_handler::{closure#0}
   7:     0x55a1f58239e9 - std[52919eca6bce4da3]::sys::backtrace::__rust_end_short_backtrace::<std[52919eca6bce4da3]::panicking::panic_handler::{closure#0}, !>
   8:     0x55a1f583037d - __rustc[8068f81614cfe5c]::rust_begin_unwind
   9:     0x55a1f587203c - core[18c8dd30382e7099]::panicking::panic_fmt
  10:     0x55a1f45b7175 - xinfer::models::layers::moe::FusedMoe::new_with_gate::h89398d76d26597e9
  11:     0x55a1f45b7b70 - xinfer::models::layers::moe::FusedMoe::new::hd8295fc49c97b5aa
  12:     0x55a1f474ecf7 - xinfer::models::qwen3_5_mtp::Qwen3_5MtpHead::new::h0d038e5deef18880
  13:     0x55a1f44d1e10 - xinfer::core::runner::ModelRunner::new::hd14df76a8d766038
  14:     0x55a1f451708a - xinfer::runner::run_runner_process::h46054a23cc697256
  15:     0x55a1f42332f2 - xinfer::main::{{closure}}::h842fad2f7d6b634a
  16:     0x55a1f4231c8d - tokio::runtime::park::CachedParkThread::block_on::h6f407c769fb9c7d8
  17:     0x55a1f4264743 - xinfer::main::h7b904e468da45404
  18:     0x55a1f42489a6 - std::sys::backtrace::__rust_begin_short_backtrace::h0cc7bdcf367325ef
  19:     0x55a1f4337bf5 - std::rt::lang_start::{{closure}}::h68963b64e105cd7d
  20:     0x55a1f584a974 - std[52919eca6bce4da3]::rt::lang_start_internal
  21:     0x55a1f4268c75 - main
  22:     0x7ffb184a9d90 - <unknown>
  23:     0x7ffb184a9e40 - __libc_start_main
  24:     0x55a1f41766e5 - _start
  25:                0x0 - <unknown>

does MTP head stay in BF16 or some fun format?

@sempervictus

Copy link
Copy Markdown
Contributor

@guoqingbao: I think the modest performance gains arent due to the implementation - its the model family. I've seen a number of references on HF about the MTP head not being well-trained and having a fairly high miss rate. I think having the infrastructure to get tokens into the sequence by other means than sample() (of one per sequence) itself is very valuable since ff-tokens and ngrams can be produced on the CPU side without even bothering the devices. That said, the efficient dispatch system powering all of this is also non-trivial to access mutably as are the relevant states of sequences, caches, and in the case of ff-tokens the FSM which need to be aligned before the next forward() call (and potentially before other data-dependent work)... if there was a "@guoqingbao-approved" infrastructure layer for that sort of function, even if the infra itself wasn't doubling decode rates, rest of us could build atop that infrastructure to at least provide initial implementations of function if not actual merge-ready features for the runtime.

@guoqingbao

Copy link
Copy Markdown
Owner Author

I think the modest performance gains arent due to the implementation - its the model family. I've seen a number of references on HF about the MTP head not being well-trained and having a fairly high miss rate.

Right, I need to make it support MTP batching, and it needs to pass several tests before merging.

@sempervictus

Copy link
Copy Markdown
Contributor

Some quantizations seem to leave the MTP head at 16b - not sure exactly what nvidia/Qwen3.5-397B-A17B-NVFP4 uses but it crashes in current state with

Details
vllm-rs-svc0  | thread 'main' (75) panicked at src/models/layers/moe.rs:497:9:
vllm-rs-svc0  | Invalid quantization format!
vllm-rs-svc0  | stack backtrace:
vllm-rs-svc0  | stack backtrace:
vllm-rs-svc0  |       00: :         0x0x5597ebc9fb0a55aa818a1b0a -  - <<<<std[std52919eca6bce4da3[]52919eca6bce4da3::]sys::::sysbacktrace::::backtraceBacktraceLock::>BacktraceLock::>print::::printDisplayBacktrace:: as DisplayBacktracecore as [core18c8dd30382e7099[]::18c8dd30382e7099fmt]::::Displayfmt>::::Displayfmt>
vllm-rs-svc0  | ::fmt
vllm-rs-svc0  |       11: :         0x0x55aa818be51a5597ebcbc51a -  - corecore[[18c8dd30382e709918c8dd30382e7099]]::::fmtfmt::::writewrite
vllm-rs-svc0  | 
vllm-rs-svc0  |    2   2: :         0x0x55aa818a9d325597ebca7d32 -  - <<stdstd[[52919eca6bce4da352919eca6bce4da3]]::::syssys::::stdiostdio::::unixunix::::StderrStderr as  as stdstd[[52919eca6bce4da352919eca6bce4da3]]::::ioio::::WriteWrite>>::::write_fmtwrite_fmt
vllm-rs-svc0  | 
vllm-rs-svc0  |       33: :         0x0x5597ebc7a0ef55aa8187c0ef -  - stdstd[[52919eca6bce4da352919eca6bce4da3]]::::panickingpanicking::::default_hookdefault_hook::{::{closureclosure##00}}
vllm-rs-svc0  | 
vllm-rs-svc0  |       44: :         0x0x55aa818987e15597ebc967e1 -  - stdstd[[52919eca6bce4da352919eca6bce4da3]]::::panickingpanicking::::default_hookdefault_hook
vllm-rs-svc0  | 
vllm-rs-svc0  |     5 :   5 :     0x 55aa81898a5b  - 0xstd5597ebc96a5b[ - 52919eca6bce4da3std][::52919eca6bce4da3panicking]::::panic_with_hookpanicking
vllm-rs-svc0  | ::panic_with_hook 
vllm-rs-svc0  |    6 :   6 :     0x 55aa8187c1da  - 0xstd5597ebc7a1da[ - 52919eca6bce4da3std][::52919eca6bce4da3panicking]::::panic_handlerpanicking::{::closurepanic_handler#::{0closure}#
vllm-rs-svc0  | 0}
vllm-rs-svc0  |       77: :         0x0x5597ebc6e9d955aa818709d9 -  - stdstd[[52919eca6bce4da352919eca6bce4da3]]::::syssys::::backtracebacktrace::::__rust_end_short_backtrace__rust_end_short_backtrace::::<<stdstd[[52919eca6bce4da352919eca6bce4da3]]::::panickingpanicking::::panic_handlerpanic_handler::{::{closureclosure##00}}, , !!>>
vllm-rs-svc0  | 
vllm-rs-svc0  |       88: :         0x0x55aa8187d36d5597ebc7b36d -  - __rustc__rustc[[8068f81614cfe5c8068f81614cfe5c]]::::rust_begin_unwindrust_begin_unwind
vllm-rs-svc0  | 
vllm-rs-svc0  |       99: :         0x0x5597ebcbccbc55aa818becbc - core[ - 18c8dd30382e7099core][::18c8dd30382e7099panicking]::::panic_fmtpanicking
vllm-rs-svc0  | ::panic_fmt
vllm-rs-svc0  |     1010: :         0x0x5597ea87606555aa80478065 -  - xinferxinfer::::modelsmodels::::layers::moelayers::::FusedMoemoe::::new_with_gateFusedMoe::::h0a5b9e8671ec3593new_with_gate
vllm-rs-svc0  | ::h0a5b9e8671ec3593
vllm-rs-svc0  |   11 :   11 :     0x 55aa80478a60  - 0xxinfer5597ea876a60:: - modelsxinfer::::layersmodels::::moelayers::::FusedMoemoe::::newFusedMoe::::h8f1f8d37ce189be3new
vllm-rs-svc0  | ::h8f1f8d37ce189be3
vllm-rs-svc0  |   12:       0x1255aa804a7df1:  -  xinfer :: models ::0xqwen3_5_mtp5597ea8a5df1:: - Qwen3_5MtpHeadxinfer::::newmodels::::hca61aceeea96f95eqwen3_5_mtp
vllm-rs-svc0  | ::Qwen3_5MtpHead::new::hca61aceeea96f95e 
vllm-rs-svc0  |  13:     0x55aa8042fb2d - xinfer :: core13::: runner :: ModelRunner :: new0x::5597ea82db2dhcf474f20453f1a9d - 
vllm-rs-svc0  | xinfer::core::runner::ModelRunner::new::hcf474f20453f1a9d 

Invalid quantization format! from ::Qwen3_5MtpHead::new::hca61aceeea96f95

@guoqingbao

guoqingbao commented Jun 4, 2026

Copy link
Copy Markdown
Owner Author

Some quantizations seem to leave the MTP head at 16b - not sure exactly what nvidia/Qwen3.5-397B-A17B-NVFP4 uses but it crashes in current state with

There is a bug, we assume the MTP layer has same quantization format as main model, but it is not. So it currently only working with unquantized models or FP8 models. Another issue need to fix is mamba cache alignment under prefix cache.

@sempervictus

Copy link
Copy Markdown
Contributor

There is a bug, we assume the MTP layer has same quantization format as main model, but it is not. So it currently only working with unquantized models. Another issue need to fix is mamba cache alignment under prefix cache.

Will keep an eye out for the cache alignment commit and can test against a 122B FP8 while the quantization format assumption exists

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants