Support mxfp4 and nvfp4 models#285
Conversation
|
Any good test cases to try? I'm working on MTP for Qwen35/Qwen3Next (its a lot easier than full draft, that didn't go so well, will need to learn more/probably ask for help since the MLA models can use EAGLE) but should be in a good place to test in the next hour or so (and hopefully have a PR for review). |
Drafted on my Mac, will test it shortly on Hopper. |
|
was under the impression that fp4 stuff needed Blackwell - happy to test, can try to find a q3.5 mxfp4 so i can test MTP and this together |
If hardware acceleration not used, the mxfp4 model can even working well with V100. |
That's the best thing i've heard all day. Thanks :-) |
|
Pulling
|
|
Similarly, V100s do not compile:
|
It can't compile before you tested the draft. I made it working now (tested on Hopper, no hardware acceleration), while, it uses flashinfer path on Blackwell, not sure if it's working too. |
|
@sempervictus I've optimized the decoding, here is the performance on Hopper for Qwen3-30B-A3B-MXFP4A16:
|
|
Will rebuild after next call :-) |
|
Catching this on ... Multimodel model Some(["Qwen3_5ForConditionalGeneration"]) detected!
vllm-rs-svc2 | Error: missing field `quant_method`
vllm-rs-svc2 | 0: candle_core::error::Error::bt
vllm-rs-svc2 | 1: vllm_rs::utils::merge_multimodal_top_level_config
vllm-rs-svc2 | 2: vllm_rs::utils::init_config_tokenizer
vllm-rs-svc2 | 3: vllm_rs::core::engine::LLMEngine::new
vllm-rs-svc2 | 4: vllm_rs::main::{{closure}}
vllm-rs-svc2 | 5: tokio::runtime::context::runtime::enter_runtime
vllm-rs-svc2 | 6: tokio::runtime::runtime::Runtime::block_on
vllm-rs-svc2 | 7: vllm_rs::main
vllm-rs-svc2 | 8: std::sys::backtrace::__rust_begin_short_backtrace
vllm-rs-svc2 | 9: std::rt::lang_start::{{closure}}
vllm-rs-svc2 | 10: std::rt::lang_start_internal
vllm-rs-svc2 | 11: main
vllm-rs-svc2 | 12: <unknown>
vllm-rs-svc2 | 13: __libc_start_main
vllm-rs-svc2 | 14: _start
vllm-rs-svc2 |
vllm-rs-svc2 exited with code 1might be collision with #287 though so pulling that out of the test stack and rebuilding. |
This is used for mlx inference, not a standard mxfp4 model. |
|
Ah 🤦 - will try to find a non-MLX one |
|
Getting vllm-rs-svc0 | 2026-04-02T14:50:13.980464Z WARN vllm_rs::utils: Multimodel model Some(["Qwen3_5MoeForConditionalGeneration"]) detected!
vllm-rs-svc0 |
vllm-rs-svc0 | thread 'main' (1) panicked at src/utils/mod.rs:1005:13:
vllm-rs-svc0 | Invalid quantization format! Only `gptq`, `awq`, `fp8` and `mxfp4` supported, got `quark`
vllm-rs-svc0 | stack backtrace:
vllm-rs-svc0 | 0: 0x5651256cf0b2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h93773fc827e3113d
vllm-rs-svc0 | 1: 0x5651256e499a - core::fmt::write::hed7b5c73d82ecb7c
vllm-rs-svc0 | 2: 0x565125696ae6 - std::io::Write::write_fmt::h6f0185aecf0ed75f
vllm-rs-svc0 | 3: 0x5651256ab329 - std::panicking::default_hook::{{closure}}::h2be84df4f189ae36
vllm-rs-svc0 | 4: 0x5651256ab189 - std::panicking::default_hook::hf0ea8939246f43a9
vllm-rs-svc0 | 5: 0x5651256ab61b - std::panicking::panic_with_hook::hb4bd9ac1123582a0
vllm-rs-svc0 | 6: 0x5651256ab3e8 - std::panicking::panic_handler::{{closure}}::hde00dd15f5637fe2
vllm-rs-svc0 | 7: 0x5651256a5319 - std::sys::backtrace::__rust_end_short_backtrace::hb72197fa777c1785
vllm-rs-svc0 | 8: 0x565125689c2d - __rustc[4425a7e20b4c8619]::rust_begin_unwind
vllm-rs-svc0 | 9: 0x5651256eff5c - core::panicking::panic_fmt::ha59b517dd231f4da
vllm-rs-svc0 | 10: 0x56512477da3c - vllm_rs::utils::init_config_tokenizer::hbf4dac29d3d83e3a
vllm-rs-svc0 | 11: 0x5651244e6575 - vllm_rs::core::engine::LLMEngine::new::h5762c1e83748c450
vllm-rs-svc0 | 12: 0x5651242d7158 - vllm_rs::main::{{closure}}::hae563abfdf0565ee
vllm-rs-svc0 | 13: 0x5651242d12a7 - tokio::runtime::context::runtime::enter_runtime::h7550c2b2fa15f4b7
vllm-rs-svc0 | 14: 0x565124311179 - tokio::runtime::runtime::Runtime::block_on::hccb8530d2c14b7e0
vllm-rs-svc0 | 15: 0x5651242d4619 - vllm_rs::main::h2f9b88b9f462d808
vllm-rs-svc0 | 16: 0x5651243f1726 - std::sys::backtrace::__rust_begin_short_backtrace::h27e44127a90ae7f5
vllm-rs-svc0 | 17: 0x565124372385 - std::rt::lang_start::{{closure}}::h1c2ae6c52f7dcb05
vllm-rs-svc0 | 18: 0x565125698e86 - std::rt::lang_start_internal::h9f282d832ae47dd5
vllm-rs-svc0 | 19: 0x5651242e3545 - main
vllm-rs-svc0 | 20: 0x7f3d4cbe5d90 - <unknown>
vllm-rs-svc0 | 21: 0x7f3d4cbe5e40 - __libc_start_main
vllm-rs-svc0 | 22: 0x565124230b25 - _start
vllm-rs-svc0 | 23: 0x0 - <unknown>from |
|
V100 build still failing w/ c-sections" "-pie" "-Wl,-z,relro,-z,now" "-Wl,-O1" "-Wl,--strip-debug" "-nodefaultlibs"
95.57 = note: some arguments are omitted. use `--verbose` to show all linker arguments
95.57 = note: rust-lld: error: undefined symbol: mxfp4_matmul_smallm_bf16
95.57 >>> referenced by attention_rs.87a1e8bca058f2d2-cgu.08
95.57 >>> attention_rs-6cc4b20ace5bb03a.attention_rs.87a1e8bca058f2d2-cgu.08.rcgu.o:(attention_rs::mxfp4_linear::mxfp4_matmul::hf5c0f5bd0b9d3685) in archive /vllm.rs/target/release/deps/libattention_rs-6cc4b20ace5bb03a.rlib
95.57 >>> did you mean: mxfp4_matmul_smallm_f16
95.57 >>> defined in: /vllm.rs/target/release/build/kernels-2d474296a48f5f72/out/libpagedattention.a(mxfp4_gemm-8a897dfa7eabfb51.o)
95.57
95.57 rust-lld: error: undefined symbol: mxfp4_matmul_wmma_bf16
95.57 >>> referenced by attention_rs.87a1e8bca058f2d2-cgu.08
95.57 >>> attention_rs-6cc4b20ace5bb03a.attention_rs.87a1e8bca058f2d2-cgu.08.rcgu.o:(attention_rs::mxfp4_linear::mxfp4_matmul::hf5c0f5bd0b9d3685) in archive /vllm.rs/target/release/deps/libattention_rs-6cc4b20ace5bb03a.rlib
95.57 >>> did you mean: mxfp4_matmul_wmma_f16
95.57 >>> defined in: /vllm.rs/target/release/build/kernels-2d474296a48f5f72/out/libpagedattention.a(mxfp4_gemm_wmma-f5a159591facd2a0.o)
95.57
95.57 rust-lld: error: undefined symbol: mxfp4_moe_grouped_gemm_wmma_bf16
95.57 >>> referenced by attention_rs.87a1e8bca058f2d2-cgu.08
95.57 >>> attention_rs-6cc4b20ace5bb03a.attention_rs.87a1e8bca058f2d2-cgu.08.rcgu.o:(attention_rs::mxfp4_linear::mxfp4_moe_gemm::hbc948a0bb53fda90) in archive /vllm.rs/target/release/deps/libattention_rs-6cc4b20ace5bb03a.rlib
95.57
95.57 rust-lld: error: undefined symbol: mxfp4_indexed_moe_gemm_bf16
95.57 >>> referenced by attention_rs.87a1e8bca058f2d2-cgu.08
95.57 >>> attention_rs-6cc4b20ace5bb03a.attention_rs.87a1e8bca058f2d2-cgu.08.rcgu.o:(attention_rs::mxfp4_linear::mxfp4_moe_gemm::hbc948a0bb53fda90) in archive /vllm.rs/target/release/deps/libattention_rs-6cc4b20ace5bb03a.rlib
95.57 collect2: error: ld returned 1 exit status
95.57
95.57
95.58 error: could not compile `vllm-rs` (bin "runner") due to 1 previous errorthink we need an fp16 branch there |
It quant with different approach, but I think you may try rename "quant_method": from "quark" to "mxfp4" in config.json. |
We have, it missing dummy functions. |
|
Based on current vLLM and quantization ecosystem documentation, AMD Quark ("quant_method": "quark") and NVIDIA MXFP4 ("quant_method": "mxfp4") are distinct quantization techniques primarily designed for different hardware ecosystems. @sempervictus AMD Quark is a quantization framework developed by AMD, with support being added to vLLM specifically to enable FP8/MXFP4 quantization paths for ROCm/AMD GPUs. MXFP4 (Microscaling Formats) on NVIDIA GPUs typically requires specific libraries like NVIDIA TensorRT Model Optimizer or Hugging Face's transformers MXFP4 implementation. |
|
Detailsvllm-rs-svc2 | 2026-04-02T15:02:12.473014Z INFO vllm_rs::models::qwen3_vl: Loading language model...
vllm-rs-svc2 |
vllm-rs-svc2 | thread 'main' (74) panicked at src/models/layers/wna16.rs:67:35:
vllm-rs-svc2 | attempt to divide by zero
vllm-rs-svc2 | stack backtrace:
vllm-rs-svc2 | 0: 0x556921242312 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h93773fc827e3113d
vllm-rs-svc2 | 1: 0x55692125545a - core::fmt::write::hed7b5c73d82ecb7c
vllm-rs-svc2 | 2: 0x55692120f376 - std::io::Write::write_fmt::h6f0185aecf0ed75f
vllm-rs-svc2 | 3: 0x556921221439 - std::panicking::default_hook::{{closure}}::h2be84df4f189ae36
vllm-rs-svc2 | 4: 0x556921221299 - std::panicking::default_hook::hf0ea8939246f43a9
vllm-rs-svc2 | 5: 0x55692122172b - std::panicking::panic_with_hook::hb4bd9ac1123582a0
vllm-rs-svc2 | 6: 0x55692122152a - std::panicking::panic_handler::{{closure}}::hde00dd15f5637fe2
vllm-rs-svc2 | 7: 0x55692121be09 - std::sys::backtrace::__rust_end_short_backtrace::hb72197fa777c1785
vllm-rs-svc2 | 8: 0x556921202e0d - __rustc[4425a7e20b4c8619]::rust_begin_unwind
vllm-rs-svc2 | 9: 0x55692125fa7c - core::panicking::panic_fmt::ha59b517dd231f4da
vllm-rs-svc2 | 10: 0x55692125f759 - core::panicking::panic_const::panic_const_div_by_zero::h7055d39cb8d892a6
vllm-rs-svc2 | 11: 0x556920a2c453 - vllm_rs::models::layers::wna16::WNA16::new::h6cccafa2c8abfb09
vllm-rs-svc2 | 12: 0x556920a81fbc - vllm_rs::models::layers::linear::linear_no_bias_x::h95ef5b3c91cffe25
vllm-rs-svc2 | 13: 0x556920994a57 - vllm_rs::models::layers::mlp::MLP::new::hdc00785dd2706bb0
vllm-rs-svc2 | 14: 0x556920a329b5 - vllm_rs::models::qwen3_5::Qwen3_5DecoderLayer::new::ha5035a5736a7280a
vllm-rs-svc2 | 15: 0x556920a2fcfe - vllm_rs::models::qwen3_5::Qwen3_5ForCausalLM::new_with_prefix::ha8547ccb27703ce7
vllm-rs-svc2 | 16: 0x5569209fc4c8 - vllm_rs::models::qwen3_vl::Qwen3VLForConditionalGeneration::new::h2f84af46a73059d8
vllm-rs-svc2 | 17: 0x5569209e81b5 - vllm_rs::core::runner::ModelRunner::new::h1077f7a353d57b77
vllm-rs-svc2 | 18: 0x5569208c3f41 - runner::main::he5a2645db115e166
vllm-rs-svc2 | 19: 0x556920909ce3 - std::sys::backtrace::__rust_begin_short_backtrace::h4f0204d7cbb3d317
vllm-rs-svc2 | 20: 0x55692092368d - std::rt::lang_start::{{closure}}::ha967de510420f25f
vllm-rs-svc2 | 21: 0x556921210e16 - std::rt::lang_start_internal::h9f282d832ae47dd5
vllm-rs-svc2 | 22: 0x5569208cbe35 - main
vllm-rs-svc2 | 23: 0x7f218c1d4d90 - <unknown>
vllm-rs-svc2 | 24: 0x7f218c1d4e40 - __libc_start_main
vllm-rs-svc2 | 25: 0x5569208a1c75 - _start
vllm-rs-svc2 | 26: 0x0 - <unknown>
vllm-rs-svc2 |
vllm-rs-svc2 | thread 'main' (73) panicked at src/models/layers/wna16.rs:67:35:
vllm-rs-svc2 | attempt to divide by zero
vllm-rs-svc2 | stack backtrace:
vllm-rs-svc2 |
vllm-rs-svc2 | thread 'main' (76) panicked at src/models/layers/wna16.rs:67:35:
vllm-rs-svc2 | attempt to divide by zero
vllm-rs-svc2 | stack backtrace:
vllm-rs-svc2 |
vllm-rs-svc2 | thread 'main' (75) panicked at src/models/layers/wna16.rs:67:35:
vllm-rs-svc2 | attempt to divide by zero
vllm-rs-svc2 | stack backtrace:
vllm-rs-svc2 | 0: 0x564f6a8bc312 - < std :: sys0:: : backtrace :: BacktraceLock0 ::: print0x ::556a6a387312 DisplayBacktrace - as< std0xcore::563bbf98d312::sys - fmt::::<backtraceDisplaystd::>::BacktraceLock::sys::fmt::print::backtrace::h93773fc827e3113d::DisplayBacktrace
vllm-rs-svc2 | BacktraceLock ::as print ::core1DisplayBacktrace::: fmt as:: Display core> ::::0xfmtfmt::564f6a8cf45a::h93773fc827e3113d - Display
vllm-rs-svc2 | core> :::: fmtfmt ::::1writeh93773fc827e3113d: ::
vllm-rs-svc2 | hed7b5c73d82ecb7c
vllm-rs-svc2 | 10x : 556a6a39a45a - 2 core: :: fmt0x ::563bbf9a045a write - 0x::core564f6a889376hed7b5c73d82ecb7c:: -
vllm-rs-svc2 | fmtstd:: ::write io:: ::hed7b5c73d82ecb7c2Write
vllm-rs-svc2 | : :: write_fmt :: h6f0185aecf0ed75f
vllm-rs-svc2 | 20x: 556a6a354376 - 3 std: 0x:: 563bbf95a376io - :: stdWrite ::::0xiowrite_fmt564f6a89b439 - ::::stdWriteh6f0185aecf0ed75f::::
vllm-rs-svc2 | panickingwrite_fmt ::::h6f0185aecf0ed75f default_hook
vllm-rs-svc2 | :: 3{ : { closure3 }: } :: 0x h2be84df4f189ae36556a6a366439
vllm-rs-svc2 | - std 0x:: 563bbf96c439panicking - ::4stddefault_hook: :::: panicking{ ::{ closuredefault_hook }::0x}{564f6a89b299::{ - h2be84df4f189ae36closurestd
vllm-rs-svc2 | }:: }panicking :::: h2be84df4f189ae36default_hook4
vllm-rs-svc2 | ::: hf0ea8939246f43a9
vllm-rs-svc2 | 4 : 0x 556a6a3662995 - : std :: 0xpanicking 563bbf96c299:: - default_hook0xstd::564f6a89b72b::hf0ea8939246f43a9 - panicking
vllm-rs-svc2 | std:: ::default_hook panicking:: ::hf0ea8939246f43a95panic_with_hook
vllm-rs-svc2 | : :: hb4bd9ac1123582a0
vllm-rs-svc2 | 5: 0x 6 556a6a36672b: - std 0x:: 563bbf96c72bpanicking - ::0xstdpanic_with_hook564f6a89b52a:::: - panickinghb4bd9ac1123582a0std::
vllm-rs-svc2 | ::panic_with_hook panicking:: ::hb4bd9ac1123582a0 panic_handler
vllm-rs-svc2 | 6:: : { { closure6 }: } 0x:: 556a6a36652ahde00dd15f5637fe2 -
vllm-rs-svc2 | std 0x:: 563bbf96c52apanicking - ::panic_handler7std::: ::{ panicking{ ::closure panic_handler} ::}0x{::564f6a895e09{hde00dd15f5637fe2 - closure
vllm-rs-svc2 | std} ::} sys:::: hde00dd15f5637fe2backtrace7
vllm-rs-svc2 | ::: __rust_end_short_backtrace :: 7hb72197fa777c1785 :
vllm-rs-svc2 | 0x 556a6a360e09 - std 0x::8563bbf966e09sys: - :: stdbacktrace :::: sys__rust_end_short_backtrace ::::0xbacktracehb72197fa777c1785564f6a87ce0d::
vllm-rs-svc2 | - __rust_end_short_backtrace__rustc ::[ hb72197fa777c17854425a7e20b4c8619 ]
vllm-rs-svc2 | ::8 rust_begin_unwind:
vllm-rs-svc2 | 8 : 0x 9556a6a347e0d : - __rustc0x [563bbf94de0d 4425a7e20b4c8619 - ]__rustc0x::[564f6a8d9a7crust_begin_unwind4425a7e20b4c8619 -
vllm-rs-svc2 | ]core :::: rust_begin_unwindpanicking
vllm-rs-svc2 | ::9panic_fmt : :: ha59b517dd231f4da
vllm-rs-svc2 | 9 : 0x 10556a6a3a4a7c : - core0x ::563bbf9aaa7c panicking - ::core0xpanic_fmt::564f6a8d9759::panicking - ha59b517dd231f4dacore::
vllm-rs-svc2 | ::panic_fmtpanicking:: ::ha59b517dd231f4da panic_const
vllm-rs-svc2 | 10:: : panic_const_div_by_zero ::10 h7055d39cb8d892a6:
vllm-rs-svc2 | 0x 556a6a3a4759 - 110xcore: 563bbf9aa759:: - panicking core:: ::panic_const panicking::0x::panic_const_div_by_zero564f6a0a6453panic_const:: - ::h7055d39cb8d892a6vllm_rspanic_const_div_by_zero
vllm-rs-svc2 | ::::modelsh7055d39cb8d892a6::
vllm-rs-svc2 | layers ::11 wna16: 11::: WNA16 :: new 0x::0x556a69b71453h6cccafa2c8abfb09563bbf177453 -
vllm-rs-svc2 | - vllm_rs vllm_rs:: ::models12models::: layers:: ::layers wna16:: ::wna16 WNA16::0x::WNA16564f6a0fbfbcnew:: - ::newvllm_rsh6cccafa2c8abfb09::::
vllm-rs-svc2 | h6cccafa2c8abfb09models
vllm-rs-svc2 | :: layers 12:: : linear12 ::: linear_no_bias_x :: h95ef5b3c91cffe25 0x
vllm-rs-svc2 | 556a69bc6fbc0x - 563bbf1ccfbcvllm_rs - ::13vllm_rsmodels: :::: modelslayers :::: layerslinear ::::0xlinearlinear_no_bias_x564f6a00ea57:::: - linear_no_bias_xh95ef5b3c91cffe25vllm_rs::
vllm-rs-svc2 | ::h95ef5b3c91cffe25 models
vllm-rs-svc2 | ::13layers : :: mlp13 ::: MLP :: 0xnew 556a69ad9a57:: - hdc00785dd2706bb00xvllm_rs
vllm-rs-svc2 | 563bbf0dfa57:: - models vllm_rs:: ::layers14models::: ::mlp layers:: ::MLP mlp:: ::new0xMLP564f6a0ac9b5:::: - hdc00785dd2706bb0newvllm_rs
vllm-rs-svc2 | :::: hdc00785dd2706bb0models
vllm-rs-svc2 | ::14qwen3_5 : :: 14 Qwen3_5DecoderLayer: :: 0xnew 556a69b779b5:: - ha5035a5736a7280a vllm_rs
vllm-rs-svc2 | 0x::563bbf17d9b5 models - ::vllm_rs15qwen3_5::: ::models Qwen3_5DecoderLayer:: ::qwen3_5 new:: ::Qwen3_5DecoderLayer0xha5035a5736a7280a::564f6a0a9cfe
vllm-rs-svc2 | new - :: vllm_rsha5035a5736a7280a ::
vllm-rs-svc2 | 15models: :: qwen3_5 15:: : Qwen3_5ForCausalLM ::0x new_with_prefix556a69b74cfe :: - ha8547ccb27703ce7vllm_rs0x
vllm-rs-svc2 | ::563bbf17acfe models - ::vllm_rs16qwen3_5::: ::models Qwen3_5ForCausalLM:: ::qwen3_5 new_with_prefix:: ::Qwen3_5ForCausalLM0xha8547ccb27703ce7::564f6a0764c8
vllm-rs-svc2 | new_with_prefix - ::vllm_rs ha8547ccb27703ce7::16
vllm-rs-svc2 | models: :: qwen3_vl ::16 Qwen3VLForConditionalGeneration: :: 0xnew 556a69b414c8:: - h2f84af46a73059d8 vllm_rs
vllm-rs-svc2 | 0x:: 563bbf1474c8models - ::17vllm_rsqwen3_vl: :::: modelsQwen3VLForConditionalGeneration :::: qwen3_vlnew ::::0xQwen3VLForConditionalGenerationh2f84af46a73059d8564f6a0621b5::new
vllm-rs-svc2 | - :: vllm_rsh2f84af46a73059d8 ::
vllm-rs-svc2 | 17core : :: runner17 ::: ModelRunner :: 0xnew 556a69b2d1b5:: - h1077f7a353d57b770xvllm_rs
vllm-rs-svc2 | 563bbf1331b5:: - corevllm_rs ::::18runnercore: :::: ModelRunnerrunner :::: newModelRunner:: ::newh1077f7a353d57b770x::
vllm-rs-svc2 | 564f69f3df41h1077f7a353d57b77 -
vllm-rs-svc2 | runner 18:: : main18 ::: he5a2645db115e166
vllm-rs-svc2 | 0x 556a69a08f410x19 - 563bbf00ef41: runner - ::runner main:: ::main0xhe5a2645db115e166::564f69f83ce3
vllm-rs-svc2 | he5a2645db115e166 -
vllm-rs-svc2 | std ::19 sys: 19:: : backtrace :: __rust_begin_short_backtrace 0x:: 556a69a4ece3h4f0204d7cbb3d3170x -
vllm-rs-svc2 | 563bbf054ce3std - :: stdsys ::::20sysbacktrace: :::: backtrace__rust_begin_short_backtrace :::: __rust_begin_short_backtraceh4f0204d7cbb3d317 ::
vllm-rs-svc2 | 0xh4f0204d7cbb3d317564f69f9d68d
vllm-rs-svc2 | - std20 ::: 20 rt: :: lang_start :: 0x{0x556a69a6868d{563bbf06e68d - closure - std}std::}::rt::::rtha967de510420f25flang_start::
vllm-rs-svc2 | ::lang_start{ ::{ {closure21{}: closure} }:: }ha967de510420f25f ::
vllm-rs-svc2 | ha967de510420f25f 0x
vllm-rs-svc2 | 564f6a88ae16 21 - : std21 ::: rt :: lang_start_internal 0x:: 556a6a355e16h9f282d832ae47dd50x -
vllm-rs-svc2 | 563bbf95be16std - ::std22rt::: ::rt lang_start_internal::lang_start_internal :::: h9f282d832ae47dd5h9f282d832ae47dd5
vllm-rs-svc2 |
vllm-rs-svc2 | 0x 564f69f45e35 - 2222main: :
vllm-rs-svc2 | 0x0x556a69a10e35563bbf016e35 - - mainmain
vllm-rs-svc2 |
vllm-rs-svc2 | 23: 0x7fdb0f3d4d90 - <unknown>
vllm-rs-svc2 | 24: 0x7fdb0f3d4e40 - __libc_start_main
vllm-rs-svc2 | 25: 230x : 564f69f1bc7523 - : _start
vllm-rs-svc2 | 0x26 7f4a62bd4d90: 0x - 7f9b136cdd90<unknown> -
vllm-rs-svc2 | <unknown>
vllm-rs-svc2 | 24 : 24 : 0x 7f4a62bd4e40 0x - 7f9b136cde40__libc_start_main -
vllm-rs-svc2 | 0x__libc_start_main 0
vllm-rs-svc2 | - 25<unknown> :
vllm-rs-svc2 | 25 : 0x 563bbefecc750x - 556a699e6c75_start -
vllm-rs-svc2 | _start
vllm-rs-svc2 | 26 : 26 : 0x 0 - <unknown>0x
vllm-rs-svc2 | 0 - <unknown>
vllm-rs-svc2 |
vllm-rs-svc2 | thread '<unnamed>' (179) panicked at src/utils/progress.rs:106:17:
vllm-rs-svc2 | Error when loading model!
vllm-rs-svc2 | stack backtrace:
vllm-rs-svc2 | 0: 0x559c5a199c72 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h93773fc827e3113d
vllm-rs-svc2 | 1: 0x559c5a1af55a - core::fmt::write::hed7b5c73d82ecb7c
vllm-rs-svc2 | 2: 0x559c5a1616a6 - std::io::Write::write_fmt::h6f0185aecf0ed75f
vllm-rs-svc2 | 3: 0x559c5a175ee9 - std::panicking::default_hook::{{closure}}::h2be84df4f189ae36
vllm-rs-svc2 | 4: 0x559c5a175d49 - std::panicking::default_hook::hf0ea8939246f43a9
vllm-rs-svc2 | 5: 0x559c5a1761db - std::panicking::panic_with_hook::hb4bd9ac1123582a0
vllm-rs-svc2 | 6: 0x559c5a175fda - std::panicking::panic_handler::{{closure}}::hde00dd15f5637fe2
vllm-rs-svc2 | 7: 0x559c5a16fed9 - std::sys::backtrace::__rust_end_short_backtrace::hb72197fa777c1785
vllm-rs-svc2 | 8: 0x559c5a1547ed - __rustc[4425a7e20b4c8619]::rust_begin_unwind
vllm-rs-svc2 | 9: 0x559c5a1bab1c - core::panicking::panic_fmt::ha59b517dd231f4da
vllm-rs-svc2 | 10: 0x559c58ee88a5 - <vllm_rs::utils::progress::RemoteProgressReporter as vllm_rs::utils::progress::ProgressLike>::get_progress::h526a7595287f5e14
vllm-rs-svc2 | 11: 0x559c58f26e89 - std::sys::backtrace::__rust_begin_short_backtrace::h19afc98525959cb4
vllm-rs-svc2 | 12: 0x559c5927767e - core::ops::function::FnOnce::call_once{{vtable.shim}}::h3f586c80ff2d26d7
vllm-rs-svc2 | 13: 0x559c5a16ae5f - std::sys::thread::unix::Thread::new::thread_start::h982f9ea829d1b5fb
vllm-rs-svc2 | 14: 0x7f0363cf4ac3 - <unknown>
vllm-rs-svc2 | 15: 0x7f0363d85a74 - clone
vllm-rs-svc2 | 16: 0x0 - <unknown>
vllm-rs-svc2 | Error: failed to fill whole buffer
vllm-rs-svc2 exited with code 1 |
|
Interesting, same effect from V100s built correctly, pulling model now. |
|
V100 trying to run Detailsvllm-rs-svc1 |
vllm-rs-svc1 | thread 'main' (90) panicked at src/models/layers/wna16.rs:67:35:
vllm-rs-svc1 | attempt to divide by zero
vllm-rs-svc1 | stack backtrace:
vllm-rs-svc1 | 0: 0x5641f3bade22 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h1851ca2a850bd9a9
vllm-rs-svc1 | 1: 0x5641f3bc0cf7 - core::fmt::write::h22467d3ad5dd5554
vllm-rs-svc1 | 2: 0x5641f3b7af06 - std::io::Write::write_fmt::h5e3b6a876f7a20bf
vllm-rs-svc1 | 3: 0x5641f3b8ce39 - std::panicking::default_hook::{{closure}}::he43c3ac33dfa4b50
vllm-rs-svc1 | 4: 0x5641f3b8cc99 - std::panicking::default_hook::hd124da54acf1152f
vllm-rs-svc1 | 5: 0x5641f3b8d12b - std::panicking::panic_with_hook::h9b5f1f19954f65a8
vllm-rs-svc1 | 6: 0x5641f3b8cf2a - std::panicking::panic_handler::{{closure}}::hf431df8c849ee0d6
vllm-rs-svc1 | 7: 0x5641f3b87859 - std::sys::backtrace::__rust_end_short_backtrace::hf97362b31a346cc0
vllm-rs-svc1 | 8: 0x5641f3b6ed5d - __rustc[9e6a08e89e4b9111]::rust_begin_unwind
vllm-rs-svc1 | 9: 0x5641f3bcb44c - core::panicking::panic_fmt::ha4414e4328fe24a0
vllm-rs-svc1 | 10: 0x5641f3bcb129 - core::panicking::panic_const::panic_const_div_by_zero::h9a45e37423e1c559
vllm-rs-svc1 | 11: 0x5641f33c9ff8 - vllm_rs::models::layers::wna16::WNA16::new::h19f406cbb70e7b87
vllm-rs-svc1 | 12: 0x5641f33482cc - vllm_rs::models::layers::linear::linear_no_bias_x::h4e845ecd41194d58
vllm-rs-svc1 | 13: 0x5641f33c5d23 - vllm_rs::models::layers::mlp::MLP::new::h999ec0615f659689
vllm-rs-svc1 | 14: 0x5641f34b3396 - vllm_rs::models::qwen3_5::Qwen3_5DecoderLayer::new::h70eb6a07201339b6
vllm-rs-svc1 | 15: 0x5641f34b0819 - vllm_rs::models::qwen3_5::Qwen3_5ForCausalLM::new_with_prefix::h7a011929eda8e6c8
vllm-rs-svc1 | 16: 0x5641f33d7703 - vllm_rs::models::qwen3_vl::Qwen3VLForConditionalGeneration::new::h7fbf1e1de19193ae
vllm-rs-svc1 | 17: 0x5641f3323f6c - vllm_rs::core::runner::ModelRunner::new::h9eb68c33c6a166dc
vllm-rs-svc1 | 18: 0x5641f322bf84 - runner::main::h3be843f7a926180f
vllm-rs-svc1 | 19: 0x5641f32829b3 - std::sys::backtrace::__rust_begin_short_backtrace::h8d990b0e738845ea
vllm-rs-svc1 | 20: 0x5641f329e56d - std::rt::lang_start::{{closure}}::hede23b9634dc3de1
vllm-rs-svc1 | 21: 0x5641f3b7c9a6 - std::rt::lang_start_internal::hb84cc625940d332a
vllm-rs-svc1 | 22: 0x5641f3233e85 - main
vllm-rs-svc1 | 23: 0x7f834b6edd90 - <unknown>
vllm-rs-svc1 | 24: 0x7f834b6ede40 - __libc_start_main
vllm-rs-svc1 | 25: 0x5641f3209695 - _start
vllm-rs-svc1 | 26: 0x0 - <unknown>
vllm-rs-svc1 |
vllm-rs-svc1 | thread 'main' (89) panicked at src/models/layers/wna16.rs:67:35:
vllm-rs-svc1 | attempt to divide by zero
vllm-rs-svc1 | stack backtrace:
vllm-rs-svc1 | 0: 0x559fdbe82e22 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h1851ca2a850bd9a9
vllm-rs-svc1 | 1: 0x559fdbe95cf7 - core::fmt::write::h22467d3ad5dd5554
vllm-rs-svc1 | 2: 0x559fdbe4ff06 - std::io::Write::write_fmt::h5e3b6a876f7a20bf
vllm-rs-svc1 | 3: 0x559fdbe61e39 - std::panicking::default_hook::{{closure}}::he43c3ac33dfa4b50
vllm-rs-svc1 | 4: 0x559fdbe61c99 - std::panicking::default_hook::hd124da54acf1152f
vllm-rs-svc1 | 5: 0x559fdbe6212b - std::panicking::panic_with_hook::h9b5f1f19954f65a8
vllm-rs-svc1 | 6: 0x559fdbe61f2a - std::panicking::panic_handler::{{closure}}::hf431df8c849ee0d6
vllm-rs-svc1 | 7: 0x559fdbe5c859 - std::sys::backtrace::__rust_end_short_backtrace::hf97362b31a346cc0
vllm-rs-svc1 | 8: 0x559fdbe43d5d - __rustc[9e6a08e89e4b9111]::rust_begin_unwind
vllm-rs-svc1 | 9: 0x559fdbea044c - core::panicking::panic_fmt::ha4414e4328fe24a0
vllm-rs-svc1 | 10: 0x559fdbea0129 - core::panicking::panic_const::panic_const_div_by_zero::h9a45e37423e1c559
vllm-rs-svc1 | 11: 0x559fdb69eff8 - vllm_rs::models::layers::wna16::WNA16::new::h19f406cbb70e7b87
vllm-rs-svc1 | 12: 0x559fdb61d2cc - vllm_rs::models::layers::linear::linear_no_bias_x::h4e845ecd41194d58
vllm-rs-svc1 | 13: 0x559fdb69ad23 - vllm_rs::models::layers::mlp::MLP::new::h999ec0615f659689
vllm-rs-svc1 | 14: 0x559fdb788396 - vllm_rs::models::qwen3_5::Qwen3_5DecoderLayer::new::h70eb6a07201339b6
vllm-rs-svc1 | 15: 0x559fdb785819 - vllm_rs::models::qwen3_5::Qwen3_5ForCausalLM::new_with_prefix::h7a011929eda8e6c8
vllm-rs-svc1 | 16: 0x559fdb6ac703 - vllm_rs::models::qwen3_vl::Qwen3VLForConditionalGeneration::new::h7fbf1e1de19193ae
vllm-rs-svc1 | 17: 0x559fdb5f8f6c - vllm_rs::core::runner::ModelRunner::new::h9eb68c33c6a166dc
vllm-rs-svc1 | 18: 0x559fdb500f84 - runner::main::h3be843f7a926180f
vllm-rs-svc1 | 19: 0x559fdb5579b3 - std::sys::backtrace::__rust_begin_short_backtrace::h8d990b0e738845ea
vllm-rs-svc1 | 20: 0x559fdb57356d - std::rt::lang_start::{{closure}}::hede23b9634dc3de1
vllm-rs-svc1 | 21: 0x559fdbe519a6 - std::rt::lang_start_internal::hb84cc625940d332a
vllm-rs-svc1 | 22: 0x559fdb508e85 - main
vllm-rs-svc1 | 23: 0x7f410ffd4d90 - <unknown>
vllm-rs-svc1 | 24: 0x7f410ffd4e40 - __libc_start_main
vllm-rs-svc1 | 25: 0x559fdb4de695 - _start
vllm-rs-svc1 | 26: 0x0 - <unknown>
vllm-rs-svc1 |
vllm-rs-svc1 | thread '<unnamed>' (187) panicked at src/utils/progress.rs:106:17:
vllm-rs-svc1 | Error when loading model!
vllm-rs-svc1 | stack backtrace:
vllm-rs-svc1 | 0: 0x556aa386fce2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h1851ca2a850bd9a9
vllm-rs-svc1 | 1: 0x556aa38852f7 - core::fmt::write::h22467d3ad5dd5554
vllm-rs-svc1 | 2: 0x556aa3837656 - std::io::Write::write_fmt::h5e3b6a876f7a20bf
vllm-rs-svc1 | 3: 0x556aa384bdd9 - std::panicking::default_hook::{{closure}}::he43c3ac33dfa4b50
vllm-rs-svc1 | 4: 0x556aa384bc39 - std::panicking::default_hook::hd124da54acf1152f
vllm-rs-svc1 | 5: 0x556aa384c0cb - std::panicking::panic_with_hook::h9b5f1f19954f65a8
vllm-rs-svc1 | 6: 0x556aa384beca - std::panicking::panic_handler::{{closure}}::hf431df8c849ee0d6
vllm-rs-svc1 | 7: 0x556aa3845e29 - std::sys::backtrace::__rust_end_short_backtrace::hf97362b31a346cc0
vllm-rs-svc1 | 8: 0x556aa382ab6d - __rustc[9e6a08e89e4b9111]::rust_begin_unwind
vllm-rs-svc1 | 9: 0x556aa3890a8c - core::panicking::panic_fmt::ha4414e4328fe24a0
vllm-rs-svc1 | 10: 0x556aa26353c4 - <vllm_rs::utils::progress::RemoteProgressReporter as vllm_rs::utils::progress::ProgressLike>::get_progress::hf20e43804cee2510
vllm-rs-svc1 | 11: 0x556aa2636ab7 - std::sys::backtrace::__rust_begin_short_backtrace::h8103b01699aca63e
vllm-rs-svc1 | 12: 0x556aa29806be - core::ops::function::FnOnce::call_once{{vtable.shim}}::h7d67afec98b95d57
vllm-rs-svc1 | 13: 0x556aa384138f - std::sys::thread::unix::Thread::new::thread_start::hc71bde616ea8b6e9
vllm-rs-svc1 | 14: 0x7fdf7d714ac3 - <unknown>
vllm-rs-svc1 | 15: 0x7fdf7d7a5a04 - clone
vllm-rs-svc1 | 16: 0x0 - <unknown>
vllm-rs-svc1 | Error: failed to fill whole bufferBuilding on SM89 to test that as well |
I only tested this model: nm-testing/Qwen3-30B-A3B-MXFP4A16 It needs to adapt to different mxfp4 quantization metadata. |
|
So this is weird: SM89 throws a completely different error for DetailsApr 02 13:14:33 unknown vllm-rs[3218650]: 2026-04-02T17:14:33.244308Z INFO runner: Loading model at rank 0
Apr 02 13:14:33 unknown vllm-rs[3218650]: 2026-04-02T17:14:33.244322Z INFO vllm_rs::utils::progress: Remote progress reporter initialized for rank 0
Apr 02 13:14:35 unknown vllm-rs[3218568]: thread '<unnamed>' (3218706) panicked at src/utils/progress.rs:106:17:
Apr 02 13:14:35 unknown vllm-rs[3218568]: Error when loading model!
Apr 02 13:14:35 unknown vllm-rs[3218568]: stack backtrace:
Apr 02 13:14:35 unknown vllm-rs[3218568]: 0: 0x562f3312a732 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::hac1d885928ba8582
Apr 02 13:14:35 unknown vllm-rs[3218568]: 1: 0x562f3313fd47 - core::fmt::write::h83ebb4d32483be9e
Apr 02 13:14:35 unknown vllm-rs[3218568]: 2: 0x562f330f20a6 - std::io::Write::write_fmt::ha6a1d6c1ea64b2d0
Apr 02 13:14:35 unknown vllm-rs[3218568]: 3: 0x562f33106829 - std::panicking::default_hook::{{closure}}::h8e9c4d1276f0925f
Apr 02 13:14:35 unknown vllm-rs[3218568]: 4: 0x562f33106689 - std::panicking::default_hook::h2b2078d38b534dfb
Apr 02 13:14:35 unknown vllm-rs[3218568]: 5: 0x562f33106b1b - std::panicking::panic_with_hook::h39b739724e701bfd
Apr 02 13:14:35 unknown vllm-rs[3218568]: 6: 0x562f3310691a - std::panicking::panic_handler::{{closure}}::he540c4833054e458
Apr 02 13:14:35 unknown vllm-rs[3218568]: 7: 0x562f33100879 - std::sys::backtrace::__rust_end_short_backtrace::hfa179d89deec8aed
Apr 02 13:14:35 unknown vllm-rs[3218568]: 8: 0x562f330e55bd - __rustc[d131491b17107b07]::rust_begin_unwind
Apr 02 13:14:35 unknown vllm-rs[3218568]: 9: 0x562f3314b4dc - core::panicking::panic_fmt::ha564519d657d9c46
Apr 02 13:14:35 unknown vllm-rs[3218568]: 10: 0x562f31f0ffc4 - <vllm_rs::utils::progress::RemoteProgressReporter as vllm_rs::utils::progress::ProgressLike>::get_progress::h3cd093e99ee77d74
Apr 02 13:14:35 unknown vllm-rs[3218568]: 11: 0x562f32064699 - std::sys::backtrace::__rust_begin_short_backtrace::he69e84ef4bad6211
Apr 02 13:14:35 unknown vllm-rs[3218568]: 12: 0x562f3206852d - core::ops::function::FnOnce::call_once{{vtable.shim}}::h1ce6bf15df09b8ec
Apr 02 13:14:35 unknown vllm-rs[3218568]: 13: 0x562f330fbddf - std::sys::thread::unix::Thread::new::thread_start::h45cc87bb053add0f
Apr 02 13:14:35 unknown vllm-rs[3218568]: 14: 0x7f7b75f7e97a - <unknown>
Apr 02 13:14:35 unknown vllm-rs[3218568]: 15: 0x7f7b760022bc - <unknown>
Apr 02 13:14:35 unknown vllm-rs[3218568]: 16: 0x0 - <unknown>
Apr 02 13:14:35 unknown vllm-rs[3218650]: Error: cannot find tensor model.embed_tokens.weight
Apr 02 13:14:35 unknown vllm-rs[3218650]: 0: candle_core::error::Error::bt
Apr 02 13:14:35 unknown vllm-rs[3218650]: 1: candle_core::safetensors::MmapedSafetensors::get
Apr 02 13:14:35 unknown vllm-rs[3218650]: 2: candle_core::safetensors::MmapedSafetensors::load
Apr 02 13:14:35 unknown vllm-rs[3218650]: 3: <candle_core::safetensors::MmapedSafetensors as candle_nn::var_builder::SimpleBackend>::get
Apr 02 13:14:35 unknown vllm-rs[3218650]: 4: <candle_nn::var_builder::ShardedSafeTensors as candle_nn::var_builder::Backend>::get
Apr 02 13:14:35 unknown vllm-rs[3218650]: 5: candle_nn::var_builder::VarBuilderArgs<B>::get_with_hints_dtype
Apr 02 13:14:35 unknown vllm-rs[3218650]: 6: vllm_rs::models::layers::others::embedding
Apr 02 13:14:35 unknown vllm-rs[3218650]: 7: vllm_rs::models::qwen3_5::Qwen3_5ForCausalLM::new_with_prefix
Apr 02 13:14:35 unknown vllm-rs[3218650]: 8: vllm_rs::core::runner::ModelRunner::new
Apr 02 13:14:35 unknown vllm-rs[3218650]: 9: runner::main
Apr 02 13:14:35 unknown vllm-rs[3218650]: 10: std::sys::backtrace::__rust_begin_short_backtrace
Apr 02 13:14:35 unknown vllm-rs[3218650]: 11: std::rt::lang_start::{{closure}}
Apr 02 13:14:35 unknown vllm-rs[3218650]: 12: std::rt::lang_start_internal
Apr 02 13:14:35 unknown vllm-rs[3218650]: 13: main
Apr 02 13:14:35 unknown vllm-rs[3218650]: 14: <unknown>
Apr 02 13:14:35 unknown vllm-rs[3218650]: 15: __libc_start_main
Apr 02 13:14:35 unknown vllm-rs[3218650]: 16: _start
Apr 02 13:14:35 unknown vllm-rs[3218650]: Stack backtrace:
Apr 02 13:14:35 unknown vllm-rs[3218650]: 0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
Apr 02 13:14:35 unknown vllm-rs[3218650]: 1: runner::main
Apr 02 13:14:35 unknown vllm-rs[3218650]: 2: std::sys::backtrace::__rust_begin_short_backtrace
Apr 02 13:14:35 unknown vllm-rs[3218650]: 3: std::rt::lang_start::{{closure}}
Apr 02 13:14:35 unknown vllm-rs[3218650]: 4: std::rt::lang_start_internal
Apr 02 13:14:35 unknown vllm-rs[3218650]: 5: main
Apr 02 13:14:35 unknown vllm-rs[3218650]: 6: <unknown>
Apr 02 13:14:35 unknown vllm-rs[3218650]: 7: __libc_start_main
Apr 02 13:14:35 unknown vllm-rs[3218650]: 8: _start
Apr 02 13:14:35 unknown vllm-rs[3218568]: Error: failed to fill whole bufferIIRC Ada-Lovelace generation has hardware MXFP8 support in the PTX intrinsics but its sort of like the mess between SM100/SM120/SM121 "blackwell" gear in that getting the same result of computation for these primitives requires taking different codepaths to get there (and with varying performance profiles). |
Taking a shot at this presently on the V100s since i have Qwen3-30 in BF16/FP32 to which i can compare. The Next/3.5 series have me spoiled with GDN benefits 😉 |
|
$ aichat -f ReadMe.md
gam gam gam am gam gam gam, ga, ga, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, g, gError: Aborted. |
|
Oddly the software path seems to produce errant output of different kinds on sm70 and 121 - something specific used from sm90 feature set? |
Yes - just main and this PR. Will also do a no-cache build to verify |
|
Clean uncached build of just this branch trying to run vllm-rs-svc2 | thread '<unnamed>' (159) panicked at src/utils/progress.rs:106:17:
vllm-rs-svc2 | Error when loading model!
vllm-rs-svc2 | stack backtrace:
vllm-rs-svc2 | 0: 0x55da289cfcc2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h93773fc827e3113d
vllm-rs-svc2 | 1: 0x55da289e55aa - core::fmt::write::hed7b5c73d82ecb7c
vllm-rs-svc2 | 2: 0x55da289976f6 - std::io::Write::write_fmt::h6f0185aecf0ed75f
vllm-rs-svc2 | 3: 0x55da289abf39 - std::panicking::default_hook::{{closure}}::h2be84df4f189ae36
vllm-rs-svc2 | 4: 0x55da289abd99 - std::panicking::default_hook::hf0ea8939246f43a9
vllm-rs-svc2 | 5: 0x55da289ac22b - std::panicking::panic_with_hook::hb4bd9ac1123582a0
vllm-rs-svc2 | 6: 0x55da289ac02a - std::panicking::panic_handler::{{closure}}::hde00dd15f5637fe2
vllm-rs-svc2 | 7: 0x55da289a5f29 - std::sys::backtrace::__rust_end_short_backtrace::hb72197fa777c1785
vllm-rs-svc2 | 8: 0x55da2898a83d - __rustc[4425a7e20b4c8619]::rust_begin_unwind
vllm-rs-svc2 | 9: 0x55da289f0b6c - core::panicking::panic_fmt::ha59b517dd231f4da
vllm-rs-svc2 | 10: 0x55da2777d3e5 - <vllm_rs::utils::progress::RemoteProgressReporter as vllm_rs::utils::progress::ProgressLike>::get_progress::haf97b0aa9b4dfcb3
vllm-rs-svc2 | 11: 0x55da278e4fa9 - std::sys::backtrace::__rust_begin_short_backtrace::h43f77b025c2aa5d8
vllm-rs-svc2 | 12: 0x55da2789298e - core::ops::function::FnOnce::call_once{{vtable.shim}}::h50a3dffe9c28764f
vllm-rs-svc2 | 13: 0x55da289a0eaf - std::sys::thread::unix::Thread::new::thread_start::h982f9ea829d1b5fb
vllm-rs-svc2 | 14: 0x7f4338e70ac3 - <unknown>
vllm-rs-svc2 | 15: 0x7f4338f01a74 - clone
vllm-rs-svc2 | 16: 0x0 - <unknown>
vllm-rs-svc2 | Error: Unable to load TP-safe quantized Qwen3.5 split in_proj_qkv: Merged quantized weight is not supported at the moment, using ISQ instead!
vllm-rs-svc2 | 0: candle_core::error::Error::bt
vllm-rs-svc2 | 1: vllm_rs::models::layers::distributed::MergedParallelColumnLinear::load_merged_chunks
vllm-rs-svc2 | 2: vllm_rs::models::layers::deltanet::GatedDeltaNet::load_projection
vllm-rs-svc2 | 3: vllm_rs::models::layers::deltanet::GatedDeltaNet::new
vllm-rs-svc2 | 4: vllm_rs::models::qwen3_5::Qwen3_5DecoderLayer::new
vllm-rs-svc2 | 5: vllm_rs::models::qwen3_5::Qwen3_5ForCausalLM::new_with_prefix
vllm-rs-svc2 | 6: vllm_rs::models::qwen3_vl::Qwen3VLForConditionalGeneration::new
vllm-rs-svc2 | 7: vllm_rs::core::runner::ModelRunner::new
vllm-rs-svc2 | 8: runner::main
vllm-rs-svc2 | 9: std::sys::backtrace::__rust_begin_short_backtrace
vllm-rs-svc2 | 10: std::rt::lang_start::{{closure}}
vllm-rs-svc2 | 11: std::rt::lang_start_internal
vllm-rs-svc2 | 12: main
vllm-rs-svc2 | 13: <unknown>
vllm-rs-svc2 | 14: __libc_start_main
vllm-rs-svc2 | 15: _startseems like a format error reading tensor-rt produced models. I recall there being something about that format having changed at one point in a breaking manner for different versions of the py |
|
n uncached build of just this branch trying to run This is a tiny model, you may try running on single device. |
|
@guoqingbao interesting (i run the BF16 on two GPUs to test) - that worked 💥 although without the LLG PR i'm running into the "models not trained to think will vomit output forever" problem:
if that is only being caused by not being trained to reason... "i've got something for that" i can revive in #265 (currently commented-out) out of lessons learned form q3next "reasoning" behaviors. Far as throughput - on a single SM120 that 0.8 shows vllm-rs-svc2 |
vllm-rs-svc2 | 2026-04-04T03:47:33.798328Z INFO vllm_rs::core::block_manager: Prefix cache miss seq 8 (7404 tokens, 58 cached blocks, raw_match=0 blocks)
vllm-rs-svc2 | 2026-04-04T03:47:33.905328Z INFO vllm_rs::core::runner: User's thinking preference for reasoning models: None
vllm-rs-svc2 | 2026-04-04T03:47:33.905340Z WARN vllm_rs::core::runner: No generation_config, using default sampling (temperature=0.7, top_k=32, top_p=0.95)
vllm-rs-svc2 | 2026-04-04T03:47:34.252992Z INFO vllm_rs::core::engine: Prefilling [seq_id 8]: 7405 tokens in 0.50s (14959.60 tokens/s)By the way that effect of run-on generation in reasoning blocks for models not trained to use them happens with much larger models as well when run using naive attention on the SM70s - it gets worse when context gets longer and gets pretty pathological at 4X+ scale factors. Originally i was just cleaning those up entirely but after #281 i've adopted a more graceful way to cope with their presence only trimming the opening start when generation is going to replace it verbatim to ensure positions match up on prefix search |
On Hopper, I got over 16000 tokens/s prefill speed for 35B qwen3.5 mxfp4 model. |
|
Theoretically i should be seeing more on Blackwell (once HW support is in-place) - HGX systems should decode faster but prefill on this format should be significantly better on blackwell. Will try to get the coder running on all 4 and see how it compares to the FP8 once i finish up the initial ngram push. Still need to figure out how to properly index into MTP layers (the 122B seems to have them but fails loading right now) but hoping to have a few spec decoding options shortly. Eagle after that which is a lot more complicated :-) |
|
Multi-GPU on |
Yes, it's related to sharding, haven't be able to test multirank for fp4 models. |
|
Single rank might still show us what's wrong - this is on the SM121, note the 2048 input number (i gave it the CN readme and asked for an english summary): vllm-rs-svc0 | 2026-04-04T05:38:06.064596Z INFO vllm_rs::core::block_manager: Prefix cache miss seq 2 (2048 tokens, 0 cached blocks, raw_match=0 blocks)
vllm-rs-svc0 | 2026-04-04T05:38:10.641089Z INFO vllm_rs::core::runner: User's thinking preference for reasoning models: None
vllm-rs-svc0 | 2026-04-04T05:38:10.641109Z WARN vllm_rs::core::runner: No generation_config, using default sampling (temperature=0.7, top_k=32, top_p=0.95)
vllm-rs-svc0 | 2026-04-04T05:38:12.217076Z INFO vllm_rs::core::engine: Prefilling [seq_id 2]: 2049 tokens in 6.18s (331.55 tokens/s)
vllm-rs-svc0 | 2026-04-04T05:38:17.256135Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [2]], avg. 25 tokens/s per request (total: 25 tokens/s)
vllm-rs-svc0 | 2026-04-04T05:38:22.280030Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [2]], avg. 25 tokens/s per request (total: 25 tokens/s)
vllm-rs-svc0 | 2026-04-04T05:38:27.293948Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [2]], avg. 25 tokens/s per request (total: 25 tokens/s)
vllm-rs-svc0 | 2026-04-04T05:38:27.293970Z INFO vllm_rs::core::scheduler: GPU Kvcache: 8154 blocks (521856 tokens) free, used 0.5% (0.06GB/12.00GB); CPU swap used NaN% (NaNGB/0.00GB)
vllm-rs-svc0 | 2026-04-04T05:38:27.293974Z INFO vllm_rs::core::scheduler: GPU MambaState: 1 / 28 slots used (3.6%), approx 0.07GB/2.02GB (slot 73.69MB)
vllm-rs-svc0 | 2026-04-04T05:38:31.837700Z INFO vllm_rs::core::block_manager: Prefix cache insert seq 2 (2535 tokens, 39 blocks)
vllm-rs-svc0 | 2026-04-04T05:38:31.837859Z WARN vllm_rs::server::server: --- Performance Metrics ---
vllm-rs-svc0 | 2026-04-04T05:38:31.837869Z INFO vllm_rs::server::server: [Seq 2] ⏱️ Prompt: 2048 tokens in 6.18s (331.39 t/s)
vllm-rs-svc0 | 2026-04-04T05:38:31.837873Z INFO vllm_rs::server::server: [Seq 2] ⏱️ Decoded: 487 tokens in 19.62s (24.82 t/s)
vllm-rs-svc0 | 2026-04-04T05:38:31.842903Z INFO vllm_rs::core::scheduler: GPU Kvcache: 8153 blocks (521792 tokens) free, used 0.5% (0.06GB/12.00GB); CPU swap used NaN% (NaNGB/0.00GB)
vllm-rs-svc0 | 2026-04-04T05:38:31.842925Z INFO vllm_rs::core::scheduler: GPU MambaState: 1 / 28 slots used (3.6%), approx 0.07GB/2.02GB (slot 73.69MB)the output is the exact same mess i see on the 4-way:
|
This may not be wrong because I counted first token into prefill. |
But the result is not messed, they are meaningful Chinese characters related to the topic. |
|
Individually maybe, but the instruction provided was to summarize in english words and that was from the very start of generation. Here's the English version:
|
There are ~7K tokens in there - here's the same request dispatched to a V100: 2026-04-04T06:22:36.689361Z WARN vllm_rs::core::engine: [Stream] New request [Seq_id 241, 7248 tokens] received! (session_id: None)
2026-04-04T06:22:36.689406Z INFO vllm_rs::core::block_manager: Prefix cache miss seq 241 (7248 tokens, 2048 cached blocks, raw_match=0 blocks)
2026-04-04T06:22:44.718354Z INFO vllm_rs::core::runner: User's thinking preference for reasoning models: None
2026-04-04T06:22:44.718373Z WARN vllm_rs::core::runner: Using sampling from generation_config: temp=Some(0.5), top_k=Some(20), top_p=Some(0.95), freq_penalty=Some(1.2), pres_penalty=Some(1.2)
2026-04-04T06:22:44.766080Z INFO vllm_rs::core::engine: Prefilling [seq_id 241]: 7249 tokens in 8.11s (894.06 tokens/s)
2026-04-04T06:22:49.781287Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [241]], avg. 29 tokens/s per request (total: 29 tokens/s)
2026-04-04T06:22:51.036928Z INFO vllm_rs::core::block_manager: Prefix cache insert seq 241 (7429 tokens, 116 blocks)
2026-04-04T06:22:51.037142Z WARN vllm_rs::server::server: --- Performance Metrics ---
2026-04-04T06:22:51.037161Z INFO vllm_rs::server::server: [Seq 241] ⏱️ Prompt: 7248 tokens in 8.11s (893.93 t/s)
2026-04-04T06:22:51.037171Z INFO vllm_rs::server::server: [Seq 241] ⏱️ Decoded: 181 tokens in 6.27s (28.86 t/s)and on the spark it reports 2048 for the CN and EN versions of the document whereas a normal model reports the CN version at 7248 vllm-rs-svc0 | 2026-04-04T05:38:06.064596Z INFO vllm_rs::core::block_manager: Prefix cache miss seq 2 (2048 tokens, 0 cached blocks, raw_match=0 blocks)
vllm-rs-svc0 | 2026-04-04T05:38:10.641089Z INFO vllm_rs::core::runner: User's thinking preference for reasoning models: None
vllm-rs-svc0 | 2026-04-04T05:38:10.641109Z WARN vllm_rs::core::runner: No generation_config, using default sampling (temperature=0.7, top_k=32, top_p=0.95)
vllm-rs-svc0 | 2026-04-04T05:38:12.217076Z INFO vllm_rs::core::engine: Prefilling [seq_id 2]: 2049 tokens in 6.18s (331.55 tokens/s)
vllm-rs-svc0 | 2026-04-04T05:38:31.837700Z INFO vllm_rs::core::block_manager: Prefix cache insert seq 2 (2535 tokens, 39 blocks)
vllm-rs-svc0 | 2026-04-04T05:38:31.837859Z WARN vllm_rs::server::server: --- Performance Metrics ---
vllm-rs-svc0 | 2026-04-04T05:38:31.837869Z INFO vllm_rs::server::server: [Seq 2] ⏱️ Prompt: 2048 tokens in 6.18s (331.39 t/s)
vllm-rs-svc0 | 2026-04-04T05:38:31.837873Z INFO vllm_rs::server::server: [Seq 2] ⏱️ Decoded: 487 tokens in 19.62s (24.82 t/s)
vllm-rs-svc0 | 2026-04-04T05:38:31.842903Z INFO vllm_rs::core::scheduler: GPU Kvcache: 8153 blocks (521792 tokens) free, used 0.5% (0.06GB/12.00GB); CPU swap used NaN% (NaNGB/0.00GB)
vllm-rs-svc0 | 2026-04-04T05:38:31.842925Z INFO vllm_rs::core::scheduler: GPU MambaState: 1 / 28 slots used (3.6%), approx 0.07GB/2.02GB (slot 73.69MB)
vllm-rs-svc0 | 2026-04-04T06:19:36.936110Z WARN vllm_rs::core::engine: [Stream] New request [Seq_id 3, 2048 tokens] received! (session_id: None)
vllm-rs-svc0 |
vllm-rs-svc0 | 2026-04-04T06:19:36.936538Z INFO vllm_rs::core::block_manager: Prefix cache miss seq 3 (2048 tokens, 39 cached blocks, raw_match=0 blocks)
vllm-rs-svc0 | 2026-04-04T06:19:48.409674Z WARN vllm_rs::server::server: [Seq 3] Stream client disconnected during prefill/stream
vllm-rs-svc0 | 2026-04-04T06:19:48.491414Z ERROR vllm_rs::core::engine: Error when sending token to client [seq_id 3]
vllm-rs-svc0 | 2026-04-04T06:19:48.491453Z WARN vllm_rs::core::scheduler: Seq 3 - cancel requested (status Running) |
That might be a client issue, with different locale settings. |
|
Same client, same command, same content just one went to a q8_0 model on a v100 and showed 7k (as it does on all fp16 or 8 targets) but the nvfp4 models on either sm120 or 121 show the wrong input size and output fragments of decided sequences. The q35 0.8 on a single rank works but the q3n on a single rank or multi rank has that problem. |
The precision issue has been fixed in the latest commit. |
|
Unfortunatley still seeing the same effect on spark - 7.6K token file being interpreted as exactly 2048: output is still nonsense:
|
|
Any inputs above 2048 seem to cause this - small inputs are fine like
the rust logs for those two requests are: 2026-04-05T13:42:25.090262Z INFO vllm_rs::core::block_manager: Prefix cache miss seq 9 (2048 tokens, 87 cached blocks, raw_match=0 blocks)
2026-04-05T13:42:30.173182Z INFO vllm_rs::core::runner: User's thinking preference for reasoning models: None
2026-04-05T13:42:30.173200Z WARN vllm_rs::core::runner: No generation_config, using default sampling (temperature=0.7, top_k=32, top_p=0.95)
2026-04-05T13:42:31.961293Z INFO vllm_rs::core::engine: Prefilling [seq_id 9]: 2049 tokens in 6.89s (297.39 tokens/s)
2026-04-05T13:42:36.990139Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [9]], avg. 25 tokens/s per request (total: 25 tokens/s)
2026-04-05T13:42:40.448830Z WARN vllm_rs::server::server: [Seq 9] Stream client disconnected during prefill/stream
2026-04-05T13:42:40.492322Z ERROR vllm_rs::core::engine: Error when sending token to client [seq_id 9]
2026-04-05T13:42:40.492339Z WARN vllm_rs::core::scheduler: Seq 9 - cancel requested (status Running)
2026-04-05T13:42:56.813600Z WARN vllm_rs::core::engine: [Stream] New request [Seq_id 10, 113 tokens] received! (session_id: None)
2026-04-05T13:42:56.813984Z INFO vllm_rs::core::block_manager: Prefix cache miss seq 10 (113 tokens, 87 cached blocks, raw_match=0 blocks)
2026-04-05T13:42:57.109200Z INFO vllm_rs::core::runner: User's thinking preference for reasoning models: None
2026-04-05T13:42:57.109217Z WARN vllm_rs::core::runner: No generation_config, using default sampling (temperature=0.7, top_k=32, top_p=0.95)
2026-04-05T13:42:57.210513Z INFO vllm_rs::core::engine: Prefilling [seq_id 10]: 114 tokens in 0.41s (274.70 tokens/s)
2026-04-05T13:43:02.236565Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [10]], avg. 24 tokens/s per request (total: 24 tokens/s)
2026-04-05T13:43:02.929812Z INFO vllm_rs::core::block_manager: Prefix cache insert seq 10 (252 tokens, 3 blocks)
2026-04-05T13:43:02.930217Z WARN vllm_rs::server::server: --- Performance Metrics ---
2026-04-05T13:43:02.930225Z INFO vllm_rs::server::server: [Seq 10] ⏱️ Prompt: 113 tokens in 0.41s (272.29 t/s)
2026-04-05T13:43:02.930230Z INFO vllm_rs::server::server: [Seq 10] ⏱️ Decoded: 139 tokens in 5.72s (24.30 t/s) |
Which model you have used? nvfp4? |
|
The q3 0.8B and the gadfly 80B coder - models seem fine so long as I don't give them more than 2k of input at which point they seem to truncate/split it somehow and generate from whatever pieces are read into prefill |
|
Bit more testing - vllm-rs-svc0 | 2026-04-05T19:17:43.182993Z INFO vllm_rs::core::block_manager: Prefix cache miss seq 4 (1862 tokens, 0 cached blocks, raw_match=0 blocks)
vllm-rs-svc0 | 2026-04-05T19:17:47.434233Z INFO vllm_rs::core::runner: User's thinking preference for reasoning models: None
vllm-rs-svc0 | 2026-04-05T19:17:47.434250Z WARN vllm_rs::core::runner: No generation_config, using default sampling (temperature=0.7, top_k=32, top_p=0.95)
vllm-rs-svc0 | 2026-04-05T19:17:48.908395Z INFO vllm_rs::core::engine: Prefilling [seq_id 4]: 1863 tokens in 5.76s (323.66 tokens/s)
vllm-rs-svc0 | 2026-04-05T19:17:50.254845Z WARN vllm_rs::server::server: [Seq 4] Stream client disconnected during prefill/stream
vllm-rs-svc0 | 2026-04-05T19:17:50.314990Z ERROR vllm_rs::core::engine: Error when sending token to client [seq_id 4]
vllm-rs-svc0 | 2026-04-05T19:17:50.315034Z WARN vllm_rs::core::scheduler: Seq 4 - cancel requested (status Running)
vllm-rs-svc0 | 2026-04-05T19:17:53.160495Z WARN vllm_rs::core::engine: [Stream] New request [Seq_id 5, 1906 tokens] received! (session_id: None)
vllm-rs-svc0 |
vllm-rs-svc0 | 2026-04-05T19:17:53.160648Z INFO vllm_rs::core::block_manager: Prefix cache miss seq 5 (1906 tokens, 0 cached blocks, raw_match=0 blocks)
vllm-rs-svc0 | 2026-04-05T19:17:57.480202Z INFO vllm_rs::core::runner: User's thinking preference for reasoning models: None
vllm-rs-svc0 | 2026-04-05T19:17:57.480221Z WARN vllm_rs::core::runner: No generation_config, using default sampling (temperature=0.7, top_k=32, top_p=0.95)
vllm-rs-svc0 | 2026-04-05T19:17:58.991237Z INFO vllm_rs::core::engine: Prefilling [seq_id 5]: 1907 tokens in 5.86s (325.26 tokens/s)
vllm-rs-svc0 | 2026-04-05T19:18:03.996902Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [5]], avg. 24 tokens/s per request (total: 24 tokens/s)
vllm-rs-svc0 | 2026-04-05T19:18:09.028967Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [5]], avg. 24 tokens/s per request (total: 24 tokens/s)when i push to 140 lines i can get clean output with BUT as soon as i cross 2048 (
to
|
* Support mxfp4 models * Working on Hopper * Fail fast if server port not available * Optimize decoding speed * Fix build on V100 * Support NVFP4 & fix mxfp4 model loading * Update ReadMe * Compatible with more meta format * Explicit error for unsupported MLX format * Fix compressed-tensors nvfp4 precision issue * Fix weight global scale sharding * Fix models & add test-model skill * Fix precision issue for mxfp4 models. * Update dependency
Tested case
Performance on Hopper (No FP4 hardware acceleration) for Qwen3-30B-A3B-MXFP4A16: