Skip to content

Qwen3.5-MoE support#1288

Merged
ikawrakow merged 2 commits intomainfrom
ik/qwen35moe
Feb 21, 2026
Merged

Qwen3.5-MoE support#1288
ikawrakow merged 2 commits intomainfrom
ik/qwen35moe

Conversation

@ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Feb 19, 2026

~It loads and runs, but t does not work. ~

Adding it as a draft PR in case someone wants to try to figure out where I have gone wrong.

It works now with the following CAVEAT (which, btw, applies to Qwen3-Next as well): one cannot have more than one sequence. The implementation here ended up being quite different from llama.cpp, so I cannot copy from there, but I haven't yet fully wrapped my head around the delta-net thing, so haven't figured out yet how to do multiple sequence yet. This is not relevant for "normal" usage, but if you try to e.g. calculate perplexity for context of 512 using u-batches > 512 (as one usually does for hybrid CPU/GPU inference), that will not work. Neither will pipeline parallelism.

As far as I can tell, this implementation is quite a bit faster than llama.cpp. Below is a comparison with the latest llama.cpp version as of this writing (build: 8111 (11c325c6e)). CPU-only benchmark is on a Ryzen-3995WX CPU. The "CUDA" benchmark is on a 3090 with all MoE tensors left in RAM (full GPU offload is hopeless for this model).

model backend test t/s (llama.cpp) t/s (ik_llama.cpp) Speedup
qwen35moe 397B.A17B IQ4_XS CPU pp512 38.98 ± 0.10 88.53 ± 0.77 2.271
qwen35moe 397B.A17B IQ4_XS CPU tg256 3.81 ± 0.01 7.58 ± 0.01 1.990
qwen35moe 397B.A17B IQ4_XS CUDA pp4096 175.37 ± 0.13 312.39 ± 4.03 1.781
qwen35moe 397B.A17B IQ4_XS CUDA tg256 19.50 ± 0.12 20.22 ± 0.06 1.037

Oh, given the caveat above, if you do want to run a perplexity check and want to use batch/u-batch size > 512 (because of hybrid inference), just use a context that is the same or larger than the u-batch size. I have used -c 4096 -b 4096 -ub 4096 for my own testing.

@ubergarm
Copy link
Contributor

ubergarm commented Feb 19, 2026

Just pulled and built d7269f1 on the CPU-only rig, managed to load and inference perplexity just enough to get some numbers to compare. I'll circle back and see how it chats now too.

tl;dr; the pereplexity values on this PR look higher than mainline at the moment and throws a warning qwen3next mixed-sequence batch contains repeated seq_id values; falling back to single-token chunking

ik

[1]8.2884,[2]10.1913,[3]6.9527,[4]5.5820,[5]5.6511,[6]5.6996,[7]5.8253,[8]4.8905,llama_decode_internal: qwen3next mixed-sequence batch contains repeated seq_id values; falling back to single-token chunking
[9]4.9977,[10]4.9563,[11]5.0721,[12]5.3878,[13]5.6697,[14]5.6432,[15]5.8091,[16]5.8180,llama_decode_internal: qwen3next mixed-sequence batch contains repeated seq_id values; falling back to single-token chunking

mainline

[1]3.2614,[2]4.5283,[3]3.3583,[4]2.6455,[5]2.3004,[6]2.1341,[7]2.1067,[8]2.0086,[9]1.9741,[10]1.9739,[11]1.9068,[12]2.0586,[13]2.1284,[14]2.1717,[15]2.3020,[16]2.4441,^C

full command and debug log here: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/raw/main/logs/perplexity-Qwen3.5-397B-A17B-Q3_K.log

@ikawrakow ikawrakow marked this pull request as ready for review February 19, 2026 17:12
@ikawrakow
Copy link
Owner Author

ikawrakow commented Feb 19, 2026

@ubergarm

Please see the updated comment, and get the latest version that I just pushed. For PPL testing, use -c 4096 -b 4096 -ub 4096 for hybrid. CPU-only it is best to stick to context of 512 with the default batch/u-batch size.

If you see the log

llama_decode_internal: qwen3next mixed-sequence batch contains repeated seq_id values; falling back to single-token chunking

that's no good, it is going to be very slow.

@ubergarm
Copy link
Contributor

It does produce coherent output that looks normal at first glance, and is 3x faster than mainline.

ik

prompt eval time =     333.01 ms /    32 tokens (   10.41 ms per token,    96.09 tokens per second)
       eval time =   26933.97 ms /   406 tokens (   66.34 ms per token,    15.07 tokens per second)
      total time =   27266.98 ms /   438 tokens

mainline

prompt eval time =     643.24 ms /    21 tokens (   30.63 ms per token,    32.65 tokens per second)
       eval time =  102551.77 ms /   477 tokens (  214.99 ms per token,     4.65 tokens per second)
      total time =  103195.01 ms /   498 tokens

I'll pull the most recent changes and check, thanks!

@ubergarm
Copy link
Contributor

Great, going to default batches e.g. -ub 512 -b 2048 on the CPU-only perplexity test is looking good now! I'll let it run and get some comparisons today, if things are still looking good I'll cook up some ik quants! Thanks for digging into this challenging implementation!

llm_load_tensors:        CPU buffer size = 184290.32 MiB
....................................................................................................
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 1
llama_init_from_model: attn_max_b    = 0
llama_init_from_model: fused_moe     = 1
llama_init_from_model: grouped er    = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad    = 1
llama_init_from_model: rope_cache    = 0
llama_init_from_model: graph_reuse   = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type   = f16
llama_init_from_model: sched_async   = 0
llama_init_from_model: ser           = -1, 0
llama_init_from_model: freq_base     = 10000000.0
llama_init_from_model: freq_scale    = 1
llama_kv_cache_init:        CPU KV buffer size =   805.31 MiB
llama_init_from_model: KV self size  =   60.00 MiB, K (f16):   30.00 MiB, V (f16):   30.00 MiB
llama_init_from_model:        CPU  output buffer size =     3.79 MiB
llama_init_from_model:        CPU compute buffer size =   493.00 MiB
llama_init_from_model: graph nodes  = 15205
llama_init_from_model: graph splits = 1
llama_init_from_model: enabling only_active_experts scheduling

system_info: n_threads = 96 (n_threads_batch = 128) / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 397.55 ms
perplexity: calculating perplexity over 580 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 8.80 seconds per pass - ETA 21.25 minutes
===================================== llama_init_from_model: f16
======================================= HAVE_FANCY_SIMD is defined
[1]3.2803,[2]4.5420,[3]3.3625,[4]2.6480,[5]2.3048,[6]2.1311,[7]2.1014,[8]2.0040,[9]1.9718,[10]1.9690,[11]1.9009,[12]2.0532,

@ikawrakow
Copy link
Owner Author

Yes, so here my PPL runs. 2x3090 + Ryzen-3995WX. I could have offloaded some of the experts to the GPUs, but didn't want to bother with that. llama.cpp actually became slower when I didn't use --cpu-moe and the automatic fit was invoked.

ik_llama.cpp GPU/CPU

./bin/llama-perplexity -m Qwen3.5-397B-A17B-IQ4_XS-00001-of-00006.gguf -f wiki.test.raw \
                 -c 4096 -b 4096 -ub 4096 -t 64 -ngl 100 --cpu-moe

perplexity: calculating perplexity over 72 chunks, n_ctx=4096, batch_size=4096, n_seq=1
perplexity: 14.54 seconds per pass - ETA 17.45 minutes 
[1]1.3938,[2]2.2598,[3]2.3024,[4]2.5078,[5]2.8840,[6]3.1975,[7]3.4323,[8]3.6297,[9]3.7199,[10]3.5278,[11]3.2466,[12]2.9809,[13]2.7980,[14]2.9990,[15]2.8489,[16]2.7450,[17]2.7146,[18]2.7583,[19]2.7281,[20]2.7149,[21]2.7801,[22]2.9267,[23]2.8404,[24]2.9333,[25]2.9752,[26]2.9976,[27]3.0260,[28]3.0491,[29]3.0536,[30]3.0751,[31]3.1215,[32]3.1867,[33]3.2110,[34]3.1894,[35]3.1965,[36]3.1570,[37]3.2159,[38]3.2353,[39]3.2652,[40]3.2490,[41]3.1704,[42]3.0948,[43]3.1294,[44]3.1471,[45]3.1136,[46]3.1814,[47]3.2218,[48]3.2796,[49]3.2989,[50]3.3185,[51]3.2742,[52]3.2116,[53]3.1680,[54]3.1371,[55]3.1237,[56]3.0877,[57]3.1249,[58]3.1514,[59]3.1680,[60]3.1928,[61]3.2064,[62]3.2268,[63]3.2478,[64]3.2547,[65]3.2698,[66]3.2380,[67]3.2255,[68]3.2301,[69]3.2020,[70]3.2294,[71]3.1828,[72]3.1910,
Final estimate: PPL over 72 chunks for n_ctx=4096 = 3.1910 +/- 0.01636

llama_print_timings:        load time =    2474.59 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  987252.39 ms / 294912 tokens (    3.35 ms per token,   298.72 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  990517.69 ms / 294913 tokens

./bin/llama-perplexity -m Qwen3.5-397B-A17B-IQ4_XS-00001-of-00006.gguf -f wiki.test.raw
-c 4096 -b 4096 -ub 4096 -t 64 -ngl 100 --cpu-moe

llama.cpp, GPU/CPU

perplexity: tokenization took 646.275 ms
perplexity: calculating perplexity over 72 chunks, n_ctx=4096, batch_size=4096, n_seq=1
perplexity: 27.44 seconds per pass - ETA 32.92 minutes
[1]1.3931,[2]2.2620,[3]2.3016,[4]2.5120,[5]2.8871,[6]3.1999,[7]3.4348,[8]3.6321,[9]3.7225,[10]3.5294,[11]3.2462,[12]2.9794,[13]2.7964,[14]2.9973,[15]2.8467,[16]2.7431,[17]2.7142,[18]2.7582,[19]2.7279,[20]2.7148,[21]2.7798,[22]2.9264,[23]2.8402,[24]2.9330,[25]2.9748,[26]2.9975,[27]3.0259,[28]3.0490,[29]3.0532,[30]3.0746,[31]3.1210,[32]3.1860,[33]3.2104,[34]3.1891,[35]3.1960,[36]3.1566,[37]3.2153,[38]3.2347,[39]3.2646,[40]3.2483,[41]3.1697,[42]3.0943,[43]3.1289,[44]3.1467,[45]3.1134,[46]3.1811,[47]3.2218,[48]3.2795,[49]3.2986,[50]3.3182,[51]3.2743,[52]3.2118,[53]3.1680,[54]3.1371,[55]3.1236,[56]3.0876,[57]3.1248,[58]3.1512,[59]3.1678,[60]3.1926,[61]3.2062,[62]3.2267,[63]3.2476,[64]3.2546,[65]3.2696,[66]3.2376,[67]3.2253,[68]3.2297,[69]3.2017,[70]3.2291,[71]3.1825,[72]3.1906,
Final estimate: PPL = 3.1906 +/- 0.01636

llama_perf_context_print:        load time =    8660.10 ms
llama_perf_context_print: prompt eval time = 1888579.66 ms / 294912 tokens (    6.40 ms per token,   156.16 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time = 1919351.60 ms / 294913 tokens

@ikawrakow
Copy link
Owner Author

Here the beginning of the llama.cpp perplexity run without -ngl 100 --cpu-moe.

common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  24101 total, 105720 used, -81999 free vs. target of   1024
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090):  24124 total, 102500 used, -78728 free vs. target of   1024
llama_params_fit_impl: projected to use 208220 MiB of device memory vs. 47492 MiB of free device memory
llama_params_fit_impl: cannot meet free memory targets on all devices, need to use 162776 MiB less in total
llama_params_fit_impl: context size set by user to 4096 -> no change
llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 30800 MiB
llama_params_fit_impl: filling dense-only layers back-to-front:
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090): 61 layers,  15691 MiB used,   8079 MiB free
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  0 layers,   2247 MiB used,  21472 MiB free
llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  7 layers ( 1 overflowing),  22403 MiB used,   1317 MiB free
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090): 54 layers (52 overflowing),  21665 MiB used,   2106 MiB free
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 8.35 seconds

So far so good, it did offload some MoE layers to the GPUs. But then we get

perplexity: calculating perplexity over 72 chunks, n_ctx=4096, batch_size=4096, n_seq=1
perplexity: 35.60 seconds per pass - ETA 42.70 minutes
[1]1.3931,[2]2.2620,[3]2.3016,[4]2.5120,[5]2.8871,[6]3.1999,

so basically 25% slower, which is not exactly the purpose of using as much of the available VRAM as possible.

@ubergarm
Copy link
Contributor

Great, I'm using this PR to cook a new imatrix from the full BF16 now, and will get some quants added to the collection!

Perplexity with n_ctx@512 is a bit higher which seems fine, I haven't checked n_ctx@4096 yet, but likely similar as what you're getting. My initial findings suggest this model is very compressible when quantizing only routed exps: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF#quant-collection

so basically 25% slower, which is not exactly the purpose of using as much of the available VRAM as possible.

Interesting, thanks for the heads up, I'll have to try some different offload strategies and compare speeds then. I recall on year old ktransformers offloading additional layers could slow down deepseek due to CUDA graphs reasons supposedly.

I believe this is similar to what @magikRUKKOLA is trying to tell me here: #1268 (comment) ...

Okay, gonna cook some ik_llama.cpp quants for Qwen3.5 MoE!

@magikRUKKOLA
Copy link

magikRUKKOLA commented Feb 19, 2026

IQ2_KL smol-IQ2_XS, 8x3090:

 main: n_kv_max = 262144, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64

 |    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
 |-------|--------|--------|----------|----------|----------|----------|
 |  4096 |   1024 |      0 |    5.089 |   804.82 |   27.987 |    36.59 |
 |  4096 |   1024 |   4096 |    5.196 |   788.25 |   27.316 |    37.49 |
 |  4096 |   1024 |   8192 |    5.345 |   766.29 |   28.092 |    36.45 |
 |  4096 |   1024 |  12288 |    5.471 |   748.72 |   28.931 |    35.39 |
 |  4096 |   1024 |  16384 |    5.597 |   731.86 |   29.363 |    34.87 |
 |  4096 |   1024 |  20480 |    5.719 |   716.20 |   30.014 |    34.12 |
 |  4096 |   1024 |  24576 |    5.852 |   699.89 |   30.641 |    33.42 |
 |  4096 |   1024 |  28672 |    5.973 |   685.77 |   31.196 |    32.83 |
 |  4096 |   1024 |  32768 |    6.117 |   669.59 |   31.973 |    32.03 |
 |  4096 |   1024 |  36864 |    6.244 |   655.96 |   32.476 |    31.53 |
 |  4096 |   1024 |  40960 |    6.371 |   642.96 |   32.860 |    31.16 |
 |  4096 |   1024 |  45056 |    6.499 |   630.24 |   33.713 |    30.37 |
 |  4096 |   1024 |  49152 |    6.621 |   618.68 |   34.260 |    29.89 |
 |  4096 |   1024 |  53248 |    6.752 |   606.66 |   34.888 |    29.35 |
 |  4096 |   1024 |  57344 |    6.873 |   595.98 |   35.762 |    28.63 |
 |  4096 |   1024 |  61440 |    7.023 |   583.23 |   35.814 |    28.59 |
 |  4096 |   1024 |  65536 |    7.139 |   573.73 |   36.550 |    28.02 |
 |  4096 |   1024 |  69632 |    7.259 |   564.25 |   37.379 |    27.39 |
 |  4096 |   1024 |  73728 |    7.380 |   555.05 |   37.640 |    27.20 |
...
perplexity: calculating perplexity over 580 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 5.74 seconds per pass - ETA 13.85 minutes

...
Final estimate: PPL over 580 chunks for n_ctx=512 = 3.8706 +/- 0.02130


The **IQ2_KL** seems to be a little bit faster in prefill:

main: n_kv_max = 262144, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 4.671 876.92 30.972 33.06
4096 1024 4096 4.778 857.20 30.428 33.65
4096 1024 8192 4.939 829.29 31.134 32.89
4096 1024 12288 5.048 811.34 31.851 32.15
4096 1024 16384 5.183 790.35 32.514 31.49
4096 1024 20480 5.300 772.81 33.036 31.00
4096 1024 24576 5.446 752.11 33.515 30.55
4096 1024 28672 5.560 736.67 34.255 29.89
4096 1024 32768 5.690 719.90 34.906 29.34
4096 1024 36864 5.817 704.11 35.614 28.75
4096 1024 40960 5.952 688.21 35.947 28.49
4096 1024 45056 6.070 674.83 36.542 28.02
4096 1024 49152 6.218 658.78 37.186 27.54
4096 1024 53248 6.333 646.77 37.898 27.02
4096 1024 57344 6.447 635.35 38.417 26.65
4096 1024 61440 6.591 621.49 38.924 26.31
4096 1024 65536 6.715 610.00 39.782 25.74

perplexity: calculating perplexity over 580 chunks, n_ctx=512, batch_size=4096, n_seq=8
perplexity: 9.86 seconds per pass - ETA 11.90 minutes

Final estimate: PPL over 580 chunks for n_ctx=512 = 3.6537 +/- 0.02000

[EDIT]:

IQ2_KL 8x3090, f16 kv, 256k ctx:

*note: kv cache f16 seems to be faster in decode.

main: n_kv_max = 262144, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |    4.664 |   878.13 |   30.759 |    33.29 |
|  4096 |   1024 |   4096 |    4.785 |   855.93 |   30.143 |    33.97 |
|  4096 |   1024 |   8192 |    4.911 |   833.96 |   30.669 |    33.39 |
|  4096 |   1024 |  12288 |    5.034 |   813.69 |   31.054 |    32.97 |
|  4096 |   1024 |  16384 |    5.165 |   792.96 |   31.565 |    32.44 |
|  4096 |   1024 |  20480 |    5.297 |   773.20 |   31.918 |    32.08 |
|  4096 |   1024 |  24576 |    5.422 |   755.47 |   32.234 |    31.77 |
|  4096 |   1024 |  28672 |    5.550 |   738.00 |   32.542 |    31.47 |
|  4096 |   1024 |  32768 |    5.686 |   720.41 |   33.039 |    30.99 |
|  4096 |   1024 |  36864 |    5.804 |   705.77 |   33.408 |    30.65 |
|  4096 |   1024 |  40960 |    5.941 |   689.46 |   33.595 |    30.48 |
|  4096 |   1024 |  45056 |    6.070 |   674.83 |   34.239 |    29.91 |
|  4096 |   1024 |  49152 |    6.198 |   660.84 |   34.497 |    29.68 |
|  4096 |   1024 |  53248 |    6.322 |   647.92 |   34.852 |    29.38 |
|  4096 |   1024 |  57344 |    6.456 |   634.42 |   35.292 |    29.01 |
|  4096 |   1024 |  61440 |    6.583 |   622.22 |   35.599 |    28.76 |
|  4096 |   1024 |  65536 |    6.706 |   610.83 |   35.931 |    28.50 |
|  4096 |   1024 |  69632 |    6.836 |   599.19 |   36.186 |    28.30 |
|  4096 |   1024 |  73728 |    6.952 |   589.22 |   36.594 |    27.98 |
|  4096 |   1024 |  77824 |    7.088 |   577.92 |   37.066 |    27.63 |
|  4096 |   1024 |  81920 |    7.210 |   568.06 |   37.383 |    27.39 |
|  4096 |   1024 |  86016 |    7.359 |   556.56 |   37.715 |    27.15 |
|  4096 |   1024 |  90112 |    7.479 |   547.68 |   38.131 |    26.85 |
|  4096 |   1024 |  94208 |    7.593 |   539.46 |   38.804 |    26.39 |
|  4096 |   1024 |  98304 |    7.737 |   529.42 |   38.961 |    26.28 |
|  4096 |   1024 | 102400 |    7.861 |   521.04 |   39.392 |    25.99 |
|  4096 |   1024 | 106496 |    7.984 |   513.00 |   39.778 |    25.74 |
|  4096 |   1024 | 110592 |    8.126 |   504.09 |   40.129 |    25.52 |
|  4096 |   1024 | 114688 |    8.255 |   496.20 |   40.358 |    25.37 |
|  4096 |   1024 | 118784 |    8.374 |   489.11 |   40.818 |    25.09 |
|  4096 |   1024 | 122880 |    8.506 |   481.55 |   41.107 |    24.91 |
|  4096 |   1024 | 126976 |    8.682 |   471.80 |   41.531 |    24.66 |
|  4096 |   1024 | 131072 |    8.780 |   466.51 |   41.888 |    24.45 |
|  4096 |   1024 | 135168 |    8.894 |   460.52 |   42.327 |    24.19 |
|  4096 |   1024 | 139264 |    9.025 |   453.86 |   42.699 |    23.98 |
|  4096 |   1024 | 143360 |    9.166 |   446.87 |   43.218 |    23.69 |
|  4096 |   1024 | 147456 |    9.296 |   440.63 |   43.465 |    23.56 |
|  4096 |   1024 | 151552 |    9.433 |   434.22 |   43.842 |    23.36 |
|  4096 |   1024 | 155648 |    9.558 |   428.55 |   44.096 |    23.22 |
|  4096 |   1024 | 159744 |    9.679 |   423.17 |   44.521 |    23.00 |
|  4096 |   1024 | 163840 |    9.814 |   417.36 |   45.058 |    22.73 |
|  4096 |   1024 | 167936 |    9.954 |   411.49 |   45.304 |    22.60 |
|  4096 |   1024 | 172032 |   10.070 |   406.77 |   45.638 |    22.44 |
|  4096 |   1024 | 176128 |   10.192 |   401.88 |   46.037 |    22.24 |
|  4096 |   1024 | 180224 |   10.318 |   396.96 |   46.397 |    22.07 |
|  4096 |   1024 | 184320 |   10.473 |   391.10 |   46.785 |    21.89 |
|  4096 |   1024 | 188416 |   10.601 |   386.36 |   47.012 |    21.78 |
|  4096 |   1024 | 192512 |   10.689 |   383.19 |   47.367 |    21.62 |
|  4096 |   1024 | 196608 |   10.856 |   377.30 |   47.918 |    21.37 |
|  4096 |   1024 | 200704 |   10.976 |   373.19 |   48.175 |    21.26 |
|  4096 |   1024 | 204800 |   11.082 |   369.62 |   48.607 |    21.07 |
|  4096 |   1024 | 208896 |   11.253 |   363.98 |   48.911 |    20.94 |
|  4096 |   1024 | 212992 |   11.362 |   360.51 |   49.297 |    20.77 |
|  4096 |   1024 | 217088 |   11.506 |   356.00 |   49.762 |    20.58 |
|  4096 |   1024 | 221184 |   11.623 |   352.40 |   50.101 |    20.44 |
|  4096 |   1024 | 225280 |   11.766 |   348.12 |   50.360 |    20.33 |
|  4096 |   1024 | 229376 |   11.887 |   344.58 |   50.828 |    20.15 |
|  4096 |   1024 | 233472 |   12.010 |   341.05 |   51.044 |    20.06 |
|  4096 |   1024 | 237568 |   12.160 |   336.85 |   51.511 |    19.88 |
|  4096 |   1024 | 241664 |   12.290 |   333.28 |   51.924 |    19.72 |
|  4096 |   1024 | 245760 |   12.431 |   329.50 |   52.217 |    19.61 |
|  4096 |   1024 | 249856 |   12.564 |   326.02 |   52.482 |    19.51 |
|  4096 |   1024 | 253952 |   12.683 |   322.96 |   53.054 |    19.30 |
|  4096 |   1024 | 258048 |   12.875 |   318.14 |   53.375 |    19.18 |

[EDIT#n]:

IQ4_KSS, 2x3090, 3975wx, DDR4:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 13.304 307.87 57.791 17.72
4096 1024 4096 13.462 304.26 57.370 17.85
4096 1024 8192 13.563 302.00 57.730 17.74
4096 1024 12288 13.621 300.71 58.153 17.61
4096 1024 16384 13.745 297.99 58.499 17.50
4096 1024 20480 13.871 295.30 58.829 17.41
4096 1024 24576 14.072 291.08 58.988 17.36

@MrHills-rs
Copy link

MrHills-rs commented Feb 20, 2026

There's an issue. The model works ok generally, but after a while it will stop any output. Sending a message won't work anymore, swiping doesn't work, changing the context doesn't work either. It won't output any error, but the model will only produce one extremely fast token (4184.10 tokens per second), and the message ends. The problem disappears when I restart ik_llama.cpp. I'm trying to figure out what I'm doing to trigger the bug, but it's hard to replicate. This happened at various context lengths, both at 5000 and 30000. Using IQ2_XS quants.
https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF

Also, idk if it's expected, but changing -ctk q8_0 -ctv q8_0 to different values, such as q4_0, doesn't affect tg or pp performance at all.

INFO [ log_server_request] request | tid="139794080301056" timestamp=1771544772 remote_addr="127.0.0.1" remote_port=60996 status=200 method="POST" path="/v1/chat/completions" params={} [INFO] Request 127.0.0.1 "POST /v1/chat/completions HTTP/1.1" 200 755 "node-fetch" 1.179624584s ======== Prompt cache: cache size: 37214, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 10000, f_keep: 1.00, cache_ram_similarity: 1.00 - looking for better prompt, base f_keep = 1.000, sim = 1.000, n_keep = 0, n_discarded_prompt = 0 - cache state: 4 prompts, 3206.256 MiB (limits: 4096.000 MiB, 0 tokens, 201896 est) - prompt 0x7f233c0f13f0: 40460 tokens, 0 discarded, checkpoints: 0, 816.357 MiB - prompt 0x7f23080f1260: 40261 tokens, 0 discarded, checkpoints: 0, 813.258 MiB - prompt 0x7f230c0ea8f0: 40105 tokens, 0 discarded, checkpoints: 0, 810.829 MiB - prompt 0x7f22f00ea8f0: 37214 tokens, 0 discarded, checkpoints: 0, 765.812 MiB prompt cache load took 6.57 ms INFO [ launch_slot_with_task] slot is processing task | tid="139804366508032" timestamp=1771544790 id_slot=0 id_task=618 ======== Cache: cache_size = 37214, n_past0 = 37214, n_past1 = 37214, n_past_prompt1 = 37214, n_past2 = 37214, n_past_prompt2 = 37214 INFO [ batch_pending_prompt] we have to evaluate at least 1 token to generate logits | tid="139804366508032" timestamp=1771544790 id_slot=0 id_task=618 INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139804366508032" timestamp=1771544790 id_slot=0 id_task=618 p0=37213 slot print_timing: id 0 | task -1 | prompt eval time = 74.69 ms / 1 tokens ( 74.69 ms per token, 13.39 tokens per second) eval time = 0.24 ms / 1 tokens ( 0.24 ms per token, 4184.10 tokens per second) total time = 74.93 ms / 2 tokens statistics ngram_simple: #calls(b,g,a) = 10 587 39, #gen drafts = 47, #acc drafts = 39, #gen tokens = 752, #acc tokens = 125, dur(b,g,a) = 0.002, 20.623, 0.009 ms INFO [ release_slots] slot released | tid="139804366508032" timestamp=1771544790 id_slot=0 id_task=618 n_ctx=131072 n_past=37214 n_system_tokens=0 n_cache_tokens=37214 truncated=false INFO [ slots_idle] all slots are idle | tid="139804366508032" timestamp=1771544790

I'll simplify my arguments for testing from now on, but this is what I used up until this point.

./build/bin/llama-server -m ~/AI/ik/models/Qwen3.5-397B-A17B-smol-IQ2_XS.gguf --no-mmproj-offload --slot-save-path ~/AI/ik/slots --context-shift on -ot "blk\.(?:[0-9]|[1-4][0-9]|[5][0-6])\.ffn.*_exps.*=CPU" -c 131072 -cram 0 -b 4096 -ub 4096 -ctk q8_0 -ctv q8_0 --cache-ram-n-min 10000 --cache-ram-similarity 1 --slot-prompt-similarity 0.45 --threads 8 -ngl 95 -cuda fusion=1,offload-batch-size=4,mmq-id-size=128 -amb 512 --host 127.0.0.1 --port ${PORT} --webui none --repeat-last-n 2048 --reasoning-format none --jinja --chat-template-file ~/AI/ik/jinja/qwen-3.5.jinja --draft-min 1 --spec-ngram-size-n 8 --draft-max 4 --spec-ngram-size-m 16 --spec-type ngram-simple --draft-p-min 0.5

@magikRUKKOLA
Copy link

magikRUKKOLA commented Feb 20, 2026

@MrHills-rs

There's an issue. The model works ok generally, but after a while it will stop any output.

Ha. Indeed.

Another interesting thing is that the LLM stopped working with jemalloc due to the memory access errors.

[EDIT]: I just noticed that its a topic about the Qwen3.5. I had these problems with GLM4.7. Hm ...

[EDIT2]: As related to Qwen3.5 -- in case of the conversation interruption and sending a new request (with the same data) I do see the as if LLM actually continuing the old conversation. Something is wrong with the attention, perhaps?

[EDIT3]:

illustration:

(2nd conversation after the interrupted (?) first one, related to the coding and debugging)

high -m Qwen3 -f -p "output a mandelbrot set via awk"
The user is asking me to fix the code. Looking at the conversation, it seems like I was providing code assistance and the user wants me to continue. However, the last message from the user is "You are a helpful assistant." which is just acknowledging my role.

Wait, looking more carefully at the conversation history, it seems like there was a code-related discussion about fixing some code, but the actual conversation shows:

1. User asked me to be a helpful assistant
2. I acknowledged
3. User said "You are a helpful assistant."

So Qwen3.5 have an imprint of the first conversation in the second one. Ha.

@ikawrakow
Copy link
Owner Author

Also, idk if it's expected, but changing -ctk q8_0 -ctv q8_0 to different values, such as q4_0, doesn't affect tg or pp performance at all.

Well, the thing about these models is that they spend most of their time in the linear attention, so the standard transformer self-attentioni, which is used only in 1 out of 4 layers, does not play a major role in performance. Only at very long context will it contribute in a more significant way to the observed TF and PP. Hence, yes, it is expected that you do not see significant differences between different KV cache quantization types.

@ikawrakow
Copy link
Owner Author

I think the issues that have been observed are related to the fact that the server currently does not handle correctly the recurrent cache. One cannot simply rewind it, as one does with standard transformer KV cache. The recurrent cache is just a blob of floating point values that somehow encode the past context, and there is no rewind operator. Instead, one needs to take frequent snapshots, and then only restart a conversation from the closest snapshot available. If you don't do that, eventually the recurrent cache will contain a salad of unrelated contexts, so it is kind of expected that it will eventually stop working altogether.

So, I guess, this is a serious limitation for Qwen3-Next and Qwen-3.5. I have zero interest in the server codebase, so hopefully @firecoperana will want to take it on.

@ikawrakow ikawrakow changed the title WIP: Qwen3.5-MoE support Qwen3.5-MoE support Feb 20, 2026
@ikawrakow
Copy link
Owner Author

@magikRUKKOLA

Another interesting thing is that the LLM stopped working with jemalloc due to the memory access errors.

Can you debug it? What kind of memory errors?

So Qwen3.5 have an imprint of the first conversation in the second one. Ha.

This is because the cache is not being handled correctly, see my comment above.

@magikRUKKOLA
Copy link

@ikawrakow

Can you debug it?

Well, its kinda hard to reproduce... Overall, it seems to kinda happen at a very long context.

What kind of memory errors?

Not sure what it was.

[Sun Feb 15 04:21:42 2026] llama-server[158344]: segfault at 21750 ip 00007fbcecc22b17 sp 00007ffe842e7820 error 4 in libjemalloc.so.2[22b17,7fbcecc09000+9e000] likely on CPU 26 (core 26, socket 0)
[Tue Feb 17 10:14:10 2026] llama-perplexit[1833915]: segfault at 0 ip 00007f82f7368f5e sp 00007ffcdb832460 error 4 in libggml.so[168f5e,7f82f7239000+1758000] likely on CPU 24 (core 24, socket 0)
[Tue Feb 17 10:14:32 2026] llama-perplexit[1834196]: segfault at 0 ip 00007fd3f2568f5e sp 00007ffe807fc4e0 error 4 in libggml.so[168f5e,7fd3f2439000+1758000] likely on CPU 89 (core 25, socket 0)
[Tue Feb 17 10:14:56 2026] llama-perplexit[1834488]: segfault at 0 ip 00007f521ff68f5e sp 00007ffe56381260 error 4 in libggml.so[168f5e,7f521fe39000+1758000] likely on CPU 13 (core 13, socket 0)
[Tue Feb 17 14:34:36 2026] llama-server[2180172]: segfault at 0 ip 00007f325329e56c sp 00007f2e76fd6dc0 error 4 in libjemalloc.so.2[9e56c,7f3253209000+9e000] likely on CPU 17 (core 17, socket 0)
[Tue Feb 17 14:37:05 2026] llama-server[2186846]: segfault at 0 ip 00007f7fc589e3be sp 00007f7b221d7dc0 error 4 in libjemalloc.so.2[9e3be,7f7fc5809000+9e000] likely on CPU 26 (core 26, socket 0)
[Tue Feb 17 14:39:56 2026] llama-server[2193899]: segfault at 40 ip 00007fe4aa280a07 sp 00007ffcc3697dc0 error 6 in libjemalloc.so.2[80a07,7fe4aa209000+9e000] likely on CPU 89 (core 25, socket 0)
[Tue Feb 17 14:41:38 2026] llama-server[2200096]: segfault at 0 ip 00007fbbaa69e512 sp 00007fb70efd7dc0 error 4 in libjemalloc.so.2[9e512,7fbbaa609000+9e000] likely on CPU 26 (core 26, socket 0)
[Tue Feb 17 14:44:44 2026] llama-server[2211797]: segfault at 0 ip 00007fc92229e54e sp 00007fc47e3d7dc0 error 4 in libjemalloc.so.2[9e54e,7fc922209000+9e000] likely on CPU 62 (core 62, socket 0)
[Tue Feb 17 14:46:12 2026] llama-server[2217582]: segfault at 0 ip 00007f20e3c9e62b sp 00007f17dd1d8dc0 error 4 in libjemalloc.so.2[9e62b,7f20e3c09000+9e000] likely on CPU 29 (core 29, socket 0)
[Tue Feb 17 14:47:41 2026] llama-server[2219584]: segfault at e007 ip 00007f026e2809e3 sp 00007fffbe5e9a50 error 4 in libjemalloc.so.2[809e3,7f026e209000+9e000] likely on CPU 24 (core 24, socket 0)
[Tue Feb 17 14:49:06 2026] llama-server[2225609]: segfault at 0 ip 00007f8381c9e5b0 sp 00007f7ee71d8dc0 error 4 in libjemalloc.so.2[9e5b0,7f8381c09000+9e000] likely on CPU 26 (core 26, socket 0)
[Tue Feb 17 14:50:25 2026] llama-server[2231376]: segfault at 0 ip 00007fe492a9e512 sp 00007fdf41fd8dc0 error 4 in libjemalloc.so.2[9e512,7fe492a09000+9e000] likely on CPU 7 (core 7, socket 0)
[Tue Feb 17 14:51:57 2026] llama-server[2237202]: segfault at 38 ip 00007f74c127c721 sp 00007f701dbd8d88 error 4 in libjemalloc.so.2[7c721,7f74c1209000+9e000] likely on CPU 89 (core 25, socket 0)
[Fri Feb 20 03:22:24 2026] llama-server[3777195]: segfault at 0 ip 00007f0c1796aa9e sp 00007ffc08904ea0 error 4 in libggml.so[16aa9e,7f0c17839000+175e000] likely on CPU 24 (core 24, socket 0)
CUDA error: unspecified launch failure
  current device: 0, in function ggml_backend_cuda_synchronize at /opt/ik_llama.cpp/ik_llama.cpp/ggml/src/ggml-cuda.cu:3894
  cudaStreamSynchronize(cuda_ctx->stream())
/opt/ik_llama.cpp/ik_llama.cpp/ggml/src/ggml-cuda.cu:131: CUDA error
[New LWP 2615522]
[New LWP 2615521]

@firecoperana
Copy link
Collaborator

I think the issues that have been observed are related to the fact that the server currently does not handle correctly the recurrent cache. One cannot simply rewind it, as one does with standard transformer KV cache. The recurrent cache is just a blob of floating point values that somehow encode the past context, and there is no rewind operator. Instead, one needs to take frequent snapshots, and then only restart a conversation from the closest snapshot available. If you don't do that, eventually the recurrent cache will contain a salad of unrelated contexts, so it is kind of expected that it will eventually stop working altogether.

So, I guess, this is a serious limitation for Qwen3-Next and Qwen-3.5. I have zero interest in the server codebase, so hopefully @firecoperana will want to take it on.

Mainline has dedicated cache management for recurrent, hybrid and isswa model. I might need to port most of the kv cache/memory related code from mainline. Are you fine with this?

@ikawrakow
Copy link
Owner Author

Mainline has dedicated cache management for recurrent, hybrid and isswa model. I might need to port most of the kv cache/memory related code from mainline. Are you fine with this?

I cannot say that I particularly like what they have done.

It is not that I like the state of affairs in ik_llama.cpp better. But I also thought that their constant refactoring wasn't particularly useful, to the point where at some level it did look like they are doing it for the sake of demonstrating significant project activity rather than anything else. And because of that, I do find it satisfying that ik_llama.cpp can mostly keep pace in terms of features, without constantly pushing the same pieces of code from here to there.

Given this, you don't think it can be dome without copying their unified cache management?

@firecoperana
Copy link
Collaborator

I will just port recurrent and hybrid part then. It should be possible.

@sayap
Copy link
Contributor

sayap commented Feb 20, 2026

I am getting the same error as @magikRUKKOLA on certain prompts, and the backtrace looks like this:

(gdb) bt
#0  0x00007fffe45099e6 in _Fork () from /usr/lib64/libc.so.6
#1  0x00007fffe450f3d1 in fork () from /usr/lib64/libc.so.6
#2  0x00007fffeaad6e90 in ggml_print_backtrace () at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml.c:217
#3  ggml_abort (file=0x7fffebc47688 "/home/sayap/repo/ik_llama.cpp/ggml/src/ggml-cuda.cu", line=131, fmt=0x7fffebc2cc68 "CUDA error") at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml.c:263
#4  0x00007fffeac71ae5 in ggml_cuda_error (stmt=stmt@entry=0x7fffebc47f30 "cudaStreamSynchronize(cuda_ctx->stream())", func=func@entry=0x7fffebc2cd63 "ggml_backend_cuda_synchronize",
    file=file@entry=0x7fffebc47688 "/home/sayap/repo/ik_llama.cpp/ggml/src/ggml-cuda.cu", line=line@entry=3894, msg=0x7fffe4097208 "an illegal memory access was encountered")
    at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml-cuda.cu:131
#5  0x00007fffeac72650 in ggml_backend_cuda_synchronize (backend=<optimized out>) at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml-cuda.cu:3894
#6  0x00007fffeab201ad in ggml_backend_sched_synchronize (sched=sched@entry=0x555557132150) at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml-backend.cpp:2641
#7  0x00007fffeab22335 in ggml_backend_sched_alloc_splits (sched=0x555557132150) at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml-backend.cpp:1955
#8  ggml_backend_sched_alloc_graph (sched=0x555557132150, graph=<optimized out>) at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml-backend.cpp:2609
#9  0x00007ffff7c8c7db in llama_decode_internal (lctx=..., batch_all=...) at /home/sayap/repo/ik_llama.cpp/src/llama.cpp:3317
#10 0x00007ffff7c8d66d in llama_decode (ctx=<optimized out>, batch=...) at /home/sayap/repo/ik_llama.cpp/src/llama.cpp:6997
#11 0x00005555556ae18b in server_context::process_batch_tokens (this=this@entry=0x7fffffffc630, n_batch=@0x7fffffff9788: 2048) at /home/sayap/repo/ik_llama.cpp/examples/server/server-context.cpp:3177
#12 0x00005555556b0019 in server_context::update_slots (this=0x7fffffffc630) at /home/sayap/repo/ik_llama.cpp/examples/server/server-context.cpp:3337
#13 0x00005555556501c6 in std::function<void()>::operator() (this=0x7fffffffd8a0) at /usr/lib/gcc/x86_64-pc-linux-gnu/14/include/g++-v14/bits/std_function.h:591
#14 server_queue::start_loop (this=this@entry=0x7fffffffd748) at /home/sayap/repo/ik_llama.cpp/examples/server/server-queue.cpp:133
#15 0x00005555555db126 in main (argc=<optimized out>, argv=<optimized out>) at /home/sayap/repo/ik_llama.cpp/examples/server/server.cpp:2139

@sayap
Copy link
Contributor

sayap commented Feb 21, 2026

Ohh the error happens with -ngl 999 -ncmoe 58 (CUDA0 buffer size = 13047.12 MiB), but goes away with -ngl 999 -ncmoe 59 (CUDA0 buffer size = 10977.12 MiB), so it might just be a side-effect of my broken 3090.

@magikRUKKOLA
Copy link

magikRUKKOLA commented Feb 21, 2026

@sayap

so it might just be a side-effect of my broken 3090.

Yeah, that is what I was thinking. In my case it could be the bad risers. At one point I connected two of them consequently (w/o retimers etc.) and that turned out to be the problem. As of now I am still not sure if some of the risers are bad because I have this:

lspci -vvv | grep -F -A 5 --colour 'LaneErr at lane'
                LaneErrStat: LaneErr at lane: 5
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                L1SubCtl2:
        Capabilities: [380 v1] Downstream Port Containment
--
                LaneErrStat: LaneErr at lane: 3 7
        Capabilities: [bb0 v1] Physical Resizable BAR
                BAR 0: current size: 16MB, supported: 16MB
                BAR 1: current size: 32GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
                BAR 3: current size: 32MB, supported: 32MB
        Capabilities: [c1c v1] Physical Layer 16.0 GT/s
--
                LaneErrStat: LaneErr at lane: 0 1 2 3 4 5 6 7
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                L1SubCtl2:
        Capabilities: [380 v1] Downstream Port Containment
--
                LaneErrStat: LaneErr at lane: 0 1 2 3 4
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                L1SubCtl2:
        Capabilities: [380 v1] Downstream Port Containment
--
                LaneErrStat: LaneErr at lane: 0 1 2 3 4 5 6 7
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                L1SubCtl2:
        Capabilities: [380 v1] Downstream Port Containment
--
                LaneErrStat: LaneErr at lane: 0 1 2 3 4 5 6 7
        Capabilities: [bb0 v1] Physical Resizable BAR
                BAR 0: current size: 16MB, supported: 16MB
                BAR 1: current size: 32GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
                BAR 3: current size: 32MB, supported: 32MB
        Capabilities: [c1c v1] Physical Layer 16.0 GT/s
--
                LaneErrStat: LaneErr at lane: 0 1 2 3 4 5 7 8 13 14
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                L1SubCtl2:
        Capabilities: [380 v1] Downstream Port Containment
--
                LaneErrStat: LaneErr at lane: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
        Capabilities: [bb0 v1] Physical Resizable BAR
                BAR 0: current size: 16MB, supported: 16MB
                BAR 1: current size: 32GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
                BAR 3: current size: 32MB, supported: 32MB
        Capabilities: [c1c v1] Physical Layer 16.0 GT/s

Alternatively, it could be some quirks of the motherboard or the SlimSAS risers so I am not sure what it is. Can you check if the command above detects any Lane Errors?

[EDIT]: Just got another problem:

[Sat Feb 21 02:56:55 2026] NVRM: Xid (PCI:0000:41:00): 31, pid=232127, name=llama-server, channel 0x02000014, intr 00000000. MMU Fault: ENGINE CE3 HUBCLIENT_CE1 faulted @ 0x7f43_33c00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE

Hm ... looks like the hardware issue again.

@ikawrakow
Copy link
Owner Author

@magikRUKKOLA

Hm ... looks like the hardware issue again.

Yes, I also think that there may be a hardware issue. I had a few occasions where inference will simply lock up, similar to the way it behaved before you changed the risers. It is much less frequent, but it does happen from time to time.

@ikawrakow
Copy link
Owner Author

So, despite the limitations outlined above, I'll merge the PR. Proper recurrent cache management will be added later.

@ikawrakow ikawrakow merged commit 13c3d83 into main Feb 21, 2026
@chulucninh09
Copy link

Is the caveat related to my issue? I got random output after sending the same prompt with prompt caching enabled
#1294

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants