Qwen3.5-MoE support by ikawrakow · Pull Request #1288 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-02-19T07:54:22Z

~It loads and runs, but t does not work. ~

~~Adding it as a draft PR in case someone wants to try to figure out where I have gone wrong.~~

It works now with the following CAVEAT (which, btw, applies to Qwen3-Next as well): one cannot have more than one sequence. The implementation here ended up being quite different from llama.cpp, so I cannot copy from there, but I haven't yet fully wrapped my head around the delta-net thing, so haven't figured out yet how to do multiple sequence yet. This is not relevant for "normal" usage, but if you try to e.g. calculate perplexity for context of 512 using u-batches > 512 (as one usually does for hybrid CPU/GPU inference), that will not work. Neither will pipeline parallelism.

As far as I can tell, this implementation is quite a bit faster than llama.cpp. Below is a comparison with the latest llama.cpp version as of this writing (build: 8111 (11c325c6e)). CPU-only benchmark is on a Ryzen-3995WX CPU. The "CUDA" benchmark is on a 3090 with all MoE tensors left in RAM (full GPU offload is hopeless for this model).

model	backend	test	t/s (llama.cpp)	t/s (ik_llama.cpp)	Speedup
qwen35moe 397B.A17B IQ4_XS	CPU	pp512	38.98 ± 0.10	88.53 ± 0.77	2.271
qwen35moe 397B.A17B IQ4_XS	CPU	tg256	3.81 ± 0.01	7.58 ± 0.01	1.990
qwen35moe 397B.A17B IQ4_XS	CUDA	pp4096	175.37 ± 0.13	312.39 ± 4.03	1.781
qwen35moe 397B.A17B IQ4_XS	CUDA	tg256	19.50 ± 0.12	20.22 ± 0.06	1.037

Oh, given the caveat above, if you do want to run a perplexity check and want to use batch/u-batch size > 512 (because of hybrid inference), just use a context that is the same or larger than the u-batch size. I have used -c 4096 -b 4096 -ub 4096 for my own testing.

ubergarm · 2026-02-19T17:08:46Z

Just pulled and built d7269f1 on the CPU-only rig, managed to load and inference perplexity just enough to get some numbers to compare. I'll circle back and see how it chats now too.

tl;dr; the pereplexity values on this PR look higher than mainline at the moment and throws a warning qwen3next mixed-sequence batch contains repeated seq_id values; falling back to single-token chunking

ik

[1]8.2884,[2]10.1913,[3]6.9527,[4]5.5820,[5]5.6511,[6]5.6996,[7]5.8253,[8]4.8905,llama_decode_internal: qwen3next mixed-sequence batch contains repeated seq_id values; falling back to single-token chunking
[9]4.9977,[10]4.9563,[11]5.0721,[12]5.3878,[13]5.6697,[14]5.6432,[15]5.8091,[16]5.8180,llama_decode_internal: qwen3next mixed-sequence batch contains repeated seq_id values; falling back to single-token chunking

mainline

[1]3.2614,[2]4.5283,[3]3.3583,[4]2.6455,[5]2.3004,[6]2.1341,[7]2.1067,[8]2.0086,[9]1.9741,[10]1.9739,[11]1.9068,[12]2.0586,[13]2.1284,[14]2.1717,[15]2.3020,[16]2.4441,^C

full command and debug log here: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/raw/main/logs/perplexity-Qwen3.5-397B-A17B-Q3_K.log

ikawrakow · 2026-02-19T17:16:51Z

@ubergarm

Please see the updated comment, and get the latest version that I just pushed. For PPL testing, use -c 4096 -b 4096 -ub 4096 for hybrid. CPU-only it is best to stick to context of 512 with the default batch/u-batch size.

If you see the log

llama_decode_internal: qwen3next mixed-sequence batch contains repeated seq_id values; falling back to single-token chunking

that's no good, it is going to be very slow.

ubergarm · 2026-02-19T17:19:15Z

It does produce coherent output that looks normal at first glance, and is 3x faster than mainline.

ik

prompt eval time =     333.01 ms /    32 tokens (   10.41 ms per token,    96.09 tokens per second)
       eval time =   26933.97 ms /   406 tokens (   66.34 ms per token,    15.07 tokens per second)
      total time =   27266.98 ms /   438 tokens

mainline

prompt eval time =     643.24 ms /    21 tokens (   30.63 ms per token,    32.65 tokens per second)
       eval time =  102551.77 ms /   477 tokens (  214.99 ms per token,     4.65 tokens per second)
      total time =  103195.01 ms /   498 tokens

I'll pull the most recent changes and check, thanks!

ubergarm · 2026-02-19T17:25:48Z

Great, going to default batches e.g. -ub 512 -b 2048 on the CPU-only perplexity test is looking good now! I'll let it run and get some comparisons today, if things are still looking good I'll cook up some ik quants! Thanks for digging into this challenging implementation!

llm_load_tensors:        CPU buffer size = 184290.32 MiB
....................................................................................................
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 1
llama_init_from_model: attn_max_b    = 0
llama_init_from_model: fused_moe     = 1
llama_init_from_model: grouped er    = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad    = 1
llama_init_from_model: rope_cache    = 0
llama_init_from_model: graph_reuse   = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type   = f16
llama_init_from_model: sched_async   = 0
llama_init_from_model: ser           = -1, 0
llama_init_from_model: freq_base     = 10000000.0
llama_init_from_model: freq_scale    = 1
llama_kv_cache_init:        CPU KV buffer size =   805.31 MiB
llama_init_from_model: KV self size  =   60.00 MiB, K (f16):   30.00 MiB, V (f16):   30.00 MiB
llama_init_from_model:        CPU  output buffer size =     3.79 MiB
llama_init_from_model:        CPU compute buffer size =   493.00 MiB
llama_init_from_model: graph nodes  = 15205
llama_init_from_model: graph splits = 1
llama_init_from_model: enabling only_active_experts scheduling

system_info: n_threads = 96 (n_threads_batch = 128) / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 397.55 ms
perplexity: calculating perplexity over 580 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 8.80 seconds per pass - ETA 21.25 minutes
===================================== llama_init_from_model: f16
======================================= HAVE_FANCY_SIMD is defined
[1]3.2803,[2]4.5420,[3]3.3625,[4]2.6480,[5]2.3048,[6]2.1311,[7]2.1014,[8]2.0040,[9]1.9718,[10]1.9690,[11]1.9009,[12]2.0532,

ikawrakow · 2026-02-19T17:52:56Z

Yes, so here my PPL runs. 2x3090 + Ryzen-3995WX. I could have offloaded some of the experts to the GPUs, but didn't want to bother with that. llama.cpp actually became slower when I didn't use --cpu-moe and the automatic fit was invoked.

ik_llama.cpp GPU/CPU

./bin/llama-perplexity -m Qwen3.5-397B-A17B-IQ4_XS-00001-of-00006.gguf -f wiki.test.raw \
                 -c 4096 -b 4096 -ub 4096 -t 64 -ngl 100 --cpu-moe

perplexity: calculating perplexity over 72 chunks, n_ctx=4096, batch_size=4096, n_seq=1
perplexity: 14.54 seconds per pass - ETA 17.45 minutes 
[1]1.3938,[2]2.2598,[3]2.3024,[4]2.5078,[5]2.8840,[6]3.1975,[7]3.4323,[8]3.6297,[9]3.7199,[10]3.5278,[11]3.2466,[12]2.9809,[13]2.7980,[14]2.9990,[15]2.8489,[16]2.7450,[17]2.7146,[18]2.7583,[19]2.7281,[20]2.7149,[21]2.7801,[22]2.9267,[23]2.8404,[24]2.9333,[25]2.9752,[26]2.9976,[27]3.0260,[28]3.0491,[29]3.0536,[30]3.0751,[31]3.1215,[32]3.1867,[33]3.2110,[34]3.1894,[35]3.1965,[36]3.1570,[37]3.2159,[38]3.2353,[39]3.2652,[40]3.2490,[41]3.1704,[42]3.0948,[43]3.1294,[44]3.1471,[45]3.1136,[46]3.1814,[47]3.2218,[48]3.2796,[49]3.2989,[50]3.3185,[51]3.2742,[52]3.2116,[53]3.1680,[54]3.1371,[55]3.1237,[56]3.0877,[57]3.1249,[58]3.1514,[59]3.1680,[60]3.1928,[61]3.2064,[62]3.2268,[63]3.2478,[64]3.2547,[65]3.2698,[66]3.2380,[67]3.2255,[68]3.2301,[69]3.2020,[70]3.2294,[71]3.1828,[72]3.1910,
Final estimate: PPL over 72 chunks for n_ctx=4096 = 3.1910 +/- 0.01636

llama_print_timings:        load time =    2474.59 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  987252.39 ms / 294912 tokens (    3.35 ms per token,   298.72 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  990517.69 ms / 294913 tokens

./bin/llama-perplexity -m Qwen3.5-397B-A17B-IQ4_XS-00001-of-00006.gguf -f wiki.test.raw
-c 4096 -b 4096 -ub 4096 -t 64 -ngl 100 --cpu-moe

llama.cpp, GPU/CPU

perplexity: tokenization took 646.275 ms
perplexity: calculating perplexity over 72 chunks, n_ctx=4096, batch_size=4096, n_seq=1
perplexity: 27.44 seconds per pass - ETA 32.92 minutes
[1]1.3931,[2]2.2620,[3]2.3016,[4]2.5120,[5]2.8871,[6]3.1999,[7]3.4348,[8]3.6321,[9]3.7225,[10]3.5294,[11]3.2462,[12]2.9794,[13]2.7964,[14]2.9973,[15]2.8467,[16]2.7431,[17]2.7142,[18]2.7582,[19]2.7279,[20]2.7148,[21]2.7798,[22]2.9264,[23]2.8402,[24]2.9330,[25]2.9748,[26]2.9975,[27]3.0259,[28]3.0490,[29]3.0532,[30]3.0746,[31]3.1210,[32]3.1860,[33]3.2104,[34]3.1891,[35]3.1960,[36]3.1566,[37]3.2153,[38]3.2347,[39]3.2646,[40]3.2483,[41]3.1697,[42]3.0943,[43]3.1289,[44]3.1467,[45]3.1134,[46]3.1811,[47]3.2218,[48]3.2795,[49]3.2986,[50]3.3182,[51]3.2743,[52]3.2118,[53]3.1680,[54]3.1371,[55]3.1236,[56]3.0876,[57]3.1248,[58]3.1512,[59]3.1678,[60]3.1926,[61]3.2062,[62]3.2267,[63]3.2476,[64]3.2546,[65]3.2696,[66]3.2376,[67]3.2253,[68]3.2297,[69]3.2017,[70]3.2291,[71]3.1825,[72]3.1906,
Final estimate: PPL = 3.1906 +/- 0.01636

llama_perf_context_print:        load time =    8660.10 ms
llama_perf_context_print: prompt eval time = 1888579.66 ms / 294912 tokens (    6.40 ms per token,   156.16 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time = 1919351.60 ms / 294913 tokens

ikawrakow · 2026-02-19T18:00:49Z

Here the beginning of the llama.cpp perplexity run without -ngl 100 --cpu-moe.

common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  24101 total, 105720 used, -81999 free vs. target of   1024
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090):  24124 total, 102500 used, -78728 free vs. target of   1024
llama_params_fit_impl: projected to use 208220 MiB of device memory vs. 47492 MiB of free device memory
llama_params_fit_impl: cannot meet free memory targets on all devices, need to use 162776 MiB less in total
llama_params_fit_impl: context size set by user to 4096 -> no change
llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 30800 MiB
llama_params_fit_impl: filling dense-only layers back-to-front:
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090): 61 layers,  15691 MiB used,   8079 MiB free
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  0 layers,   2247 MiB used,  21472 MiB free
llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  7 layers ( 1 overflowing),  22403 MiB used,   1317 MiB free
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090): 54 layers (52 overflowing),  21665 MiB used,   2106 MiB free
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 8.35 seconds

So far so good, it did offload some MoE layers to the GPUs. But then we get

perplexity: calculating perplexity over 72 chunks, n_ctx=4096, batch_size=4096, n_seq=1
perplexity: 35.60 seconds per pass - ETA 42.70 minutes
[1]1.3931,[2]2.2620,[3]2.3016,[4]2.5120,[5]2.8871,[6]3.1999,

so basically 25% slower, which is not exactly the purpose of using as much of the available VRAM as possible.

ubergarm · 2026-02-19T19:02:57Z

Great, I'm using this PR to cook a new imatrix from the full BF16 now, and will get some quants added to the collection!

Perplexity with n_ctx@512 is a bit higher which seems fine, I haven't checked n_ctx@4096 yet, but likely similar as what you're getting. My initial findings suggest this model is very compressible when quantizing only routed exps: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF#quant-collection

so basically 25% slower, which is not exactly the purpose of using as much of the available VRAM as possible.

Interesting, thanks for the heads up, I'll have to try some different offload strategies and compare speeds then. I recall on year old ktransformers offloading additional layers could slow down deepseek due to CUDA graphs reasons supposedly.

I believe this is similar to what @magikRUKKOLA is trying to tell me here: #1268 (comment) ...

Okay, gonna cook some ik_llama.cpp quants for Qwen3.5 MoE!

magikRUKKOLA · 2026-02-19T21:38:25Z

~~IQ2_KL~~ smol-IQ2_XS, 8x3090:

 main: n_kv_max = 262144, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64

 |    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
 |-------|--------|--------|----------|----------|----------|----------|
 |  4096 |   1024 |      0 |    5.089 |   804.82 |   27.987 |    36.59 |
 |  4096 |   1024 |   4096 |    5.196 |   788.25 |   27.316 |    37.49 |
 |  4096 |   1024 |   8192 |    5.345 |   766.29 |   28.092 |    36.45 |
 |  4096 |   1024 |  12288 |    5.471 |   748.72 |   28.931 |    35.39 |
 |  4096 |   1024 |  16384 |    5.597 |   731.86 |   29.363 |    34.87 |
 |  4096 |   1024 |  20480 |    5.719 |   716.20 |   30.014 |    34.12 |
 |  4096 |   1024 |  24576 |    5.852 |   699.89 |   30.641 |    33.42 |
 |  4096 |   1024 |  28672 |    5.973 |   685.77 |   31.196 |    32.83 |
 |  4096 |   1024 |  32768 |    6.117 |   669.59 |   31.973 |    32.03 |
 |  4096 |   1024 |  36864 |    6.244 |   655.96 |   32.476 |    31.53 |
 |  4096 |   1024 |  40960 |    6.371 |   642.96 |   32.860 |    31.16 |
 |  4096 |   1024 |  45056 |    6.499 |   630.24 |   33.713 |    30.37 |
 |  4096 |   1024 |  49152 |    6.621 |   618.68 |   34.260 |    29.89 |
 |  4096 |   1024 |  53248 |    6.752 |   606.66 |   34.888 |    29.35 |
 |  4096 |   1024 |  57344 |    6.873 |   595.98 |   35.762 |    28.63 |
 |  4096 |   1024 |  61440 |    7.023 |   583.23 |   35.814 |    28.59 |
 |  4096 |   1024 |  65536 |    7.139 |   573.73 |   36.550 |    28.02 |
 |  4096 |   1024 |  69632 |    7.259 |   564.25 |   37.379 |    27.39 |
 |  4096 |   1024 |  73728 |    7.380 |   555.05 |   37.640 |    27.20 |
...

perplexity: calculating perplexity over 580 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 5.74 seconds per pass - ETA 13.85 minutes

...
Final estimate: PPL over 580 chunks for n_ctx=512 = 3.8706 +/- 0.02130


The **IQ2_KL** seems to be a little bit faster in prefill:

main: n_kv_max = 262144, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	4.671	876.92	30.972	33.06
4096	1024	4096	4.778	857.20	30.428	33.65
4096	1024	8192	4.939	829.29	31.134	32.89
4096	1024	12288	5.048	811.34	31.851	32.15
4096	1024	16384	5.183	790.35	32.514	31.49
4096	1024	20480	5.300	772.81	33.036	31.00
4096	1024	24576	5.446	752.11	33.515	30.55
4096	1024	28672	5.560	736.67	34.255	29.89
4096	1024	32768	5.690	719.90	34.906	29.34
4096	1024	36864	5.817	704.11	35.614	28.75
4096	1024	40960	5.952	688.21	35.947	28.49
4096	1024	45056	6.070	674.83	36.542	28.02
4096	1024	49152	6.218	658.78	37.186	27.54
4096	1024	53248	6.333	646.77	37.898	27.02
4096	1024	57344	6.447	635.35	38.417	26.65
4096	1024	61440	6.591	621.49	38.924	26.31
4096	1024	65536	6.715	610.00	39.782	25.74

perplexity: calculating perplexity over 580 chunks, n_ctx=512, batch_size=4096, n_seq=8
perplexity: 9.86 seconds per pass - ETA 11.90 minutes

Final estimate: PPL over 580 chunks for n_ctx=512 = 3.6537 +/- 0.02000

[EDIT]:

IQ2_KL 8x3090, f16 kv, 256k ctx:

*note: kv cache f16 seems to be faster in decode.

main: n_kv_max = 262144, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |    4.664 |   878.13 |   30.759 |    33.29 |
|  4096 |   1024 |   4096 |    4.785 |   855.93 |   30.143 |    33.97 |
|  4096 |   1024 |   8192 |    4.911 |   833.96 |   30.669 |    33.39 |
|  4096 |   1024 |  12288 |    5.034 |   813.69 |   31.054 |    32.97 |
|  4096 |   1024 |  16384 |    5.165 |   792.96 |   31.565 |    32.44 |
|  4096 |   1024 |  20480 |    5.297 |   773.20 |   31.918 |    32.08 |
|  4096 |   1024 |  24576 |    5.422 |   755.47 |   32.234 |    31.77 |
|  4096 |   1024 |  28672 |    5.550 |   738.00 |   32.542 |    31.47 |
|  4096 |   1024 |  32768 |    5.686 |   720.41 |   33.039 |    30.99 |
|  4096 |   1024 |  36864 |    5.804 |   705.77 |   33.408 |    30.65 |
|  4096 |   1024 |  40960 |    5.941 |   689.46 |   33.595 |    30.48 |
|  4096 |   1024 |  45056 |    6.070 |   674.83 |   34.239 |    29.91 |
|  4096 |   1024 |  49152 |    6.198 |   660.84 |   34.497 |    29.68 |
|  4096 |   1024 |  53248 |    6.322 |   647.92 |   34.852 |    29.38 |
|  4096 |   1024 |  57344 |    6.456 |   634.42 |   35.292 |    29.01 |
|  4096 |   1024 |  61440 |    6.583 |   622.22 |   35.599 |    28.76 |
|  4096 |   1024 |  65536 |    6.706 |   610.83 |   35.931 |    28.50 |
|  4096 |   1024 |  69632 |    6.836 |   599.19 |   36.186 |    28.30 |
|  4096 |   1024 |  73728 |    6.952 |   589.22 |   36.594 |    27.98 |
|  4096 |   1024 |  77824 |    7.088 |   577.92 |   37.066 |    27.63 |
|  4096 |   1024 |  81920 |    7.210 |   568.06 |   37.383 |    27.39 |
|  4096 |   1024 |  86016 |    7.359 |   556.56 |   37.715 |    27.15 |
|  4096 |   1024 |  90112 |    7.479 |   547.68 |   38.131 |    26.85 |
|  4096 |   1024 |  94208 |    7.593 |   539.46 |   38.804 |    26.39 |
|  4096 |   1024 |  98304 |    7.737 |   529.42 |   38.961 |    26.28 |
|  4096 |   1024 | 102400 |    7.861 |   521.04 |   39.392 |    25.99 |
|  4096 |   1024 | 106496 |    7.984 |   513.00 |   39.778 |    25.74 |
|  4096 |   1024 | 110592 |    8.126 |   504.09 |   40.129 |    25.52 |
|  4096 |   1024 | 114688 |    8.255 |   496.20 |   40.358 |    25.37 |
|  4096 |   1024 | 118784 |    8.374 |   489.11 |   40.818 |    25.09 |
|  4096 |   1024 | 122880 |    8.506 |   481.55 |   41.107 |    24.91 |
|  4096 |   1024 | 126976 |    8.682 |   471.80 |   41.531 |    24.66 |
|  4096 |   1024 | 131072 |    8.780 |   466.51 |   41.888 |    24.45 |
|  4096 |   1024 | 135168 |    8.894 |   460.52 |   42.327 |    24.19 |
|  4096 |   1024 | 139264 |    9.025 |   453.86 |   42.699 |    23.98 |
|  4096 |   1024 | 143360 |    9.166 |   446.87 |   43.218 |    23.69 |
|  4096 |   1024 | 147456 |    9.296 |   440.63 |   43.465 |    23.56 |
|  4096 |   1024 | 151552 |    9.433 |   434.22 |   43.842 |    23.36 |
|  4096 |   1024 | 155648 |    9.558 |   428.55 |   44.096 |    23.22 |
|  4096 |   1024 | 159744 |    9.679 |   423.17 |   44.521 |    23.00 |
|  4096 |   1024 | 163840 |    9.814 |   417.36 |   45.058 |    22.73 |
|  4096 |   1024 | 167936 |    9.954 |   411.49 |   45.304 |    22.60 |
|  4096 |   1024 | 172032 |   10.070 |   406.77 |   45.638 |    22.44 |
|  4096 |   1024 | 176128 |   10.192 |   401.88 |   46.037 |    22.24 |
|  4096 |   1024 | 180224 |   10.318 |   396.96 |   46.397 |    22.07 |
|  4096 |   1024 | 184320 |   10.473 |   391.10 |   46.785 |    21.89 |
|  4096 |   1024 | 188416 |   10.601 |   386.36 |   47.012 |    21.78 |
|  4096 |   1024 | 192512 |   10.689 |   383.19 |   47.367 |    21.62 |
|  4096 |   1024 | 196608 |   10.856 |   377.30 |   47.918 |    21.37 |
|  4096 |   1024 | 200704 |   10.976 |   373.19 |   48.175 |    21.26 |
|  4096 |   1024 | 204800 |   11.082 |   369.62 |   48.607 |    21.07 |
|  4096 |   1024 | 208896 |   11.253 |   363.98 |   48.911 |    20.94 |
|  4096 |   1024 | 212992 |   11.362 |   360.51 |   49.297 |    20.77 |
|  4096 |   1024 | 217088 |   11.506 |   356.00 |   49.762 |    20.58 |
|  4096 |   1024 | 221184 |   11.623 |   352.40 |   50.101 |    20.44 |
|  4096 |   1024 | 225280 |   11.766 |   348.12 |   50.360 |    20.33 |
|  4096 |   1024 | 229376 |   11.887 |   344.58 |   50.828 |    20.15 |
|  4096 |   1024 | 233472 |   12.010 |   341.05 |   51.044 |    20.06 |
|  4096 |   1024 | 237568 |   12.160 |   336.85 |   51.511 |    19.88 |
|  4096 |   1024 | 241664 |   12.290 |   333.28 |   51.924 |    19.72 |
|  4096 |   1024 | 245760 |   12.431 |   329.50 |   52.217 |    19.61 |
|  4096 |   1024 | 249856 |   12.564 |   326.02 |   52.482 |    19.51 |
|  4096 |   1024 | 253952 |   12.683 |   322.96 |   53.054 |    19.30 |
|  4096 |   1024 | 258048 |   12.875 |   318.14 |   53.375 |    19.18 |

[EDIT#n]:

IQ4_KSS, 2x3090, 3975wx, DDR4:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	13.304	307.87	57.791	17.72
4096	1024	4096	13.462	304.26	57.370	17.85
4096	1024	8192	13.563	302.00	57.730	17.74
4096	1024	12288	13.621	300.71	58.153	17.61
4096	1024	16384	13.745	297.99	58.499	17.50
4096	1024	20480	13.871	295.30	58.829	17.41
4096	1024	24576	14.072	291.08	58.988	17.36

MrHills-rs · 2026-02-20T00:04:19Z

There's an issue. The model works ok generally, but after a while it will stop any output. Sending a message won't work anymore, swiping doesn't work, changing the context doesn't work either. It won't output any error, but the model will only produce one extremely fast token (4184.10 tokens per second), and the message ends. The problem disappears when I restart ik_llama.cpp. I'm trying to figure out what I'm doing to trigger the bug, but it's hard to replicate. This happened at various context lengths, both at 5000 and 30000. Using IQ2_XS quants.
https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF

Also, idk if it's expected, but changing -ctk q8_0 -ctv q8_0 to different values, such as q4_0, doesn't affect tg or pp performance at all.

INFO [ log_server_request] request | tid="139794080301056" timestamp=1771544772 remote_addr="127.0.0.1" remote_port=60996 status=200 method="POST" path="/v1/chat/completions" params={} [INFO] Request 127.0.0.1 "POST /v1/chat/completions HTTP/1.1" 200 755 "node-fetch" 1.179624584s ======== Prompt cache: cache size: 37214, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 10000, f_keep: 1.00, cache_ram_similarity: 1.00 - looking for better prompt, base f_keep = 1.000, sim = 1.000, n_keep = 0, n_discarded_prompt = 0 - cache state: 4 prompts, 3206.256 MiB (limits: 4096.000 MiB, 0 tokens, 201896 est) - prompt 0x7f233c0f13f0: 40460 tokens, 0 discarded, checkpoints: 0, 816.357 MiB - prompt 0x7f23080f1260: 40261 tokens, 0 discarded, checkpoints: 0, 813.258 MiB - prompt 0x7f230c0ea8f0: 40105 tokens, 0 discarded, checkpoints: 0, 810.829 MiB - prompt 0x7f22f00ea8f0: 37214 tokens, 0 discarded, checkpoints: 0, 765.812 MiB prompt cache load took 6.57 ms INFO [ launch_slot_with_task] slot is processing task | tid="139804366508032" timestamp=1771544790 id_slot=0 id_task=618 ======== Cache: cache_size = 37214, n_past0 = 37214, n_past1 = 37214, n_past_prompt1 = 37214, n_past2 = 37214, n_past_prompt2 = 37214 INFO [ batch_pending_prompt] we have to evaluate at least 1 token to generate logits | tid="139804366508032" timestamp=1771544790 id_slot=0 id_task=618 INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139804366508032" timestamp=1771544790 id_slot=0 id_task=618 p0=37213 slot print_timing: id 0 | task -1 | prompt eval time = 74.69 ms / 1 tokens ( 74.69 ms per token, 13.39 tokens per second) eval time = 0.24 ms / 1 tokens ( 0.24 ms per token, 4184.10 tokens per second) total time = 74.93 ms / 2 tokens statistics ngram_simple: #calls(b,g,a) = 10 587 39, #gen drafts = 47, #acc drafts = 39, #gen tokens = 752, #acc tokens = 125, dur(b,g,a) = 0.002, 20.623, 0.009 ms INFO [ release_slots] slot released | tid="139804366508032" timestamp=1771544790 id_slot=0 id_task=618 n_ctx=131072 n_past=37214 n_system_tokens=0 n_cache_tokens=37214 truncated=false INFO [ slots_idle] all slots are idle | tid="139804366508032" timestamp=1771544790

I'll simplify my arguments for testing from now on, but this is what I used up until this point.

./build/bin/llama-server -m ~/AI/ik/models/Qwen3.5-397B-A17B-smol-IQ2_XS.gguf --no-mmproj-offload --slot-save-path ~/AI/ik/slots --context-shift on -ot "blk\.(?:[0-9]|[1-4][0-9]|[5][0-6])\.ffn.*_exps.*=CPU" -c 131072 -cram 0 -b 4096 -ub 4096 -ctk q8_0 -ctv q8_0 --cache-ram-n-min 10000 --cache-ram-similarity 1 --slot-prompt-similarity 0.45 --threads 8 -ngl 95 -cuda fusion=1,offload-batch-size=4,mmq-id-size=128 -amb 512 --host 127.0.0.1 --port ${PORT} --webui none --repeat-last-n 2048 --reasoning-format none --jinja --chat-template-file ~/AI/ik/jinja/qwen-3.5.jinja --draft-min 1 --spec-ngram-size-n 8 --draft-max 4 --spec-ngram-size-m 16 --spec-type ngram-simple --draft-p-min 0.5

magikRUKKOLA · 2026-02-20T00:19:12Z

@MrHills-rs

There's an issue. The model works ok generally, but after a while it will stop any output.

Ha. Indeed.

Another interesting thing is that the LLM stopped working with jemalloc due to the memory access errors.

[EDIT]: I just noticed that its a topic about the Qwen3.5. I had these problems with GLM4.7. Hm ...

[EDIT2]: As related to Qwen3.5 -- in case of the conversation interruption and sending a new request (with the same data) I do see the as if LLM actually continuing the old conversation. Something is wrong with the attention, perhaps?

[EDIT3]:

illustration:

(2nd conversation after the interrupted (?) first one, related to the coding and debugging)

high -m Qwen3 -f -p "output a mandelbrot set via awk"
The user is asking me to fix the code. Looking at the conversation, it seems like I was providing code assistance and the user wants me to continue. However, the last message from the user is "You are a helpful assistant." which is just acknowledging my role.

Wait, looking more carefully at the conversation history, it seems like there was a code-related discussion about fixing some code, but the actual conversation shows:

1. User asked me to be a helpful assistant
2. I acknowledged
3. User said "You are a helpful assistant."

So Qwen3.5 have an imprint of the first conversation in the second one. Ha.

ikawrakow · 2026-02-20T06:02:06Z

Also, idk if it's expected, but changing -ctk q8_0 -ctv q8_0 to different values, such as q4_0, doesn't affect tg or pp performance at all.

Well, the thing about these models is that they spend most of their time in the linear attention, so the standard transformer self-attentioni, which is used only in 1 out of 4 layers, does not play a major role in performance. Only at very long context will it contribute in a more significant way to the observed TF and PP. Hence, yes, it is expected that you do not see significant differences between different KV cache quantization types.

ikawrakow · 2026-02-20T06:08:50Z

I think the issues that have been observed are related to the fact that the server currently does not handle correctly the recurrent cache. One cannot simply rewind it, as one does with standard transformer KV cache. The recurrent cache is just a blob of floating point values that somehow encode the past context, and there is no rewind operator. Instead, one needs to take frequent snapshots, and then only restart a conversation from the closest snapshot available. If you don't do that, eventually the recurrent cache will contain a salad of unrelated contexts, so it is kind of expected that it will eventually stop working altogether.

So, I guess, this is a serious limitation for Qwen3-Next and Qwen-3.5. I have zero interest in the server codebase, so hopefully @firecoperana will want to take it on.

ikawrakow · 2026-02-20T06:25:22Z

@magikRUKKOLA

Another interesting thing is that the LLM stopped working with jemalloc due to the memory access errors.

Can you debug it? What kind of memory errors?

So Qwen3.5 have an imprint of the first conversation in the second one. Ha.

This is because the cache is not being handled correctly, see my comment above.

magikRUKKOLA · 2026-02-20T07:08:01Z

@ikawrakow

Can you debug it?

Well, its kinda hard to reproduce... Overall, it seems to kinda happen at a very long context.

What kind of memory errors?

Not sure what it was.

[Sun Feb 15 04:21:42 2026] llama-server[158344]: segfault at 21750 ip 00007fbcecc22b17 sp 00007ffe842e7820 error 4 in libjemalloc.so.2[22b17,7fbcecc09000+9e000] likely on CPU 26 (core 26, socket 0)
[Tue Feb 17 10:14:10 2026] llama-perplexit[1833915]: segfault at 0 ip 00007f82f7368f5e sp 00007ffcdb832460 error 4 in libggml.so[168f5e,7f82f7239000+1758000] likely on CPU 24 (core 24, socket 0)
[Tue Feb 17 10:14:32 2026] llama-perplexit[1834196]: segfault at 0 ip 00007fd3f2568f5e sp 00007ffe807fc4e0 error 4 in libggml.so[168f5e,7fd3f2439000+1758000] likely on CPU 89 (core 25, socket 0)
[Tue Feb 17 10:14:56 2026] llama-perplexit[1834488]: segfault at 0 ip 00007f521ff68f5e sp 00007ffe56381260 error 4 in libggml.so[168f5e,7f521fe39000+1758000] likely on CPU 13 (core 13, socket 0)
[Tue Feb 17 14:34:36 2026] llama-server[2180172]: segfault at 0 ip 00007f325329e56c sp 00007f2e76fd6dc0 error 4 in libjemalloc.so.2[9e56c,7f3253209000+9e000] likely on CPU 17 (core 17, socket 0)
[Tue Feb 17 14:37:05 2026] llama-server[2186846]: segfault at 0 ip 00007f7fc589e3be sp 00007f7b221d7dc0 error 4 in libjemalloc.so.2[9e3be,7f7fc5809000+9e000] likely on CPU 26 (core 26, socket 0)
[Tue Feb 17 14:39:56 2026] llama-server[2193899]: segfault at 40 ip 00007fe4aa280a07 sp 00007ffcc3697dc0 error 6 in libjemalloc.so.2[80a07,7fe4aa209000+9e000] likely on CPU 89 (core 25, socket 0)
[Tue Feb 17 14:41:38 2026] llama-server[2200096]: segfault at 0 ip 00007fbbaa69e512 sp 00007fb70efd7dc0 error 4 in libjemalloc.so.2[9e512,7fbbaa609000+9e000] likely on CPU 26 (core 26, socket 0)
[Tue Feb 17 14:44:44 2026] llama-server[2211797]: segfault at 0 ip 00007fc92229e54e sp 00007fc47e3d7dc0 error 4 in libjemalloc.so.2[9e54e,7fc922209000+9e000] likely on CPU 62 (core 62, socket 0)
[Tue Feb 17 14:46:12 2026] llama-server[2217582]: segfault at 0 ip 00007f20e3c9e62b sp 00007f17dd1d8dc0 error 4 in libjemalloc.so.2[9e62b,7f20e3c09000+9e000] likely on CPU 29 (core 29, socket 0)
[Tue Feb 17 14:47:41 2026] llama-server[2219584]: segfault at e007 ip 00007f026e2809e3 sp 00007fffbe5e9a50 error 4 in libjemalloc.so.2[809e3,7f026e209000+9e000] likely on CPU 24 (core 24, socket 0)
[Tue Feb 17 14:49:06 2026] llama-server[2225609]: segfault at 0 ip 00007f8381c9e5b0 sp 00007f7ee71d8dc0 error 4 in libjemalloc.so.2[9e5b0,7f8381c09000+9e000] likely on CPU 26 (core 26, socket 0)
[Tue Feb 17 14:50:25 2026] llama-server[2231376]: segfault at 0 ip 00007fe492a9e512 sp 00007fdf41fd8dc0 error 4 in libjemalloc.so.2[9e512,7fe492a09000+9e000] likely on CPU 7 (core 7, socket 0)
[Tue Feb 17 14:51:57 2026] llama-server[2237202]: segfault at 38 ip 00007f74c127c721 sp 00007f701dbd8d88 error 4 in libjemalloc.so.2[7c721,7f74c1209000+9e000] likely on CPU 89 (core 25, socket 0)
[Fri Feb 20 03:22:24 2026] llama-server[3777195]: segfault at 0 ip 00007f0c1796aa9e sp 00007ffc08904ea0 error 4 in libggml.so[16aa9e,7f0c17839000+175e000] likely on CPU 24 (core 24, socket 0)

CUDA error: unspecified launch failure
  current device: 0, in function ggml_backend_cuda_synchronize at /opt/ik_llama.cpp/ik_llama.cpp/ggml/src/ggml-cuda.cu:3894
  cudaStreamSynchronize(cuda_ctx->stream())
/opt/ik_llama.cpp/ik_llama.cpp/ggml/src/ggml-cuda.cu:131: CUDA error
[New LWP 2615522]
[New LWP 2615521]

Very high PPL, empty TG.

firecoperana · 2026-02-20T16:26:59Z

I think the issues that have been observed are related to the fact that the server currently does not handle correctly the recurrent cache. One cannot simply rewind it, as one does with standard transformer KV cache. The recurrent cache is just a blob of floating point values that somehow encode the past context, and there is no rewind operator. Instead, one needs to take frequent snapshots, and then only restart a conversation from the closest snapshot available. If you don't do that, eventually the recurrent cache will contain a salad of unrelated contexts, so it is kind of expected that it will eventually stop working altogether.

So, I guess, this is a serious limitation for Qwen3-Next and Qwen-3.5. I have zero interest in the server codebase, so hopefully @firecoperana will want to take it on.

Mainline has dedicated cache management for recurrent, hybrid and isswa model. I might need to port most of the kv cache/memory related code from mainline. Are you fine with this?

ikawrakow · 2026-02-20T17:00:22Z

Mainline has dedicated cache management for recurrent, hybrid and isswa model. I might need to port most of the kv cache/memory related code from mainline. Are you fine with this?

I cannot say that I particularly like what they have done.

It is not that I like the state of affairs in ik_llama.cpp better. But I also thought that their constant refactoring wasn't particularly useful, to the point where at some level it did look like they are doing it for the sake of demonstrating significant project activity rather than anything else. And because of that, I do find it satisfying that ik_llama.cpp can mostly keep pace in terms of features, without constantly pushing the same pieces of code from here to there.

Given this, you don't think it can be dome without copying their unified cache management?

firecoperana · 2026-02-20T17:29:33Z

I will just port recurrent and hybrid part then. It should be possible.

sayap · 2026-02-20T23:48:53Z

I am getting the same error as @magikRUKKOLA on certain prompts, and the backtrace looks like this:

(gdb) bt
#0  0x00007fffe45099e6 in _Fork () from /usr/lib64/libc.so.6
#1  0x00007fffe450f3d1 in fork () from /usr/lib64/libc.so.6
#2  0x00007fffeaad6e90 in ggml_print_backtrace () at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml.c:217
#3  ggml_abort (file=0x7fffebc47688 "/home/sayap/repo/ik_llama.cpp/ggml/src/ggml-cuda.cu", line=131, fmt=0x7fffebc2cc68 "CUDA error") at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml.c:263
#4  0x00007fffeac71ae5 in ggml_cuda_error (stmt=stmt@entry=0x7fffebc47f30 "cudaStreamSynchronize(cuda_ctx->stream())", func=func@entry=0x7fffebc2cd63 "ggml_backend_cuda_synchronize",
    file=file@entry=0x7fffebc47688 "/home/sayap/repo/ik_llama.cpp/ggml/src/ggml-cuda.cu", line=line@entry=3894, msg=0x7fffe4097208 "an illegal memory access was encountered")
    at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml-cuda.cu:131
#5  0x00007fffeac72650 in ggml_backend_cuda_synchronize (backend=<optimized out>) at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml-cuda.cu:3894
#6  0x00007fffeab201ad in ggml_backend_sched_synchronize (sched=sched@entry=0x555557132150) at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml-backend.cpp:2641
#7  0x00007fffeab22335 in ggml_backend_sched_alloc_splits (sched=0x555557132150) at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml-backend.cpp:1955
#8  ggml_backend_sched_alloc_graph (sched=0x555557132150, graph=<optimized out>) at /home/sayap/repo/ik_llama.cpp/ggml/src/ggml-backend.cpp:2609
#9  0x00007ffff7c8c7db in llama_decode_internal (lctx=..., batch_all=...) at /home/sayap/repo/ik_llama.cpp/src/llama.cpp:3317
#10 0x00007ffff7c8d66d in llama_decode (ctx=<optimized out>, batch=...) at /home/sayap/repo/ik_llama.cpp/src/llama.cpp:6997
#11 0x00005555556ae18b in server_context::process_batch_tokens (this=this@entry=0x7fffffffc630, n_batch=@0x7fffffff9788: 2048) at /home/sayap/repo/ik_llama.cpp/examples/server/server-context.cpp:3177
#12 0x00005555556b0019 in server_context::update_slots (this=0x7fffffffc630) at /home/sayap/repo/ik_llama.cpp/examples/server/server-context.cpp:3337
#13 0x00005555556501c6 in std::function<void()>::operator() (this=0x7fffffffd8a0) at /usr/lib/gcc/x86_64-pc-linux-gnu/14/include/g++-v14/bits/std_function.h:591
#14 server_queue::start_loop (this=this@entry=0x7fffffffd748) at /home/sayap/repo/ik_llama.cpp/examples/server/server-queue.cpp:133
#15 0x00005555555db126 in main (argc=<optimized out>, argv=<optimized out>) at /home/sayap/repo/ik_llama.cpp/examples/server/server.cpp:2139

sayap · 2026-02-21T00:07:07Z

Ohh the error happens with -ngl 999 -ncmoe 58 (CUDA0 buffer size = 13047.12 MiB), but goes away with -ngl 999 -ncmoe 59 (CUDA0 buffer size = 10977.12 MiB), so it might just be a side-effect of my broken 3090.

magikRUKKOLA · 2026-02-21T02:40:36Z

@sayap

so it might just be a side-effect of my broken 3090.

Yeah, that is what I was thinking. In my case it could be the bad risers. At one point I connected two of them consequently (w/o retimers etc.) and that turned out to be the problem. As of now I am still not sure if some of the risers are bad because I have this:

lspci -vvv | grep -F -A 5 --colour 'LaneErr at lane'

                LaneErrStat: LaneErr at lane: 5
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                L1SubCtl2:
        Capabilities: [380 v1] Downstream Port Containment
--
                LaneErrStat: LaneErr at lane: 3 7
        Capabilities: [bb0 v1] Physical Resizable BAR
                BAR 0: current size: 16MB, supported: 16MB
                BAR 1: current size: 32GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
                BAR 3: current size: 32MB, supported: 32MB
        Capabilities: [c1c v1] Physical Layer 16.0 GT/s
--
                LaneErrStat: LaneErr at lane: 0 1 2 3 4 5 6 7
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                L1SubCtl2:
        Capabilities: [380 v1] Downstream Port Containment
--
                LaneErrStat: LaneErr at lane: 0 1 2 3 4
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                L1SubCtl2:
        Capabilities: [380 v1] Downstream Port Containment
--
                LaneErrStat: LaneErr at lane: 0 1 2 3 4 5 6 7
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                L1SubCtl2:
        Capabilities: [380 v1] Downstream Port Containment
--
                LaneErrStat: LaneErr at lane: 0 1 2 3 4 5 6 7
        Capabilities: [bb0 v1] Physical Resizable BAR
                BAR 0: current size: 16MB, supported: 16MB
                BAR 1: current size: 32GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
                BAR 3: current size: 32MB, supported: 32MB
        Capabilities: [c1c v1] Physical Layer 16.0 GT/s
--
                LaneErrStat: LaneErr at lane: 0 1 2 3 4 5 7 8 13 14
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                L1SubCtl2:
        Capabilities: [380 v1] Downstream Port Containment
--
                LaneErrStat: LaneErr at lane: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
        Capabilities: [bb0 v1] Physical Resizable BAR
                BAR 0: current size: 16MB, supported: 16MB
                BAR 1: current size: 32GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
                BAR 3: current size: 32MB, supported: 32MB
        Capabilities: [c1c v1] Physical Layer 16.0 GT/s

Alternatively, it could be some quirks of the motherboard or the SlimSAS risers so I am not sure what it is. Can you check if the command above detects any Lane Errors?

[EDIT]: Just got another problem:

[Sat Feb 21 02:56:55 2026] NVRM: Xid (PCI:0000:41:00): 31, pid=232127, name=llama-server, channel 0x02000014, intr 00000000. MMU Fault: ENGINE CE3 HUBCLIENT_CE1 faulted @ 0x7f43_33c00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE

Hm ... looks like the hardware issue again.

ikawrakow · 2026-02-21T07:31:40Z

@magikRUKKOLA

Hm ... looks like the hardware issue again.

Yes, I also think that there may be a hardware issue. I had a few occasions where inference will simply lock up, similar to the way it behaved before you changed the risers. It is much less frequent, but it does happen from time to time.

ikawrakow · 2026-02-21T07:32:53Z

So, despite the limitations outlined above, I'll merge the PR. Proper recurrent cache management will be added later.

chulucninh09 · 2026-02-21T08:04:09Z

Is the caveat related to my issue? I got random output after sending the same prompt with prompt caching enabled
#1294

ikawrakow marked this pull request as ready for review February 19, 2026 17:12

ikawrakow changed the title ~~WIP: Qwen3.5-MoE support~~ Qwen3.5-MoE support Feb 20, 2026

ikawrakow added 2 commits February 20, 2026 07:53

WIP: loads and runs, but not correct

8737d3d

Very high PPL, empty TG.

This appears to work

07516ce

ikawrakow force-pushed the ik/qwen35moe branch from d7269f1 to 07516ce Compare February 20, 2026 07:53

ikawrakow mentioned this pull request Feb 20, 2026

Graph parallel for Qwen3-Next #1292

Merged

ikawrakow merged commit 13c3d83 into main Feb 21, 2026

chulucninh09 mentioned this pull request Feb 21, 2026

Bug: Prompt caching doesn't work #1294

Closed

buliaoyin mentioned this pull request Feb 21, 2026

Feature Request: Support for Qwen 3.5 #1255

Closed

4 tasks

Conversation

ikawrakow commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ik

mainline

Uh oh!

ikawrakow commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Feb 19, 2026

ik

mainline

Uh oh!

ubergarm commented Feb 19, 2026

Uh oh!

ikawrakow commented Feb 19, 2026

ik_llama.cpp GPU/CPU

llama.cpp, GPU/CPU

Uh oh!

ikawrakow commented Feb 19, 2026

Uh oh!

ubergarm commented Feb 19, 2026

Uh oh!

magikRUKKOLA commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrHills-rs commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

magikRUKKOLA commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Feb 20, 2026

Uh oh!

ikawrakow commented Feb 20, 2026

Uh oh!

ikawrakow commented Feb 20, 2026

Uh oh!

magikRUKKOLA commented Feb 20, 2026

Uh oh!

firecoperana commented Feb 20, 2026

Uh oh!

ikawrakow commented Feb 20, 2026

Uh oh!

firecoperana commented Feb 20, 2026

Uh oh!

sayap commented Feb 20, 2026

Uh oh!

sayap commented Feb 21, 2026

Uh oh!

magikRUKKOLA commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Feb 21, 2026

Uh oh!

ikawrakow commented Feb 21, 2026

Uh oh!

chulucninh09 commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ikawrakow commented Feb 19, 2026 •

edited

Loading

ubergarm commented Feb 19, 2026 •

edited

Loading

ikawrakow commented Feb 19, 2026 •

edited

Loading

magikRUKKOLA commented Feb 19, 2026 •

edited

Loading

MrHills-rs commented Feb 20, 2026 •

edited

Loading

magikRUKKOLA commented Feb 20, 2026 •

edited

Loading

magikRUKKOLA commented Feb 21, 2026 •

edited

Loading