Skip to content

Add PEG parser and new jinja template engine#1369

Merged
ikawrakow merged 1 commit intomainfrom
fcp/auto_parser
Mar 9, 2026
Merged

Add PEG parser and new jinja template engine#1369
ikawrakow merged 1 commit intomainfrom
fcp/auto_parser

Conversation

@firecoperana
Copy link
Collaborator

@firecoperana firecoperana commented Mar 6, 2026

This PR syncs PEG parser, new jinja template engine and other jinja related updates from mainline.
ggml-org/llama.cpp#17136
ggml-org/llama.cpp#18462

@sayap Mainline has merged qwen3-coder and nemotron nano 3 parsers. When I ported it, qwen3 next coder still uses common_chat_params_init_qwen3_coder_xml. Can you test if it still works?

@hksdpc255 Mainline disables Minja polyfill in chat.cpp. In my simple test, MiroThinker tool call works. Can you check if MiroThinker tool call still works?

@chulucninh09 Can you also test this PR?

@firecoperana firecoperana marked this pull request as draft March 6, 2026 03:13
@sayap
Copy link
Contributor

sayap commented Mar 6, 2026

Just tested the PR branch, and it has the same 2 issues as mainline autoparser branch that impacts Q3C / Q3CN / Q3.5 (#1352 (comment)):

  • The grammar doesn't allow parallel tool calls.
  • The grammar enforces strict arguments order that these models are not well-trained on following (even the 397B).

Currently, common_chat_params_init_qwen3_coder_xml from the main branch here should have good support for Q3C / Q3CN / Q3.5 / Step-3.5-Flash, but I don't think we support nemotron nano 3 so I haven't done any testing with it.

@hksdpc255
Copy link
Contributor

@sayap The grammar should allow parallel tool calls. Is there any bugs?

auto tool_call_once = builder.add_rule("root-tool-call-once", string_join(tool_rules, " | "));
auto tool_call_more = builder.add_rule("root-tool-call-more", gbnf_format_literal(form.tool_end) + " " + tool_call_once);
auto call_end = builder.add_rule("root-call-end", form.last_tool_end ? gbnf_format_literal(*form.last_tool_end) : gbnf_format_literal(form.tool_end));
auto tool_call_multiple_with_end = builder.add_rule("root-tool-call-multiple-with-end", tool_call_once + " " + tool_call_more + "* " + call_end);
builder.add_rule("root",
(form.scope_start.empty() ? "" : gbnf_format_literal(form.scope_start) + " ") +
tool_call_multiple_with_end + "?" +
(form.scope_end.empty() ? "" : " " + gbnf_format_literal(form.scope_end))
);

@firecoperana
Copy link
Collaborator Author

firecoperana commented Mar 6, 2026

I just reverted it back to common_chat_params_init_qwen3_coder_xml.

With common_chat_params_init_qwen3_coder_xml, I hit an exception in regex_to_reversed_partial_regex during debug, but common_chat_params_init_qwen3_coder does not have such error. It's hidden in release build though. Can you check if this is expected?

I tested with below message with Qwen3 Coder Next:
"messages": [
{"role": "system", "content": "You are a chatbot that uses tools/functions. Dont overthink things."},
{"role": "user", "content": "What is the weather in Istanbul?"}
],
"tools": [{
"type":"function",
"function":{
"name":"get_current_weather",
"description":"Get the current weather in a given location",
"parameters":{
"type":"object",
"properties":{
"location":{
"type":"string",
"description":"The city and country/state, e.g. San Francisco, CA, or Paris, France"
}
},
"required":["location"]
}
}
}]
}'

@sayap
Copy link
Contributor

sayap commented Mar 6, 2026

@hksdpc255 In #1300, I had to change scope_start and scope_end to empty string, to allow parallel tool calls. Otherwise, the grammar would only allow one <tool_call>...</tool_call>

@bitraft
Copy link

bitraft commented Mar 6, 2026

27b, 122b non-think mode crash, 35b work.

args:

--no-mmap --mlock  --jinja --chat-template-kwargs {"enable_thinking": false } --reasoning-budget 0 --reasoning-format none --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 -c 65536 -np 1 -ub 4096 -b 4096 -ctk bf16 -ctv bf16  --no-display-prompt -co --webui llamacpp -vq --check-tensors --context-shift 0 -cram 0

logs:

Decode process is cancelled by user.                                                                                                                                                                                                                                  
INFO [           release_slots] slot released | tid="139706766753152" timestamp=1772793865 id_slot=0 id_task=125 n_ctx=65536 n_past=8192 n_system_tokens=0 n_cache_tokens=8192 truncated=false                                                                        
INFO [              slots_idle] all slots are idle | tid="139706766753152" timestamp=1772793865                                                                                                                                                                       
======== Prompt cache: cache size: 8192, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.42, cache_ram_similarity: 0.50                                                                                                                               
INFO [   launch_slot_with_task] slot is processing task | tid="139706766753152" timestamp=1772793865 id_slot=0 id_task=128                                                                                                                                            
======== Cache: cache_size = 8192, n_past0 =  3457, n_past1 =  3457, n_past_prompt1 = 3457,  n_past2 =  3457, n_past_prompt2 =  3457                                                                                                                                  
Common part does not match fully                                                                                                                                                                                                                                      
cache :  via channel plugins. Supports actions: send, broadcast, poll, react, delete, edit, topic-create.", "parameters": {"type": "object                                                                                                                            
prompt:  via channel plugins. Current channel (telegram) supports: delete, edit, poll, react, send, topic-create.", "parameters": {"type":                                                                                                                            
slot apply_checkp: id  0 | task 128 | n_past = 3457, slot.prompt.tokens.size() = 8192, seq_id = 0, pos_min = 8191                                                                                                                                                     
slot apply_checkp: id  0 | task 128 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)                                        
slot apply_checkp: id  0 | task 128 | erased invalidated context checkpoint (pos_min = 4095, pos_max = 4095, size = 149.659 MiB)                                                                                                                                      
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139706766753152" timestamp=1772793865 id_slot=0 id_task=128 p0=0                                                                                                                                         
slot create_check: id  0 | task 128 | created context checkpoint 1 of 8 (pos_min = 4095, pos_max = 4095, size = 149.659 MiB, took 20.80 ms)                                                                                                                           
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139706766753152" timestamp=1772793894 id_slot=0 id_task=128 p0=4096                                                                                                                                      
dslot create_check: id  0 | task 128 | created context checkpoint 2 of 8 (pos_min = 8191, pos_max = 8191, size = 149.690 MiB, took 21.55 ms)                                                                                                                          
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139706766753152" timestamp=1772793923 id_slot=0 id_task=128 p0=8192                                                                                                                                      
slot create_check: id  0 | task 128 | created context checkpoint 3 of 8 (pos_min = 12287, pos_max = 12287, size = 149.721 MiB, took 21.26 ms)                                                                                                                         
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139706766753152" timestamp=1772793955 id_slot=0 id_task=128 p0=12288                                                                                                                                     
slot create_check: id  0 | task 128 | created context checkpoint 4 of 8 (pos_min = 13905, pos_max = 13905, size = 149.734 MiB, took 21.61 ms)                                                                                                                         
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139706766753152" timestamp=1772793968 id_slot=0 id_task=128 p0=13906  

ik_llama.cpp/src/llama-sampling.cpp:733: GGML_ASSERT(iter != probs.end()) failed

....

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f100441fc17 in __GI___wait4 (pid=1321516, stat_loc=0x7fff0c47ce5c, options=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007f100441fc17 in __GI___wait4 (pid=1321516, stat_loc=0x7fff0c47ce5c, options=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007f100441fb97 in __GI___waitpid (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>) at ./posix/waitpid.c:38
38      ./posix/waitpid.c: No such file or directory.
#2  0x000055b827e31fe5 in ggml_print_backtrace () at ik_llama.cpp/ggml/src/ggml.c:236
236             waitpid(pid, &wstatus, 0);
#3  0x000055b827e31f98 in ggml_abort (file=0x55b82779eae1 "ik_llama.cpp/src/llama-sampling.cpp", line=733, fmt=0x55b8277afaec <str.3.llvm> "GGML_ASSERT(%s) failed") at ik_llama.cpp/ggml/src/ggml.c:263
263         ggml_print_backtrace();
#4  0x000055b827d18ae5 in llama_sample_token_with_rng_impl (smpl=0x55b82cb761f8, candidates=<optimized out>, rng=...) at ik_llama.cpp/src/llama-sampling.cpp:733
733         GGML_ASSERT(iter != probs.end());
#5  0x000055b827bc0de2 in llama_sampling_sample_impl (ctx_sampling=0x55b82cdbf100, ctx_main=0x55b82cb76140, ctx_cfg=0x0, idx=0, is_resampling=false) at ik_llama.cpp/common/sampling.cpp:435
435                 id = llama_sample_token_with_rng(ctx_main, &cur_p, ctx_sampling->rng);
#6  0x000055b827ab4907 in server_context::process_batch_tokens (this=this@entry=0x7fff0c47f9f0, n_batch=@0x7fff0c47d610: 4096) at ik_llama.cpp/examples/server/server-context.cpp:3456
3456                const llama_token id = common_sampler_sample(slot.ctx_sampling, ctx, tok_idx);
#7  0x000055b827ab6537 in server_context::update_slots (this=0x7fff0c47f9f0) at ik_llama.cpp/examples/server/server-context.cpp:3582
3582        process_batch_tokens(n_batch); // Decode with batch
#8  0x000055b827a24580 in std::__1::__function::__value_func<void ()>::operator()[abi:ne210108]() const (this=0x7fff0c480a50) at ../include/c++/v1/__functional/function.h:274
274         return (*__f_)(std::forward<_ArgTypes>(__args)...);
#9  std::__1::function<void ()>::operator()() const (this=0x7fff0c480a50) at ../include/c++/v1/__functional/function.h:772
772       return __f_(std::forward<_ArgTypes>(__arg)...);
#10 server_queue::start_loop (this=0x7fff0c480910) at ik_llama.cpp/examples/server/server-queue.cpp:133
133             callback_update_slots();
#11 0x000055b8278f5cc6 in main (argc=<optimized out>, argv=<optimized out>) at ik_llama.cpp/examples/server/server.cpp:2139
2139        ctx_server.queue_tasks.start_loop();
[Inferior 1 (process 1312989) detached]
Aborted

@ikawrakow
Copy link
Owner

I'm not following the PEG progress in mainline, but my concept was that it was still immature. Has that changed?

@firecoperana
Copy link
Collaborator Author

firecoperana commented Mar 6, 2026

Not sure about the status either, but PEG is not going away. They have an auto parser PR that is soon to be merged, which relies on PEG. It's also a complete refactor, so I want create this PR to get it up to date before that large refactor. This PR and the future port of auto parser PR can stay at separate branch before it's mature enough though.

common_chat_params_init_qwen3_coder use peg parser.
common_chat_params_init_qwen3_coder_xml use the old xml parser.

For people want to test, the default, it is not using PEG parser for qwen3 coder/qwen3.5. Change common_chat_params_init_qwen3_coder_xml to common_chat_params_init_qwen3_coder to test the PEG parser that is used in mainline.

@bitraft Try -cuda fa-offset=2

@firecoperana
Copy link
Collaborator Author

firecoperana commented Mar 6, 2026

@hksdpc255 In #1300, I had to change scope_start and scope_end to empty string, to allow parallel tool calls. Otherwise, the grammar would only allow one <tool_call>...</tool_call>

In #1365, I fixed a bug that trigger pattern in grammar only accepted one trigger. Could it be related to this bug?

@bitraft
Copy link

bitraft commented Mar 6, 2026

@bitraft Try -cuda fa-offset=2

I get this error without this patch, and my setup is CPU only.

@ikawrakow
Copy link
Owner

@bitraft

Can you share model/command line/etc.?

@bitraft
Copy link

bitraft commented Mar 6, 2026

Can you share model/command line/etc.?

CPU is 9454P, Qwen3.5-27B and Qwen3.5-122B-A10B get same error. 35B is OK.

llama-server  -m Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf --threads 48 --threads-batch 48  --no-mmap --mlock  --jinja --chat-template-kwargs '{"enable_thinking":false}' --reasoning-budget 0 --reasoning-format none --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 -c 65536 -np 1 -ub 4096 -b 4096 -ctk bf16 -ctv bf16  --no-display-prompt -co --webui llamacpp -vq --check-tensors --context-shift 0 -cram 0

Q4_K_M.gguf get same error.

@ikawrakow
Copy link
Owner

@bitraft

I still don't know enough, you are very stingy with details.

But to make sure, I downloaded the UD-Q4_K_XL quantization of Qwen3.5-27B from Unsloth, and it works perfectly fine.
I'm not going to also download the 122B model as there is no reason to believe it wouldn't work.

@firecoperana
Copy link
Collaborator Author

ggml-org/llama.cpp#20171 This seems to address the issue with the order of multiple arguments in tool call for qwen3.5 models.

@sayap
Copy link
Contributor

sayap commented Mar 6, 2026

I just tested ggml-org/llama.cpp#20171, and indeed it fixes the arguments ordering issue. It is better than my quick fix in #1352:

  • It allows loose arguments order for all models. There is no good reason why strict arguments order is needed in the first place.
  • It still enforces that the required arguments must be set.

However, it still doesn't allow parallel tool calls for Q3C / Q3CN / Qwen3.5.

@hksdpc255 In #1300, I had to change scope_start and scope_end to empty string, to allow parallel tool calls. Otherwise, the grammar would only allow one <tool_call>...</tool_call>

In #1365, I fixed a bug that trigger pattern in grammar only accepted one trigger. Could it be related to this bug?

Don't think it is related. The thing with the template used by Q3C / Q3CN / Qwen3.5 is that there is no definitive scope_end. For parallel tool calls, it will just be one <tool_call>...</tool_call> followed by either another <tool_call>...</tool_call>, or an EOT.

@firecoperana
Copy link
Collaborator Author

firecoperana commented Mar 6, 2026

I just ported the whole autoparser in #1376
You can submit a follow up PR to address parallel tool calls issue.

… and new jinja template engine

---------

Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>

common : add nemotron 3 parsing (#18077)

common : add parser for ministral/mistral large 3/devstral 2 (#17713)

common : default content to an empty string (#18485)

chat: make tool description and parameters optional per OpenAI spec (#18478)

Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.

Attempts to fix #17667

common : implement new jinja template engine (#18462)
---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

jinja: correct member access rule (#18905)

jinja : fix lexing of float literals with sign (#18901)

jinja : add missing tojson filter for bool (#18900)

jinja : attribute support for join, map and sort (#18883)

jinja : fix object item order (and properly implement dictsort) (#18904)

tests : add test-jinja -py option for cross-checking (#18906)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ci : run test-jinja -py on high perf [no ci] (#18916)

jinja : fix undefined keys and attributes and int/float as bool (#18924)

jinja: support none|string (#18995)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

jinja : implement mixed type object keys (#18955)

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)

`tojson` is not a supported `undefined` filter

keep it DRY and fix some types

jinja : do not pass empty tools and add some none filters (#19176)

jinja : add unordered_map include to value.h [no ci] (#19205)

jinja : add missing 'in' test to template engine (#19004) (#19239)

The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".

This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.

Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.

Includes test cases for all three containment types plus
reject/select filter usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

Add Jinja support for "indent" string filter (#19529)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

add vendor

refactor chat

server : support preserving reasoning_content in assistant message (#18994)

chat : fix translategemma crash on common_chat_format_example (#19019)

chat: fix language input for translategemma (#19052)

Co-authored-by: Aldehir Rojas <hello@alde.dev>

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>

chat: fix case where template accepts type content only (#19419)

mtmd : chat : Fix extra \n between text and media marker (#19595)

Thanks to @tugot17 for detecting and reporting the issue.

For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.

However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.

This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.

PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.

With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.

I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`

Please propose alternative ways of fixing this issue.

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

common : merge qwen3-coder and nemotron nano 3 parsers (#19765)

common : fix improper trimming in XML parser on complete message (#19805)

Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com>

jinja: correct stats for tojson and string filters (#19785)

jinja : correct default size for string slices (#19913)

common : handle unicode during partial json parsing (#16526)

common : fix json schema with '\' in literals (#17307)

add back qwen_coder_xml and mirothinker
@sayap
Copy link
Contributor

sayap commented Mar 7, 2026

@firecoperana I think we should fix parallel tool calls in autoparser first before merging it? Otherwise it will be a regression.

https://gist.github.com/sayap/33550a28a18f29081dcc0b832f675c48 - Can use this gist for reference. The key is to make sure that all the logprobs during tool calls are close to 0.

@firecoperana
Copy link
Collaborator Author

That's fine with me. Do you want to write it? I'm not familiar with tool call.

@firecoperana
Copy link
Collaborator Author

Do you find any regression with this PR?

@bitraft
Copy link

bitraft commented Mar 7, 2026

I still don't know enough, you are very stingy with details.

But to make sure, I downloaded the UD-Q4_K_XL quantization of Qwen3.5-27B from Unsloth, and it works perfectly fine. I'm not going to also download the 122B model as there is no reason to believe it wouldn't work.

here is a full logs trigger by https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF/blob/main/Qwen_Qwen3.5-27B-Q2_K.gguf with openclaw /reset coomand
llama-server  -m bartowski/Qwen_Qwen3.5-27B-Q2_K.gguf --threads 40 --threads-batch 40  --no-mmap --mlock  --jinja --chat-template-kwargs '{"enable_thinking":false}' --reasoning-budget 0 --reasoning-format none --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 -c 65536 -np 1 -ub 4096 -b 4096 -ctk bf16 -ctv bf16  --no-display-prompt -co --webui llamacpp -vq --check-tensors --context-shift 0 -cram 0
INFO [                    main] build info | tid="140147593637248" timestamp=1772845651 build=4264 commit="277fc1d2"
INFO [                    main] system info | tid="140147593637248" timestamp=1772845651 n_threads=40 n_threads_batch=40 total_threads=96 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
CPU: using device CPU - 0 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 851 tensors from bartowski/Qwen_Qwen3.5-27B-Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.600000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.5 27B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.5
llama_model_loader: - kv   7:                         general.size_label str              = 27B
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-2...
llama_model_loader: - kv  10:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  11:                         qwen35.block_count u32              = 64
llama_model_loader: - kv  12:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  13:                    qwen35.embedding_length u32              = 5120
llama_model_loader: - kv  14:                 qwen35.feed_forward_length u32              = 17408
llama_model_loader: - kv  15:                qwen35.attention.head_count u32              = 24
llama_model_loader: - kv  16:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  17:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  18:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  19:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  21:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  22:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  23:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  24:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  25:                  qwen35.ssm.time_step_rank u32              = 48
llama_model_loader: - kv  26:                      qwen35.ssm.inner_size u32              = 6144
llama_model_loader: - kv  27:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  28:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  29:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  31:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  32:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  33:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 248044
llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 10
llama_model_loader: - kv  40:                      quantize.imatrix.file str              = /models_out/Qwen3.5-27B-GGUF/Qwen_Qwe...
llama_model_loader: - kv  41:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav5.txt
llama_model_loader: - kv  42:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  43:              quantize.imatrix.chunks_count u32              = 802
llama_model_loader: - type  f32:  449 tensors
llama_model_loader: - type q8_0:   24 tensors
llama_model_loader: - type q2_K:  281 tensors
llama_model_loader: - type q3_K:   80 tensors
llama_model_loader: - type q4_K:   16 tensors
llama_model_loader: - type q6_K:    1 tensors
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen35
llm_load_print_meta: n_ctx_train      = 262144
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 64
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 17408
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 40
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 262144
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: mrope sections   = [11, 11, 10, 0]
llm_load_print_meta: ssm_d_conv       = 4
llm_load_print_meta: ssm_d_inner      = 6144
llm_load_print_meta: ssm_d_state      = 128
llm_load_print_meta: ssm_dt_rank      = 48
llm_load_print_meta: ssm_n_group      = 16
llm_load_print_meta: model type       = 27B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 26.896 B
llm_load_print_meta: model size       = 10.013 GiB (3.198 BPW) 
llm_load_print_meta: repeating layers = 8.654 GiB (3.052 BPW, 24.353 B parameters)
llm_load_print_meta: general.name     = Qwen3.5 27B
print_info: vocab type       = BPE
print_info: n_vocab          = 248320
print_info: n_merges         = 247587
print_info: BOS token        = 11 ','
print_info: EOS token        = 248046 '<|im_end|>'
print_info: EOT token        = 248046 '<|im_end|>'
print_info: PAD token        = 248044 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 248060 '<|fim_prefix|>'
print_info: FIM SUF token    = 248062 '<|fim_suffix|>'
print_info: FIM MID token    = 248061 '<|fim_middle|>'
print_info: FIM PAD token    = 248063 '<|fim_pad|>'
print_info: FIM REP token    = 248064 '<|repo_name|>'
print_info: FIM SEP token    = 248065 '<|file_sep|>'
print_info: EOG token        = 248044 '<|endoftext|>'
print_info: EOG token        = 248046 '<|im_end|>'
print_info: EOG token        = 248063 '<|fim_pad|>'
print_info: EOG token        = 248064 '<|repo_name|>'
print_info: EOG token        = 248065 '<|file_sep|>'
print_info: max token length = 256
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors:        CPU buffer size = 10253.82 MiB
.........................................................................................
llama_init_from_model: n_ctx         = 65536
llama_init_from_model: n_batch       = 4096
llama_init_from_model: n_ubatch      = 4096
llama_init_from_model: flash_attn    = 1
llama_init_from_model: attn_max_b    = 0
llama_init_from_model: fused_moe     = 1
llama_init_from_model: grouped er    = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad    = 1
llama_init_from_model: rope_cache    = 0
llama_init_from_model: graph_reuse   = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type   = f16
llama_init_from_model: sched_async   = 0
llama_init_from_model: ser           = -1, 0
llama_init_from_model: freq_base     = 10000000.0
llama_init_from_model: freq_scale    = 1
llama_kv_cache_init:        CPU KV buffer size =  4245.63 MiB
llama_init_from_model: KV self size  = 4096.00 MiB, K (bf16): 2048.00 MiB, V (bf16): 2048.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.95 MiB
llama_init_from_model:        CPU compute buffer size =  3960.00 MiB
llama_init_from_model: graph nodes  = 3269
llama_init_from_model: graph splits = 1
llama_init_from_model: enabling only_active_experts scheduling
======================================= HAVE_FANCY_SIMD is defined
INFO [                    init] initializing slots | tid="140147593637248" timestamp=1772845653 n_slots=1
srv          init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
INFO [                    init] new slot | tid="140147593637248" timestamp=1772845653 id_slot=0 n_ctx_slot=65536
no implementations specified for speculative decoding
slot         init: id  0 | task -1 | speculative decoding context not initialized
prompt cache is disabled - use `--cache-ram N` to enable it
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>

</think>

'
INFO [                    main] model loaded | tid="140147593637248" timestamp=1772845654
srv          init: init: chat template, thinking = 0
INFO [                    main] HTTP server listening | tid="140147593637248" timestamp=1772845654 port="1024" n_threads_http="95" hostname="192.0.0.1"
INFO [              slots_idle] all slots are idle | tid="140147593637248" timestamp=1772845654
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="140147593637248" timestamp=1772845672 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 =  0, n_past1 =  0, n_past_prompt1 = 0,  n_past2 =  0, n_past_prompt2 =  0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845672 id_slot=0 id_task=0 p0=0
INFO [      log_server_request] request | tid="140125019256512" timestamp=1772845687 remote_addr="192.0.0.3" remote_port=43884 status=200 method="POST" path="/v1/chat/completions" params={}
srv          stop: cancel task, id_task = 0
slot create_check: id  0 | task 0 | created context checkpoint 1 of 8 (pos_min = 4095, pos_max = 4095, size = 149.659 MiB, took 20.61 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=0 p0=4096
llama_decode: failed to decode, ret = -3
Decode process is cancelled by user.
INFO [           release_slots] slot released | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=0 n_ctx=65536 n_past=8192 n_system_tokens=0 n_cache_tokens=8192 truncated=false
INFO [              slots_idle] all slots are idle | tid="140147593637248" timestamp=1772845690
======== Prompt cache: cache size: 8192, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.42, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=3
======== Cache: cache_size = 8192, n_past0 =  3457, n_past1 =  3457, n_past_prompt1 = 3457,  n_past2 =  3457, n_past_prompt2 =  3457
Common part does not match fully
cache :  via channel plugins. Supports actions: send, broadcast, poll, react, delete, edit, topic-create.", "parameters": {"type": "object
prompt:  via channel plugins. Current channel (telegram) supports: delete, edit, poll, react, send, topic-create.", "parameters": {"type":
slot apply_checkp: id  0 | task 3 | n_past = 3457, slot.prompt.tokens.size() = 8192, seq_id = 0, pos_min = 8191
slot apply_checkp: id  0 | task 3 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot apply_checkp: id  0 | task 3 | erased invalidated context checkpoint (pos_min = 4095, pos_max = 4095, size = 149.659 MiB)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=3 p0=0
slot create_check: id  0 | task 3 | created context checkpoint 1 of 8 (pos_min = 4095, pos_max = 4095, size = 149.659 MiB, took 20.84 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845709 id_slot=0 id_task=3 p0=4096
slot create_check: id  0 | task 3 | created context checkpoint 2 of 8 (pos_min = 8191, pos_max = 8191, size = 149.690 MiB, took 20.92 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845730 id_slot=0 id_task=3 p0=8192
slot create_check: id  0 | task 3 | created context checkpoint 3 of 8 (pos_min = 12287, pos_max = 12287, size = 149.721 MiB, took 20.79 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845752 id_slot=0 id_task=3 p0=12288
slot create_check: id  0 | task 3 | created context checkpoint 4 of 8 (pos_min = 13905, pos_max = 13905, size = 149.734 MiB, took 21.64 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845761 id_slot=0 id_task=3 p0=13906
ik_llama.cpp/src/llama-sampling.cpp:733: GGML_ASSERT(iter != probs.end()) failed
[New LWP 3160176]
[New LWP 3160177]
[New LWP 3160178]
[New LWP 3160179]
[New LWP 3160180]
[New LWP 3160181]
[New LWP 3160182]
[New LWP 3160183]
[New LWP 3160184]
[New LWP 3160185]
[New LWP 3160186]
[New LWP 3160187]
[New LWP 3160188]
[New LWP 3160189]
[New LWP 3160190]
[New LWP 3160191]
[New LWP 3160192]
[New LWP 3160193]
[New LWP 3160194]
[New LWP 3160195]
[New LWP 3160196]
[New LWP 3160197]
[New LWP 3160198]
[New LWP 3160199]
[New LWP 3160200]
[New LWP 3160201]
[New LWP 3160202]
[New LWP 3160203]
[New LWP 3160204]
[New LWP 3160205]
[New LWP 3160206]
[New LWP 3160207]
[New LWP 3160208]
[New LWP 3160209]
[New LWP 3160210]
[New LWP 3160211]
[New LWP 3160212]
[New LWP 3160213]
[New LWP 3160214]
[New LWP 3160216]
[New LWP 3160282]
[New LWP 3160283]
[New LWP 3160284]
[New LWP 3160285]
[New LWP 3160286]
[New LWP 3160287]
[New LWP 3160288]
[New LWP 3160289]
[New LWP 3160290]
[New LWP 3160291]
[New LWP 3160292]
[New LWP 3160293]
[New LWP 3160294]
[New LWP 3160295]
[New LWP 3160296]
[New LWP 3160297]
[New LWP 3160298]
[New LWP 3160299]
[New LWP 3160300]
[New LWP 3160301]
[New LWP 3160302]
[New LWP 3160303]
[New LWP 3160304]
[New LWP 3160305]
[New LWP 3160306]
[New LWP 3160307]
[New LWP 3160308]
[New LWP 3160309]
[New LWP 3160310]
[New LWP 3160311]
[New LWP 3160312]
[New LWP 3160313]
[New LWP 3160314]
[New LWP 3160315]
[New LWP 3160316]
[New LWP 3160317]
[New LWP 3160318]
[New LWP 3160319]
[New LWP 3160320]
[New LWP 3160321]
[New LWP 3160322]
[New LWP 3160323]
[New LWP 3160324]
[New LWP 3160325]
[New LWP 3160326]
[New LWP 3160327]
[New LWP 3160328]
[New LWP 3160329]
[New LWP 3160330]
[New LWP 3160331]
[New LWP 3160332]
[New LWP 3160333]
[New LWP 3160334]
[New LWP 3160335]
[New LWP 3160336]
[New LWP 3160337]
[New LWP 3160338]
[New LWP 3160339]
[New LWP 3160340]
[New LWP 3160341]
[New LWP 3160342]
[New LWP 3160343]
[New LWP 3160344]
[New LWP 3160345]
[New LWP 3160346]
[New LWP 3160347]
[New LWP 3160348]
[New LWP 3160349]
[New LWP 3160350]
[New LWP 3160351]
[New LWP 3160352]
[New LWP 3160353]
[New LWP 3160354]
[New LWP 3160355]
[New LWP 3160356]
[New LWP 3160357]
[New LWP 3160358]
[New LWP 3160359]
[New LWP 3160360]
[New LWP 3160361]
[New LWP 3160362]
[New LWP 3160363]
[New LWP 3160364]
[New LWP 3160365]
[New LWP 3160366]
[New LWP 3160367]
[New LWP 3160368]
[New LWP 3160369]
[New LWP 3160370]
[New LWP 3160371]
[New LWP 3160372]
[New LWP 3160373]
[New LWP 3160374]
[New LWP 3160375]
[New LWP 3160376]
[New LWP 3160377]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f76a7966c17 in __GI___wait4 (pid=3163505, stat_loc=0x7ffec57ad44c, options=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007f76a7966c17 in __GI___wait4 (pid=3163505, stat_loc=0x7ffec57ad44c, options=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007f76a7966b97 in __GI___waitpid (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>) at ./posix/waitpid.c:38
38	./posix/waitpid.c: No such file or directory.
#2  0x00005628de190f25 in ggml_print_backtrace () at ik_llama.cpp/ggml/src/ggml.c:236
236	        waitpid(pid, &wstatus, 0);
#3  0x00005628de190ed8 in ggml_abort (file=0x5628ddb4d275 "ik_llama.cpp/src/llama-sampling.cpp", line=733, fmt=0x5628ddb5ddcc <str.3.llvm> "GGML_ASSERT(%s) failed") at ik_llama.cpp/ggml/src/ggml.c:263
263	    ggml_print_backtrace();
#4  0x00005628de077c45 in llama_sample_token_with_rng_impl (smpl=0x5628e4148498, candidates=<optimized out>, rng=...) at ik_llama.cpp/src/llama-sampling.cpp:733
733	    GGML_ASSERT(iter != probs.end());
#5  0x00005628dde379d2 in llama_sampling_sample_impl (ctx_sampling=0x5628e84c96d0, ctx_main=0x5628e41483e0, ctx_cfg=0x0, idx=0, is_resampling=false) at ik_llama.cpp/common/sampling.cpp:435
435	            id = llama_sample_token_with_rng(ctx_main, &cur_p, ctx_sampling->rng);
#6  0x00005628ddde09b5 in server_context::process_batch_tokens (this=this@entry=0x7ffec57b00d0, n_batch=@0x7ffec57adbf0: 4096) at ik_llama.cpp/examples/server/server-context.cpp:3525
3525	            const llama_token id = common_sampler_sample(slot.ctx_sampling, ctx, tok_idx);
#7  0x00005628ddde218a in server_context::update_slots (this=0x7ffec57b00d0) at ik_llama.cpp/examples/server/server-context.cpp:3651
3651	    process_batch_tokens(n_batch); // Decode with batch
#8  0x00005628ddd7ef34 in std::__1::__function::__value_func<void ()>::operator()[abi:ne210108]() const (this=0x7ffec57b1130) at ../include/c++/v1/__functional/function.h:274
274	    return (*__f_)(std::forward<_ArgTypes>(__args)...);
#9  std::__1::function<void ()>::operator()() const (this=0x7ffec57b1130) at ../include/c++/v1/__functional/function.h:772
772	  return __f_(std::forward<_ArgTypes>(__arg)...);
#10 server_queue::start_loop (this=0x7ffec57b0ff0) at ik_llama.cpp/examples/server/server-queue.cpp:133
133	        callback_update_slots();
#11 0x00005628ddcb3781 in main (argc=<optimized out>, argv=<optimized out>) at ik_llama.cpp/examples/server/server.cpp:2143
2143	    ctx_server.queue_tasks.start_loop();
[Inferior 1 (process 3159265) detached]
Aborted

you can try with simple input like : write a good story, I get crash every time with this.

@sayap
Copy link
Contributor

sayap commented Mar 7, 2026

I am still trying to understand whether parallel tool calls is actually supported by autoparser.

Tested these combinations using the current mainline master branch:

model name trained to do parallel tool calls supports_parallel_calls: true from autoparser check can make parallal tool calls
Qwen3.5 397B A17B yes yes no
Qwen3.5 27B yes yes no
GLM-4.7-Flash yes yes no

The thinking is that if I can find a yes/yes/yes combination, then it will be easier to start from there.

@sayap
Copy link
Contributor

sayap commented Mar 7, 2026

Ah my bad. I just gathered that the original intention since ggml-org/llama.cpp@8b576b6c55 is to enable parallel tool calls explicitly via a request parameter:

inputs.parallel_tool_calls = json_value(body, "parallel_tool_calls", false);

I wrongly assumed that it is auto enabled. After setting this request parameter, I got yes/yes/yes for the combinations above 👍

@bitraft
Copy link

bitraft commented Mar 7, 2026

I still don't know enough, you are very stingy with details.
But to make sure, I downloaded the UD-Q4_K_XL quantization of Qwen3.5-27B from Unsloth, and it works perfectly fine. I'm not going to also download the 122B model as there is no reason to believe it wouldn't work.

you can try with simple input like : write a good story, I get crash every time with this.

Hi @ikawrakow, here is the full log. Please let me know if you need any additional information to locate the issue.

full log
llama-server  -m bartowski/Qwen_Qwen3.5-27B-Q2_K.gguf --threads 40 --threads-batch 40  --no-mmap --mlock  --jinja --chat-template-kwargs '{"enable_thinking":false}' --reasoning-budget 0 --reasoning-format none --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 -c 65536 -np 1 -ub 4096 -b 4096 -ctk bf16 -ctv bf16  --no-display-prompt -co --webui llamacpp -vq --check-tensors --context-shift 0 -cram 0
INFO [                    main] build info | tid="140147593637248" timestamp=1772845651 build=4264 commit="277fc1d2"
INFO [                    main] system info | tid="140147593637248" timestamp=1772845651 n_threads=40 n_threads_batch=40 total_threads=96 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
CPU: using device CPU - 0 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 851 tensors from bartowski/Qwen_Qwen3.5-27B-Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.600000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.5 27B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.5
llama_model_loader: - kv   7:                         general.size_label str              = 27B
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-2...
llama_model_loader: - kv  10:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  11:                         qwen35.block_count u32              = 64
llama_model_loader: - kv  12:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  13:                    qwen35.embedding_length u32              = 5120
llama_model_loader: - kv  14:                 qwen35.feed_forward_length u32              = 17408
llama_model_loader: - kv  15:                qwen35.attention.head_count u32              = 24
llama_model_loader: - kv  16:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  17:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  18:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  19:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  21:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  22:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  23:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  24:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  25:                  qwen35.ssm.time_step_rank u32              = 48
llama_model_loader: - kv  26:                      qwen35.ssm.inner_size u32              = 6144
llama_model_loader: - kv  27:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  28:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  29:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  31:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  32:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  33:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 248044
llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 10
llama_model_loader: - kv  40:                      quantize.imatrix.file str              = /models_out/Qwen3.5-27B-GGUF/Qwen_Qwe...
llama_model_loader: - kv  41:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav5.txt
llama_model_loader: - kv  42:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  43:              quantize.imatrix.chunks_count u32              = 802
llama_model_loader: - type  f32:  449 tensors
llama_model_loader: - type q8_0:   24 tensors
llama_model_loader: - type q2_K:  281 tensors
llama_model_loader: - type q3_K:   80 tensors
llama_model_loader: - type q4_K:   16 tensors
llama_model_loader: - type q6_K:    1 tensors
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen35
llm_load_print_meta: n_ctx_train      = 262144
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 64
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 17408
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 40
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 262144
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: mrope sections   = [11, 11, 10, 0]
llm_load_print_meta: ssm_d_conv       = 4
llm_load_print_meta: ssm_d_inner      = 6144
llm_load_print_meta: ssm_d_state      = 128
llm_load_print_meta: ssm_dt_rank      = 48
llm_load_print_meta: ssm_n_group      = 16
llm_load_print_meta: model type       = 27B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 26.896 B
llm_load_print_meta: model size       = 10.013 GiB (3.198 BPW) 
llm_load_print_meta: repeating layers = 8.654 GiB (3.052 BPW, 24.353 B parameters)
llm_load_print_meta: general.name     = Qwen3.5 27B
print_info: vocab type       = BPE
print_info: n_vocab          = 248320
print_info: n_merges         = 247587
print_info: BOS token        = 11 ','
print_info: EOS token        = 248046 '<|im_end|>'
print_info: EOT token        = 248046 '<|im_end|>'
print_info: PAD token        = 248044 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 248060 '<|fim_prefix|>'
print_info: FIM SUF token    = 248062 '<|fim_suffix|>'
print_info: FIM MID token    = 248061 '<|fim_middle|>'
print_info: FIM PAD token    = 248063 '<|fim_pad|>'
print_info: FIM REP token    = 248064 '<|repo_name|>'
print_info: FIM SEP token    = 248065 '<|file_sep|>'
print_info: EOG token        = 248044 '<|endoftext|>'
print_info: EOG token        = 248046 '<|im_end|>'
print_info: EOG token        = 248063 '<|fim_pad|>'
print_info: EOG token        = 248064 '<|repo_name|>'
print_info: EOG token        = 248065 '<|file_sep|>'
print_info: max token length = 256
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors:        CPU buffer size = 10253.82 MiB
.........................................................................................
llama_init_from_model: n_ctx         = 65536
llama_init_from_model: n_batch       = 4096
llama_init_from_model: n_ubatch      = 4096
llama_init_from_model: flash_attn    = 1
llama_init_from_model: attn_max_b    = 0
llama_init_from_model: fused_moe     = 1
llama_init_from_model: grouped er    = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad    = 1
llama_init_from_model: rope_cache    = 0
llama_init_from_model: graph_reuse   = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type   = f16
llama_init_from_model: sched_async   = 0
llama_init_from_model: ser           = -1, 0
llama_init_from_model: freq_base     = 10000000.0
llama_init_from_model: freq_scale    = 1
llama_kv_cache_init:        CPU KV buffer size =  4245.63 MiB
llama_init_from_model: KV self size  = 4096.00 MiB, K (bf16): 2048.00 MiB, V (bf16): 2048.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.95 MiB
llama_init_from_model:        CPU compute buffer size =  3960.00 MiB
llama_init_from_model: graph nodes  = 3269
llama_init_from_model: graph splits = 1
llama_init_from_model: enabling only_active_experts scheduling
======================================= HAVE_FANCY_SIMD is defined
INFO [                    init] initializing slots | tid="140147593637248" timestamp=1772845653 n_slots=1
srv          init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
INFO [                    init] new slot | tid="140147593637248" timestamp=1772845653 id_slot=0 n_ctx_slot=65536
no implementations specified for speculative decoding
slot         init: id  0 | task -1 | speculative decoding context not initialized
prompt cache is disabled - use `--cache-ram N` to enable it
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>

</think>

'
INFO [                    main] model loaded | tid="140147593637248" timestamp=1772845654
srv          init: init: chat template, thinking = 0
INFO [                    main] HTTP server listening | tid="140147593637248" timestamp=1772845654 port="1024" n_threads_http="95" hostname="192.0.0.1"
INFO [              slots_idle] all slots are idle | tid="140147593637248" timestamp=1772845654
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="140147593637248" timestamp=1772845672 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 =  0, n_past1 =  0, n_past_prompt1 = 0,  n_past2 =  0, n_past_prompt2 =  0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845672 id_slot=0 id_task=0 p0=0
INFO [      log_server_request] request | tid="140125019256512" timestamp=1772845687 remote_addr="192.0.0.3" remote_port=43884 status=200 method="POST" path="/v1/chat/completions" params={}
srv          stop: cancel task, id_task = 0
slot create_check: id  0 | task 0 | created context checkpoint 1 of 8 (pos_min = 4095, pos_max = 4095, size = 149.659 MiB, took 20.61 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=0 p0=4096
llama_decode: failed to decode, ret = -3
Decode process is cancelled by user.
INFO [           release_slots] slot released | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=0 n_ctx=65536 n_past=8192 n_system_tokens=0 n_cache_tokens=8192 truncated=false
INFO [              slots_idle] all slots are idle | tid="140147593637248" timestamp=1772845690
======== Prompt cache: cache size: 8192, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.42, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=3
======== Cache: cache_size = 8192, n_past0 =  3457, n_past1 =  3457, n_past_prompt1 = 3457,  n_past2 =  3457, n_past_prompt2 =  3457
Common part does not match fully
cache :  via channel plugins. Supports actions: send, broadcast, poll, react, delete, edit, topic-create.", "parameters": {"type": "object
prompt:  via channel plugins. Current channel (telegram) supports: delete, edit, poll, react, send, topic-create.", "parameters": {"type":
slot apply_checkp: id  0 | task 3 | n_past = 3457, slot.prompt.tokens.size() = 8192, seq_id = 0, pos_min = 8191
slot apply_checkp: id  0 | task 3 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot apply_checkp: id  0 | task 3 | erased invalidated context checkpoint (pos_min = 4095, pos_max = 4095, size = 149.659 MiB)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=3 p0=0
slot create_check: id  0 | task 3 | created context checkpoint 1 of 8 (pos_min = 4095, pos_max = 4095, size = 149.659 MiB, took 20.84 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845709 id_slot=0 id_task=3 p0=4096
slot create_check: id  0 | task 3 | created context checkpoint 2 of 8 (pos_min = 8191, pos_max = 8191, size = 149.690 MiB, took 20.92 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845730 id_slot=0 id_task=3 p0=8192
slot create_check: id  0 | task 3 | created context checkpoint 3 of 8 (pos_min = 12287, pos_max = 12287, size = 149.721 MiB, took 20.79 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845752 id_slot=0 id_task=3 p0=12288
slot create_check: id  0 | task 3 | created context checkpoint 4 of 8 (pos_min = 13905, pos_max = 13905, size = 149.734 MiB, took 21.64 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845761 id_slot=0 id_task=3 p0=13906
ik_llama.cpp/src/llama-sampling.cpp:733: GGML_ASSERT(iter != probs.end()) failed
[New LWP 3160176]
[New LWP 3160177]
[New LWP 3160178]
[New LWP 3160179]
[New LWP 3160180]
[New LWP 3160181]
[New LWP 3160182]
[New LWP 3160183]
[New LWP 3160184]
[New LWP 3160185]
[New LWP 3160186]
[New LWP 3160187]
[New LWP 3160188]
[New LWP 3160189]
[New LWP 3160190]
[New LWP 3160191]
[New LWP 3160192]
[New LWP 3160193]
[New LWP 3160194]
[New LWP 3160195]
[New LWP 3160196]
[New LWP 3160197]
[New LWP 3160198]
[New LWP 3160199]
[New LWP 3160200]
[New LWP 3160201]
[New LWP 3160202]
[New LWP 3160203]
[New LWP 3160204]
[New LWP 3160205]
[New LWP 3160206]
[New LWP 3160207]
[New LWP 3160208]
[New LWP 3160209]
[New LWP 3160210]
[New LWP 3160211]
[New LWP 3160212]
[New LWP 3160213]
[New LWP 3160214]
[New LWP 3160216]
[New LWP 3160282]
[New LWP 3160283]
[New LWP 3160284]
[New LWP 3160285]
[New LWP 3160286]
[New LWP 3160287]
[New LWP 3160288]
[New LWP 3160289]
[New LWP 3160290]
[New LWP 3160291]
[New LWP 3160292]
[New LWP 3160293]
[New LWP 3160294]
[New LWP 3160295]
[New LWP 3160296]
[New LWP 3160297]
[New LWP 3160298]
[New LWP 3160299]
[New LWP 3160300]
[New LWP 3160301]
[New LWP 3160302]
[New LWP 3160303]
[New LWP 3160304]
[New LWP 3160305]
[New LWP 3160306]
[New LWP 3160307]
[New LWP 3160308]
[New LWP 3160309]
[New LWP 3160310]
[New LWP 3160311]
[New LWP 3160312]
[New LWP 3160313]
[New LWP 3160314]
[New LWP 3160315]
[New LWP 3160316]
[New LWP 3160317]
[New LWP 3160318]
[New LWP 3160319]
[New LWP 3160320]
[New LWP 3160321]
[New LWP 3160322]
[New LWP 3160323]
[New LWP 3160324]
[New LWP 3160325]
[New LWP 3160326]
[New LWP 3160327]
[New LWP 3160328]
[New LWP 3160329]
[New LWP 3160330]
[New LWP 3160331]
[New LWP 3160332]
[New LWP 3160333]
[New LWP 3160334]
[New LWP 3160335]
[New LWP 3160336]
[New LWP 3160337]
[New LWP 3160338]
[New LWP 3160339]
[New LWP 3160340]
[New LWP 3160341]
[New LWP 3160342]
[New LWP 3160343]
[New LWP 3160344]
[New LWP 3160345]
[New LWP 3160346]
[New LWP 3160347]
[New LWP 3160348]
[New LWP 3160349]
[New LWP 3160350]
[New LWP 3160351]
[New LWP 3160352]
[New LWP 3160353]
[New LWP 3160354]
[New LWP 3160355]
[New LWP 3160356]
[New LWP 3160357]
[New LWP 3160358]
[New LWP 3160359]
[New LWP 3160360]
[New LWP 3160361]
[New LWP 3160362]
[New LWP 3160363]
[New LWP 3160364]
[New LWP 3160365]
[New LWP 3160366]
[New LWP 3160367]
[New LWP 3160368]
[New LWP 3160369]
[New LWP 3160370]
[New LWP 3160371]
[New LWP 3160372]
[New LWP 3160373]
[New LWP 3160374]
[New LWP 3160375]
[New LWP 3160376]
[New LWP 3160377]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f76a7966c17 in __GI___wait4 (pid=3163505, stat_loc=0x7ffec57ad44c, options=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007f76a7966c17 in __GI___wait4 (pid=3163505, stat_loc=0x7ffec57ad44c, options=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007f76a7966b97 in __GI___waitpid (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>) at ./posix/waitpid.c:38
38	./posix/waitpid.c: No such file or directory.
#2  0x00005628de190f25 in ggml_print_backtrace () at ik_llama.cpp/ggml/src/ggml.c:236
236	        waitpid(pid, &wstatus, 0);
#3  0x00005628de190ed8 in ggml_abort (file=0x5628ddb4d275 "ik_llama.cpp/src/llama-sampling.cpp", line=733, fmt=0x5628ddb5ddcc <str.3.llvm> "GGML_ASSERT(%s) failed") at ik_llama.cpp/ggml/src/ggml.c:263
263	    ggml_print_backtrace();
#4  0x00005628de077c45 in llama_sample_token_with_rng_impl (smpl=0x5628e4148498, candidates=<optimized out>, rng=...) at ik_llama.cpp/src/llama-sampling.cpp:733
733	    GGML_ASSERT(iter != probs.end());
#5  0x00005628dde379d2 in llama_sampling_sample_impl (ctx_sampling=0x5628e84c96d0, ctx_main=0x5628e41483e0, ctx_cfg=0x0, idx=0, is_resampling=false) at ik_llama.cpp/common/sampling.cpp:435
435	            id = llama_sample_token_with_rng(ctx_main, &cur_p, ctx_sampling->rng);
#6  0x00005628ddde09b5 in server_context::process_batch_tokens (this=this@entry=0x7ffec57b00d0, n_batch=@0x7ffec57adbf0: 4096) at ik_llama.cpp/examples/server/server-context.cpp:3525
3525	            const llama_token id = common_sampler_sample(slot.ctx_sampling, ctx, tok_idx);
#7  0x00005628ddde218a in server_context::update_slots (this=0x7ffec57b00d0) at ik_llama.cpp/examples/server/server-context.cpp:3651
3651	    process_batch_tokens(n_batch); // Decode with batch
#8  0x00005628ddd7ef34 in std::__1::__function::__value_func<void ()>::operator()[abi:ne210108]() const (this=0x7ffec57b1130) at ../include/c++/v1/__functional/function.h:274
274	    return (*__f_)(std::forward<_ArgTypes>(__args)...);
#9  std::__1::function<void ()>::operator()() const (this=0x7ffec57b1130) at ../include/c++/v1/__functional/function.h:772
772	  return __f_(std::forward<_ArgTypes>(__arg)...);
#10 server_queue::start_loop (this=0x7ffec57b0ff0) at ik_llama.cpp/examples/server/server-queue.cpp:133
133	        callback_update_slots();
#11 0x00005628ddcb3781 in main (argc=<optimized out>, argv=<optimized out>) at ik_llama.cpp/examples/server/server.cpp:2143
2143	    ctx_server.queue_tasks.start_loop();
[Inferior 1 (process 3159265) detached]
Aborted

@ikawrakow ikawrakow merged commit ab1d740 into main Mar 9, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 10, 2026
… parsing and new jinja template engine (ikawrakow#1369)"

This reverts commit ab1d740.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 11, 2026
… parsing and new jinja template engine (ikawrakow#1369)"

This reverts commit ab1d740.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 12, 2026
… parsing and new jinja template engine (ikawrakow#1369)"

This reverts commit ab1d740.
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 12, 2026
… parsing and new jinja template engine (ikawrakow#1369)"

This reverts commit ab1d740.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants