Add PEG parser and new jinja template engine#1369
Conversation
|
Just tested the PR branch, and it has the same 2 issues as mainline autoparser branch that impacts Q3C / Q3CN / Q3.5 (#1352 (comment)):
Currently, |
|
@sayap The grammar should allow parallel tool calls. Is there any bugs? ik_llama.cpp/common/chat-parser-xml-toolcall.cpp Lines 301 to 309 in 1ef4b5e |
|
I just reverted it back to common_chat_params_init_qwen3_coder_xml. With I tested with below message with Qwen3 Coder Next: |
|
@hksdpc255 In #1300, I had to change |
|
27b, 122b args: --no-mmap --mlock --jinja --chat-template-kwargs {"enable_thinking": false } --reasoning-budget 0 --reasoning-format none --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 -c 65536 -np 1 -ub 4096 -b 4096 -ctk bf16 -ctv bf16 --no-display-prompt -co --webui llamacpp -vq --check-tensors --context-shift 0 -cram 0logs: Decode process is cancelled by user.
INFO [ release_slots] slot released | tid="139706766753152" timestamp=1772793865 id_slot=0 id_task=125 n_ctx=65536 n_past=8192 n_system_tokens=0 n_cache_tokens=8192 truncated=false
INFO [ slots_idle] all slots are idle | tid="139706766753152" timestamp=1772793865
======== Prompt cache: cache size: 8192, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.42, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="139706766753152" timestamp=1772793865 id_slot=0 id_task=128
======== Cache: cache_size = 8192, n_past0 = 3457, n_past1 = 3457, n_past_prompt1 = 3457, n_past2 = 3457, n_past_prompt2 = 3457
Common part does not match fully
cache : via channel plugins. Supports actions: send, broadcast, poll, react, delete, edit, topic-create.", "parameters": {"type": "object
prompt: via channel plugins. Current channel (telegram) supports: delete, edit, poll, react, send, topic-create.", "parameters": {"type":
slot apply_checkp: id 0 | task 128 | n_past = 3457, slot.prompt.tokens.size() = 8192, seq_id = 0, pos_min = 8191
slot apply_checkp: id 0 | task 128 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot apply_checkp: id 0 | task 128 | erased invalidated context checkpoint (pos_min = 4095, pos_max = 4095, size = 149.659 MiB)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139706766753152" timestamp=1772793865 id_slot=0 id_task=128 p0=0
slot create_check: id 0 | task 128 | created context checkpoint 1 of 8 (pos_min = 4095, pos_max = 4095, size = 149.659 MiB, took 20.80 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139706766753152" timestamp=1772793894 id_slot=0 id_task=128 p0=4096
dslot create_check: id 0 | task 128 | created context checkpoint 2 of 8 (pos_min = 8191, pos_max = 8191, size = 149.690 MiB, took 21.55 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139706766753152" timestamp=1772793923 id_slot=0 id_task=128 p0=8192
slot create_check: id 0 | task 128 | created context checkpoint 3 of 8 (pos_min = 12287, pos_max = 12287, size = 149.721 MiB, took 21.26 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139706766753152" timestamp=1772793955 id_slot=0 id_task=128 p0=12288
slot create_check: id 0 | task 128 | created context checkpoint 4 of 8 (pos_min = 13905, pos_max = 13905, size = 149.734 MiB, took 21.61 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="139706766753152" timestamp=1772793968 id_slot=0 id_task=128 p0=13906
ik_llama.cpp/src/llama-sampling.cpp:733: GGML_ASSERT(iter != probs.end()) failed
....
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f100441fc17 in __GI___wait4 (pid=1321516, stat_loc=0x7fff0c47ce5c, options=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007f100441fc17 in __GI___wait4 (pid=1321516, stat_loc=0x7fff0c47ce5c, options=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007f100441fb97 in __GI___waitpid (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>) at ./posix/waitpid.c:38
38 ./posix/waitpid.c: No such file or directory.
#2 0x000055b827e31fe5 in ggml_print_backtrace () at ik_llama.cpp/ggml/src/ggml.c:236
236 waitpid(pid, &wstatus, 0);
#3 0x000055b827e31f98 in ggml_abort (file=0x55b82779eae1 "ik_llama.cpp/src/llama-sampling.cpp", line=733, fmt=0x55b8277afaec <str.3.llvm> "GGML_ASSERT(%s) failed") at ik_llama.cpp/ggml/src/ggml.c:263
263 ggml_print_backtrace();
#4 0x000055b827d18ae5 in llama_sample_token_with_rng_impl (smpl=0x55b82cb761f8, candidates=<optimized out>, rng=...) at ik_llama.cpp/src/llama-sampling.cpp:733
733 GGML_ASSERT(iter != probs.end());
#5 0x000055b827bc0de2 in llama_sampling_sample_impl (ctx_sampling=0x55b82cdbf100, ctx_main=0x55b82cb76140, ctx_cfg=0x0, idx=0, is_resampling=false) at ik_llama.cpp/common/sampling.cpp:435
435 id = llama_sample_token_with_rng(ctx_main, &cur_p, ctx_sampling->rng);
#6 0x000055b827ab4907 in server_context::process_batch_tokens (this=this@entry=0x7fff0c47f9f0, n_batch=@0x7fff0c47d610: 4096) at ik_llama.cpp/examples/server/server-context.cpp:3456
3456 const llama_token id = common_sampler_sample(slot.ctx_sampling, ctx, tok_idx);
#7 0x000055b827ab6537 in server_context::update_slots (this=0x7fff0c47f9f0) at ik_llama.cpp/examples/server/server-context.cpp:3582
3582 process_batch_tokens(n_batch); // Decode with batch
#8 0x000055b827a24580 in std::__1::__function::__value_func<void ()>::operator()[abi:ne210108]() const (this=0x7fff0c480a50) at ../include/c++/v1/__functional/function.h:274
274 return (*__f_)(std::forward<_ArgTypes>(__args)...);
#9 std::__1::function<void ()>::operator()() const (this=0x7fff0c480a50) at ../include/c++/v1/__functional/function.h:772
772 return __f_(std::forward<_ArgTypes>(__arg)...);
#10 server_queue::start_loop (this=0x7fff0c480910) at ik_llama.cpp/examples/server/server-queue.cpp:133
133 callback_update_slots();
#11 0x000055b8278f5cc6 in main (argc=<optimized out>, argv=<optimized out>) at ik_llama.cpp/examples/server/server.cpp:2139
2139 ctx_server.queue_tasks.start_loop();
[Inferior 1 (process 1312989) detached]
Aborted
|
|
I'm not following the PEG progress in mainline, but my concept was that it was still immature. Has that changed? |
|
Not sure about the status either, but PEG is not going away. They have an auto parser PR that is soon to be merged, which relies on PEG. It's also a complete refactor, so I want create this PR to get it up to date before that large refactor. This PR and the future port of auto parser PR can stay at separate branch before it's mature enough though.
For people want to test, the default, it is not using PEG parser for qwen3 coder/qwen3.5. Change @bitraft Try |
In #1365, I fixed a bug that trigger pattern in grammar only accepted one trigger. Could it be related to this bug? |
I get this error without this patch, and my setup is CPU only. |
|
Can you share model/command line/etc.? |
fb7e04b to
d4b5eb6
Compare
CPU is llama-server -m Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf --threads 48 --threads-batch 48 --no-mmap --mlock --jinja --chat-template-kwargs '{"enable_thinking":false}' --reasoning-budget 0 --reasoning-format none --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 -c 65536 -np 1 -ub 4096 -b 4096 -ctk bf16 -ctv bf16 --no-display-prompt -co --webui llamacpp -vq --check-tensors --context-shift 0 -cram 0
|
|
I still don't know enough, you are very stingy with details. But to make sure, I downloaded the UD-Q4_K_XL quantization of Qwen3.5-27B from Unsloth, and it works perfectly fine. |
|
ggml-org/llama.cpp#20171 This seems to address the issue with the order of multiple arguments in tool call for qwen3.5 models. |
d4b5eb6 to
7723283
Compare
|
I just tested ggml-org/llama.cpp#20171, and indeed it fixes the arguments ordering issue. It is better than my quick fix in #1352:
However, it still doesn't allow parallel tool calls for Q3C / Q3CN / Qwen3.5.
Don't think it is related. The thing with the template used by Q3C / Q3CN / Qwen3.5 is that there is no definitive |
|
I just ported the whole autoparser in #1376 |
… and new jinja template engine
---------
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
common : add nemotron 3 parsing (#18077)
common : add parser for ministral/mistral large 3/devstral 2 (#17713)
common : default content to an empty string (#18485)
chat: make tool description and parameters optional per OpenAI spec (#18478)
Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.
Attempts to fix #17667
common : implement new jinja template engine (#18462)
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
jinja: correct member access rule (#18905)
jinja : fix lexing of float literals with sign (#18901)
jinja : add missing tojson filter for bool (#18900)
jinja : attribute support for join, map and sort (#18883)
jinja : fix object item order (and properly implement dictsort) (#18904)
tests : add test-jinja -py option for cross-checking (#18906)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
ci : run test-jinja -py on high perf [no ci] (#18916)
jinja : fix undefined keys and attributes and int/float as bool (#18924)
jinja: support none|string (#18995)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
jinja : implement mixed type object keys (#18955)
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)
`tojson` is not a supported `undefined` filter
keep it DRY and fix some types
jinja : do not pass empty tools and add some none filters (#19176)
jinja : add unordered_map include to value.h [no ci] (#19205)
jinja : add missing 'in' test to template engine (#19004) (#19239)
The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".
This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.
Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.
Includes test cases for all three containment types plus
reject/select filter usage.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Add Jinja support for "indent" string filter (#19529)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
add vendor
refactor chat
server : support preserving reasoning_content in assistant message (#18994)
chat : fix translategemma crash on common_chat_format_example (#19019)
chat: fix language input for translategemma (#19052)
Co-authored-by: Aldehir Rojas <hello@alde.dev>
---------
Co-authored-by: Aldehir Rojas <hello@alde.dev>
chat: fix case where template accepts type content only (#19419)
mtmd : chat : Fix extra \n between text and media marker (#19595)
Thanks to @tugot17 for detecting and reporting the issue.
For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.
However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.
This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.
PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.
With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.
I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`
Please propose alternative ways of fixing this issue.
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
common : merge qwen3-coder and nemotron nano 3 parsers (#19765)
common : fix improper trimming in XML parser on complete message (#19805)
Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com>
jinja: correct stats for tojson and string filters (#19785)
jinja : correct default size for string slices (#19913)
common : handle unicode during partial json parsing (#16526)
common : fix json schema with '\' in literals (#17307)
add back qwen_coder_xml and mirothinker
7723283 to
d1a4b71
Compare
|
@firecoperana I think we should fix parallel tool calls in autoparser first before merging it? Otherwise it will be a regression. https://gist.github.com/sayap/33550a28a18f29081dcc0b832f675c48 - Can use this gist for reference. The key is to make sure that all the logprobs during tool calls are close to 0. |
|
That's fine with me. Do you want to write it? I'm not familiar with tool call. |
|
Do you find any regression with this PR? |
here is a full logs trigger by https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF/blob/main/Qwen_Qwen3.5-27B-Q2_K.gguf with openclaw /reset coomandllama-server -m bartowski/Qwen_Qwen3.5-27B-Q2_K.gguf --threads 40 --threads-batch 40 --no-mmap --mlock --jinja --chat-template-kwargs '{"enable_thinking":false}' --reasoning-budget 0 --reasoning-format none --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 -c 65536 -np 1 -ub 4096 -b 4096 -ctk bf16 -ctv bf16 --no-display-prompt -co --webui llamacpp -vq --check-tensors --context-shift 0 -cram 0
INFO [ main] build info | tid="140147593637248" timestamp=1772845651 build=4264 commit="277fc1d2"
INFO [ main] system info | tid="140147593637248" timestamp=1772845651 n_threads=40 n_threads_batch=40 total_threads=96 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
CPU: using device CPU - 0 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 851 tensors from bartowski/Qwen_Qwen3.5-27B-Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen35
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 0.600000
llama_model_loader: - kv 5: general.name str = Qwen3.5 27B
llama_model_loader: - kv 6: general.basename str = Qwen3.5
llama_model_loader: - kv 7: general.size_label str = 27B
llama_model_loader: - kv 8: general.license str = apache-2.0
llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3.5-2...
llama_model_loader: - kv 10: general.tags arr[str,1] = ["image-text-to-text"]
llama_model_loader: - kv 11: qwen35.block_count u32 = 64
llama_model_loader: - kv 12: qwen35.context_length u32 = 262144
llama_model_loader: - kv 13: qwen35.embedding_length u32 = 5120
llama_model_loader: - kv 14: qwen35.feed_forward_length u32 = 17408
llama_model_loader: - kv 15: qwen35.attention.head_count u32 = 24
llama_model_loader: - kv 16: qwen35.attention.head_count_kv u32 = 4
llama_model_loader: - kv 17: qwen35.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
llama_model_loader: - kv 18: qwen35.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 19: qwen35.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 20: qwen35.attention.key_length u32 = 256
llama_model_loader: - kv 21: qwen35.attention.value_length u32 = 256
llama_model_loader: - kv 22: qwen35.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 23: qwen35.ssm.state_size u32 = 128
llama_model_loader: - kv 24: qwen35.ssm.group_count u32 = 16
llama_model_loader: - kv 25: qwen35.ssm.time_step_rank u32 = 48
llama_model_loader: - kv 26: qwen35.ssm.inner_size u32 = 6144
llama_model_loader: - kv 27: qwen35.full_attention_interval u32 = 4
llama_model_loader: - kv 28: qwen35.rope.dimension_count u32 = 64
llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 30: tokenizer.ggml.pre str = qwen35
llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,248320] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 248046
llama_model_loader: - kv 35: tokenizer.ggml.padding_token_id u32 = 248044
llama_model_loader: - kv 36: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 37: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: general.file_type u32 = 10
llama_model_loader: - kv 40: quantize.imatrix.file str = /models_out/Qwen3.5-27B-GGUF/Qwen_Qwe...
llama_model_loader: - kv 41: quantize.imatrix.dataset str = /training_dir/calibration_datav5.txt
llama_model_loader: - kv 42: quantize.imatrix.entries_count u32 = 496
llama_model_loader: - kv 43: quantize.imatrix.chunks_count u32 = 802
llama_model_loader: - type f32: 449 tensors
llama_model_loader: - type q8_0: 24 tensors
llama_model_loader: - type q2_K: 281 tensors
llama_model_loader: - type q3_K: 80 tensors
llama_model_loader: - type q4_K: 16 tensors
llama_model_loader: - type q6_K: 1 tensors
load: printing all EOG tokens:
load: - 248044 ('<|endoftext|>')
load: - 248046 ('<|im_end|>')
load: - 248063 ('<|fim_pad|>')
load: - 248064 ('<|repo_name|>')
load: - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen35
llm_load_print_meta: n_ctx_train = 262144
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_layer = 64
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 6
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 17408
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 40
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 262144
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: mrope sections = [11, 11, 10, 0]
llm_load_print_meta: ssm_d_conv = 4
llm_load_print_meta: ssm_d_inner = 6144
llm_load_print_meta: ssm_d_state = 128
llm_load_print_meta: ssm_dt_rank = 48
llm_load_print_meta: ssm_n_group = 16
llm_load_print_meta: model type = 27B
llm_load_print_meta: model ftype = Q2_K - Medium
llm_load_print_meta: model params = 26.896 B
llm_load_print_meta: model size = 10.013 GiB (3.198 BPW)
llm_load_print_meta: repeating layers = 8.654 GiB (3.052 BPW, 24.353 B parameters)
llm_load_print_meta: general.name = Qwen3.5 27B
print_info: vocab type = BPE
print_info: n_vocab = 248320
print_info: n_merges = 247587
print_info: BOS token = 11 ','
print_info: EOS token = 248046 '<|im_end|>'
print_info: EOT token = 248046 '<|im_end|>'
print_info: PAD token = 248044 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 248060 '<|fim_prefix|>'
print_info: FIM SUF token = 248062 '<|fim_suffix|>'
print_info: FIM MID token = 248061 '<|fim_middle|>'
print_info: FIM PAD token = 248063 '<|fim_pad|>'
print_info: FIM REP token = 248064 '<|repo_name|>'
print_info: FIM SEP token = 248065 '<|file_sep|>'
print_info: EOG token = 248044 '<|endoftext|>'
print_info: EOG token = 248046 '<|im_end|>'
print_info: EOG token = 248063 '<|fim_pad|>'
print_info: EOG token = 248064 '<|repo_name|>'
print_info: EOG token = 248065 '<|file_sep|>'
print_info: max token length = 256
llm_load_tensors: ggml ctx size = 0.37 MiB
llm_load_tensors: CPU buffer size = 10253.82 MiB
.........................................................................................
llama_init_from_model: n_ctx = 65536
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 4245.63 MiB
llama_init_from_model: KV self size = 4096.00 MiB, K (bf16): 2048.00 MiB, V (bf16): 2048.00 MiB
llama_init_from_model: CPU output buffer size = 0.95 MiB
llama_init_from_model: CPU compute buffer size = 3960.00 MiB
llama_init_from_model: graph nodes = 3269
llama_init_from_model: graph splits = 1
llama_init_from_model: enabling only_active_experts scheduling
======================================= HAVE_FANCY_SIMD is defined
INFO [ init] initializing slots | tid="140147593637248" timestamp=1772845653 n_slots=1
srv init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
INFO [ init] new slot | tid="140147593637248" timestamp=1772845653 id_slot=0 n_ctx_slot=65536
no implementations specified for speculative decoding
slot init: id 0 | task -1 | speculative decoding context not initialized
prompt cache is disabled - use `--cache-ram N` to enable it
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
</think>
'
INFO [ main] model loaded | tid="140147593637248" timestamp=1772845654
srv init: init: chat template, thinking = 0
INFO [ main] HTTP server listening | tid="140147593637248" timestamp=1772845654 port="1024" n_threads_http="95" hostname="192.0.0.1"
INFO [ slots_idle] all slots are idle | tid="140147593637248" timestamp=1772845654
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="140147593637248" timestamp=1772845672 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 = 0, n_past1 = 0, n_past_prompt1 = 0, n_past2 = 0, n_past_prompt2 = 0
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845672 id_slot=0 id_task=0 p0=0
INFO [ log_server_request] request | tid="140125019256512" timestamp=1772845687 remote_addr="192.0.0.3" remote_port=43884 status=200 method="POST" path="/v1/chat/completions" params={}
srv stop: cancel task, id_task = 0
slot create_check: id 0 | task 0 | created context checkpoint 1 of 8 (pos_min = 4095, pos_max = 4095, size = 149.659 MiB, took 20.61 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=0 p0=4096
llama_decode: failed to decode, ret = -3
Decode process is cancelled by user.
INFO [ release_slots] slot released | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=0 n_ctx=65536 n_past=8192 n_system_tokens=0 n_cache_tokens=8192 truncated=false
INFO [ slots_idle] all slots are idle | tid="140147593637248" timestamp=1772845690
======== Prompt cache: cache size: 8192, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.42, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=3
======== Cache: cache_size = 8192, n_past0 = 3457, n_past1 = 3457, n_past_prompt1 = 3457, n_past2 = 3457, n_past_prompt2 = 3457
Common part does not match fully
cache : via channel plugins. Supports actions: send, broadcast, poll, react, delete, edit, topic-create.", "parameters": {"type": "object
prompt: via channel plugins. Current channel (telegram) supports: delete, edit, poll, react, send, topic-create.", "parameters": {"type":
slot apply_checkp: id 0 | task 3 | n_past = 3457, slot.prompt.tokens.size() = 8192, seq_id = 0, pos_min = 8191
slot apply_checkp: id 0 | task 3 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot apply_checkp: id 0 | task 3 | erased invalidated context checkpoint (pos_min = 4095, pos_max = 4095, size = 149.659 MiB)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=3 p0=0
slot create_check: id 0 | task 3 | created context checkpoint 1 of 8 (pos_min = 4095, pos_max = 4095, size = 149.659 MiB, took 20.84 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845709 id_slot=0 id_task=3 p0=4096
slot create_check: id 0 | task 3 | created context checkpoint 2 of 8 (pos_min = 8191, pos_max = 8191, size = 149.690 MiB, took 20.92 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845730 id_slot=0 id_task=3 p0=8192
slot create_check: id 0 | task 3 | created context checkpoint 3 of 8 (pos_min = 12287, pos_max = 12287, size = 149.721 MiB, took 20.79 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845752 id_slot=0 id_task=3 p0=12288
slot create_check: id 0 | task 3 | created context checkpoint 4 of 8 (pos_min = 13905, pos_max = 13905, size = 149.734 MiB, took 21.64 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845761 id_slot=0 id_task=3 p0=13906
ik_llama.cpp/src/llama-sampling.cpp:733: GGML_ASSERT(iter != probs.end()) failed
[New LWP 3160176]
[New LWP 3160177]
[New LWP 3160178]
[New LWP 3160179]
[New LWP 3160180]
[New LWP 3160181]
[New LWP 3160182]
[New LWP 3160183]
[New LWP 3160184]
[New LWP 3160185]
[New LWP 3160186]
[New LWP 3160187]
[New LWP 3160188]
[New LWP 3160189]
[New LWP 3160190]
[New LWP 3160191]
[New LWP 3160192]
[New LWP 3160193]
[New LWP 3160194]
[New LWP 3160195]
[New LWP 3160196]
[New LWP 3160197]
[New LWP 3160198]
[New LWP 3160199]
[New LWP 3160200]
[New LWP 3160201]
[New LWP 3160202]
[New LWP 3160203]
[New LWP 3160204]
[New LWP 3160205]
[New LWP 3160206]
[New LWP 3160207]
[New LWP 3160208]
[New LWP 3160209]
[New LWP 3160210]
[New LWP 3160211]
[New LWP 3160212]
[New LWP 3160213]
[New LWP 3160214]
[New LWP 3160216]
[New LWP 3160282]
[New LWP 3160283]
[New LWP 3160284]
[New LWP 3160285]
[New LWP 3160286]
[New LWP 3160287]
[New LWP 3160288]
[New LWP 3160289]
[New LWP 3160290]
[New LWP 3160291]
[New LWP 3160292]
[New LWP 3160293]
[New LWP 3160294]
[New LWP 3160295]
[New LWP 3160296]
[New LWP 3160297]
[New LWP 3160298]
[New LWP 3160299]
[New LWP 3160300]
[New LWP 3160301]
[New LWP 3160302]
[New LWP 3160303]
[New LWP 3160304]
[New LWP 3160305]
[New LWP 3160306]
[New LWP 3160307]
[New LWP 3160308]
[New LWP 3160309]
[New LWP 3160310]
[New LWP 3160311]
[New LWP 3160312]
[New LWP 3160313]
[New LWP 3160314]
[New LWP 3160315]
[New LWP 3160316]
[New LWP 3160317]
[New LWP 3160318]
[New LWP 3160319]
[New LWP 3160320]
[New LWP 3160321]
[New LWP 3160322]
[New LWP 3160323]
[New LWP 3160324]
[New LWP 3160325]
[New LWP 3160326]
[New LWP 3160327]
[New LWP 3160328]
[New LWP 3160329]
[New LWP 3160330]
[New LWP 3160331]
[New LWP 3160332]
[New LWP 3160333]
[New LWP 3160334]
[New LWP 3160335]
[New LWP 3160336]
[New LWP 3160337]
[New LWP 3160338]
[New LWP 3160339]
[New LWP 3160340]
[New LWP 3160341]
[New LWP 3160342]
[New LWP 3160343]
[New LWP 3160344]
[New LWP 3160345]
[New LWP 3160346]
[New LWP 3160347]
[New LWP 3160348]
[New LWP 3160349]
[New LWP 3160350]
[New LWP 3160351]
[New LWP 3160352]
[New LWP 3160353]
[New LWP 3160354]
[New LWP 3160355]
[New LWP 3160356]
[New LWP 3160357]
[New LWP 3160358]
[New LWP 3160359]
[New LWP 3160360]
[New LWP 3160361]
[New LWP 3160362]
[New LWP 3160363]
[New LWP 3160364]
[New LWP 3160365]
[New LWP 3160366]
[New LWP 3160367]
[New LWP 3160368]
[New LWP 3160369]
[New LWP 3160370]
[New LWP 3160371]
[New LWP 3160372]
[New LWP 3160373]
[New LWP 3160374]
[New LWP 3160375]
[New LWP 3160376]
[New LWP 3160377]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f76a7966c17 in __GI___wait4 (pid=3163505, stat_loc=0x7ffec57ad44c, options=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007f76a7966c17 in __GI___wait4 (pid=3163505, stat_loc=0x7ffec57ad44c, options=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007f76a7966b97 in __GI___waitpid (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>) at ./posix/waitpid.c:38
38 ./posix/waitpid.c: No such file or directory.
#2 0x00005628de190f25 in ggml_print_backtrace () at ik_llama.cpp/ggml/src/ggml.c:236
236 waitpid(pid, &wstatus, 0);
#3 0x00005628de190ed8 in ggml_abort (file=0x5628ddb4d275 "ik_llama.cpp/src/llama-sampling.cpp", line=733, fmt=0x5628ddb5ddcc <str.3.llvm> "GGML_ASSERT(%s) failed") at ik_llama.cpp/ggml/src/ggml.c:263
263 ggml_print_backtrace();
#4 0x00005628de077c45 in llama_sample_token_with_rng_impl (smpl=0x5628e4148498, candidates=<optimized out>, rng=...) at ik_llama.cpp/src/llama-sampling.cpp:733
733 GGML_ASSERT(iter != probs.end());
#5 0x00005628dde379d2 in llama_sampling_sample_impl (ctx_sampling=0x5628e84c96d0, ctx_main=0x5628e41483e0, ctx_cfg=0x0, idx=0, is_resampling=false) at ik_llama.cpp/common/sampling.cpp:435
435 id = llama_sample_token_with_rng(ctx_main, &cur_p, ctx_sampling->rng);
#6 0x00005628ddde09b5 in server_context::process_batch_tokens (this=this@entry=0x7ffec57b00d0, n_batch=@0x7ffec57adbf0: 4096) at ik_llama.cpp/examples/server/server-context.cpp:3525
3525 const llama_token id = common_sampler_sample(slot.ctx_sampling, ctx, tok_idx);
#7 0x00005628ddde218a in server_context::update_slots (this=0x7ffec57b00d0) at ik_llama.cpp/examples/server/server-context.cpp:3651
3651 process_batch_tokens(n_batch); // Decode with batch
#8 0x00005628ddd7ef34 in std::__1::__function::__value_func<void ()>::operator()[abi:ne210108]() const (this=0x7ffec57b1130) at ../include/c++/v1/__functional/function.h:274
274 return (*__f_)(std::forward<_ArgTypes>(__args)...);
#9 std::__1::function<void ()>::operator()() const (this=0x7ffec57b1130) at ../include/c++/v1/__functional/function.h:772
772 return __f_(std::forward<_ArgTypes>(__arg)...);
#10 server_queue::start_loop (this=0x7ffec57b0ff0) at ik_llama.cpp/examples/server/server-queue.cpp:133
133 callback_update_slots();
#11 0x00005628ddcb3781 in main (argc=<optimized out>, argv=<optimized out>) at ik_llama.cpp/examples/server/server.cpp:2143
2143 ctx_server.queue_tasks.start_loop();
[Inferior 1 (process 3159265) detached]
Aborted
you can try with simple input like : |
|
I am still trying to understand whether parallel tool calls is actually supported by autoparser. Tested these combinations using the current mainline master branch:
The thinking is that if I can find a yes/yes/yes combination, then it will be easier to start from there. |
|
Ah my bad. I just gathered that the original intention since ggml-org/llama.cpp@8b576b6c55 is to enable parallel tool calls explicitly via a request parameter: inputs.parallel_tool_calls = json_value(body, "parallel_tool_calls", false);I wrongly assumed that it is auto enabled. After setting this request parameter, I got yes/yes/yes for the combinations above 👍 |
Hi @ikawrakow, here is the full log. Please let me know if you need any additional information to locate the issue. full logllama-server -m bartowski/Qwen_Qwen3.5-27B-Q2_K.gguf --threads 40 --threads-batch 40 --no-mmap --mlock --jinja --chat-template-kwargs '{"enable_thinking":false}' --reasoning-budget 0 --reasoning-format none --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 -c 65536 -np 1 -ub 4096 -b 4096 -ctk bf16 -ctv bf16 --no-display-prompt -co --webui llamacpp -vq --check-tensors --context-shift 0 -cram 0
INFO [ main] build info | tid="140147593637248" timestamp=1772845651 build=4264 commit="277fc1d2"
INFO [ main] system info | tid="140147593637248" timestamp=1772845651 n_threads=40 n_threads_batch=40 total_threads=96 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
CPU: using device CPU - 0 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 851 tensors from bartowski/Qwen_Qwen3.5-27B-Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen35
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 0.600000
llama_model_loader: - kv 5: general.name str = Qwen3.5 27B
llama_model_loader: - kv 6: general.basename str = Qwen3.5
llama_model_loader: - kv 7: general.size_label str = 27B
llama_model_loader: - kv 8: general.license str = apache-2.0
llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3.5-2...
llama_model_loader: - kv 10: general.tags arr[str,1] = ["image-text-to-text"]
llama_model_loader: - kv 11: qwen35.block_count u32 = 64
llama_model_loader: - kv 12: qwen35.context_length u32 = 262144
llama_model_loader: - kv 13: qwen35.embedding_length u32 = 5120
llama_model_loader: - kv 14: qwen35.feed_forward_length u32 = 17408
llama_model_loader: - kv 15: qwen35.attention.head_count u32 = 24
llama_model_loader: - kv 16: qwen35.attention.head_count_kv u32 = 4
llama_model_loader: - kv 17: qwen35.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
llama_model_loader: - kv 18: qwen35.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 19: qwen35.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 20: qwen35.attention.key_length u32 = 256
llama_model_loader: - kv 21: qwen35.attention.value_length u32 = 256
llama_model_loader: - kv 22: qwen35.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 23: qwen35.ssm.state_size u32 = 128
llama_model_loader: - kv 24: qwen35.ssm.group_count u32 = 16
llama_model_loader: - kv 25: qwen35.ssm.time_step_rank u32 = 48
llama_model_loader: - kv 26: qwen35.ssm.inner_size u32 = 6144
llama_model_loader: - kv 27: qwen35.full_attention_interval u32 = 4
llama_model_loader: - kv 28: qwen35.rope.dimension_count u32 = 64
llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 30: tokenizer.ggml.pre str = qwen35
llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,248320] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 248046
llama_model_loader: - kv 35: tokenizer.ggml.padding_token_id u32 = 248044
llama_model_loader: - kv 36: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 37: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: general.file_type u32 = 10
llama_model_loader: - kv 40: quantize.imatrix.file str = /models_out/Qwen3.5-27B-GGUF/Qwen_Qwe...
llama_model_loader: - kv 41: quantize.imatrix.dataset str = /training_dir/calibration_datav5.txt
llama_model_loader: - kv 42: quantize.imatrix.entries_count u32 = 496
llama_model_loader: - kv 43: quantize.imatrix.chunks_count u32 = 802
llama_model_loader: - type f32: 449 tensors
llama_model_loader: - type q8_0: 24 tensors
llama_model_loader: - type q2_K: 281 tensors
llama_model_loader: - type q3_K: 80 tensors
llama_model_loader: - type q4_K: 16 tensors
llama_model_loader: - type q6_K: 1 tensors
load: printing all EOG tokens:
load: - 248044 ('<|endoftext|>')
load: - 248046 ('<|im_end|>')
load: - 248063 ('<|fim_pad|>')
load: - 248064 ('<|repo_name|>')
load: - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen35
llm_load_print_meta: n_ctx_train = 262144
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_layer = 64
llm_load_print_meta: n_head = 24
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 6
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 17408
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 40
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 262144
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: mrope sections = [11, 11, 10, 0]
llm_load_print_meta: ssm_d_conv = 4
llm_load_print_meta: ssm_d_inner = 6144
llm_load_print_meta: ssm_d_state = 128
llm_load_print_meta: ssm_dt_rank = 48
llm_load_print_meta: ssm_n_group = 16
llm_load_print_meta: model type = 27B
llm_load_print_meta: model ftype = Q2_K - Medium
llm_load_print_meta: model params = 26.896 B
llm_load_print_meta: model size = 10.013 GiB (3.198 BPW)
llm_load_print_meta: repeating layers = 8.654 GiB (3.052 BPW, 24.353 B parameters)
llm_load_print_meta: general.name = Qwen3.5 27B
print_info: vocab type = BPE
print_info: n_vocab = 248320
print_info: n_merges = 247587
print_info: BOS token = 11 ','
print_info: EOS token = 248046 '<|im_end|>'
print_info: EOT token = 248046 '<|im_end|>'
print_info: PAD token = 248044 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 248060 '<|fim_prefix|>'
print_info: FIM SUF token = 248062 '<|fim_suffix|>'
print_info: FIM MID token = 248061 '<|fim_middle|>'
print_info: FIM PAD token = 248063 '<|fim_pad|>'
print_info: FIM REP token = 248064 '<|repo_name|>'
print_info: FIM SEP token = 248065 '<|file_sep|>'
print_info: EOG token = 248044 '<|endoftext|>'
print_info: EOG token = 248046 '<|im_end|>'
print_info: EOG token = 248063 '<|fim_pad|>'
print_info: EOG token = 248064 '<|repo_name|>'
print_info: EOG token = 248065 '<|file_sep|>'
print_info: max token length = 256
llm_load_tensors: ggml ctx size = 0.37 MiB
llm_load_tensors: CPU buffer size = 10253.82 MiB
.........................................................................................
llama_init_from_model: n_ctx = 65536
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 4245.63 MiB
llama_init_from_model: KV self size = 4096.00 MiB, K (bf16): 2048.00 MiB, V (bf16): 2048.00 MiB
llama_init_from_model: CPU output buffer size = 0.95 MiB
llama_init_from_model: CPU compute buffer size = 3960.00 MiB
llama_init_from_model: graph nodes = 3269
llama_init_from_model: graph splits = 1
llama_init_from_model: enabling only_active_experts scheduling
======================================= HAVE_FANCY_SIMD is defined
INFO [ init] initializing slots | tid="140147593637248" timestamp=1772845653 n_slots=1
srv init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
INFO [ init] new slot | tid="140147593637248" timestamp=1772845653 id_slot=0 n_ctx_slot=65536
no implementations specified for speculative decoding
slot init: id 0 | task -1 | speculative decoding context not initialized
prompt cache is disabled - use `--cache-ram N` to enable it
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
</think>
'
INFO [ main] model loaded | tid="140147593637248" timestamp=1772845654
srv init: init: chat template, thinking = 0
INFO [ main] HTTP server listening | tid="140147593637248" timestamp=1772845654 port="1024" n_threads_http="95" hostname="192.0.0.1"
INFO [ slots_idle] all slots are idle | tid="140147593637248" timestamp=1772845654
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="140147593637248" timestamp=1772845672 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 = 0, n_past1 = 0, n_past_prompt1 = 0, n_past2 = 0, n_past_prompt2 = 0
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845672 id_slot=0 id_task=0 p0=0
INFO [ log_server_request] request | tid="140125019256512" timestamp=1772845687 remote_addr="192.0.0.3" remote_port=43884 status=200 method="POST" path="/v1/chat/completions" params={}
srv stop: cancel task, id_task = 0
slot create_check: id 0 | task 0 | created context checkpoint 1 of 8 (pos_min = 4095, pos_max = 4095, size = 149.659 MiB, took 20.61 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=0 p0=4096
llama_decode: failed to decode, ret = -3
Decode process is cancelled by user.
INFO [ release_slots] slot released | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=0 n_ctx=65536 n_past=8192 n_system_tokens=0 n_cache_tokens=8192 truncated=false
INFO [ slots_idle] all slots are idle | tid="140147593637248" timestamp=1772845690
======== Prompt cache: cache size: 8192, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.42, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=3
======== Cache: cache_size = 8192, n_past0 = 3457, n_past1 = 3457, n_past_prompt1 = 3457, n_past2 = 3457, n_past_prompt2 = 3457
Common part does not match fully
cache : via channel plugins. Supports actions: send, broadcast, poll, react, delete, edit, topic-create.", "parameters": {"type": "object
prompt: via channel plugins. Current channel (telegram) supports: delete, edit, poll, react, send, topic-create.", "parameters": {"type":
slot apply_checkp: id 0 | task 3 | n_past = 3457, slot.prompt.tokens.size() = 8192, seq_id = 0, pos_min = 8191
slot apply_checkp: id 0 | task 3 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot apply_checkp: id 0 | task 3 | erased invalidated context checkpoint (pos_min = 4095, pos_max = 4095, size = 149.659 MiB)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845690 id_slot=0 id_task=3 p0=0
slot create_check: id 0 | task 3 | created context checkpoint 1 of 8 (pos_min = 4095, pos_max = 4095, size = 149.659 MiB, took 20.84 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845709 id_slot=0 id_task=3 p0=4096
slot create_check: id 0 | task 3 | created context checkpoint 2 of 8 (pos_min = 8191, pos_max = 8191, size = 149.690 MiB, took 20.92 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845730 id_slot=0 id_task=3 p0=8192
slot create_check: id 0 | task 3 | created context checkpoint 3 of 8 (pos_min = 12287, pos_max = 12287, size = 149.721 MiB, took 20.79 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845752 id_slot=0 id_task=3 p0=12288
slot create_check: id 0 | task 3 | created context checkpoint 4 of 8 (pos_min = 13905, pos_max = 13905, size = 149.734 MiB, took 21.64 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140147593637248" timestamp=1772845761 id_slot=0 id_task=3 p0=13906
ik_llama.cpp/src/llama-sampling.cpp:733: GGML_ASSERT(iter != probs.end()) failed
[New LWP 3160176]
[New LWP 3160177]
[New LWP 3160178]
[New LWP 3160179]
[New LWP 3160180]
[New LWP 3160181]
[New LWP 3160182]
[New LWP 3160183]
[New LWP 3160184]
[New LWP 3160185]
[New LWP 3160186]
[New LWP 3160187]
[New LWP 3160188]
[New LWP 3160189]
[New LWP 3160190]
[New LWP 3160191]
[New LWP 3160192]
[New LWP 3160193]
[New LWP 3160194]
[New LWP 3160195]
[New LWP 3160196]
[New LWP 3160197]
[New LWP 3160198]
[New LWP 3160199]
[New LWP 3160200]
[New LWP 3160201]
[New LWP 3160202]
[New LWP 3160203]
[New LWP 3160204]
[New LWP 3160205]
[New LWP 3160206]
[New LWP 3160207]
[New LWP 3160208]
[New LWP 3160209]
[New LWP 3160210]
[New LWP 3160211]
[New LWP 3160212]
[New LWP 3160213]
[New LWP 3160214]
[New LWP 3160216]
[New LWP 3160282]
[New LWP 3160283]
[New LWP 3160284]
[New LWP 3160285]
[New LWP 3160286]
[New LWP 3160287]
[New LWP 3160288]
[New LWP 3160289]
[New LWP 3160290]
[New LWP 3160291]
[New LWP 3160292]
[New LWP 3160293]
[New LWP 3160294]
[New LWP 3160295]
[New LWP 3160296]
[New LWP 3160297]
[New LWP 3160298]
[New LWP 3160299]
[New LWP 3160300]
[New LWP 3160301]
[New LWP 3160302]
[New LWP 3160303]
[New LWP 3160304]
[New LWP 3160305]
[New LWP 3160306]
[New LWP 3160307]
[New LWP 3160308]
[New LWP 3160309]
[New LWP 3160310]
[New LWP 3160311]
[New LWP 3160312]
[New LWP 3160313]
[New LWP 3160314]
[New LWP 3160315]
[New LWP 3160316]
[New LWP 3160317]
[New LWP 3160318]
[New LWP 3160319]
[New LWP 3160320]
[New LWP 3160321]
[New LWP 3160322]
[New LWP 3160323]
[New LWP 3160324]
[New LWP 3160325]
[New LWP 3160326]
[New LWP 3160327]
[New LWP 3160328]
[New LWP 3160329]
[New LWP 3160330]
[New LWP 3160331]
[New LWP 3160332]
[New LWP 3160333]
[New LWP 3160334]
[New LWP 3160335]
[New LWP 3160336]
[New LWP 3160337]
[New LWP 3160338]
[New LWP 3160339]
[New LWP 3160340]
[New LWP 3160341]
[New LWP 3160342]
[New LWP 3160343]
[New LWP 3160344]
[New LWP 3160345]
[New LWP 3160346]
[New LWP 3160347]
[New LWP 3160348]
[New LWP 3160349]
[New LWP 3160350]
[New LWP 3160351]
[New LWP 3160352]
[New LWP 3160353]
[New LWP 3160354]
[New LWP 3160355]
[New LWP 3160356]
[New LWP 3160357]
[New LWP 3160358]
[New LWP 3160359]
[New LWP 3160360]
[New LWP 3160361]
[New LWP 3160362]
[New LWP 3160363]
[New LWP 3160364]
[New LWP 3160365]
[New LWP 3160366]
[New LWP 3160367]
[New LWP 3160368]
[New LWP 3160369]
[New LWP 3160370]
[New LWP 3160371]
[New LWP 3160372]
[New LWP 3160373]
[New LWP 3160374]
[New LWP 3160375]
[New LWP 3160376]
[New LWP 3160377]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f76a7966c17 in __GI___wait4 (pid=3163505, stat_loc=0x7ffec57ad44c, options=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007f76a7966c17 in __GI___wait4 (pid=3163505, stat_loc=0x7ffec57ad44c, options=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007f76a7966b97 in __GI___waitpid (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>) at ./posix/waitpid.c:38
38 ./posix/waitpid.c: No such file or directory.
#2 0x00005628de190f25 in ggml_print_backtrace () at ik_llama.cpp/ggml/src/ggml.c:236
236 waitpid(pid, &wstatus, 0);
#3 0x00005628de190ed8 in ggml_abort (file=0x5628ddb4d275 "ik_llama.cpp/src/llama-sampling.cpp", line=733, fmt=0x5628ddb5ddcc <str.3.llvm> "GGML_ASSERT(%s) failed") at ik_llama.cpp/ggml/src/ggml.c:263
263 ggml_print_backtrace();
#4 0x00005628de077c45 in llama_sample_token_with_rng_impl (smpl=0x5628e4148498, candidates=<optimized out>, rng=...) at ik_llama.cpp/src/llama-sampling.cpp:733
733 GGML_ASSERT(iter != probs.end());
#5 0x00005628dde379d2 in llama_sampling_sample_impl (ctx_sampling=0x5628e84c96d0, ctx_main=0x5628e41483e0, ctx_cfg=0x0, idx=0, is_resampling=false) at ik_llama.cpp/common/sampling.cpp:435
435 id = llama_sample_token_with_rng(ctx_main, &cur_p, ctx_sampling->rng);
#6 0x00005628ddde09b5 in server_context::process_batch_tokens (this=this@entry=0x7ffec57b00d0, n_batch=@0x7ffec57adbf0: 4096) at ik_llama.cpp/examples/server/server-context.cpp:3525
3525 const llama_token id = common_sampler_sample(slot.ctx_sampling, ctx, tok_idx);
#7 0x00005628ddde218a in server_context::update_slots (this=0x7ffec57b00d0) at ik_llama.cpp/examples/server/server-context.cpp:3651
3651 process_batch_tokens(n_batch); // Decode with batch
#8 0x00005628ddd7ef34 in std::__1::__function::__value_func<void ()>::operator()[abi:ne210108]() const (this=0x7ffec57b1130) at ../include/c++/v1/__functional/function.h:274
274 return (*__f_)(std::forward<_ArgTypes>(__args)...);
#9 std::__1::function<void ()>::operator()() const (this=0x7ffec57b1130) at ../include/c++/v1/__functional/function.h:772
772 return __f_(std::forward<_ArgTypes>(__arg)...);
#10 server_queue::start_loop (this=0x7ffec57b0ff0) at ik_llama.cpp/examples/server/server-queue.cpp:133
133 callback_update_slots();
#11 0x00005628ddcb3781 in main (argc=<optimized out>, argv=<optimized out>) at ik_llama.cpp/examples/server/server.cpp:2143
2143 ctx_server.queue_tasks.start_loop();
[Inferior 1 (process 3159265) detached]
Aborted
|
… parsing and new jinja template engine (ikawrakow#1369)" This reverts commit ab1d740.
… parsing and new jinja template engine (ikawrakow#1369)" This reverts commit ab1d740.
… parsing and new jinja template engine (ikawrakow#1369)" This reverts commit ab1d740.
… parsing and new jinja template engine (ikawrakow#1369)" This reverts commit ab1d740.
This PR syncs PEG parser, new jinja template engine and other jinja related updates from mainline.
ggml-org/llama.cpp#17136
ggml-org/llama.cpp#18462
@sayap Mainline has merged qwen3-coder and nemotron nano 3 parsers. When I ported it, qwen3 next coder still uses
common_chat_params_init_qwen3_coder_xml. Can you test if it still works?@hksdpc255 Mainline disables Minja polyfill in chat.cpp. In my simple test, MiroThinker tool call works. Can you check if MiroThinker tool call still works?
@chulucninh09 Can you also test this PR?