- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
Description
Name and Version
Environment (compiled from master)
root@77c821627b43:/app# ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
version: 6118 (6c7e9a54)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Issue
Runtime runs in
Operating systems
Linux
GGML backends
CUDA
Hardware
Environment (Compiled from master):
root@77c821627b43:/app# ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no  
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no  
ggml_cuda_init: found 2 CUDA devices:  
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes  
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes  
load_backend: loaded CUDA backend from /app/libggml-cuda.so  
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so  
version: 6118 (6c7e9a54)  
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu  
Models
gpt-oss-20b-BF16.gguf
Problem description & steps to reproduce
Issue:
The server fails at runtime when attempting to use ChatBox with tools enabled. Everything builds fine from master, and the server starts up without issues. However, once I initiate a session using the ChatBox frontend with tools turned on, the process crashes or becomes unresponsive.
Expected behavior:
The server should operate normally with tools enabled in ChatBox.
Steps to reproduce:
Build llama-server from latest master
Start the server
Connect with ChatBox
Enable tools (MSP Server)
Attempt to start a chat
Runtime fails with error
srv  log_server_r: request: POST /v1/chat/completions 192.168.1.248 200
slot      release: id  0 | task 2 | stop processing: n_past = 419, truncated = 0
slot print_timing: id  0 | task 2 |
prompt eval time =     109.57 ms /   327 tokens (    0.34 ms per token,  2984.31 tokens per second)
       eval time =     291.01 ms /    34 tokens (    8.56 ms per token,   116.83 tokens per second)
      total time =     400.59 ms /   361 tokens
libggml-base.so(+0x16d4b)[0x7f5eb1ed3d4b]
libggml-base.so(ggml_print_backtrace+0x21f)[0x7f5eb1ed41af]
libggml-base.so(+0x28aaf)[0x7f5eb1ee5aaf]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f5eb1d3d20c]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7f5eb1d3d277]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8)[0x7f5eb1d3d4d8]
/app/llama-server(+0x44fb2)[0x5593b03aefb2]
/app/llama-server(+0x158ce8)[0x5593b04c2ce8]
/app/llama-server(+0xb0f14)[0x5593b041af14]
/app/llama-server(+0xb321c)[0x5593b041d21c]
/app/llama-server(+0xdf406)[0x5593b0449406]
/app/llama-server(+0x856fd)[0x5593b03ef6fd]
/app/llama-server(+0x4d5e5)[0x5593b03b75e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f5eb1988d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f5eb1988e40]
/app/llama-server(+0x4f035)[0x5593b03b9035]
terminate called after throwing an instance of 'std::runtime_error'
  what():  Unexpected content at end of input
Additional context:
Both CUDA and CPU backends load successfully.
No errors during build or initial startup.
ChatBox works fine without tools.
The failure happens only when tools are enabled.
Let me know if logs or stack traces are needed.
Runtime configuration
Configuration:
  GPU Layers: 99
  Threads: -1
  Context Size: 16384
  Temperature: 1.0
  Top-p: 1.0
  Top-k: 0
  Jinja: true
First Bad Commit
No response
Relevant log output
srv log_server_r: request: POST /v1/chat/completions 192.168.1.248 200
slot release: id 0 | task 2 | stop processing: n_past = 419, truncated = 0
slot print_timing: id 0 | task 2 |
prompt eval time = 109.57 ms / 327 tokens ( 0.34 ms per token, 2984.31 tokens per second)
eval time = 291.01 ms / 34 tokens ( 8.56 ms per token, 116.83 tokens per second)
total time = 400.59 ms / 361 tokens
libggml-base.so(+0x16d4b)[0x7f5eb1ed3d4b]
libggml-base.so(ggml_print_backtrace+0x21f)[0x7f5eb1ed41af]
libggml-base.so(+0x28aaf)[0x7f5eb1ee5aaf]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f5eb1d3d20c]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7f5eb1d3d277]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8)[0x7f5eb1d3d4d8]
/app/llama-server(+0x44fb2)[0x5593b03aefb2]
/app/llama-server(+0x158ce8)[0x5593b04c2ce8]
/app/llama-server(+0xb0f14)[0x5593b041af14]
/app/llama-server(+0xb321c)[0x5593b041d21c]
/app/llama-server(+0xdf406)[0x5593b0449406]
/app/llama-server(+0x856fd)[0x5593b03ef6fd]
/app/llama-server(+0x4d5e5)[0x5593b03b75e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f5eb1988d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f5eb1988e40]
/app/llama-server(+0x4f035)[0x5593b03b9035]
terminate called after throwing an instance of 'std::runtime_error'
what(): Unexpected content at end of input