-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Description
Name and Version
llama.cpp server 7097 (b03ebc2)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
This is not a pure stock master build: it includes
- The GLM-4.5/GLM-4.6 chat-template / tool-calling support from PR common: Yet another add GLM-4.5/GLM-4.6 tool calling support #15904
(“common: Yet another add GLM-4.5/GLM-4.6 tool calling support”). - The server-side prompt cache support from PR server : host-memory prompt caching #16391, which introduces
--cache-ram/server_prompt_cacheand logs:
srv init: prompt cache is enabled, size limit: ...
srv init: use--cache-ram 0to disable the prompt cache
srv init: for more info see server : host-memory prompt caching #16391
Operating systems
Linux
GGML backends
CUDA
Hardware
CPU: AMD EPYC 9255
RAM: 768 GB
GPUs:
GPU 0–1: NVIDIA Blackwell (used for chat model)
GPU 2: NVIDIA RTX 4090 (used by a separate llama-server process for embeddings; that one is stable)
Models
GLM-4.6 GGUF from Unsloth (REAP variant)
Example file:
GLM-4.6-REAP-268B-A32B-Q5_K_S-00001-of-00004.gguf
Quantization: Q5_K_S
Qwen3-Embedding-8B, GGUF Q5_K_M
File: Qwen3-Embedding-8B-Q5_K_M.gguf
Problem description & steps to reproduce
Summary:
When running llama-server with GLM-4.6 GGUF and using the OpenAI-compatible /v1/chat/completions endpoint under heavy “structured output” load (BAML / Cognee knowledge-graph extraction), the server eventually becomes unstable and crashes with a segmentation fault, or becomes unresponsive and clients get timeouts.
The workload is many non-streaming chat requests, each with:
A large system prompt (structured-output / graph extraction instructions, JSON-like schema)
A medium-sized user input (~500–1,000 tokens) containing HTML/Markdown from documentation
Moderately high concurrency (multiple parallel requests via asyncio.gather)
The same crash pattern occurred earlier with a MiniMax-M2 GGUF model; I’ve switched to GLM-4.6 but the behaviour persists.
Runtime commands
Chat server (the one that crashes), on two Blackwell GPUs:
cd /path/to/llama.cpp/build/bin
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export GGML_CUDA_USE_GRAPHS=0
CHAT_MODEL="/path/to/GLM-4.6-REAP-268B-A32B-Q5_K_S-00001-of-00004.gguf"
./llama-server
--host 127.0.0.1
--port 8080
-m "$CHAT_MODEL"
--alias glm-4.6-reap
--ctx-size 16384
--batch-size 512
--ubatch-size 128
--kv-unified
--cache-ram 0
-ngl 999
--flash-attn auto
--jinja
Embedding server (for completeness, but this one is stable):
cd /path/to/llama.cpp/build/bin
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=2
export GGML_CUDA_USE_GRAPHS=0
EMB_MODEL="/path/to/Qwen3-Embedding-8B-Q5_K_M.gguf"
./llama-server
--host 127.0.0.1
--port 8082
-m "$EMB_MODEL"
--alias text-embedding-3-large
--embedding
--pooling last
--batch-size 64
--ubatch-size 64
-ngl 999
First Bad Commit
I don’t have a bisect yet.
The issue is present on the current master + the two PRs mentioned above (prompt cache and Unsloth GLM chat format).
I previously saw similar instability using a MiniMax-M2 GGUF model (also with --jinja and a custom template) on an earlier build.
I have not yet tested a clean master build without those patches or with a smaller model; happy to do that if that would be useful.
If you’d like, I can:
Rebuild on a specific known-good tag or commit you suggest.
Retry with debug symbols and run under gdb to capture a full backtrace at the point of the segfault.
Relevant log output
Below is a representative snippet from the llama-server log right before the crash, plus the client-side error.
Server-side (llama.cpp):
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16384
llama_context: n_ctx_seq = 16384
llama_context: n_batch = 512
llama_context: n_ubatch = 128
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
...
srv init: prompt cache is enabled, size limit: 8192 MiB
srv init: use `--cache-ram 0` to disable the prompt cache
srv init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv init: thinking = 1
main: model loaded
main: chat template, chat_template: {# Unsloth template fixes #}
...
srv update_slots: id 0 | task 4125 | n_tokens = 680, memory_seq_rm [680, end)
srv update_slots: id 0 | task 4125 | prompt processing progress, n_tokens = 680, batch.n_tokens = 68, progress = 1.000000
srv update_slots: id 0 | task 4125 | prompt done, n_tokens = 680, batch.n_tokens = 68
srv stop: cancel task, id_task = 4169
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: GLM 4.5
srv stop: cancel task, id_task = 4170
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: GLM 4.5
...
# many repeated lines of "srv stop: cancel task, id_task = XXXX" and "srv params_from_: Chat format: GLM 4.5"
...
Segmentation fault (core dumped)
On another run, with a Ctrl+C during shutdown I also saw:
^Csrv operator(): operator(): cleaning up before exit...
libggml-base.so.0(+0x1840b)[0x...]
libggml-base.so.0(ggml_print_backtrace+0x21f)[0x...]
libggml-base.so.0(+0x2ba6f)[0x...]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x...]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x...]
./llama-server(+0xe4520)[0x...]
./llama-server(+0xe465c)[0x...]
./llama-server(+0x80ee4)[0x...]
./llama-server(+0x874af)[0x...]
./llama-server(+0x24ff62)[0x...]
./llama-server(+0x27da64)[0x...]
./llama-server(+0x27e3b8)[0x...]
./llama-server(+0x93e85)[0x...]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x...]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x...]
Client-side (BAML / Cognee), at the point where the server stops responding:
BamlClientHttpError(client_name=openai, message=reqwest::Error {
kind: Request,
url: "http://127.0.0.1:8080/v1/chat/completions",
source: hyper_util::client::legacy::Error(
Connect,
ConnectError("tcp connect error", 127.0.0.1:8080, TimedOut)
)
}, status_code=503)
...
File ".../extract_content_graph.py", line 32, in extract_content_graph
content_graph = await LLMGateway.acreate_structured_output(...)
...
baml_py.internal_monkeypatch.BamlClientHttpError: status_code=503