Eval bug: lama-server with GLM-4.6 GGUF + BAML structured output eventually segfaults under repeated /v1/chat/completions

### Name and Version

llama.cpp server 7097 (b03ebc2f1)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu

This is not a pure stock master build: it includes

- The GLM-4.5/GLM-4.6 chat-template / tool-calling support from PR #15904
  (“common: Yet another add GLM-4.5/GLM-4.6 tool calling support”).
- The server-side prompt cache support from **PR #16391**, which introduces
  `--cache-ram` / `server_prompt_cache` and logs:
    srv  init: prompt cache is enabled, size limit: ...
    srv  init: use `--cache-ram 0` to disable the prompt cache
    srv  init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391


### Operating systems

Linux

### GGML backends

CUDA

### Hardware

CPU: AMD EPYC 9255

RAM: 768 GB

GPUs:

GPU 0–1: NVIDIA Blackwell (used for chat model)

GPU 2: NVIDIA RTX 4090 (used by a separate llama-server process for embeddings; that one is stable)

### Models

GLM-4.6 GGUF from Unsloth (REAP variant)
Example file:
GLM-4.6-REAP-268B-A32B-Q5_K_S-00001-of-00004.gguf
Quantization: Q5_K_S
Qwen3-Embedding-8B, GGUF Q5_K_M
File: Qwen3-Embedding-8B-Q5_K_M.gguf

### Problem description & steps to reproduce

Summary:

When running llama-server with GLM-4.6 GGUF and using the OpenAI-compatible /v1/chat/completions endpoint under heavy “structured output” load (BAML / Cognee knowledge-graph extraction), the server eventually becomes unstable and crashes with a segmentation fault, or becomes unresponsive and clients get timeouts.

The workload is many non-streaming chat requests, each with:

A large system prompt (structured-output / graph extraction instructions, JSON-like schema)

A medium-sized user input (~500–1,000 tokens) containing HTML/Markdown from documentation

Moderately high concurrency (multiple parallel requests via asyncio.gather)

The same crash pattern occurred earlier with a MiniMax-M2 GGUF model; I’ve switched to GLM-4.6 but the behaviour persists.

**Runtime commands**

Chat server (the one that crashes), on two Blackwell GPUs:
cd /path/to/llama.cpp/build/bin

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export GGML_CUDA_USE_GRAPHS=0

CHAT_MODEL="/path/to/GLM-4.6-REAP-268B-A32B-Q5_K_S-00001-of-00004.gguf"

./llama-server \
  --host 127.0.0.1 \
  --port 8080 \
  -m "$CHAT_MODEL" \
  --alias glm-4.6-reap \
  --ctx-size 16384 \
  --batch-size 512 \
  --ubatch-size 128 \
  --kv-unified \
  --cache-ram 0 \
  -ngl 999 \
  --flash-attn auto \
  --jinja

Embedding server (for completeness, but this one is stable):

cd /path/to/llama.cpp/build/bin

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=2
export GGML_CUDA_USE_GRAPHS=0

EMB_MODEL="/path/to/Qwen3-Embedding-8B-Q5_K_M.gguf"

./llama-server \
  --host 127.0.0.1 \
  --port 8082 \
  -m "$EMB_MODEL" \
  --alias text-embedding-3-large \
  --embedding \
  --pooling last \
  --batch-size 64 \
  --ubatch-size 64 \
  -ngl 999




### First Bad Commit

I don’t have a bisect yet.

The issue is present on the current master + the two PRs mentioned above (prompt cache and Unsloth GLM chat format).

I previously saw similar instability using a MiniMax-M2 GGUF model (also with --jinja and a custom template) on an earlier build.

I have not yet tested a clean master build without those patches or with a smaller model; happy to do that if that would be useful.

If you’d like, I can:

Rebuild on a specific known-good tag or commit you suggest.

Retry with debug symbols and run under gdb to capture a full backtrace at the point of the segfault.

### Relevant log output

```shell
Below is a representative snippet from the llama-server log right before the crash, plus the client-side error.

Server-side (llama.cpp):

llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_seq     = 16384
llama_context: n_batch       = 512
llama_context: n_ubatch      = 128
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
...
srv          init: prompt cache is enabled, size limit: 8192 MiB
srv          init: use `--cache-ram 0` to disable the prompt cache
srv          init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: thinking = 1
main: model loaded
main: chat template, chat_template: {# Unsloth template fixes #}
...
srv  update_slots: id  0 | task 4125 | n_tokens = 680, memory_seq_rm [680, end)
srv  update_slots: id  0 | task 4125 | prompt processing progress, n_tokens = 680, batch.n_tokens = 68, progress = 1.000000
srv  update_slots: id  0 | task 4125 | prompt done, n_tokens = 680, batch.n_tokens = 68
srv          stop: cancel task, id_task = 4169
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: GLM 4.5
srv          stop: cancel task, id_task = 4170
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: GLM 4.5
...
# many repeated lines of "srv  stop: cancel task, id_task = XXXX" and "srv  params_from_: Chat format: GLM 4.5"
...
Segmentation fault (core dumped)

On another run, with a Ctrl+C during shutdown I also saw:

^Csrv    operator(): operator(): cleaning up before exit...
libggml-base.so.0(+0x1840b)[0x...]
libggml-base.so.0(ggml_print_backtrace+0x21f)[0x...]
libggml-base.so.0(+0x2ba6f)[0x...]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x...]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x...]
./llama-server(+0xe4520)[0x...]
./llama-server(+0xe465c)[0x...]
./llama-server(+0x80ee4)[0x...]
./llama-server(+0x874af)[0x...]
./llama-server(+0x24ff62)[0x...]
./llama-server(+0x27da64)[0x...]
./llama-server(+0x27e3b8)[0x...]
./llama-server(+0x93e85)[0x...]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x...]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x...]

Client-side (BAML / Cognee), at the point where the server stops responding:

BamlClientHttpError(client_name=openai, message=reqwest::Error {
  kind: Request,
  url: "http://127.0.0.1:8080/v1/chat/completions",
  source: hyper_util::client::legacy::Error(
    Connect,
    ConnectError("tcp connect error", 127.0.0.1:8080, TimedOut)
  )
}, status_code=503)

...
  File ".../extract_content_graph.py", line 32, in extract_content_graph
    content_graph = await LLMGateway.acreate_structured_output(...)
...
baml_py.internal_monkeypatch.BamlClientHttpError: status_code=503
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: lama-server with GLM-4.6 GGUF + BAML structured output eventually segfaults under repeated /v1/chat/completions #17391

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: lama-server with GLM-4.6 GGUF + BAML structured output eventually segfaults under repeated /v1/chat/completions #17391

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions