server : print warning when HTTP timeout exceeded by ggerganov · Pull Request #22907 · ggml-org/llama.cpp

ggerganov · 2026-05-10T13:40:32Z

Overview

Long-lasting (i.e. more than 10 mins) generations with stream: false, in router mode, get terminated by the should_stop() condition. However, we don't see any information about it in the logs.

Promoting the debug message to warning to help understand what is happing in such cases.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

(cherry picked from commit 389ff61)

kripper · 2026-05-24T02:31:59Z

HTTP timeout (llama-server is not respecting the --timeout / HTTP libraries handle different timeouts) or the client is unexpectedly disconnecting (client-side timeout).
The task is canceled and the computed cache is released (this shouldn't be done IMO), so when the client retries the request, it starts from scratch, the timeout kicks in again and we end in a deadlock (#22160).

ngxson · 2026-05-28T20:21:45Z

I don't think there is a problem with server --timeout. It seems like most reported problems are due to the timeout from client side, not the server.

In any cases, we can simply bump --timeout to a very large number and remove this message, which can be a bit misleading

kripper · 2026-05-28T21:29:00Z

I don't think there is a problem with server --timeout. It seems like most reported problems are due to the timeout from client side, not the server.

Yes, it could be a client-side timeout.

In any cases, we can simply bump --timeout to a very large number and remove this message, which can be a bit misleading

I think it is important to detect both situations and explicitly show a message saying "Client disconnected unexpectedly" or "The request timed out on the server (adjust the --timeout argument if needed)."

alexhalf · 2026-05-30T20:35:05Z

I update my llama from b8895 to b9296 + use the same, but now MTP versions of models (qwen3.5-35b-3ab and qwen3.5-9B) and face to the problem with both. At first with 35b (slowly for me with cached prompt 25k context) and with qwen3.5-9b when my context about 190k and server have to spend more time with fast model 9b to invalidate all prompt cache . llama-swap have the same version, now I often see the stop message. And I can not continue chat with context because handling is stoped

alexhalf · 2026-05-31T21:41:31Z

My apologies.

I set that experimented config in my llama-swap
"Qwen3.5-9B-Q4_K_M":
cmd: |
/models/llm/llama-b9296/bin/helper.sh --port 8038 --model /models/Qwen3.5-9B-MTP-Q4_K_M.gguf --mmproj /models/qwen3.5-9b-mmproj-BF16.gguf
--reasoning off
-ngl 99
--ctx-size 262144
--spec-type draft-mtp
--batch-size 512
--ubatch-size 256
--flash-attn on
--jinja
--metrics
--mlock
--no-mmap
--cache-type-k q4_0
--cache-type-v q4_0
--parallel 1
--cont-batching
--temp 0.7
--top-p 0.95
--top-k 40
--slot-save-path /models/cache/qwen3.5-9b
--ctx-checkpoints 1024
--timeout 10
proxy: http://127.0.0.1:8038
proxyTimeout: 20
timeouts:
read: 30
write: 40
responseHeader: 50

and it does not matter

because after 120 seconds
i have stop signal

2.04.697.982 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 116736, progress = 0.54, t = 119.08 s / 980.30 tokens per second
2.05.359.947 W srv next: stopping wait for next result due to should_stop condition (adjust the --timeout argument if needed)
2.05.359.969 W srv next: ref: #22907

but in llama-swap log I checked request headers

I updated my openclaw from 2026-4-23 to 2026-5-19 release, but I does not change my openclaw config. But I found issue with deepseek openclaw/openclaw#76117

I had ddefault timeout 30 minutes in openclaw and now
I tried set models.providers.NAME_OF_YOUR_PROVIDER.timeoutSeconds: 1210,
And it helped me

Now it's ok
99.765 I slot print_timing: id 0 | task 1 | prompt processing, n_tokens = 217686, progress = 1.00, t = 292.61 s / 743.96 tokens per second

llama-server's default slot timeout is 30 s. Qwen3.6 models using SWA / hybrid memory regularly trigger a full prompt re-prefill on cache miss ("forcing full prompt re-processing due to lack of cache data"). A 39 k-token context re-prefill can easily exceed 30 s, causing the server to abort the slot with "stopping wait for next result due to should_stop condition" and return a 500 to LiteLLM ("proxy error: Failed to read connection"). Set timeout = 600 (10 min) in: - router.nix globalKeys (INI [*] section for the llama-server backend) - lib/scripts.nix mkLlamaScript (llama-swap per-model wrappers) Ref: ggml-org/llama.cpp#22907

ggerganov requested a review from a team as a code owner May 10, 2026 13:40

github-actions Bot added examples server labels May 10, 2026

server : print warning when HTTP timeout exceeded

e4e3ca6

ggerganov force-pushed the gg/server-timeout-warning branch from e716e06 to e4e3ca6 Compare May 10, 2026 13:41

ServeurpersoCom approved these changes May 10, 2026

View reviewed changes

taronaeo approved these changes May 10, 2026

View reviewed changes

ggerganov merged commit 389ff61 into master May 10, 2026
46 checks passed

ggerganov deleted the gg/server-timeout-warning branch May 10, 2026 19:00

xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 12, 2026

server : print warning when HTTP timeout exceeded (ggml-org#22907)

6f3b852

komadori82 mentioned this pull request May 17, 2026

Eval bug: stopping wait for next result due to should_stop condition when Prompt Processing is >60s #22997

Closed

bi4key mentioned this pull request May 18, 2026

llama.cpp stopping , problem with pause ? Doorman11991/smallcode#1

Closed

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026

server : print warning when HTTP timeout exceeded (ggml-org#22907)

c0c61fd

ameaninglessname mentioned this pull request May 22, 2026

runner: Remove CGO engines, use llama-server exclusively for GGML models ollama/ollama#16031

Merged

corrm mentioned this pull request May 22, 2026

server: fix checkpoints creation #22929

Merged

baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026

server : print warning when HTTP timeout exceeded (ggml-org#22907)

7fe7e43

carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request May 24, 2026

server : print warning when HTTP timeout exceeded (ggml-org#22907)

88dfaf6

(cherry picked from commit 389ff61)

winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026

server : print warning when HTTP timeout exceeded (ggml-org#22907)

f7b6d30

kripper mentioned this pull request May 28, 2026

fix(server): add connection_closed shared flag for HTTP streaming stop detection #23837

Closed

ngxson mentioned this pull request May 28, 2026

server: bump timeout to 3600s #23842

Merged

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026

server : print warning when HTTP timeout exceeded (ggml-org#22907)

7b7ee8c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : print warning when HTTP timeout exceeded#22907

server : print warning when HTTP timeout exceeded#22907
ggerganov merged 1 commit into
masterfrom
gg/server-timeout-warning

ggerganov commented May 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

kripper commented May 24, 2026 •

edited

Loading

Uh oh!

ngxson commented May 28, 2026

Uh oh!

kripper commented May 28, 2026

Uh oh!

alexhalf commented May 30, 2026 •

edited

Loading

Uh oh!

alexhalf commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ggerganov commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

Uh oh!

kripper commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 28, 2026

Uh oh!

kripper commented May 28, 2026

Uh oh!

alexhalf commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexhalf commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ggerganov commented May 10, 2026 •

edited

Loading

kripper commented May 24, 2026 •

edited

Loading

alexhalf commented May 30, 2026 •

edited

Loading

alexhalf commented May 31, 2026 •

edited

Loading