Skip to content

server : print warning when HTTP timeout exceeded#22907

Merged
ggerganov merged 1 commit into
masterfrom
gg/server-timeout-warning
May 10, 2026
Merged

server : print warning when HTTP timeout exceeded#22907
ggerganov merged 1 commit into
masterfrom
gg/server-timeout-warning

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented May 10, 2026

Overview

Long-lasting (i.e. more than 10 mins) generations with stream: false, in router mode, get terminated by the should_stop() condition. However, we don't see any information about it in the logs.

Promoting the debug message to warning to help understand what is happing in such cases.

Requirements

@ggerganov ggerganov requested a review from a team as a code owner May 10, 2026 13:40
@ggerganov ggerganov force-pushed the gg/server-timeout-warning branch from e716e06 to e4e3ca6 Compare May 10, 2026 13:41
@ggerganov ggerganov merged commit 389ff61 into master May 10, 2026
46 checks passed
@ggerganov ggerganov deleted the gg/server-timeout-warning branch May 10, 2026 19:00
xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 12, 2026
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request May 24, 2026
@kripper
Copy link
Copy Markdown

kripper commented May 24, 2026

HTTP timeout (llama-server is not respecting the --timeout / HTTP libraries handle different timeouts) or the client is unexpectedly disconnecting (client-side timeout).
The task is canceled and the computed cache is released (this shouldn't be done IMO), so when the client retries the request, it starts from scratch, the timeout kicks in again and we end in a deadlock (#22160).

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 28, 2026

I don't think there is a problem with server --timeout. It seems like most reported problems are due to the timeout from client side, not the server.

In any cases, we can simply bump --timeout to a very large number and remove this message, which can be a bit misleading

@kripper
Copy link
Copy Markdown

kripper commented May 28, 2026

I don't think there is a problem with server --timeout. It seems like most reported problems are due to the timeout from client side, not the server.

Yes, it could be a client-side timeout.

In any cases, we can simply bump --timeout to a very large number and remove this message, which can be a bit misleading

I think it is important to detect both situations and explicitly show a message saying "Client disconnected unexpectedly" or "The request timed out on the server (adjust the --timeout argument if needed)."

@alexhalf
Copy link
Copy Markdown

alexhalf commented May 30, 2026

I update my llama from b8895 to b9296 + use the same, but now MTP versions of models (qwen3.5-35b-3ab and qwen3.5-9B) and face to the problem with both. At first with 35b (slowly for me with cached prompt 25k context) and with qwen3.5-9b when my context about 190k and server have to spend more time with fast model 9b to invalidate all prompt cache . llama-swap have the same version, now I often see the stop message. And I can not continue chat with context because handling is stoped

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
@alexhalf
Copy link
Copy Markdown

alexhalf commented May 31, 2026

My apologies.

I set that experimented config in my llama-swap
"Qwen3.5-9B-Q4_K_M":
cmd: |
/models/llm/llama-b9296/bin/helper.sh --port 8038 --model /models/Qwen3.5-9B-MTP-Q4_K_M.gguf --mmproj /models/qwen3.5-9b-mmproj-BF16.gguf
--reasoning off
-ngl 99
--ctx-size 262144
--spec-type draft-mtp
--batch-size 512
--ubatch-size 256
--flash-attn on
--jinja
--metrics
--mlock
--no-mmap
--cache-type-k q4_0
--cache-type-v q4_0
--parallel 1
--cont-batching
--temp 0.7
--top-p 0.95
--top-k 40
--slot-save-path /models/cache/qwen3.5-9b
--ctx-checkpoints 1024
--timeout 10
proxy: http://127.0.0.1:8038
proxyTimeout: 20
timeouts:
read: 30
write: 40
responseHeader: 50

and it does not matter

because after 120 seconds
i have stop signal

2.04.697.982 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 116736, progress = 0.54, t = 119.08 s / 980.30 tokens per second
2.05.359.947 W srv next: stopping wait for next result due to should_stop condition (adjust the --timeout argument if needed)
2.05.359.969 W srv next: ref: #22907

but in llama-swap log I checked request headers

image

I updated my openclaw from 2026-4-23 to 2026-5-19 release, but I does not change my openclaw config. But I found issue with deepseek openclaw/openclaw#76117

I had ddefault timeout 30 minutes in openclaw and now
I tried set models.providers.NAME_OF_YOUR_PROVIDER.timeoutSeconds: 1210,
And it helped me
image
Now it's ok
99.765 I slot print_timing: id 0 | task 1 | prompt processing, n_tokens = 217686, progress = 1.00, t = 292.61 s / 743.96 tokens per second

maxhbr added a commit to maxhbr/myconfig that referenced this pull request Jun 5, 2026
llama-server's default slot timeout is 30 s. Qwen3.6 models using SWA /
hybrid memory regularly trigger a full prompt re-prefill on cache miss
("forcing full prompt re-processing due to lack of cache data"). A
39 k-token context re-prefill can easily exceed 30 s, causing the server
to abort the slot with "stopping wait for next result due to
should_stop condition" and return a 500 to LiteLLM
("proxy error: Failed to read connection").

Set timeout = 600 (10 min) in:
- router.nix globalKeys  (INI [*] section for the llama-server backend)
- lib/scripts.nix mkLlamaScript  (llama-swap per-model wrappers)

Ref: ggml-org/llama.cpp#22907
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants