server : print warning when HTTP timeout exceeded#22907
Conversation
e716e06 to
e4e3ca6
Compare
(cherry picked from commit 389ff61)
|
HTTP timeout (llama-server is not respecting the |
|
I don't think there is a problem with server --timeout. It seems like most reported problems are due to the timeout from client side, not the server. In any cases, we can simply bump --timeout to a very large number and remove this message, which can be a bit misleading |
Yes, it could be a client-side timeout.
I think it is important to detect both situations and explicitly show a message saying "Client disconnected unexpectedly" or "The request timed out on the server (adjust the --timeout argument if needed)." |
|
I update my llama from b8895 to b9296 + use the same, but now MTP versions of models (qwen3.5-35b-3ab and qwen3.5-9B) and face to the problem with both. At first with 35b (slowly for me with cached prompt 25k context) and with qwen3.5-9b when my context about 190k and server have to spend more time with fast model 9b to invalidate all prompt cache . llama-swap have the same version, now I often see the stop message. And I can not continue chat with context because handling is stoped |
|
My apologies. I set that experimented config in my llama-swap and it does not matter because after 120 seconds 2.04.697.982 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 116736, progress = 0.54, t = 119.08 s / 980.30 tokens per second but in llama-swap log I checked request headers
I updated my openclaw from 2026-4-23 to 2026-5-19 release, but I does not change my openclaw config. But I found issue with deepseek openclaw/openclaw#76117 I had ddefault timeout 30 minutes in openclaw and now |
llama-server's default slot timeout is 30 s. Qwen3.6 models using SWA /
hybrid memory regularly trigger a full prompt re-prefill on cache miss
("forcing full prompt re-processing due to lack of cache data"). A
39 k-token context re-prefill can easily exceed 30 s, causing the server
to abort the slot with "stopping wait for next result due to
should_stop condition" and return a 500 to LiteLLM
("proxy error: Failed to read connection").
Set timeout = 600 (10 min) in:
- router.nix globalKeys (INI [*] section for the llama-server backend)
- lib/scripts.nix mkLlamaScript (llama-swap per-model wrappers)
Ref: ggml-org/llama.cpp#22907


Overview
Long-lasting (i.e. more than 10 mins) generations with
stream: false, in router mode, get terminated by theshould_stop()condition. However, we don't see any information about it in the logs.Promoting the debug message to warning to help understand what is happing in such cases.
Requirements