server: bump timeout to 3600s#23842
Conversation
| if (time_elapsed_ms > 30000) { | ||
| SRV_WRN("%s", "request cancelled after 30s, likely a client-side timeout; please check your client's code\n"); | ||
| } |
There was a problem hiding this comment.
note: it would be better to detect if time_elapsed_ms > server's --timeout here, then log another message. but due to the way things are structured, this proved to be quite complicated
There was a problem hiding this comment.
Would the client-set timeout for request be easier to log? It would have avoided the confusion
There was a problem hiding this comment.
but how? AFAIK client never communicate such info to server
There was a problem hiding this comment.
I might misunderstand the PR but isn't the log request cancelled after 30s a bit confusing?
It could show this message with a time_elapsed_ms anywhere between 30s & 3600s, but for an user reading this log it looks like it's exactly 30s.
Why not use the actual time_elapsed_ms value?
There was a problem hiding this comment.
but how? AFAIK client never communicate such info to server
You are right, I should have checked before asking. The whole log messaging was just implying that request is cancelled on server initiative.
* origin/master: vocab : support tokenizer for LFM2.5-8B-A1B (ggml-org#23826) graph : ensure DS32 kq_mask_lid is F32 (ggml-org#23864) server: remove obsolete scripts (ggml-org#23870) ci : update macos release to use macos-26 runner (ggml-org#23878) download: add option to skip_download (ggml-org#23059) mtmd: Add DeepSeekOCR 2 Support (ggml-org#20975) CUDA: Check PTX version on host side to guard PDL dispatch (ggml-org#23530) server: bump timeout to 3600s (ggml-org#23842) model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (ggml-org#23346) llama: use f16 mask for FA to save VRAM (ggml-org#23764) sync : ggml ggml : bump version to 0.13.1 (ggml/1523) ngram-mod : Add missing include (ggml-org#23857) llama: add llm_graph_input_mtp (ggml-org#23643) app : move licences to llama-app (ggml-org#23824) cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml-org#23825) meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (ggml-org#23480)
* server: bump timeout to 3600s * nits: change wording
* server: bump timeout to 3600s * nits: change wording
Overview
IMPORTANT: server's
--timeoutworks fine. For users who reported the problem related to timeout, check your client code first, some HTTP framework and browsers may have a default client-side timeout.Ref discussion from #22907
Fix #23832
Fix #22997
Bump timeout to one hour. This "ought to be enough for anybody"
Also print a message to remind about client's timeout.
How I tested this change
Here is how I test it:
llama_decode()inserver-context.cpp:std::this_thread::sleep_for(std::chrono::seconds(1000000));that will simulate a long blocking task
Requirements