Skip to content

server: bump timeout to 3600s#23842

Merged
ngxson merged 2 commits into
masterfrom
xsn/server_bump_timeout
May 29, 2026
Merged

server: bump timeout to 3600s#23842
ngxson merged 2 commits into
masterfrom
xsn/server_bump_timeout

Conversation

@ngxson
Copy link
Copy Markdown
Contributor

@ngxson ngxson commented May 28, 2026

Overview

IMPORTANT: server's --timeout works fine. For users who reported the problem related to timeout, check your client code first, some HTTP framework and browsers may have a default client-side timeout.

Ref discussion from #22907

Fix #23832

Fix #22997

Bump timeout to one hour. This "ought to be enough for anybody"

Also print a message to remind about client's timeout.

How I tested this change

Here is how I test it:

  1. add this to either before or after llama_decode() in server-context.cpp: std::this_thread::sleep_for(std::chrono::seconds(1000000));
    that will simulate a long blocking task
  2. remember to configure timeout of the client. in my case, postman: https://stackoverflow.com/questions/36355732/how-to-increase-postman-client-request-timeout
  3. send a request and wait
  4. 17 minutes later, I stop the request:
0.36.243.889 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.36.243.895 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 131072 tokens, 8589934592 est)
0.36.243.899 I srv  get_availabl: prompt cache update took 0.04 ms
0.36.244.264 I slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
17.11.781.940 W srv          next: request cancelled after 30s, likely a client-side timeout; please check your client's code
17.11.781.948 W srv          stop: cancel task, id_task = 0

Requirements

@ngxson ngxson requested review from a team as code owners May 28, 2026 21:49
Comment on lines +385 to +387
if (time_elapsed_ms > 30000) {
SRV_WRN("%s", "request cancelled after 30s, likely a client-side timeout; please check your client's code\n");
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: it would be better to detect if time_elapsed_ms > server's --timeout here, then log another message. but due to the way things are structured, this proved to be quite complicated

Copy link
Copy Markdown

@sasa7812 sasa7812 May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the client-set timeout for request be easier to log? It would have avoided the confusion

Copy link
Copy Markdown
Contributor Author

@ngxson ngxson May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but how? AFAIK client never communicate such info to server

Copy link
Copy Markdown

@bonswouar bonswouar May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might misunderstand the PR but isn't the log request cancelled after 30s a bit confusing?
It could show this message with a time_elapsed_ms anywhere between 30s & 3600s, but for an user reading this log it looks like it's exactly 30s.
Why not use the actual time_elapsed_ms value?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but how? AFAIK client never communicate such info to server

You are right, I should have checked before asking. The whole log messaging was just implying that request is cancelled on server initiative.

@ngxson ngxson merged commit cb47092 into master May 29, 2026
27 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 29, 2026
* origin/master:
vocab : support tokenizer for LFM2.5-8B-A1B (ggml-org#23826)
graph : ensure DS32 kq_mask_lid is F32 (ggml-org#23864)
server: remove obsolete scripts (ggml-org#23870)
ci : update macos release to use macos-26 runner (ggml-org#23878)
download: add option to skip_download (ggml-org#23059)
mtmd: Add DeepSeekOCR 2 Support (ggml-org#20975)
CUDA: Check PTX version on host side to guard PDL dispatch (ggml-org#23530)
server: bump timeout to 3600s (ggml-org#23842)
model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (ggml-org#23346)
llama: use f16 mask for FA to save VRAM (ggml-org#23764)
sync : ggml
ggml : bump version to 0.13.1 (ggml/1523)
ngram-mod : Add missing include (ggml-org#23857)
llama: add llm_graph_input_mtp (ggml-org#23643)
app : move licences to llama-app (ggml-org#23824)
cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml-org#23825)
meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (ggml-org#23480)
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* server: bump timeout to 3600s

* nits: change wording
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
* server: bump timeout to 3600s

* nits: change wording
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

5 participants