Skip to content

server : add SSE keepalive#23994

Closed
joleuger wants to merge 1 commit into
ggml-org:masterfrom
joleuger:master
Closed

server : add SSE keepalive#23994
joleuger wants to merge 1 commit into
ggml-org:masterfrom
joleuger:master

Conversation

@joleuger
Copy link
Copy Markdown

@joleuger joleuger commented Jun 1, 2026

Overview

Introduce a --sse-keepalive-interval flag (default: disabled, 0 = disabled) that emits an SSE comment line (: keepalive) during silent periods of a streaming response. This prevents network infrastructure with idle-connection timeouts (proxies, load balancers, NAT, Node.js fetch at 300 s) from killing the connection during long prefill or slow generation.

  • add server_task_result_keepalive result type
  • add --sse-keepalive-interval / LLAMA_ARG_SSE_KEEPALIVE_INTERVAL
  • add keepalive_interval_seconds to server_response_reader::next and disable the keepalive for all non-streaming responses

Additional information

The SSE comment line is specified in section 9.2.6 of the Server Sent Events specification. The typescript API of openai also complies (see openai-node library).

Tested with:

  • a small test tool using the openai library. The stream via chat.completions.create({ stream: true }). keepalive comments are silently discarded, response text is clean.
  • A qwen code session with qwen-coder-next 80B that survived for longer than 600 seconds.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure:
    YES:
    • Used deepwiki to analyze the code.
    • Used Anthropic Sonnet 4.5 to generate a small test tool using the openai library to check if the "keepalive" bleeds in.

Introduce a `--sse-keepalive-interval` flag (default: disabled, 0 = disabled)
that emits an SSE comment line (`: keepalive`) during silent periods of a
streaming response. This prevents network infrastructure with idle-connection
timeouts (proxies, load balancers, NAT, Node.js fetch at 300 s) from killing
the connection during long prefill or slow generation.

- add `server_task_result_keepalive`
  result type
- add `--sse-keepalive-interval` / `LLAMA_ARG_SSE_KEEPALIVE_INTERVAL`
- add `keepalive_interval_seconds` to server_response_reader::next and disable the keepalive for all non-streaming responses
@joleuger
Copy link
Copy Markdown
Author

joleuger commented Jun 1, 2026

Test tool result (tool is on github gist)

BASE_URL=http://192.168.51.234:15231/v1 bun index.ts --words 40000 --model qwencoder-deep
  padding  : 40,000 words (~52,000 tokens estimated)
SSE comment compatibility test
──────────────────────────────────────────────────
  base URL : http://192.168.51.234:15231/v1
  model    : qwencoder-deep
  messages : 2

I received this message.

──────────────────────────────────────────────────
Result: ✓ stream completed
  elapsed              : 323.1s
  SDK chunks           : 7
  tokens received      : 5
  keepalive comments   : 0
  other comments       : 0
  ~ no keepalive comments seen (direct connection, or interval not reached)

@joleuger
Copy link
Copy Markdown
Author

joleuger commented Jun 1, 2026

Results of the ci tests:
100% tests passed, 0 tests failed out of 45

Full log ci-tests.md

Results of the server tests:
========== 282 passed, 3 skipped, 199 deselected in 455.53s (0:07:35) ==========

Full log server-tests.md

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jun 1, 2026

as explained for n-th times: this is NOT necessary. none of the inference runtimes (vllm/ollama/etc) support it

@ngxson ngxson closed this Jun 1, 2026
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jun 1, 2026

if you have problems with timeout, THAT IS UP TO YOUR CLIENT

@nonbasketless
Copy link
Copy Markdown

I just want to point out that pi is blaming llama.cpp and llama.cpp is blaming pi. I have a workaround and am not taking sides, but it's worth saying that at least one of ya'll is incorrect.

@aldehir
Copy link
Copy Markdown
Contributor

aldehir commented Jun 2, 2026

I just want to point out that pi is blaming llama.cpp and llama.cpp is blaming pi. I have a workaround and am not taking sides, but it's worth saying that at least one of ya'll is incorrect.

See ngxson's response from your own issue in Pi's repo,

so my conclusion is that SSE keep alive / heartbeat is still needed in this situation, but I will push a proper implementation tomorrow

earendil-works/pi#5089 (comment)

Note, tool call argument streaming was broken for Qwen models with their edit tool but has since been resolved in #23173.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jun 2, 2026

@nonbasketless no, I'm blame the fact that no one has dig deep enough to show where the problem comes from.

people have been blaming --timeout option in llama.cpp / httplib for not being handled properly while it works fine, not until I pointed out the truth that their client code have problems and many people confirmed that it's true: #22997

people push vibe fix, slop code to us but no one can take some minutes to ask AI to dig into the root cause. well at least I did, that took me 20 minutes, earendil-works/pi#5089 (comment)

@joleuger
Copy link
Copy Markdown
Author

joleuger commented Jun 2, 2026

@nonbasketless no, I'm blame the fact that no one has dig deep enough to show where the problem comes from.

people have been blaming --timeout option in llama.cpp / httplib for not being handled properly while it works fine, not until I pointed out the truth that their client code have problems and many people confirmed that it's true: #22997

people push vibe fix, slop code to us but no one can take some minutes to ask AI to dig into the root cause. well at least I did, that took me 20 minutes, earendil-works/pi#5089 (comment)

Sorry, I guess I should have been more specific. For qwen code the analysis has been done in the linked issue by the qwen team. I just summed it up to "Node.js fetch at 300 s" Bun does not have the problem. But still, I think the addition is worth the effort. It is spec-compliant, I have tested it (see protocols), it should work even with the first real open source openai release of openai-node, it helps solving issues for end users, and it also works in other situations with a less reliable networking.

Regarding AI slop. I can understand that the new technology is also a burden for maintainers facing unreviewed code drops. However, I do not think that characterization applies here.

Because I was unfamiliar with this part of the codebase, I first explored an alternative implementation by making the change in server_routes::create_response(bool bypass_sleep, int keepalive_interval) (see version 1). I finally came up with the solution here, because I think it is less invasive with only 40 lines added and 4 lines removed. To ensure that it does not have a negative impact on existing users, I have executed all relevant test suites. From your perspective as a maintainer, what additional analysis, testing, or documentation would have made this contribution more useful or easier to review besides more context?

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jun 2, 2026

what additional analysis, testing, or documentation would have made this contribution more useful or easier to review besides more context?

sorry in advance I might be hard here, but I won't call it an "analysis". I never take an AI's output for granted without hard evidences.

both nodejs and bun are open-source, I don't get why people can't simply link to the exact line in the source code that handle bodyTimeout and throw the error?

and if you look into earendil-works/pi#5089 (comment) , I don't just get the exact line of code where it's handled, I also investigated multiple alternative methods. that's the part many people missed before deciding what to fix

@joleuger
Copy link
Copy Markdown
Author

joleuger commented Jun 2, 2026

Thank you! I have looked into your change and it keeps the change more local and does not rely on an artifical struct to have a return value. I like that.

Btw. I really am grateful for the time you invest into llama.cpp. I especially want to highlight the router-mode and the efficiency improvements for the qwen models. Cheers!

@nonbasketless
Copy link
Copy Markdown

I just want to point out that pi is blaming llama.cpp and llama.cpp is blaming pi. I have a workaround and am not taking sides, but it's worth saying that at least one of ya'll is incorrect.

See ngxson's response from your own issue in Pi's repo,

I did! Bless ngxson's soul for being a detective. I just wanted to say something because this issue was starting to feel stale and I think it's in pi and llama.cpp's best interest to fix it. I write backend C++ and don't know jack about node/http (I know I sound stupid), but imagine an average user firing these tools up and after a few mins it times out. They shouldn't have to dig and tweak something like that.

Thanks all for the hard work though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants