server : add SSE keepalive by joleuger · Pull Request #23994 · ggml-org/llama.cpp

joleuger · 2026-06-01T20:13:10Z

Overview

Introduce a --sse-keepalive-interval flag (default: disabled, 0 = disabled) that emits an SSE comment line (: keepalive) during silent periods of a streaming response. This prevents network infrastructure with idle-connection timeouts (proxies, load balancers, NAT, Node.js fetch at 300 s) from killing the connection during long prefill or slow generation.

add server_task_result_keepalive result type
add --sse-keepalive-interval / LLAMA_ARG_SSE_KEEPALIVE_INTERVAL
add keepalive_interval_seconds to server_response_reader::next and disable the keepalive for all non-streaming responses

Additional information

The SSE comment line is specified in section 9.2.6 of the Server Sent Events specification. The typescript API of openai also complies (see openai-node library).

Tested with:

a small test tool using the openai library. The stream via chat.completions.create({ stream: true }). keepalive comments are silently discarded, response text is clean.
A qwen code session with qwen-coder-next 80B that survived for longer than 600 seconds.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:
YES:
- Used deepwiki to analyze the code.
- Used Anthropic Sonnet 4.5 to generate a small test tool using the openai library to check if the "keepalive" bleeds in.

Introduce a `--sse-keepalive-interval` flag (default: disabled, 0 = disabled) that emits an SSE comment line (`: keepalive`) during silent periods of a streaming response. This prevents network infrastructure with idle-connection timeouts (proxies, load balancers, NAT, Node.js fetch at 300 s) from killing the connection during long prefill or slow generation. - add `server_task_result_keepalive` result type - add `--sse-keepalive-interval` / `LLAMA_ARG_SSE_KEEPALIVE_INTERVAL` - add `keepalive_interval_seconds` to server_response_reader::next and disable the keepalive for all non-streaming responses

joleuger · 2026-06-01T20:19:21Z

Test tool result (tool is on github gist)

BASE_URL=http://192.168.51.234:15231/v1 bun index.ts --words 40000 --model qwencoder-deep
  padding  : 40,000 words (~52,000 tokens estimated)
SSE comment compatibility test
──────────────────────────────────────────────────
  base URL : http://192.168.51.234:15231/v1
  model    : qwencoder-deep
  messages : 2

I received this message.

──────────────────────────────────────────────────
Result: ✓ stream completed
  elapsed              : 323.1s
  SDK chunks           : 7
  tokens received      : 5
  keepalive comments   : 0
  other comments       : 0
  ~ no keepalive comments seen (direct connection, or interval not reached)

joleuger · 2026-06-01T21:31:55Z

Results of the ci tests:
100% tests passed, 0 tests failed out of 45

Full log ci-tests.md

Results of the server tests:
========== 282 passed, 3 skipped, 199 deselected in 455.53s (0:07:35) ==========

Full log server-tests.md

ngxson · 2026-06-01T21:51:06Z

as explained for n-th times: this is NOT necessary. none of the inference runtimes (vllm/ollama/etc) support it

ngxson · 2026-06-01T21:52:49Z

if you have problems with timeout, THAT IS UP TO YOUR CLIENT

nonbasketless · 2026-06-02T04:31:51Z

I just want to point out that pi is blaming llama.cpp and llama.cpp is blaming pi. I have a workaround and am not taking sides, but it's worth saying that at least one of ya'll is incorrect.

aldehir · 2026-06-02T05:23:50Z

I just want to point out that pi is blaming llama.cpp and llama.cpp is blaming pi. I have a workaround and am not taking sides, but it's worth saying that at least one of ya'll is incorrect.

See ngxson's response from your own issue in Pi's repo,

so my conclusion is that SSE keep alive / heartbeat is still needed in this situation, but I will push a proper implementation tomorrow

earendil-works/pi#5089 (comment)

Note, tool call argument streaming was broken for Qwen models with their edit tool but has since been resolved in #23173.

ngxson · 2026-06-02T07:49:32Z

@nonbasketless no, I'm blame the fact that no one has dig deep enough to show where the problem comes from.

people have been blaming --timeout option in llama.cpp / httplib for not being handled properly while it works fine, not until I pointed out the truth that their client code have problems and many people confirmed that it's true: #22997

people push vibe fix, slop code to us but no one can take some minutes to ask AI to dig into the root cause. well at least I did, that took me 20 minutes, earendil-works/pi#5089 (comment)

joleuger · 2026-06-02T08:19:47Z

@nonbasketless no, I'm blame the fact that no one has dig deep enough to show where the problem comes from.

people have been blaming --timeout option in llama.cpp / httplib for not being handled properly while it works fine, not until I pointed out the truth that their client code have problems and many people confirmed that it's true: #22997

people push vibe fix, slop code to us but no one can take some minutes to ask AI to dig into the root cause. well at least I did, that took me 20 minutes, earendil-works/pi#5089 (comment)

Sorry, I guess I should have been more specific. For qwen code the analysis has been done in the linked issue by the qwen team. I just summed it up to "Node.js fetch at 300 s" Bun does not have the problem. But still, I think the addition is worth the effort. It is spec-compliant, I have tested it (see protocols), it should work even with the first real open source openai release of openai-node, it helps solving issues for end users, and it also works in other situations with a less reliable networking.

Regarding AI slop. I can understand that the new technology is also a burden for maintainers facing unreviewed code drops. However, I do not think that characterization applies here.

Because I was unfamiliar with this part of the codebase, I first explored an alternative implementation by making the change in server_routes::create_response(bool bypass_sleep, int keepalive_interval) (see version 1). I finally came up with the solution here, because I think it is less invasive with only 40 lines added and 4 lines removed. To ensure that it does not have a negative impact on existing users, I have executed all relevant test suites. From your perspective as a maintainer, what additional analysis, testing, or documentation would have made this contribution more useful or easier to review besides more context?

ngxson · 2026-06-02T08:43:16Z

what additional analysis, testing, or documentation would have made this contribution more useful or easier to review besides more context?

sorry in advance I might be hard here, but I won't call it an "analysis". I never take an AI's output for granted without hard evidences.

both nodejs and bun are open-source, I don't get why people can't simply link to the exact line in the source code that handle bodyTimeout and throw the error?

and if you look into earendil-works/pi#5089 (comment) , I don't just get the exact line of code where it's handled, I also investigated multiple alternative methods. that's the part many people missed before deciding what to fix

joleuger · 2026-06-02T12:52:12Z

Thank you! I have looked into your change and it keeps the change more local and does not rely on an artifical struct to have a return value. I like that.

Btw. I really am grateful for the time you invest into llama.cpp. I especially want to highlight the router-mode and the efficiency improvements for the qwen models. Cheers!

nonbasketless · 2026-06-02T16:16:12Z

I just want to point out that pi is blaming llama.cpp and llama.cpp is blaming pi. I have a workaround and am not taking sides, but it's worth saying that at least one of ya'll is incorrect.

See ngxson's response from your own issue in Pi's repo,

I did! Bless ngxson's soul for being a detective. I just wanted to say something because this issue was starting to feel stale and I think it's in pi and llama.cpp's best interest to fix it. I write backend C++ and don't know jack about node/http (I know I sound stupid), but imagine an average user firing these tools up and after a few mins it times out. They shouldn't have to dig and tweak something like that.

Thanks all for the hard work though.

github-actions Bot added examples server labels Jun 1, 2026

joleuger marked this pull request as ready for review June 1, 2026 21:32

joleuger requested review from a team as code owners June 1, 2026 21:32

This was referenced Jun 1, 2026

Doesn't seem to respect timeoutMs past a certain value earendil-works/pi#5089

Closed

API Error: terminated (cause: Body Timeout Error) QwenLM/qwen-code#4604

Closed

ngxson closed this Jun 1, 2026

ngxson mentioned this pull request Jun 2, 2026

server: add SSE ping interval #24013

Merged

This was referenced Jun 2, 2026

Add keepalive messages to HTTP streaming connections #23355

Closed

feat: Add optional keepalive heartbeats for streaming endpoints ollama/ollama#14420

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : add SSE keepalive#23994

server : add SSE keepalive#23994
joleuger wants to merge 1 commit into
ggml-org:masterfrom
joleuger:master

joleuger commented Jun 1, 2026

Uh oh!

joleuger commented Jun 1, 2026

Uh oh!

joleuger commented Jun 1, 2026

Uh oh!

ngxson commented Jun 1, 2026 •

edited

Loading

Uh oh!

ngxson commented Jun 1, 2026

Uh oh!

nonbasketless commented Jun 2, 2026

Uh oh!

aldehir commented Jun 2, 2026 •

edited

Loading

Uh oh!

ngxson commented Jun 2, 2026 •

edited

Loading

Uh oh!

joleuger commented Jun 2, 2026

Uh oh!

ngxson commented Jun 2, 2026 •

edited

Loading

Uh oh!

joleuger commented Jun 2, 2026 •

edited

Loading

Uh oh!

nonbasketless commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

joleuger commented Jun 1, 2026

Overview

Additional information

Requirements

Uh oh!

joleuger commented Jun 1, 2026

Uh oh!

joleuger commented Jun 1, 2026

Uh oh!

ngxson commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jun 1, 2026

Uh oh!

nonbasketless commented Jun 2, 2026

Uh oh!

aldehir commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joleuger commented Jun 2, 2026

Uh oh!

ngxson commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joleuger commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nonbasketless commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ngxson commented Jun 1, 2026 •

edited

Loading

aldehir commented Jun 2, 2026 •

edited

Loading

ngxson commented Jun 2, 2026 •

edited

Loading

ngxson commented Jun 2, 2026 •

edited

Loading

joleuger commented Jun 2, 2026 •

edited

Loading