server : add SSE keepalive#23994
Conversation
Introduce a `--sse-keepalive-interval` flag (default: disabled, 0 = disabled) that emits an SSE comment line (`: keepalive`) during silent periods of a streaming response. This prevents network infrastructure with idle-connection timeouts (proxies, load balancers, NAT, Node.js fetch at 300 s) from killing the connection during long prefill or slow generation. - add `server_task_result_keepalive` result type - add `--sse-keepalive-interval` / `LLAMA_ARG_SSE_KEEPALIVE_INTERVAL` - add `keepalive_interval_seconds` to server_response_reader::next and disable the keepalive for all non-streaming responses
|
Test tool result (tool is on github gist) |
|
Results of the ci tests: Full log ci-tests.md Results of the server tests: Full log server-tests.md |
|
as explained for n-th times: this is NOT necessary. none of the inference runtimes (vllm/ollama/etc) support it |
|
if you have problems with timeout, THAT IS UP TO YOUR CLIENT |
|
I just want to point out that pi is blaming llama.cpp and llama.cpp is blaming pi. I have a workaround and am not taking sides, but it's worth saying that at least one of ya'll is incorrect. |
See ngxson's response from your own issue in Pi's repo,
earendil-works/pi#5089 (comment) Note, tool call argument streaming was broken for Qwen models with their edit tool but has since been resolved in #23173. |
|
@nonbasketless no, I'm blame the fact that no one has dig deep enough to show where the problem comes from. people have been blaming people push vibe fix, slop code to us but no one can take some minutes to ask AI to dig into the root cause. well at least I did, that took me 20 minutes, earendil-works/pi#5089 (comment) |
Sorry, I guess I should have been more specific. For qwen code the analysis has been done in the linked issue by the qwen team. I just summed it up to "Node.js fetch at 300 s" Bun does not have the problem. But still, I think the addition is worth the effort. It is spec-compliant, I have tested it (see protocols), it should work even with the first real open source openai release of openai-node, it helps solving issues for end users, and it also works in other situations with a less reliable networking. Regarding AI slop. I can understand that the new technology is also a burden for maintainers facing unreviewed code drops. However, I do not think that characterization applies here. Because I was unfamiliar with this part of the codebase, I first explored an alternative implementation by making the change in |
sorry in advance I might be hard here, but I won't call it an "analysis". I never take an AI's output for granted without hard evidences. both nodejs and bun are open-source, I don't get why people can't simply link to the exact line in the source code that handle and if you look into earendil-works/pi#5089 (comment) , I don't just get the exact line of code where it's handled, I also investigated multiple alternative methods. that's the part many people missed before deciding what to fix |
|
Thank you! I have looked into your change and it keeps the change more local and does not rely on an artifical struct to have a return value. I like that. Btw. I really am grateful for the time you invest into llama.cpp. I especially want to highlight the router-mode and the efficiency improvements for the qwen models. Cheers! |
I did! Bless ngxson's soul for being a detective. I just wanted to say something because this issue was starting to feel stale and I think it's in pi and llama.cpp's best interest to fix it. I write backend C++ and don't know jack about node/http (I know I sound stupid), but imagine an average user firing these tools up and after a few mins it times out. They shouldn't have to dig and tweak something like that. Thanks all for the hard work though. |
Overview
Introduce a
--sse-keepalive-intervalflag (default: disabled, 0 = disabled) that emits an SSE comment line (: keepalive) during silent periods of a streaming response. This prevents network infrastructure with idle-connection timeouts (proxies, load balancers, NAT, Node.js fetch at 300 s) from killing the connection during long prefill or slow generation.server_task_result_keepaliveresult type--sse-keepalive-interval/LLAMA_ARG_SSE_KEEPALIVE_INTERVALkeepalive_interval_secondsto server_response_reader::next and disable the keepalive for all non-streaming responsesAdditional information
The SSE comment line is specified in section 9.2.6 of the Server Sent Events specification. The typescript API of openai also complies (see openai-node library).
Tested with:
Requirements
YES: