server: in SSE mode, send HTTP headers when slot starts by ngxson · Pull Request #23884 · ggml-org/llama.cpp

ngxson · 2026-05-29T19:32:45Z

Overview

In stream mode, send HTTP header when the slot starts processing.

This may fix some timeout problems on clients. Ref: earendil-works/pi#5089 (comment)

Test script: https://gist.github.com/ngxson/d46506aab7e4a6b0f0df9f843d721cd6

On master, the header is sent after prompt processing completed (in this case: 0.8s from request sent):

[0.800s] STATUS: 200 OK
[0.800s] HEADERS:
[0.800s]   Content-Type: text/event-stream
[0.800s]   Keep-Alive: timeout=5, max=100
[0.800s]   Server: llama.cpp
[0.800s]   Access-Control-Allow-Origin: 
[0.800s]   Transfer-Encoding: chunked
[0.800s] --- streaming chunks ---
[0.800s] CHUNK: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1780083321,"id":"chatcmpl-eeaLzaFFF83iAY9uKZICpcMQsH8N5F3l","model":"unsloth/gemma-4-E4B-it-GGUF:Q4_K_M","system_fingerprint":"b9417-b5f52280f","object":"chat.completion.chunk"}

On PR, header is sent as soon as slot starts processing the prompt (0.008s from request sent):

[0.008s] STATUS: 200 OK
[0.008s] HEADERS:
[0.008s]   Content-Type: text/event-stream
[0.008s]   Keep-Alive: timeout=5, max=100
[0.008s]   Server: llama.cpp
[0.008s]   Access-Control-Allow-Origin: 
[0.008s]   Transfer-Encoding: chunked
[0.008s] --- streaming chunks ---
[0.799s] CHUNK: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1780083540,"id":"chatcmpl-oJ96sIgDZqP3a4YiY0wwx01FvCTJkGy6","model":"unsloth/gemma-4-E4B-it-GGUF:Q4_K_M","system_fingerprint":"b9419-b9d69e18e","object":"chat.completion.chunk"}

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: no

ngxson · 2026-05-29T19:42:22Z

hmm hold on, one test case failed

ngxson · 2026-05-29T21:59:53Z

@ggml-org/maintainers can I have the 2nd approval please 🙏

* origin/master: server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884) ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879) ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760) server-bench : add speed-bench for speculative decoding benchmarking (ggml-org#23869) app: add llama update self updater (ggml-org#23865) ui: handle audio/vnd.wave as audio WAV file (ggml-org#23754)

* server: in SSE mode, send HTTP headers when slot starts * ref to pr * stream should be false by default

Upstream llama.cpp (ggml-org/llama.cpp#23884), pulled in by this bump, now emits an initial "begin" partial whose to_json() returns null. It exists only to signal the HTTP layer to flush 200 status headers before any token is produced. gRPC has no such concept, and PredictStream had no guard: the null result was fed straight into build_reply_from_json, which threw an uncaught exception. That surfaced as a generic "Unexpected error in RPC handling" and the task was cancelled the instant it launched, breaking the PredictStream e2e spec. Skip null results in both the first-result handling and the streaming loop, mirroring upstream's own `if (first_result_json == nullptr)` guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

…1a810c8ae18` (#10093) * ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(llama-cpp): skip begin-of-stream null partial in PredictStream Upstream llama.cpp (ggml-org/llama.cpp#23884), pulled in by this bump, now emits an initial "begin" partial whose to_json() returns null. It exists only to signal the HTTP layer to flush 200 status headers before any token is produced. gRPC has no such concept, and PredictStream had no guard: the null result was fed straight into build_reply_from_json, which threw an uncaught exception. That surfaced as a generic "Unexpected error in RPC handling" and the task was cancelled the instant it launched, breaking the PredictStream e2e spec. Skip null results in both the first-result handling and the streaming loop, mirroring upstream's own `if (first_result_json == nullptr)` guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>

…wercase * upstream/master: (27 commits) vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756) ui: fix ETag truncation with MSVC compiler (ggml-org#23917) docs : update ZenDNN docs for Q8 support (ggml-org#23791) llama: only use one iGPU device by default (ggml-org#23897) webui: add custom CSS injection via config (ggml-org#23904) Support `-fa auto` in llama-bench (ggml-org#23714) opencl: support bf16 by converting to f16 (ggml-org#23839) ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (ggml-org#23910) TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (ggml-org#23843) metal : restore im2col implementation for large kernels (ggml-org#23901) test: (test-llama-archs) log the config name first (ggml-org#23885) ci : update ios-xcode release job to macos-26 (ggml-org#23906) ggml : add some lsx support (ggml-org#23798) vulkan: add Flash Attention support for BFloat16 KV cache (ggml-org#23420) ci : fix s390x release job (ggml-org#23898) ci : clear cache instead of "no timestamp" keys + fix macos (ggml-org#23895) llama : do not skip iGPU when only RPC devices are present (ggml-org#23868) server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884) ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879) ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760) ... # Conflicts: # gguf-py/gguf/vocab.py # src/llama-vocab.cpp

* server: in SSE mode, send HTTP headers when slot starts * ref to pr * stream should be false by default

server: in SSE mode, send HTTP headers when slot starts

d734294

github-actions Bot added examples server labels May 29, 2026

ref to pr

b9d69e1

ngxson marked this pull request as ready for review May 29, 2026 19:35

ngxson requested a review from a team as a code owner May 29, 2026 19:35

ngxson mentioned this pull request May 29, 2026

Doesn't seem to respect timeoutMs past a certain value earendil-works/pi#5089

Closed

ServeurpersoCom approved these changes May 29, 2026

View reviewed changes

stream should be false by default

d1236c8

lhez approved these changes May 29, 2026

View reviewed changes

ngxson merged commit 0821c5f into master May 29, 2026
27 checks passed

ngxson mentioned this pull request May 29, 2026

Eval bug: stopping wait for next result due to should_stop condition when Prompt Processing is >60s #22997

Closed

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026

server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884)

b8086f3

* server: in SSE mode, send HTTP headers when slot starts * ref to pr * stream should be false by default

turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026

server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884)

7e9fc2e

* server: in SSE mode, send HTTP headers when slot starts * ref to pr * stream should be false by default

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: in SSE mode, send HTTP headers when slot starts#23884

server: in SSE mode, send HTTP headers when slot starts#23884
ngxson merged 3 commits into
masterfrom
xsn/stream_send_header_faster

ngxson commented May 29, 2026 •

edited

Loading

Uh oh!

ngxson commented May 29, 2026

Uh oh!

ngxson commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ngxson commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

ngxson commented May 29, 2026

Uh oh!

ngxson commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented May 29, 2026 •

edited

Loading