Skip to content

server: in SSE mode, send HTTP headers when slot starts#23884

Merged
ngxson merged 3 commits into
masterfrom
xsn/stream_send_header_faster
May 29, 2026
Merged

server: in SSE mode, send HTTP headers when slot starts#23884
ngxson merged 3 commits into
masterfrom
xsn/stream_send_header_faster

Conversation

@ngxson
Copy link
Copy Markdown
Contributor

@ngxson ngxson commented May 29, 2026

Overview

In stream mode, send HTTP header when the slot starts processing.

This may fix some timeout problems on clients. Ref: earendil-works/pi#5089 (comment)

Test script: https://gist.github.com/ngxson/d46506aab7e4a6b0f0df9f843d721cd6

On master, the header is sent after prompt processing completed (in this case: 0.8s from request sent):

[0.800s] STATUS: 200 OK
[0.800s] HEADERS:
[0.800s]   Content-Type: text/event-stream
[0.800s]   Keep-Alive: timeout=5, max=100
[0.800s]   Server: llama.cpp
[0.800s]   Access-Control-Allow-Origin: 
[0.800s]   Transfer-Encoding: chunked
[0.800s] --- streaming chunks ---
[0.800s] CHUNK: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1780083321,"id":"chatcmpl-eeaLzaFFF83iAY9uKZICpcMQsH8N5F3l","model":"unsloth/gemma-4-E4B-it-GGUF:Q4_K_M","system_fingerprint":"b9417-b5f52280f","object":"chat.completion.chunk"}

On PR, header is sent as soon as slot starts processing the prompt (0.008s from request sent):

[0.008s] STATUS: 200 OK
[0.008s] HEADERS:
[0.008s]   Content-Type: text/event-stream
[0.008s]   Keep-Alive: timeout=5, max=100
[0.008s]   Server: llama.cpp
[0.008s]   Access-Control-Allow-Origin: 
[0.008s]   Transfer-Encoding: chunked
[0.008s] --- streaming chunks ---
[0.799s] CHUNK: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1780083540,"id":"chatcmpl-oJ96sIgDZqP3a4YiY0wwx01FvCTJkGy6","model":"unsloth/gemma-4-E4B-it-GGUF:Q4_K_M","system_fingerprint":"b9419-b9d69e18e","object":"chat.completion.chunk"}

Requirements

@ngxson ngxson marked this pull request as ready for review May 29, 2026 19:35
@ngxson ngxson requested a review from a team as a code owner May 29, 2026 19:35
@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented May 29, 2026

hmm hold on, one test case failed

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented May 29, 2026

@ggml-org/maintainers can I have the 2nd approval please 🙏

@ngxson ngxson merged commit 0821c5f into master May 29, 2026
27 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 29, 2026
* origin/master:
server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884)
ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879)
ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760)
server-bench : add speed-bench for speculative decoding benchmarking (ggml-org#23869)
app: add llama update self updater (ggml-org#23865)
ui: handle audio/vnd.wave as audio WAV file (ggml-org#23754)
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* server: in SSE mode, send HTTP headers when slot starts

* ref to pr

* stream should be false by default
localai-bot pushed a commit to ci-forks/LocalAI that referenced this pull request May 31, 2026
Upstream llama.cpp (ggml-org/llama.cpp#23884), pulled in by this bump,
now emits an initial "begin" partial whose to_json() returns null. It
exists only to signal the HTTP layer to flush 200 status headers before
any token is produced.

gRPC has no such concept, and PredictStream had no guard: the null result
was fed straight into build_reply_from_json, which threw an uncaught
exception. That surfaced as a generic "Unexpected error in RPC handling"
and the task was cancelled the instant it launched, breaking the
PredictStream e2e spec.

Skip null results in both the first-result handling and the streaming
loop, mirroring upstream's own `if (first_result_json == nullptr)` guard.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
mudler added a commit to mudler/LocalAI that referenced this pull request May 31, 2026
…1a810c8ae18` (#10093)

* ⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(llama-cpp): skip begin-of-stream null partial in PredictStream

Upstream llama.cpp (ggml-org/llama.cpp#23884), pulled in by this bump,
now emits an initial "begin" partial whose to_json() returns null. It
exists only to signal the HTTP layer to flush 200 status headers before
any token is produced.

gRPC has no such concept, and PredictStream had no guard: the null result
was fed straight into build_reply_from_json, which threw an uncaught
exception. That surfaced as a generic "Unexpected error in RPC handling"
and the task was cancelled the instant it launched, breaking the
PredictStream e2e spec.

Skip null results in both the first-result handling and the streaming
loop, mirroring upstream's own `if (first_result_json == nullptr)` guard.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
o7si added a commit to o7si/llama.cpp that referenced this pull request May 31, 2026
…wercase

* upstream/master: (27 commits)
  vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756)
  ui: fix ETag truncation with MSVC compiler (ggml-org#23917)
  docs : update ZenDNN docs for Q8 support (ggml-org#23791)
  llama: only use one iGPU device by default (ggml-org#23897)
  webui: add custom CSS injection via config (ggml-org#23904)
  Support `-fa auto` in llama-bench (ggml-org#23714)
  opencl: support bf16 by converting to f16 (ggml-org#23839)
  ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (ggml-org#23910)
  TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (ggml-org#23843)
  metal : restore im2col implementation for large kernels (ggml-org#23901)
  test: (test-llama-archs) log the config name first (ggml-org#23885)
  ci : update ios-xcode release job to macos-26 (ggml-org#23906)
  ggml : add some lsx support (ggml-org#23798)
  vulkan: add Flash Attention support for BFloat16 KV cache (ggml-org#23420)
  ci : fix s390x release job (ggml-org#23898)
  ci : clear cache instead of "no timestamp" keys + fix macos (ggml-org#23895)
  llama : do not skip iGPU when only RPC devices are present (ggml-org#23868)
  server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884)
  ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879)
  ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760)
  ...

# Conflicts:
#	gguf-py/gguf/vocab.py
#	src/llama-vocab.cpp
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
* server: in SSE mode, send HTTP headers when slot starts

* ref to pr

* stream should be false by default
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants