Skip to content

Fix broken streaming response with --incremental-streaming-output#22549

Merged
Kangyan-Zhou merged 1 commit intomainfrom
fix/incremental-streaming-delta
Apr 12, 2026
Merged

Fix broken streaming response with --incremental-streaming-output#22549
Kangyan-Zhou merged 1 commit intomainfrom
fix/incremental-streaming-delta

Conversation

@kpham-sgl
Copy link
Copy Markdown
Collaborator

Motivation

Fixes #22510

--incremental-streaming-output + stream=True produces garbled text for all models (not just Gemma 4).

Root cause: #21037 (c37200f) changed tokenizer_manager.py to emit incremental deltas (state.text[last_text_offset:]) instead of cumulative text (state.text) when incremental_streaming_output is enabled. serving_chat.py was not updated and still slices the already-incremental text by the accumulated buffer length, producing garbage fragments.

Repro on any model:

# Server
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-0.5B-Instruct --incremental-streaming-output --port 51000

# Client
curl -s http://127.0.0.1:51000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Hello, who are you?"}],"stream":true}'

# Before fix: chunks = ['Hello', 'e', 'age', 's', 'g'] -> "Helloeagesg"
# After fix:  chunks = ['I', ' am', ' Q', 'wen', ',', ' a', ' large', ...] -> coherent text

Modifications

  • serving_chat.py: When incremental_streaming_output is enabled, use content["text"] directly as the delta instead of slicing by accumulated buffer length.
  • test_serving_chat.py: Add regression test test_incremental_streaming_output_delta.

Checklist

When incremental_streaming_output is enabled, content["text"] from the
tokenizer manager is already the incremental delta (new text only), not
the full accumulated text. The delta computation in _generate_chat_stream
incorrectly sliced into this delta using the accumulated buffer length,
producing garbled fragments like "rge", "age", "eable" instead of
"large", "language", "trainable".

Fixes #22510

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kpham-sgl kpham-sgl requested a review from ispobock as a code owner April 10, 2026 21:58
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@kpham-sgl
Copy link
Copy Markdown
Collaborator Author

/tag-run-ci-label

@kpham-sgl
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/openai_server/basic/test_serving_chat.py

@github-actions
Copy link
Copy Markdown
Contributor

1-gpu-5090 (1 test): View workflow run

cd test/ && python3 registered/openai_server/basic/test_serving_chat.py

@kpham-sgl kpham-sgl requested a review from Kangyan-Zhou April 10, 2026 22:05
@kpham-sgl
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@kpham-sgl kpham-sgl added the ready-to-merge The PR is ready to merge after the CI is green. label Apr 11, 2026
@Kangyan-Zhou Kangyan-Zhou merged commit 1f8df97 into main Apr 12, 2026
246 of 304 checks passed
@Kangyan-Zhou Kangyan-Zhou deleted the fix/incremental-streaming-delta branch April 12, 2026 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge The PR is ready to merge after the CI is green. run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Streaming response returns broken in Gemma

4 participants