Skip to content

fix: streaming detokenizer for UTF-8-safe incremental decode#195

Closed
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:fix/streaming-detokenizer-unicode
Closed

fix: streaming detokenizer for UTF-8-safe incremental decode#195
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:fix/streaming-detokenizer-unicode

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Summary

Replace raw tokenizer.decode([token]) with NaiveStreamingDetokenizer in the BatchedEngine scheduler. The raw decode splits multi-byte codepoints (emoji, CJK characters) into surrogate pairs (\ud83d\udc4b instead of actual emoji) because individual tokens may represent incomplete UTF-8 byte sequences.

NaiveStreamingDetokenizer (from mlx_lm.tokenizer_utils) buffers incomplete byte sequences and only emits valid UTF-8 segments, matching how mlx-lm's own server handles streaming output.

Changes

  • Import NaiveStreamingDetokenizer from mlx_lm.tokenizer_utils
  • Add _detokenizer_pool dict to Scheduler.__init__() for per-request detokenizers
  • Replace self._decode_tokens([response.token]) with streaming detokenizer in _process_batch_responses()
  • On request finish, call detok.finalize() and use detok.text for full output
  • Cleanup in _do_abort_request(), _cleanup_finished(), and reset()

Test plan

  • Send a chat message that elicits emoji response with --continuous-batching
  • Verify emoji render correctly (no surrogate pairs like \ud83d\udc4b)
  • Verify CJK characters render correctly
  • Verify normal ASCII text is unaffected
  • Verify aborted requests clean up their detokenizer

Fixes #130

Replace raw tokenizer.decode([token]) with NaiveStreamingDetokenizer
in the BatchedEngine scheduler. The raw decode splits multi-byte
codepoints (emoji, CJK characters) into surrogate pairs because
individual tokens may represent incomplete UTF-8 byte sequences.

NaiveStreamingDetokenizer buffers incomplete sequences and only emits
valid UTF-8 segments, matching how mlx-lm's own server handles
streaming output.

Cleanup in abort, finish, and reset paths prevents detokenizer leaks.

Fixes waybarrios#130
@Thump604
Copy link
Copy Markdown
Collaborator Author

CI green. Fixes #130 — streaming detokenizer now handles multi-byte UTF-8 correctly by switching to incremental decode with a byte buffer.

@Thump604
Copy link
Copy Markdown
Collaborator Author

Evidence from M2 Ultra 128GB, Qwen3.5-122B-A10B, BatchedEngine streaming:

Test Result
Emoji streaming PASS -- emoji characters stream correctly without garbling
Mixed multi-byte (CJK + emoji + Latin) PASS -- all character types preserved
CJK-only FAIL (model compliance) -- model returned empty response to the constrained prompt, not a streaming bug

The core fix (NaiveStreamingDetokenizer for UTF-8 safe incremental decode) works. The CJK failure is prompt compliance on the 122B, not detokenizer behavior -- the emoji test proves multi-byte streaming is correct.

This is the only PR addressing Unicode streaming in BatchedEngine. Without it, multi-byte characters can be split across SSE chunks, producing replacement characters on the client side.

@waybarrios
Copy link
Copy Markdown
Owner

Closing in favor of #109 which covers both scheduler.py and mllm_scheduler.py. The cleanup patterns from this PR (abort, reset, cleanup_finished) have been added to #109 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Emoji encoded as surrogate pairs in BatchedEngine (--continuous-batching)

2 participants