fix: streaming detokenizer for UTF-8-safe incremental decode#195
Closed
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Closed
fix: streaming detokenizer for UTF-8-safe incremental decode#195Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Conversation
Replace raw tokenizer.decode([token]) with NaiveStreamingDetokenizer in the BatchedEngine scheduler. The raw decode splits multi-byte codepoints (emoji, CJK characters) into surrogate pairs because individual tokens may represent incomplete UTF-8 byte sequences. NaiveStreamingDetokenizer buffers incomplete sequences and only emits valid UTF-8 segments, matching how mlx-lm's own server handles streaming output. Cleanup in abort, finish, and reset paths prevents detokenizer leaks. Fixes waybarrios#130
ddd420f to
ae703b8
Compare
6 tasks
Collaborator
Author
|
CI green. Fixes #130 — streaming detokenizer now handles multi-byte UTF-8 correctly by switching to incremental decode with a byte buffer. |
Collaborator
Author
|
Evidence from M2 Ultra 128GB, Qwen3.5-122B-A10B, BatchedEngine streaming:
The core fix (NaiveStreamingDetokenizer for UTF-8 safe incremental decode) works. The CJK failure is prompt compliance on the 122B, not detokenizer behavior -- the emoji test proves multi-byte streaming is correct. This is the only PR addressing Unicode streaming in BatchedEngine. Without it, multi-byte characters can be split across SSE chunks, producing replacement characters on the client side. |
Open
5 tasks
Owner
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace raw
tokenizer.decode([token])withNaiveStreamingDetokenizerin the BatchedEngine scheduler. The raw decode splits multi-byte codepoints (emoji, CJK characters) into surrogate pairs (\ud83d\udc4binstead of actual emoji) because individual tokens may represent incomplete UTF-8 byte sequences.NaiveStreamingDetokenizer(frommlx_lm.tokenizer_utils) buffers incomplete byte sequences and only emits valid UTF-8 segments, matching how mlx-lm's own server handles streaming output.Changes
NaiveStreamingDetokenizerfrommlx_lm.tokenizer_utils_detokenizer_pooldict toScheduler.__init__()for per-request detokenizersself._decode_tokens([response.token])with streaming detokenizer in_process_batch_responses()detok.finalize()and usedetok.textfor full output_do_abort_request(),_cleanup_finished(), andreset()Test plan
--continuous-batching\ud83d\udc4b)Fixes #130