Skip to content

fix: Use streaming detokenizer for UTF-8-safe incremental decode#109

Merged
waybarrios merged 3 commits intowaybarrios:mainfrom
janhilgard:fix/streaming-utf8-detokenizer
Mar 31, 2026
Merged

fix: Use streaming detokenizer for UTF-8-safe incremental decode#109
waybarrios merged 3 commits intowaybarrios:mainfrom
janhilgard:fix/streaming-utf8-detokenizer

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

  • Replace per-token tokenizer.decode([token]) with NaiveStreamingDetokenizer (or BPEStreamingDetokenizer when available) for UTF-8-safe incremental decoding during SSE streaming
  • Fix corrupted multi-byte characters (e.g. Czech ď��, emoji → ���) in streaming responses
  • Both LLM scheduler and MLLM scheduler are fixed

Problem

When a tokenizer produces byte-level tokens, multi-byte UTF-8 characters (like ď = 0xC4 0x8F) are split across two tokens. Decoding each byte-token individually via tokenizer.decode([single_token]) produces invalid UTF-8 (replacement character U+FFFD), which then gets sent to clients in SSE chunks:

chunk 1: {"content":" \ufffd"}     ← 0xC4 decoded alone → invalid
chunk 2: {"content":"\ufffdáb"}    ← 0x8F decoded alone → invalid

The word "ďábelské" appears as "��ábelské" in the client.

Fix

Use mlx_lm's streaming detokenizer which buffers byte-tokens until a complete UTF-8 character boundary is reached:

# Before (broken):
new_text = tokenizer.decode([response.token])

# After (UTF-8 safe):
detok = self._get_detokenizer(request_id)
detok.add_token(response.token)
new_text = detok.last_segment

A per-request detokenizer pool is maintained and cleaned up when requests finish.

Changes

File Change
vllm_mlx/scheduler.py Add _detokenizer_pool, _get_detokenizer(), _cleanup_detokenizer(); use streaming detokenizer in _process_batch_responses()
vllm_mlx/mllm_scheduler.py Same fix for the MLLM code path

Test plan

  • uvx black --check vllm_mlx/ passes
  • Existing tests/test_streaming_detokenizer.py validates the detokenizer logic
  • Stream a response containing multi-byte characters (e.g. Příliš žluťoučký kůň úpěl ďábelské ódy) and verify no in SSE chunks
  • Non-streaming responses still work correctly (uses detok.finalize() + detok.text for full output)

🤖 Generated with Claude Code

janhilgard and others added 2 commits February 24, 2026 13:47
Replace per-token tokenizer.decode([token]) with a streaming
detokenizer that buffers partial UTF-8 byte sequences. This fixes
corrupted multi-byte characters (e.g. Czech 'ď' → '��') during
SSE streaming, caused by byte-level tokens being decoded individually
instead of accumulated until a complete UTF-8 character boundary.

Uses mlx_lm's NaiveStreamingDetokenizer (or the optimized
BPEStreamingDetokenizer when available via tokenizer.detokenizer)
with a per-request pool that is cleaned up on request completion.

Both LLM scheduler and MLLM scheduler are fixed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waybarrios
Copy link
Copy Markdown
Owner

Pushed a cleanup commit to fix detokenizer pool leaks in all exit paths:

scheduler.py:

  • _do_abort_request now calls _cleanup_detokenizer
  • _recover_from_generation_error clears the pool after aborting all requests
  • reset() clears the pool

mllm_scheduler.py:

  • abort_request now pops from _detokenizer_pool
  • reset() clears the pool

Also closed #195 as duplicate since this PR covers both schedulers.

@waybarrios waybarrios merged commit 4ede902 into waybarrios:main Mar 31, 2026
7 checks passed
janhilgard added a commit to janhilgard/vllm-mlx that referenced this pull request Apr 1, 2026
Brings in: prompt_tokens fix (waybarrios#236), ArraysCache batching (waybarrios#160),
platform rename (waybarrios#185), mlx-lm 0.31 compat (waybarrios#183, waybarrios#227),
base64 hash fix (waybarrios#206), streaming UTF-8 detokenizer (waybarrios#109),
and cleanup commits.

Conflicts resolved:
- scheduler.py: keep make_logits_processors import (fork feature)
- mllm_scheduler.py: take upstream stop-token skip in detokenizer
- models/mllm.py: keep SHA256 hash (fork fix for collision)
- utils/tokenizer.py: merge upstream error message with fork elif chain

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sysit pushed a commit to sysit/vllm-mlx that referenced this pull request Apr 1, 2026
…detokenizer

fix: Use streaming detokenizer for UTF-8-safe incremental decode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants