fix: MLLM scheduler streaming detokenizer + VLM model pre-detection by Thump604 · Pull Request #242 · waybarrios/vllm-mlx

Thump604 · 2026-03-31T23:49:47Z

Summary

mllm_scheduler.py - streaming detokenizer:

Replace tokenizer.decode([token]) with NaiveStreamingDetokenizer (from mlx_lm.tokenizer_utils) per request
Fixes garbled UTF-8 multibyte characters in streaming output (e.g. CJK text where a character is split across token boundaries emits replacement characters mid-stream instead of waiting for the complete codepoint)
_detokenizer_pool: per-request detokenizer lifecycle. On finish: detok.finalize() + detok.text for the authoritative full output text
Error response handling: requests that fail during preprocessing (from the batch generator error path) now produce a proper RequestOutput with finish_reason="error" rather than silently vanishing

utils/tokenizer.py - VLM pre-detection:

_needs_strict_false(): reads config.json before loading to detect VLM models (vision_config + text_config present). Avoids the 2x memory penalty of loading ~100GB weights twice (once with strict=True that fails, once with strict=False)
_load_strict_false(): clean implementation using mlx_lm._download + load_model + load_tokenizer with parameter count logging for diagnostics
load_model_with_fallback: VLM fast path + improved retry path (clears traceback references and calls gc.collect() before retry to release memory from the failed load)

Split from #224 for easier review. No dependency on the other split PRs.

Test plan

CJK text (Chinese/Japanese/Korean) streams without replacement characters mid-stream
VLM model (e.g. Qwen3.5-122B-VLM) loads without double-loading weights
Non-VLM models (Nemotron, plain Qwen) still follow their existing code paths
Request that fails preprocessing produces a proper error response, not a hang

janhilgard

Good changes — streaming detokenizer for MLLM and VLM pre-detection are both needed. Two issues:

Bug: shared detokenizer instance in concurrent requests

if hasattr(tokenizer, "detokenizer"):
    detok = tokenizer.detokenizer
else:
    detok = NaiveStreamingDetokenizer(tokenizer)
detok.reset()
self._detokenizer_pool[request_id] = detok

tokenizer.detokenizer returns the same object every time. With concurrent requests, multiple request_id entries in _detokenizer_pool will point to the same detokenizer instance. When request B calls .reset(), it destroys request A's in-progress state.

Fix: always create a new instance per request:

detok = NaiveStreamingDetokenizer(tokenizer)
self._detokenizer_pool[request_id] = detok

Minor: detokenizer leak on abort/timeout

_detokenizer_pool is only cleaned up in _process_batch_responses on normal finish. If a request is aborted or times out via _cleanup_finished, the detokenizer entry leaks. Add cleanup in _cleanup_finished:

self._detokenizer_pool.pop(request_id, None)

Merge conflicts

PR currently shows CONFLICTING status — needs rebase on latest main.

Overlap note

My MTP PR (#245) also modifies utils/tokenizer.py (MTP weight detection in load_model_with_fallback). Will rebase after both land.

waybarrios · 2026-04-01T19:51:32Z

Please double check the conflict files.

mllm_scheduler.py: - Replace tokenizer.decode([token]) with NaiveStreamingDetokenizer per request. Fixes garbled UTF-8 multibyte characters in streaming output (e.g. CJK text split across token boundaries). - Add error response handling: preprocessing failures now produce an error RequestOutput instead of silently dropping the request. - _detokenizer_pool: per-request detokenizer lifecycle tied to request completion (pool.pop on finish, detok.finalize() for full text). utils/tokenizer.py: - Add _needs_strict_false(): reads model config to detect VLM models (vision_config + text_config) before attempting load. Avoids wasting ~100GB of memory loading weights only to fail strict=True validation. - Add _load_strict_false(): clean strict=False loader using mlx_lm _download + load_model + load_tokenizer with parameter count logging. - load_model_with_fallback: add VLM pre-detection fast path and improve retry error handling (gc.collect + traceback clear before retry).

- Always create new NaiveStreamingDetokenizer per request instead of reusing tokenizer.detokenizer (same object shared across concurrent requests, causing state corruption) - Clean up detokenizer in _cleanup_finished to prevent memory leak when requests are aborted or time out Fixes from janhilgard review.

Thump604 · 2026-04-01T22:07:14Z

Fixed both issues and rebased on main:

Shared detokenizer: always create NaiveStreamingDetokenizer per request instead of reusing tokenizer.detokenizer
Leak on abort: added _detokenizer_pool.pop(request_id, None) in _cleanup_finished
Merge conflicts: resolved, rebased on b4fa030

Thanks for catching the shared instance bug, that would have been a nasty concurrency issue.

Merge conflict artifact: tokenizer is not in scope in this function. The caller already handles the return value.

tokenizer.detokenizer returns the same object for every call. With concurrent requests, multiple entries in _detokenizer_pool point to the same instance -- reset() on one corrupts the other. Always create NaiveStreamingDetokenizer per request, matching the fix already applied in mllm_scheduler.py.

Thump604 · 2026-04-07T23:42:49Z

@janhilgard, the three changes you requested are in the diff and the PR is rebased clean on current main:

Shared detokenizer instance: Both call sites in vllm_mlx/scheduler.py and vllm_mlx/engine/batched.py now always create NaiveStreamingDetokenizer(...) per request. The old tokenizer.detokenizer reuse path is gone.
Leak on abort: _detokenizer_pool.pop(request_id, None) added to _cleanup_finished. Also tightened the normal-finish path from .get() then later .pop() to a single .pop() at the finalize site so the pool entry is removed atomically.
Merge conflicts: rebased clean. mergeable=MERGEABLE.

Would appreciate a re-review when convenient, would unblock this one.

Thump604 · 2026-04-09T17:26:54Z

@janhilgard gentle ping — the three concerns from your 2026-03-31 CHANGES_REQUESTED review were addressed on the current head (c2438058) in the 2026-04-07 refresh. Branch is mergeable against current main, CI is green across lint, type-check, test-matrix 3.10-3.12, test-apple-silicon, and tests. Would you mind another pass so the reviewDecision can flip? Happy to walk through any of the three points if that helps.

janhilgard

All three items from my previous review are addressed:

Shared detokenizer — both mllm_scheduler.py and scheduler.py now always create NaiveStreamingDetokenizer(...) per request. The tokenizer.detokenizer reuse path is gone. ✅
Leak on abort/timeout — _detokenizer_pool.pop() added to _cleanup_finished, and the normal-finish path uses a single .pop() instead of .get() + .pop(). ✅
Merge conflicts — rebased clean on current main. ✅

LGTM. Thanks for the quick turnaround on the fixes.

Note: my MTP PR (#245) touches utils/tokenizer.py as well (MTP weight injection in load_model_with_fallback). Will rebase after this lands.

Thump604 · 2026-04-13T19:19:40Z

Closing this - the key fixes (streaming detokenizer per-request isolation, VLM pre-detection, error finish handling) have landed on main through subsequent commits. The conflicts are extensive enough that a rebase isn't worth it at this point.

Thump604 mentioned this pull request Mar 31, 2026

fix: MLLM hybrid batching + message normalization #224

Closed

Thump604 requested a review from janhilgard April 1, 2026 16:04

janhilgard requested changes Apr 1, 2026

View reviewed changes

Thump604 added 2 commits April 1, 2026 17:06

Thump604 force-pushed the fix/mllm-scheduler-detokenizer branch from 2618de2 to 3e82b4d Compare April 1, 2026 22:07

Thump604 added 2 commits April 1, 2026 17:10

fix: remove stray return from _try_inject_mtp_post_load

347c92e

Merge conflict artifact: tokenizer is not in scope in this function. The caller already handles the return value.

janhilgard approved these changes Apr 9, 2026

View reviewed changes

Thump604 closed this Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: MLLM scheduler streaming detokenizer + VLM model pre-detection#242

fix: MLLM scheduler streaming detokenizer + VLM model pre-detection#242
Thump604 wants to merge 4 commits intowaybarrios:mainfrom
Thump604:fix/mllm-scheduler-detokenizer

Thump604 commented Mar 31, 2026

Uh oh!

janhilgard left a comment

Uh oh!

waybarrios commented Apr 1, 2026

Uh oh!

Thump604 commented Apr 1, 2026

Uh oh!

Thump604 commented Apr 7, 2026

Uh oh!

Thump604 commented Apr 9, 2026

Uh oh!

janhilgard left a comment

Uh oh!

Thump604 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Thump604 commented Mar 31, 2026

Summary

Test plan

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Bug: shared detokenizer instance in concurrent requests

Minor: detokenizer leak on abort/timeout

Merge conflicts

Overlap note

Uh oh!

waybarrios commented Apr 1, 2026

Uh oh!

Thump604 commented Apr 1, 2026

Uh oh!

Thump604 commented Apr 7, 2026

Uh oh!

Thump604 commented Apr 9, 2026

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Uh oh!

Thump604 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants