fix: MLLM scheduler streaming detokenizer + VLM model pre-detection#242
fix: MLLM scheduler streaming detokenizer + VLM model pre-detection#242Thump604 wants to merge 4 commits intowaybarrios:mainfrom
Conversation
janhilgard
left a comment
There was a problem hiding this comment.
Good changes — streaming detokenizer for MLLM and VLM pre-detection are both needed. Two issues:
Bug: shared detokenizer instance in concurrent requests
if hasattr(tokenizer, "detokenizer"):
detok = tokenizer.detokenizer
else:
detok = NaiveStreamingDetokenizer(tokenizer)
detok.reset()
self._detokenizer_pool[request_id] = detoktokenizer.detokenizer returns the same object every time. With concurrent requests, multiple request_id entries in _detokenizer_pool will point to the same detokenizer instance. When request B calls .reset(), it destroys request A's in-progress state.
Fix: always create a new instance per request:
detok = NaiveStreamingDetokenizer(tokenizer)
self._detokenizer_pool[request_id] = detokMinor: detokenizer leak on abort/timeout
_detokenizer_pool is only cleaned up in _process_batch_responses on normal finish. If a request is aborted or times out via _cleanup_finished, the detokenizer entry leaks. Add cleanup in _cleanup_finished:
self._detokenizer_pool.pop(request_id, None)Merge conflicts
PR currently shows CONFLICTING status — needs rebase on latest main.
Overlap note
My MTP PR (#245) also modifies utils/tokenizer.py (MTP weight detection in load_model_with_fallback). Will rebase after both land.
|
Please double check the conflict files. |
mllm_scheduler.py: - Replace tokenizer.decode([token]) with NaiveStreamingDetokenizer per request. Fixes garbled UTF-8 multibyte characters in streaming output (e.g. CJK text split across token boundaries). - Add error response handling: preprocessing failures now produce an error RequestOutput instead of silently dropping the request. - _detokenizer_pool: per-request detokenizer lifecycle tied to request completion (pool.pop on finish, detok.finalize() for full text). utils/tokenizer.py: - Add _needs_strict_false(): reads model config to detect VLM models (vision_config + text_config) before attempting load. Avoids wasting ~100GB of memory loading weights only to fail strict=True validation. - Add _load_strict_false(): clean strict=False loader using mlx_lm _download + load_model + load_tokenizer with parameter count logging. - load_model_with_fallback: add VLM pre-detection fast path and improve retry error handling (gc.collect + traceback clear before retry).
- Always create new NaiveStreamingDetokenizer per request instead of reusing tokenizer.detokenizer (same object shared across concurrent requests, causing state corruption) - Clean up detokenizer in _cleanup_finished to prevent memory leak when requests are aborted or time out Fixes from janhilgard review.
2618de2 to
3e82b4d
Compare
|
Fixed both issues and rebased on main:
Thanks for catching the shared instance bug, that would have been a nasty concurrency issue. |
Merge conflict artifact: tokenizer is not in scope in this function. The caller already handles the return value.
tokenizer.detokenizer returns the same object for every call. With concurrent requests, multiple entries in _detokenizer_pool point to the same instance -- reset() on one corrupts the other. Always create NaiveStreamingDetokenizer per request, matching the fix already applied in mllm_scheduler.py.
|
@janhilgard, the three changes you requested are in the diff and the PR is rebased clean on current main:
Would appreciate a re-review when convenient, would unblock this one. |
|
@janhilgard gentle ping — the three concerns from your 2026-03-31 CHANGES_REQUESTED review were addressed on the current head ( |
janhilgard
left a comment
There was a problem hiding this comment.
All three items from my previous review are addressed:
- Shared detokenizer — both
mllm_scheduler.pyandscheduler.pynow always createNaiveStreamingDetokenizer(...)per request. Thetokenizer.detokenizerreuse path is gone. ✅ - Leak on abort/timeout —
_detokenizer_pool.pop()added to_cleanup_finished, and the normal-finish path uses a single.pop()instead of.get()+.pop(). ✅ - Merge conflicts — rebased clean on current main. ✅
LGTM. Thanks for the quick turnaround on the fixes.
Note: my MTP PR (#245) touches utils/tokenizer.py as well (MTP weight injection in load_model_with_fallback). Will rebase after this lands.
|
Closing this - the key fixes (streaming detokenizer per-request isolation, VLM pre-detection, error finish handling) have landed on main through subsequent commits. The conflicts are extensive enough that a rebase isn't worth it at this point. |
Summary
mllm_scheduler.py- streaming detokenizer:tokenizer.decode([token])withNaiveStreamingDetokenizer(frommlx_lm.tokenizer_utils) per request_detokenizer_pool: per-request detokenizer lifecycle. On finish:detok.finalize()+detok.textfor the authoritative full output textRequestOutputwithfinish_reason="error"rather than silently vanishingutils/tokenizer.py- VLM pre-detection:_needs_strict_false(): readsconfig.jsonbefore loading to detect VLM models (vision_config+text_configpresent). Avoids the 2x memory penalty of loading ~100GB weights twice (once withstrict=Truethat fails, once withstrict=False)_load_strict_false(): clean implementation usingmlx_lm._download+load_model+load_tokenizerwith parameter count logging for diagnosticsload_model_with_fallback: VLM fast path + improved retry path (clears traceback references and callsgc.collect()before retry to release memory from the failed load)Split from #224 for easier review. No dependency on the other split PRs.
Test plan