Skip to content

fix: MLLM scheduler streaming detokenizer + VLM model pre-detection#242

Closed
Thump604 wants to merge 4 commits intowaybarrios:mainfrom
Thump604:fix/mllm-scheduler-detokenizer
Closed

fix: MLLM scheduler streaming detokenizer + VLM model pre-detection#242
Thump604 wants to merge 4 commits intowaybarrios:mainfrom
Thump604:fix/mllm-scheduler-detokenizer

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Summary

mllm_scheduler.py - streaming detokenizer:

  • Replace tokenizer.decode([token]) with NaiveStreamingDetokenizer (from mlx_lm.tokenizer_utils) per request
  • Fixes garbled UTF-8 multibyte characters in streaming output (e.g. CJK text where a character is split across token boundaries emits replacement characters mid-stream instead of waiting for the complete codepoint)
  • _detokenizer_pool: per-request detokenizer lifecycle. On finish: detok.finalize() + detok.text for the authoritative full output text
  • Error response handling: requests that fail during preprocessing (from the batch generator error path) now produce a proper RequestOutput with finish_reason="error" rather than silently vanishing

utils/tokenizer.py - VLM pre-detection:

  • _needs_strict_false(): reads config.json before loading to detect VLM models (vision_config + text_config present). Avoids the 2x memory penalty of loading ~100GB weights twice (once with strict=True that fails, once with strict=False)
  • _load_strict_false(): clean implementation using mlx_lm._download + load_model + load_tokenizer with parameter count logging for diagnostics
  • load_model_with_fallback: VLM fast path + improved retry path (clears traceback references and calls gc.collect() before retry to release memory from the failed load)

Split from #224 for easier review. No dependency on the other split PRs.

Test plan

  • CJK text (Chinese/Japanese/Korean) streams without replacement characters mid-stream
  • VLM model (e.g. Qwen3.5-122B-VLM) loads without double-loading weights
  • Non-VLM models (Nemotron, plain Qwen) still follow their existing code paths
  • Request that fails preprocessing produces a proper error response, not a hang

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good changes — streaming detokenizer for MLLM and VLM pre-detection are both needed. Two issues:

Bug: shared detokenizer instance in concurrent requests

if hasattr(tokenizer, "detokenizer"):
    detok = tokenizer.detokenizer
else:
    detok = NaiveStreamingDetokenizer(tokenizer)
detok.reset()
self._detokenizer_pool[request_id] = detok

tokenizer.detokenizer returns the same object every time. With concurrent requests, multiple request_id entries in _detokenizer_pool will point to the same detokenizer instance. When request B calls .reset(), it destroys request A's in-progress state.

Fix: always create a new instance per request:

detok = NaiveStreamingDetokenizer(tokenizer)
self._detokenizer_pool[request_id] = detok

Minor: detokenizer leak on abort/timeout

_detokenizer_pool is only cleaned up in _process_batch_responses on normal finish. If a request is aborted or times out via _cleanup_finished, the detokenizer entry leaks. Add cleanup in _cleanup_finished:

self._detokenizer_pool.pop(request_id, None)

Merge conflicts

PR currently shows CONFLICTING status — needs rebase on latest main.

Overlap note

My MTP PR (#245) also modifies utils/tokenizer.py (MTP weight detection in load_model_with_fallback). Will rebase after both land.

@waybarrios
Copy link
Copy Markdown
Owner

Please double check the conflict files.

Thump604 added 2 commits April 1, 2026 17:06
mllm_scheduler.py:
- Replace tokenizer.decode([token]) with NaiveStreamingDetokenizer per
  request. Fixes garbled UTF-8 multibyte characters in streaming output
  (e.g. CJK text split across token boundaries).
- Add error response handling: preprocessing failures now produce an
  error RequestOutput instead of silently dropping the request.
- _detokenizer_pool: per-request detokenizer lifecycle tied to request
  completion (pool.pop on finish, detok.finalize() for full text).

utils/tokenizer.py:
- Add _needs_strict_false(): reads model config to detect VLM models
  (vision_config + text_config) before attempting load. Avoids wasting
  ~100GB of memory loading weights only to fail strict=True validation.
- Add _load_strict_false(): clean strict=False loader using mlx_lm
  _download + load_model + load_tokenizer with parameter count logging.
- load_model_with_fallback: add VLM pre-detection fast path and improve
  retry error handling (gc.collect + traceback clear before retry).
- Always create new NaiveStreamingDetokenizer per request instead of
  reusing tokenizer.detokenizer (same object shared across concurrent
  requests, causing state corruption)
- Clean up detokenizer in _cleanup_finished to prevent memory leak
  when requests are aborted or time out

Fixes from janhilgard review.
@Thump604 Thump604 force-pushed the fix/mllm-scheduler-detokenizer branch from 2618de2 to 3e82b4d Compare April 1, 2026 22:07
@Thump604
Copy link
Copy Markdown
Collaborator Author

Thump604 commented Apr 1, 2026

Fixed both issues and rebased on main:

  1. Shared detokenizer: always create NaiveStreamingDetokenizer per request instead of reusing tokenizer.detokenizer
  2. Leak on abort: added _detokenizer_pool.pop(request_id, None) in _cleanup_finished
  3. Merge conflicts: resolved, rebased on b4fa030

Thanks for catching the shared instance bug, that would have been a nasty concurrency issue.

Thump604 added 2 commits April 1, 2026 17:10
Merge conflict artifact: tokenizer is not in scope in this function.
The caller already handles the return value.
tokenizer.detokenizer returns the same object for every call.
With concurrent requests, multiple entries in _detokenizer_pool
point to the same instance -- reset() on one corrupts the other.

Always create NaiveStreamingDetokenizer per request, matching
the fix already applied in mllm_scheduler.py.
@Thump604
Copy link
Copy Markdown
Collaborator Author

Thump604 commented Apr 7, 2026

@janhilgard, the three changes you requested are in the diff and the PR is rebased clean on current main:

  1. Shared detokenizer instance: Both call sites in vllm_mlx/scheduler.py and vllm_mlx/engine/batched.py now always create NaiveStreamingDetokenizer(...) per request. The old tokenizer.detokenizer reuse path is gone.

  2. Leak on abort: _detokenizer_pool.pop(request_id, None) added to _cleanup_finished. Also tightened the normal-finish path from .get() then later .pop() to a single .pop() at the finalize site so the pool entry is removed atomically.

  3. Merge conflicts: rebased clean. mergeable=MERGEABLE.

Would appreciate a re-review when convenient, would unblock this one.

@Thump604
Copy link
Copy Markdown
Collaborator Author

Thump604 commented Apr 9, 2026

@janhilgard gentle ping — the three concerns from your 2026-03-31 CHANGES_REQUESTED review were addressed on the current head (c2438058) in the 2026-04-07 refresh. Branch is mergeable against current main, CI is green across lint, type-check, test-matrix 3.10-3.12, test-apple-silicon, and tests. Would you mind another pass so the reviewDecision can flip? Happy to walk through any of the three points if that helps.

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three items from my previous review are addressed:

  1. Shared detokenizer — both mllm_scheduler.py and scheduler.py now always create NaiveStreamingDetokenizer(...) per request. The tokenizer.detokenizer reuse path is gone. ✅
  2. Leak on abort/timeout_detokenizer_pool.pop() added to _cleanup_finished, and the normal-finish path uses a single .pop() instead of .get() + .pop(). ✅
  3. Merge conflicts — rebased clean on current main. ✅

LGTM. Thanks for the quick turnaround on the fixes.

Note: my MTP PR (#245) touches utils/tokenizer.py as well (MTP weight injection in load_model_with_fallback). Will rebase after this lands.

@Thump604
Copy link
Copy Markdown
Collaborator Author

Closing this - the key fixes (streaming detokenizer per-request isolation, VLM pre-detection, error finish handling) have landed on main through subsequent commits. The conflicts are extensive enough that a rebase isn't worth it at this point.

@Thump604 Thump604 closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants