fix(mllm_scheduler): drain requests dict and clear metal cache after completions#154
Merged
waybarrios merged 1 commit intowaybarrios:mainfrom Mar 12, 2026
Merged
Conversation
Owner
|
Nice work on this one @kol22 , the OOM fix is legit. Tested numbers look great too. Only thing I'd maybe think about later is doing the cache clearing on a timer like the LLM scheduler does instead of only when requests finish, since a big video request could still pile up memory if nothing completes for a while. But honestly that's a separate thing, this fixes the actual problem. |
Owner
|
Let me know if you wanna work in the last thing I mention @kol22 |
Contributor
Author
|
@waybarrios I'll work this in |
janhilgard
added a commit
to janhilgard/vllm-mlx
that referenced
this pull request
Mar 21, 2026
…sion Cherry-pick upstream merged fixes + local improvements: - PR waybarrios#92: Eager eval batch.tokens after mx.concatenate() to release Metal AGXAllocation handles; adaptive cache clear interval scales with concurrency - PR waybarrios#154: Drain self.requests dict in MLLM _cleanup_finished() to prevent linear memory growth; add mx.clear_cache() after cleanup - PR waybarrios#157: Port adaptive periodic mx.clear_cache() from LLM scheduler to MLLM scheduler (interval scales inversely with active sequences) - PR waybarrios#124: Forward tool definitions through MLLM chat/stream_chat paths in SimpleEngine and MLXMultimodalLM (get_chat_template) - PR waybarrios#126: Hash full base64 string with SHA-256 instead of MD5 on first 1000 chars to prevent cross-request image cache collisions Additional fixes: - batched.py: Disable thinking mode for coder models, promote MLLM stats - mllm_batch_generator.py: Downgrade prompt size guard from error to warning - qwen3_parser.py: Treat tagless output as reasoning (max_tokens truncation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
self.requests.pop(request_id, None)inabort_request()and_cleanup_finished()to drain the request tracking dictmx.clear_cache()after_cleanup_finished()instep()to release MLX metal buffer pool memory back to the OSContext
The
MLLMSchedulerhas two resource leaks that compound under continuous batching with vision requests:1.
self.requestsdict never drained. Requests are added atadd_request()but never removed during the normal lifecycle._cleanup_finished()marks requests infinished_req_idsand cleans uprunning/UID mappings, but leavesself.requestsintact. Aremove_finished_request()method exists but is only called fromEngineCore, which uses the regularScheduler, the MLLM scheduler's own async loop never calls it. Over time,self.requestsgrows without bound (7,000+ orphaned entries observed per backend before OOM).2. MLX metal buffer pool monotonic growth. When KV cache tensors are deallocated after request completion, MLX returns the GPU memory to its internal metal buffer pool rather than the OS. Without an explicit
mx.clear_cache()call, the pool grows monotonically. The regularScheduleralready received this fix in PR #44 (merged Feb 9), but theMLLMSchedulerwas missed.Together, these leaks cause backends to hit
kIOGPUCommandBufferCallbackErrorOutOfMemoryafter hours of continuous serving with multimodal requests.Observed impact
Tested on Mac Studio M3 Ultra (96 GB) running two
vllm-mlxbackends with--continuous-batchingand Qwen3.5-35B-A3B (4-bit):kIOGPUCommandBufferCallbackErrorOutOfMemoryafter ~15 minself.requestssize per backend