fix(mllm_scheduler): drain requests dict and clear metal cache after completions by kol22 · Pull Request #154 · waybarrios/vllm-mlx

kol22 · 2026-03-11T22:10:20Z

Summary

Add self.requests.pop(request_id, None) in abort_request() and _cleanup_finished() to drain the request tracking dict
Add mx.clear_cache() after _cleanup_finished() in step() to release MLX metal buffer pool memory back to the OS

Context

The MLLMScheduler has two resource leaks that compound under continuous batching with vision requests:

1. self.requests dict never drained. Requests are added at add_request() but never removed during the normal lifecycle. _cleanup_finished() marks requests in finished_req_ids and cleans up running/UID mappings, but leaves self.requests intact. A remove_finished_request() method exists but is only called from EngineCore, which uses the regular Scheduler, the MLLM scheduler's own async loop never calls it. Over time, self.requests grows without bound (7,000+ orphaned entries observed per backend before OOM).

2. MLX metal buffer pool monotonic growth. When KV cache tensors are deallocated after request completion, MLX returns the GPU memory to its internal metal buffer pool rather than the OS. Without an explicit mx.clear_cache() call, the pool grows monotonically. The regular Scheduler already received this fix in PR #44 (merged Feb 9), but the MLLMScheduler was missed.

Together, these leaks cause backends to hit kIOGPUCommandBufferCallbackErrorOutOfMemory after hours of continuous serving with multimodal requests.

Observed impact

Tested on Mac Studio M3 Ultra (96 GB) running two vllm-mlx backends with --continuous-batching and Qwen3.5-35B-A3B (4-bit):

Metric	Before	After
Soak test result	94.4% error rate (1201/1272 failures)	0% error rate (8279/8279 OK)
Failure mode	`kIOGPUCommandBufferCallbackErrorOutOfMemory` after ~15 min	Stable over 100+ min
`self.requests` size per backend	7,085 orphaned entries at crash	0 (properly drained)

…completions

waybarrios · 2026-03-12T17:09:41Z

Nice work on this one @kol22 , the OOM fix is legit. Tested numbers look great too. Only thing I'd maybe think about later is doing the cache clearing on a timer like the LLM scheduler does instead of only when requests finish, since a big video request could still pile up memory if nothing completes for a while. But honestly that's a separate thing, this fixes the actual problem.

waybarrios · 2026-03-12T17:10:30Z

Let me know if you wanna work in the last thing I mention @kol22

kol22 · 2026-03-13T00:25:37Z

@waybarrios I'll work this in

…sion Cherry-pick upstream merged fixes + local improvements: - PR waybarrios#92: Eager eval batch.tokens after mx.concatenate() to release Metal AGXAllocation handles; adaptive cache clear interval scales with concurrency - PR waybarrios#154: Drain self.requests dict in MLLM _cleanup_finished() to prevent linear memory growth; add mx.clear_cache() after cleanup - PR waybarrios#157: Port adaptive periodic mx.clear_cache() from LLM scheduler to MLLM scheduler (interval scales inversely with active sequences) - PR waybarrios#124: Forward tool definitions through MLLM chat/stream_chat paths in SimpleEngine and MLXMultimodalLM (get_chat_template) - PR waybarrios#126: Hash full base64 string with SHA-256 instead of MD5 on first 1000 chars to prevent cross-request image cache collisions Additional fixes: - batched.py: Disable thinking mode for coder models, promote MLLM stats - mllm_batch_generator.py: Downgrade prompt size guard from error to warning - qwen3_parser.py: Treat tagless output as reasoning (max_tokens truncation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(mllm_scheduler): drain requests dict and clear metal cache after …

5f2a9e8

…completions

waybarrios merged commit eed6cde into waybarrios:main Mar 12, 2026
7 checks passed

kol22 mentioned this pull request Mar 13, 2026

fix(mllm_scheduler): add adaptive periodic cache clearing #157

Merged

kol22 deleted the fix/mllm-scheduler-resource-leaks branch March 13, 2026 01:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mllm_scheduler): drain requests dict and clear metal cache after completions#154

fix(mllm_scheduler): drain requests dict and clear metal cache after completions#154
waybarrios merged 1 commit intowaybarrios:mainfrom
kol22:fix/mllm-scheduler-resource-leaks

kol22 commented Mar 11, 2026

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

kol22 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kol22 commented Mar 11, 2026

Summary

Context

Observed impact

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

kol22 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants