Skip to content

fix(mllm_scheduler): drain requests dict and clear metal cache after completions#154

Merged
waybarrios merged 1 commit intowaybarrios:mainfrom
kol22:fix/mllm-scheduler-resource-leaks
Mar 12, 2026
Merged

fix(mllm_scheduler): drain requests dict and clear metal cache after completions#154
waybarrios merged 1 commit intowaybarrios:mainfrom
kol22:fix/mllm-scheduler-resource-leaks

Conversation

@kol22
Copy link
Copy Markdown
Contributor

@kol22 kol22 commented Mar 11, 2026

Summary

  • Add self.requests.pop(request_id, None) in abort_request() and _cleanup_finished() to drain the request tracking dict
  • Add mx.clear_cache() after _cleanup_finished() in step() to release MLX metal buffer pool memory back to the OS

Context

The MLLMScheduler has two resource leaks that compound under continuous batching with vision requests:

1. self.requests dict never drained. Requests are added at add_request() but never removed during the normal lifecycle. _cleanup_finished() marks requests in finished_req_ids and cleans up running/UID mappings, but leaves self.requests intact. A remove_finished_request() method exists but is only called from EngineCore, which uses the regular Scheduler, the MLLM scheduler's own async loop never calls it. Over time, self.requests grows without bound (7,000+ orphaned entries observed per backend before OOM).

2. MLX metal buffer pool monotonic growth. When KV cache tensors are deallocated after request completion, MLX returns the GPU memory to its internal metal buffer pool rather than the OS. Without an explicit mx.clear_cache() call, the pool grows monotonically. The regular Scheduler already received this fix in PR #44 (merged Feb 9), but the MLLMScheduler was missed.

Together, these leaks cause backends to hit kIOGPUCommandBufferCallbackErrorOutOfMemory after hours of continuous serving with multimodal requests.

Observed impact

Tested on Mac Studio M3 Ultra (96 GB) running two vllm-mlx backends with --continuous-batching and Qwen3.5-35B-A3B (4-bit):

Metric Before After
Soak test result 94.4% error rate (1201/1272 failures) 0% error rate (8279/8279 OK)
Failure mode kIOGPUCommandBufferCallbackErrorOutOfMemory after ~15 min Stable over 100+ min
self.requests size per backend 7,085 orphaned entries at crash 0 (properly drained)

@waybarrios
Copy link
Copy Markdown
Owner

Nice work on this one @kol22 , the OOM fix is legit. Tested numbers look great too. Only thing I'd maybe think about later is doing the cache clearing on a timer like the LLM scheduler does instead of only when requests finish, since a big video request could still pile up memory if nothing completes for a while. But honestly that's a separate thing, this fixes the actual problem.

@waybarrios waybarrios merged commit eed6cde into waybarrios:main Mar 12, 2026
7 checks passed
@waybarrios
Copy link
Copy Markdown
Owner

Let me know if you wanna work in the last thing I mention @kol22

@kol22
Copy link
Copy Markdown
Contributor Author

kol22 commented Mar 13, 2026

@waybarrios I'll work this in

@kol22 kol22 deleted the fix/mllm-scheduler-resource-leaks branch March 13, 2026 01:22
janhilgard added a commit to janhilgard/vllm-mlx that referenced this pull request Mar 21, 2026
…sion

Cherry-pick upstream merged fixes + local improvements:

- PR waybarrios#92: Eager eval batch.tokens after mx.concatenate() to release Metal
  AGXAllocation handles; adaptive cache clear interval scales with concurrency
- PR waybarrios#154: Drain self.requests dict in MLLM _cleanup_finished() to prevent
  linear memory growth; add mx.clear_cache() after cleanup
- PR waybarrios#157: Port adaptive periodic mx.clear_cache() from LLM scheduler to
  MLLM scheduler (interval scales inversely with active sequences)
- PR waybarrios#124: Forward tool definitions through MLLM chat/stream_chat paths
  in SimpleEngine and MLXMultimodalLM (get_chat_template)
- PR waybarrios#126: Hash full base64 string with SHA-256 instead of MD5 on first
  1000 chars to prevent cross-request image cache collisions

Additional fixes:
- batched.py: Disable thinking mode for coder models, promote MLLM stats
- mllm_batch_generator.py: Downgrade prompt size guard from error to warning
- qwen3_parser.py: Treat tagless output as reasoning (max_tokens truncation)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants