Fix Metal resource leak under high concurrency#92
Fix Metal resource leak under high concurrency#92waybarrios merged 2 commits intowaybarrios:mainfrom
Conversation
Addresses Metal buffer leak where batch.tokens grows via mx.concatenate() each generation step without evaluation, causing computation graph nodes to hold AGXAllocation handles indefinitely. Changes: - Add mx.async_eval(*batch.tokens) after each generation step to eagerly evaluate accumulated token concatenations and release Metal buffers - Make cache clear interval adaptive: scales inversely with active sequence count (min interval 8) to prevent Metal resource handle exhaustion under high-concurrency workloads - Add explicit mx.eval(*tokens) during periodic cache clear to collapse any remaining lazy concatenation chains before clearing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
I've tried this and seems to be working. Tanked my memory (48GB) with 2 4b qwen modell loaded without this, with the fix it is around 6gb memory usage. can you please check and merge if looking good? |
There was a problem hiding this comment.
2 issues found. Do you mind if you take a look @janhilgard ? Other than that, the pr is good!
| # from lazy mx.concatenate() chains holding AGXAllocation handles | ||
| if batch.tokens: | ||
| mx.async_eval(*batch.tokens) | ||
|
|
There was a problem hiding this comment.
mx.async_eval(*batch.tokens) may be redundant here. batch.tokens[i] is built via mx.concatenate((toks, y[i:i+1])) which depends on batch.y. Two lines below, y = y.tolist() forces synchronous evaluation of y, which cascades to evaluate these token concatenation chains. This async_eval either (a) is a no-op since the sync point already forces evaluation, or (b) inserts extra async work between the batch.y async eval above and its consumption via y.tolist(), potentially causing premature materialization of shared computation graph nodes.
vllm_mlx/scheduler.py
Outdated
| # Metal resource handle exhaustion under high-concurrency workloads. | ||
| active_seqs = len(self.running) | ||
| effective_interval = max( | ||
| 8, self._clear_cache_interval // max(1, active_seqs // 8) |
There was a problem hiding this comment.
The hardcoded floor of 8 causes effective_interval to collapse to 8 for any active_seqs >= 32 (since 32 // (32//8) = 8) — a 4x increase in cache-clear frequency. The comment at lines 1051-1052 documents _clear_cache_interval as the operator-configurable knob, but this formula bypasses it. Consider deriving the floor from the configured value (e.g. self._clear_cache_interval // 4) instead of a magic constant. Also, recalculating effective_interval every step() with a monotonic _step_count means the modulo trigger can be skipped for extended periods when concurrency drops suddenly.
waybarrios
left a comment
There was a problem hiding this comment.
The pr is good, just minor edits
|
pushed a small fix (30c5ad4):
|
…sion Cherry-pick upstream merged fixes + local improvements: - PR waybarrios#92: Eager eval batch.tokens after mx.concatenate() to release Metal AGXAllocation handles; adaptive cache clear interval scales with concurrency - PR waybarrios#154: Drain self.requests dict in MLLM _cleanup_finished() to prevent linear memory growth; add mx.clear_cache() after cleanup - PR waybarrios#157: Port adaptive periodic mx.clear_cache() from LLM scheduler to MLLM scheduler (interval scales inversely with active sequences) - PR waybarrios#124: Forward tool definitions through MLLM chat/stream_chat paths in SimpleEngine and MLXMultimodalLM (get_chat_template) - PR waybarrios#126: Hash full base64 string with SHA-256 instead of MD5 on first 1000 chars to prevent cross-request image cache collisions Additional fixes: - batched.py: Disable thinking mode for coder models, promote MLLM stats - mllm_batch_generator.py: Downgrade prompt size guard from error to warning - qwen3_parser.py: Treat tagless output as reasoning (max_tokens truncation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ection, served-model-name Merge 16 upstream commits (22dcbf8..d235c37) into our fork: - feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180) - feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150) - fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97) - feat: Add --served-model-name CLI parameter (waybarrios#125) - feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127) - fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157) - fix: Metal resource leak under high concurrency (waybarrios#92) Conflict resolution strategy: keep all fork features (DeltaNet snapshots, fast SSE templates, tool injection, cloud routing, prompt cache, etc.) while incorporating upstream's new functionality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Fixes #91 — Metal buffer leak under high concurrency.
Root cause:
batch.tokensgrows viamx.concatenate()each generation step but is never evaluated, so computation graph nodes holdAGXAllocationhandles indefinitely. Under high concurrency this exhausts Metal resource handles.Changes:
mx.async_eval(*batch.tokens)after each generation step to eagerly evaluate accumulated token concatenations and release Metal buffersmx.eval(*tokens)during periodic cache clear to collapse any remaining lazy concatenation chainsHow it works
Fix A — Eager evaluation (line ~202):
Fix B — Adaptive cache clearing (line ~2239):
batch.tokensto collapse lazy chainsTest plan
ioclasscount | grep AGXAllocationstays stable under sustained high-concurrency load🤖 Generated with Claude Code