Fix Metal resource leak under high concurrency by janhilgard · Pull Request #92 · waybarrios/vllm-mlx

janhilgard · 2026-02-16T08:30:45Z

Summary

Fixes #91 — Metal buffer leak under high concurrency.

Root cause: batch.tokens grows via mx.concatenate() each generation step but is never evaluated, so computation graph nodes hold AGXAllocation handles indefinitely. Under high concurrency this exhausts Metal resource handles.

Changes:

Add mx.async_eval(*batch.tokens) after each generation step to eagerly evaluate accumulated token concatenations and release Metal buffers
Make cache clear interval adaptive: scales inversely with active sequence count (min interval 8, base interval 32) to prevent Metal resource handle exhaustion under high-concurrency workloads
Add explicit mx.eval(*tokens) during periodic cache clear to collapse any remaining lazy concatenation chains

How it works

Fix A — Eager evaluation (line ~202):

mx.async_eval(batch.y, batch.logprobs)
# NEW: evaluate accumulated tokens to prevent Metal buffer buildup
if batch.tokens:
    mx.async_eval(*batch.tokens)

Fix B — Adaptive cache clearing (line ~2239):

Base interval: 32 steps (unchanged for single-user)
At 8+ concurrent sequences: interval halves to 16
At 16+ concurrent sequences: interval drops to 8 (minimum)
Each clear also evaluates batch.tokens to collapse lazy chains

Test plan

Verify ioclasscount | grep AGXAllocation stays stable under sustained high-concurrency load
Benchmark single-user throughput (should be unchanged)
Benchmark multi-user throughput with 8+ concurrent requests
Verify no regression in generation quality

🤖 Generated with Claude Code

Addresses Metal buffer leak where batch.tokens grows via mx.concatenate() each generation step without evaluation, causing computation graph nodes to hold AGXAllocation handles indefinitely. Changes: - Add mx.async_eval(*batch.tokens) after each generation step to eagerly evaluate accumulated token concatenations and release Metal buffers - Make cache clear interval adaptive: scales inversely with active sequence count (min interval 8) to prevent Metal resource handle exhaustion under high-concurrency workloads - Add explicit mx.eval(*tokens) during periodic cache clear to collapse any remaining lazy concatenation chains before clearing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dinnyosz · 2026-03-03T10:17:17Z

I've tried this and seems to be working. Tanked my memory (48GB) with 2 4b qwen modell loaded without this, with the fix it is around 6gb memory usage. can you please check and merge if looking good?

waybarrios

2 issues found. Do you mind if you take a look @janhilgard ? Other than that, the pr is good!

waybarrios · 2026-03-12T20:18:19Z

vllm_mlx/scheduler.py

+        # from lazy mx.concatenate() chains holding AGXAllocation handles
+        if batch.tokens:
+            mx.async_eval(*batch.tokens)



mx.async_eval(*batch.tokens) may be redundant here. batch.tokens[i] is built via mx.concatenate((toks, y[i:i+1])) which depends on batch.y. Two lines below, y = y.tolist() forces synchronous evaluation of y, which cascades to evaluate these token concatenation chains. This async_eval either (a) is a no-op since the sync point already forces evaluation, or (b) inserts extra async work between the batch.y async eval above and its consumption via y.tolist(), potentially causing premature materialization of shared computation graph nodes.

waybarrios · 2026-03-12T20:18:19Z

vllm_mlx/scheduler.py

+        # Metal resource handle exhaustion under high-concurrency workloads.
+        active_seqs = len(self.running)
+        effective_interval = max(
+            8, self._clear_cache_interval // max(1, active_seqs // 8)


The hardcoded floor of 8 causes effective_interval to collapse to 8 for any active_seqs >= 32 (since 32 // (32//8) = 8) — a 4x increase in cache-clear frequency. The comment at lines 1051-1052 documents _clear_cache_interval as the operator-configurable knob, but this formula bypasses it. Consider deriving the floor from the configured value (e.g. self._clear_cache_interval // 4) instead of a magic constant. Also, recalculating effective_interval every step() with a monotonic _step_count means the modulo trigger can be skipped for extended periods when concurrency drops suddenly.

waybarrios

The pr is good, just minor edits

… config

waybarrios · 2026-03-12T20:26:29Z

pushed a small fix (30c5ad4):

removed the mx.async_eval(*batch.tokens) — turns out y.tolist() right below already forces sync eval of the token chains, so it was redundant
changed the interval floor from a hardcoded 8 to self._clear_cache_interval // 4 so it respects whatever the operator configures

…sion Cherry-pick upstream merged fixes + local improvements: - PR waybarrios#92: Eager eval batch.tokens after mx.concatenate() to release Metal AGXAllocation handles; adaptive cache clear interval scales with concurrency - PR waybarrios#154: Drain self.requests dict in MLLM _cleanup_finished() to prevent linear memory growth; add mx.clear_cache() after cleanup - PR waybarrios#157: Port adaptive periodic mx.clear_cache() from LLM scheduler to MLLM scheduler (interval scales inversely with active sequences) - PR waybarrios#124: Forward tool definitions through MLLM chat/stream_chat paths in SimpleEngine and MLXMultimodalLM (get_chat_template) - PR waybarrios#126: Hash full base64 string with SHA-256 instead of MD5 on first 1000 chars to prevent cross-request image cache collisions Additional fixes: - batched.py: Disable thinking mode for coder models, promote MLLM stats - mllm_batch_generator.py: Downgrade prompt size guard from error to warning - qwen3_parser.py: Treat tagless output as reasoning (max_tokens truncation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ection, served-model-name Merge 16 upstream commits (22dcbf8..d235c37) into our fork: - feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180) - feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150) - fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97) - feat: Add --served-model-name CLI parameter (waybarrios#125) - feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127) - fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157) - fix: Metal resource leak under high concurrency (waybarrios#92) Conflict resolution strategy: keep all fork features (DeltaNet snapshots, fast SSE templates, tool injection, cloud routing, prompt cache, etc.) while incorporating upstream's new functionality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

waybarrios self-requested a review March 12, 2026 20:08

waybarrios added bug Something isn't working UNDER REVIEW labels Mar 12, 2026

waybarrios self-assigned this Mar 12, 2026

waybarrios reviewed Mar 12, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

waybarrios self-requested a review March 12, 2026 20:20

waybarrios approved these changes Mar 12, 2026

View reviewed changes

fix: remove redundant async_eval and derive cache interval floor from…

30c5ad4

… config

waybarrios merged commit 22dcbf8 into waybarrios:main Mar 12, 2026
7 checks passed

janhilgard mentioned this pull request Mar 21, 2026

Add Metal memory stats to MLLMScheduler #78

Closed

2 tasks

raullenchai mentioned this pull request Mar 26, 2026

Sync upstream: SpecPrefill, native video, MTP injection raullenchai/Rapid-MLX#58

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Metal resource leak under high concurrency#92

Fix Metal resource leak under high concurrency#92
waybarrios merged 2 commits intowaybarrios:mainfrom
janhilgard:fix/metal-resource-leak

janhilgard commented Feb 16, 2026

Uh oh!

dinnyosz commented Mar 3, 2026

Uh oh!

waybarrios left a comment •

edited

Loading

Uh oh!

waybarrios Mar 12, 2026

Uh oh!

waybarrios Mar 12, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

waybarrios left a comment

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

janhilgard commented Feb 16, 2026

Summary

How it works

Test plan

Uh oh!

dinnyosz commented Mar 3, 2026

Uh oh!

waybarrios left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

waybarrios Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

waybarrios left a comment

Choose a reason for hiding this comment

Uh oh!

waybarrios commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

waybarrios left a comment •

edited

Loading