Skip to content

Fix Metal resource leak under high concurrency#92

Merged
waybarrios merged 2 commits intowaybarrios:mainfrom
janhilgard:fix/metal-resource-leak
Mar 12, 2026
Merged

Fix Metal resource leak under high concurrency#92
waybarrios merged 2 commits intowaybarrios:mainfrom
janhilgard:fix/metal-resource-leak

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

Fixes #91 — Metal buffer leak under high concurrency.

Root cause: batch.tokens grows via mx.concatenate() each generation step but is never evaluated, so computation graph nodes hold AGXAllocation handles indefinitely. Under high concurrency this exhausts Metal resource handles.

Changes:

  • Add mx.async_eval(*batch.tokens) after each generation step to eagerly evaluate accumulated token concatenations and release Metal buffers
  • Make cache clear interval adaptive: scales inversely with active sequence count (min interval 8, base interval 32) to prevent Metal resource handle exhaustion under high-concurrency workloads
  • Add explicit mx.eval(*tokens) during periodic cache clear to collapse any remaining lazy concatenation chains

How it works

Fix A — Eager evaluation (line ~202):

mx.async_eval(batch.y, batch.logprobs)
# NEW: evaluate accumulated tokens to prevent Metal buffer buildup
if batch.tokens:
    mx.async_eval(*batch.tokens)

Fix B — Adaptive cache clearing (line ~2239):

  • Base interval: 32 steps (unchanged for single-user)
  • At 8+ concurrent sequences: interval halves to 16
  • At 16+ concurrent sequences: interval drops to 8 (minimum)
  • Each clear also evaluates batch.tokens to collapse lazy chains

Test plan

  • Verify ioclasscount | grep AGXAllocation stays stable under sustained high-concurrency load
  • Benchmark single-user throughput (should be unchanged)
  • Benchmark multi-user throughput with 8+ concurrent requests
  • Verify no regression in generation quality

🤖 Generated with Claude Code

Addresses Metal buffer leak where batch.tokens grows via mx.concatenate()
each generation step without evaluation, causing computation graph nodes
to hold AGXAllocation handles indefinitely.

Changes:
- Add mx.async_eval(*batch.tokens) after each generation step to eagerly
  evaluate accumulated token concatenations and release Metal buffers
- Make cache clear interval adaptive: scales inversely with active
  sequence count (min interval 8) to prevent Metal resource handle
  exhaustion under high-concurrency workloads
- Add explicit mx.eval(*tokens) during periodic cache clear to collapse
  any remaining lazy concatenation chains before clearing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dinnyosz
Copy link
Copy Markdown

dinnyosz commented Mar 3, 2026

I've tried this and seems to be working. Tanked my memory (48GB) with 2 4b qwen modell loaded without this, with the fix it is around 6gb memory usage. can you please check and merge if looking good?

@waybarrios waybarrios self-requested a review March 12, 2026 20:08
@waybarrios waybarrios added bug Something isn't working UNDER REVIEW labels Mar 12, 2026
@waybarrios waybarrios self-assigned this Mar 12, 2026
Copy link
Copy Markdown
Owner

@waybarrios waybarrios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found. Do you mind if you take a look @janhilgard ? Other than that, the pr is good!

# from lazy mx.concatenate() chains holding AGXAllocation handles
if batch.tokens:
mx.async_eval(*batch.tokens)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mx.async_eval(*batch.tokens) may be redundant here. batch.tokens[i] is built via mx.concatenate((toks, y[i:i+1])) which depends on batch.y. Two lines below, y = y.tolist() forces synchronous evaluation of y, which cascades to evaluate these token concatenation chains. This async_eval either (a) is a no-op since the sync point already forces evaluation, or (b) inserts extra async work between the batch.y async eval above and its consumption via y.tolist(), potentially causing premature materialization of shared computation graph nodes.

# Metal resource handle exhaustion under high-concurrency workloads.
active_seqs = len(self.running)
effective_interval = max(
8, self._clear_cache_interval // max(1, active_seqs // 8)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded floor of 8 causes effective_interval to collapse to 8 for any active_seqs >= 32 (since 32 // (32//8) = 8) — a 4x increase in cache-clear frequency. The comment at lines 1051-1052 documents _clear_cache_interval as the operator-configurable knob, but this formula bypasses it. Consider deriving the floor from the configured value (e.g. self._clear_cache_interval // 4) instead of a magic constant. Also, recalculating effective_interval every step() with a monotonic _step_count means the modulo trigger can be skipped for extended periods when concurrency drops suddenly.

waybarrios

This comment was marked as resolved.

@waybarrios waybarrios self-requested a review March 12, 2026 20:20
Copy link
Copy Markdown
Owner

@waybarrios waybarrios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pr is good, just minor edits

@waybarrios
Copy link
Copy Markdown
Owner

pushed a small fix (30c5ad4):

  • removed the mx.async_eval(*batch.tokens) — turns out y.tolist() right below already forces sync eval of the token chains, so it was redundant
  • changed the interval floor from a hardcoded 8 to self._clear_cache_interval // 4 so it respects whatever the operator configures

@waybarrios waybarrios merged commit 22dcbf8 into waybarrios:main Mar 12, 2026
7 checks passed
janhilgard added a commit to janhilgard/vllm-mlx that referenced this pull request Mar 21, 2026
…sion

Cherry-pick upstream merged fixes + local improvements:

- PR waybarrios#92: Eager eval batch.tokens after mx.concatenate() to release Metal
  AGXAllocation handles; adaptive cache clear interval scales with concurrency
- PR waybarrios#154: Drain self.requests dict in MLLM _cleanup_finished() to prevent
  linear memory growth; add mx.clear_cache() after cleanup
- PR waybarrios#157: Port adaptive periodic mx.clear_cache() from LLM scheduler to
  MLLM scheduler (interval scales inversely with active sequences)
- PR waybarrios#124: Forward tool definitions through MLLM chat/stream_chat paths
  in SimpleEngine and MLXMultimodalLM (get_chat_template)
- PR waybarrios#126: Hash full base64 string with SHA-256 instead of MD5 on first
  1000 chars to prevent cross-request image cache collisions

Additional fixes:
- batched.py: Disable thinking mode for coder models, promote MLLM stats
- mllm_batch_generator.py: Downgrade prompt size guard from error to warning
- qwen3_parser.py: Treat tagless output as reasoning (max_tokens truncation)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai pushed a commit to raullenchai/Rapid-MLX that referenced this pull request Mar 26, 2026
…ection, served-model-name

Merge 16 upstream commits (22dcbf8..d235c37) into our fork:

- feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180)
- feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE (waybarrios#150)
- fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection (waybarrios#97)
- feat: Add --served-model-name CLI parameter (waybarrios#125)
- feat: Add Qwen3.5 text-only loading and dynamic memory threshold (waybarrios#127)
- fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157)
- fix: Metal resource leak under high concurrency (waybarrios#92)

Conflict resolution strategy: keep all fork features (DeltaNet snapshots,
fast SSE templates, tool injection, cloud routing, prompt cache, etc.)
while incorporating upstream's new functionality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working UNDER REVIEW

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metal resource leak

3 participants