Skip to content

Fix memory leak: close BatchGenerator properly and clear Metal cache#44

Closed
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:fix/memory-management-cache-clearing
Closed

Fix memory leak: close BatchGenerator properly and clear Metal cache#44
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:fix/memory-management-cache-clearing

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard commented Feb 5, 2026

Summary

  • Fix VRAM (wired memory) growth during generation by properly closing BatchGenerator and clearing Metal cache periodically
  • Without this fix, wired memory spikes from ~70GB to 130GB+ during generation on large contexts
  • With fix: stable ~108GB during generation, returns to baseline within ~10s after completion
  • KV cache is preserved for agentic multi-turn reuse

Root Cause

Three issues caused unbounded VRAM growth:

  1. BatchGenerator.close() never called — when the scheduler replaced a BatchGenerator, it set self.batch_generator = None without calling .close(), leaving wired_limit elevated and Metal buffers unreleased

  2. No periodic cache clearing — MLX accumulates command buffers and intermediate tensors during generation. Without explicit mx.clear_cache(), wired memory grows with every generated token

  3. No engine-level cleanup — after the scheduler finishes processing requests, the engine core never freed Metal buffers

Changes

File Change
scheduler.py Add _close_batch_generator() — calls .close() before discarding
scheduler.py mx.clear_cache() every 32 scheduler steps (key: limits VRAM spike during generation)
scheduler.py mx.clear_cache() when requests finish in _cleanup_finished
scheduler.py mx.eval() + mx.clear_cache() after KV cache extraction (prevents deferred spike)
scheduler.py Use _close_batch_generator() in reset() and _recover_from_cache_error()
scheduler.py Add Metal memory stats (active/peak/cache GB) to get_stats()
engine_core.py mx.clear_cache() after distributing finished request outputs
tests/test_memory_stability.py Unit tests for all memory management paths

Test Results

Qwen3-Next-80B-A3B-6bit on 256GB M3 Ultra, ~5K token contexts:

During generation (wired memory):

Without fix With fix
70 → 132 GB (spike) 107 → 109 GB (stable)

Multi-request stability (5 sequential requests):

Request Before Peak (during gen) +10s after
1 45.32 GB 106.45 GB 45.67 GB
2 45.67 GB 106.57 GB 45.67 GB
3 45.67 GB 106.68 GB 45.67 GB
4 45.67 GB 106.81 GB 45.67 GB
5 45.67 GB 106.92 GB 30.37 GB

Memory returns to baseline after each request — no monotonic growth.

Test plan

  • Unit tests pass (tests/test_memory_stability.py)
  • VRAM spike during generation reduced from +62GB to +2GB
  • Multi-request: wired memory bounded, returns to baseline
  • KV cache properly stored for agentic multi-turn
  • Throughput unchanged (~62-64 tok/s)
  • CI passes

🤖 Generated with Claude Code

@janhilgard janhilgard force-pushed the fix/memory-management-cache-clearing branch from cbb84af to 0617913 Compare February 6, 2026 17:32
@janhilgard janhilgard changed the title Fix memory spikes during generation with periodic mx.clear_cache() Fix memory leak: close BatchGenerator properly and clear Metal cache Feb 6, 2026
…periodically

Without this fix, VRAM (wired memory) grows during generation because:
1. BatchGenerator.close() was never called when replacing generators,
   leaving wired_limit elevated and Metal buffers unreleased
2. Accumulated MLX async operations and intermediate tensors were never
   explicitly freed during generation
3. No cache cleanup in engine_core after requests finish

Changes in scheduler.py:
- Add _close_batch_generator() that calls .close() before discarding
- Call mx.clear_cache() every 32 scheduler steps (prevents VRAM spike
  during generation from ~130GB to stable ~108GB)
- Call mx.clear_cache() when requests finish in _cleanup_finished
- Force mx.eval() + mx.clear_cache() after KV cache extraction to
  prevent deferred VRAM spike during later access
- Use _close_batch_generator() in reset() and _recover_from_cache_error()
- Add Metal memory stats (active/peak/cache GB) to get_stats()

Changes in engine_core.py:
- Add mx.clear_cache() after distributing finished request outputs

Verified on Qwen3-Next-80B-A3B-6bit (256GB M4 Ultra):
- During generation: stable ~108GB wired (was 130GB+ without fix)
- After completion: returns to baseline within ~10s
- 5 sequential requests: wired memory bounded, no monotonic growth
- KV cache properly stored for agentic multi-turn reuse

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@janhilgard janhilgard force-pushed the fix/memory-management-cache-clearing branch from 0617913 to aa817dc Compare February 6, 2026 18:09
@janhilgard
Copy link
Copy Markdown
Collaborator Author

Closing — all changes from this PR were already merged into main via PR #46 (b191aec).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant