Fix memory leak: close BatchGenerator properly and clear Metal cache#44
Closed
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
Closed
Fix memory leak: close BatchGenerator properly and clear Metal cache#44janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
Conversation
cbb84af to
0617913
Compare
…periodically Without this fix, VRAM (wired memory) grows during generation because: 1. BatchGenerator.close() was never called when replacing generators, leaving wired_limit elevated and Metal buffers unreleased 2. Accumulated MLX async operations and intermediate tensors were never explicitly freed during generation 3. No cache cleanup in engine_core after requests finish Changes in scheduler.py: - Add _close_batch_generator() that calls .close() before discarding - Call mx.clear_cache() every 32 scheduler steps (prevents VRAM spike during generation from ~130GB to stable ~108GB) - Call mx.clear_cache() when requests finish in _cleanup_finished - Force mx.eval() + mx.clear_cache() after KV cache extraction to prevent deferred VRAM spike during later access - Use _close_batch_generator() in reset() and _recover_from_cache_error() - Add Metal memory stats (active/peak/cache GB) to get_stats() Changes in engine_core.py: - Add mx.clear_cache() after distributing finished request outputs Verified on Qwen3-Next-80B-A3B-6bit (256GB M4 Ultra): - During generation: stable ~108GB wired (was 130GB+ without fix) - After completion: returns to baseline within ~10s - 5 sequential requests: wired memory bounded, no monotonic growth - KV cache properly stored for agentic multi-turn reuse Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0617913 to
aa817dc
Compare
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BatchGeneratorand clearing Metal cache periodicallyRoot Cause
Three issues caused unbounded VRAM growth:
BatchGenerator.close() never called — when the scheduler replaced a
BatchGenerator, it setself.batch_generator = Nonewithout calling.close(), leavingwired_limitelevated and Metal buffers unreleasedNo periodic cache clearing — MLX accumulates command buffers and intermediate tensors during generation. Without explicit
mx.clear_cache(), wired memory grows with every generated tokenNo engine-level cleanup — after the scheduler finishes processing requests, the engine core never freed Metal buffers
Changes
scheduler.py_close_batch_generator()— calls.close()before discardingscheduler.pymx.clear_cache()every 32 scheduler steps (key: limits VRAM spike during generation)scheduler.pymx.clear_cache()when requests finish in_cleanup_finishedscheduler.pymx.eval()+mx.clear_cache()after KV cache extraction (prevents deferred spike)scheduler.py_close_batch_generator()inreset()and_recover_from_cache_error()scheduler.pyget_stats()engine_core.pymx.clear_cache()after distributing finished request outputstests/test_memory_stability.pyTest Results
Qwen3-Next-80B-A3B-6bit on 256GB M3 Ultra, ~5K token contexts:
During generation (wired memory):
Multi-request stability (5 sequential requests):
Memory returns to baseline after each request — no monotonic growth.
Test plan
tests/test_memory_stability.py)🤖 Generated with Claude Code