Fix memory leak: close BatchGenerator properly and clear Metal cache by janhilgard · Pull Request #44 · waybarrios/vllm-mlx

janhilgard · 2026-02-05T00:08:26Z

Summary

Fix VRAM (wired memory) growth during generation by properly closing BatchGenerator and clearing Metal cache periodically
Without this fix, wired memory spikes from ~70GB to 130GB+ during generation on large contexts
With fix: stable ~108GB during generation, returns to baseline within ~10s after completion
KV cache is preserved for agentic multi-turn reuse

Root Cause

Three issues caused unbounded VRAM growth:

BatchGenerator.close() never called — when the scheduler replaced a BatchGenerator, it set self.batch_generator = None without calling .close(), leaving wired_limit elevated and Metal buffers unreleased
No periodic cache clearing — MLX accumulates command buffers and intermediate tensors during generation. Without explicit mx.clear_cache(), wired memory grows with every generated token
No engine-level cleanup — after the scheduler finishes processing requests, the engine core never freed Metal buffers

Changes

File	Change
`scheduler.py`	Add `_close_batch_generator()` — calls `.close()` before discarding
`scheduler.py`	`mx.clear_cache()` every 32 scheduler steps (key: limits VRAM spike during generation)
`scheduler.py`	`mx.clear_cache()` when requests finish in `_cleanup_finished`
`scheduler.py`	`mx.eval()` + `mx.clear_cache()` after KV cache extraction (prevents deferred spike)
`scheduler.py`	Use `_close_batch_generator()` in `reset()` and `_recover_from_cache_error()`
`scheduler.py`	Add Metal memory stats (active/peak/cache GB) to `get_stats()`
`engine_core.py`	`mx.clear_cache()` after distributing finished request outputs
`tests/test_memory_stability.py`	Unit tests for all memory management paths

Test Results

Qwen3-Next-80B-A3B-6bit on 256GB M3 Ultra, ~5K token contexts:

During generation (wired memory):

Without fix	With fix
70 → 132 GB (spike)	107 → 109 GB (stable)

Multi-request stability (5 sequential requests):

Request	Before	Peak (during gen)	+10s after
1	45.32 GB	106.45 GB	45.67 GB
2	45.67 GB	106.57 GB	45.67 GB
3	45.67 GB	106.68 GB	45.67 GB
4	45.67 GB	106.81 GB	45.67 GB
5	45.67 GB	106.92 GB	30.37 GB

Memory returns to baseline after each request — no monotonic growth.

Test plan

Unit tests pass (tests/test_memory_stability.py)
VRAM spike during generation reduced from +62GB to +2GB
Multi-request: wired memory bounded, returns to baseline
KV cache properly stored for agentic multi-turn
Throughput unchanged (~62-64 tok/s)
CI passes

🤖 Generated with Claude Code

…periodically Without this fix, VRAM (wired memory) grows during generation because: 1. BatchGenerator.close() was never called when replacing generators, leaving wired_limit elevated and Metal buffers unreleased 2. Accumulated MLX async operations and intermediate tensors were never explicitly freed during generation 3. No cache cleanup in engine_core after requests finish Changes in scheduler.py: - Add _close_batch_generator() that calls .close() before discarding - Call mx.clear_cache() every 32 scheduler steps (prevents VRAM spike during generation from ~130GB to stable ~108GB) - Call mx.clear_cache() when requests finish in _cleanup_finished - Force mx.eval() + mx.clear_cache() after KV cache extraction to prevent deferred VRAM spike during later access - Use _close_batch_generator() in reset() and _recover_from_cache_error() - Add Metal memory stats (active/peak/cache GB) to get_stats() Changes in engine_core.py: - Add mx.clear_cache() after distributing finished request outputs Verified on Qwen3-Next-80B-A3B-6bit (256GB M4 Ultra): - During generation: stable ~108GB wired (was 130GB+ without fix) - After completion: returns to baseline within ~10s - 5 sequential requests: wired memory bounded, no monotonic growth - KV cache properly stored for agentic multi-turn reuse Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

janhilgard · 2026-02-09T07:39:31Z

Closing — all changes from this PR were already merged into main via PR #46 (b191aec).

janhilgard force-pushed the fix/memory-management-cache-clearing branch from cbb84af to 0617913 Compare February 6, 2026 17:32

janhilgard changed the title ~~Fix memory spikes during generation with periodic mx.clear_cache()~~ Fix memory leak: close BatchGenerator properly and clear Metal cache Feb 6, 2026

janhilgard force-pushed the fix/memory-management-cache-clearing branch from 0617913 to aa817dc Compare February 6, 2026 18:09

janhilgard closed this Feb 9, 2026

kol22 mentioned this pull request Mar 11, 2026

fix(mllm_scheduler): drain requests dict and clear metal cache after completions #154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak: close BatchGenerator properly and clear Metal cache#44

Fix memory leak: close BatchGenerator properly and clear Metal cache#44
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:fix/memory-management-cache-clearing

janhilgard commented Feb 5, 2026 •

edited

Loading

Uh oh!

janhilgard commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

janhilgard commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Changes

Test Results

Test plan

Uh oh!

janhilgard commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

janhilgard commented Feb 5, 2026 •

edited

Loading