fix: skip RNN snapshots in MTP optimistic mode to prevent memory leak#196
Conversation
b534f74 to
f811002
Compare
|
CI green. Fixes a memory leak in MTP optimistic mode — skips RNN state snapshots when they aren't needed for rollback. |
|
Evidence from M2 Ultra 128GB, Qwen3.5-122B-A10B-VLM-MTP-5bit, BatchedEngine with MTP routing: Test: 20 sequential requests, server RSS delta < 200MB -- PASS Without this fix, each MTP optimistic-mode generation creates a full RNN state snapshot that is never freed, growing ~2GB per 256 generation steps on the 122B. After 20 requests, the server would OOM and crash. With the fix (skip RNN snapshots in optimistic mode), RSS stays stable across requests. The fix is minimal: one conditional check in the MTP generation loop. Running 24/7 in production with this applied. No memory drift observed over multi-day sessions. |
|
@janhilgard - MTP memory leak fix. RSS grows unbounded without this. Would appreciate a review. |
janhilgard
left a comment
There was a problem hiding this comment.
Reviewed the diff. Minimal, targeted fix for a real OOM issue. Gating _rnn_snapshots with if not optimistic: is correct — optimistic mode never rejects, so the state copies are pure waste (~147 MB/step of lazy graph nodes holding Metal buffer references). Including batch.tokens in mx.async_eval to collapse the concatenation chain is a good secondary fix.
Single file, 12 lines changed, clear root cause analysis in the PR description.
|
@waybarrios, status update: CI green, mergeable. @janhilgard formally approved. One file, 12 lines. Skips the RNN state snapshot in MTP optimistic mode, where the verifier can never reject so the snapshot is pure waste. Worse, the snapshot holds lazy MLX graph references that prevent Metal buffer reuse, which is what causes the leak. Plus a secondary fix that adds Validated on M2 Ultra 128GB with Qwen3.5-122B-A10B-VLM-MTP-5bit across 20 sequential requests, RSS delta under 200MB vs OOM without. Ready to merge whenever convenient. |
|
Hey @Thump604 — this is a solid fix, already applied it to our production 122B server. The RNN snapshot skip in optimistic mode is a real memory saver. The PR has merge conflicts with current Note: the |
|
Restacked on current main. I dropped the already-covered eval hunk and kept only the optimistic-mode RNN snapshot guard, which is the part still missing after #278. Local verification on the rebased branch: and |
f811002 to
a28fa7b
Compare
|
Restacked on current main. I dropped the already-covered batch.tokens eval hunk and kept only the optimistic-mode RNN snapshot guard, which is the part still missing after #278. Local verification on the rebased branch:
|
a28fa7b to
380b12d
Compare
Summary
In the MTP always-advance decode path, every step copies all recurrent layer states (GatedDeltaNet SSM) for potential rollback on draft token rejection. In optimistic mode, rejection never happens — the copies are pure waste that causes a memory leak.
Root Cause
Each SSM state is
[1, 64, 128, 128]float32 = ~4 MB. With 36 linear attention layers (e.g., Qwen3.5-122B-A10B), that's ~147 MB of.copy()graph nodes per step. The lazy copies hold references to pre-verify Metal buffers, preventing the allocator from freeing them. With GPU pipeline depth of 2-3 steps, this creates 300-450 MB of memory pressure that grows during long generations.Observed Impact
On M2 Ultra 128GB with Qwen3.5-122B-A10B (5-bit, ~82 GB weights):
[METAL] Command buffer execution failed: Insufficient MemoryChanges
if not optimistic:— snapshots are only needed for the verified reject pathbatch.tokensinmx.async_evalto collapse the token history concatenation chainTest plan
--enable-mtpon a hybrid model (Qwen3.5)optimistic=False)[Metal memory]log lines