Skip to content

fix: skip RNN snapshots in MTP optimistic mode to prevent memory leak#196

Merged
Thump604 merged 1 commit intowaybarrios:mainfrom
Thump604:fix/mtp-optimistic-rnn-snapshot-leak
Apr 11, 2026
Merged

fix: skip RNN snapshots in MTP optimistic mode to prevent memory leak#196
Thump604 merged 1 commit intowaybarrios:mainfrom
Thump604:fix/mtp-optimistic-rnn-snapshot-leak

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Summary

In the MTP always-advance decode path, every step copies all recurrent layer states (GatedDeltaNet SSM) for potential rollback on draft token rejection. In optimistic mode, rejection never happens — the copies are pure waste that causes a memory leak.

Root Cause

Each SSM state is [1, 64, 128, 128] float32 = ~4 MB. With 36 linear attention layers (e.g., Qwen3.5-122B-A10B), that's ~147 MB of .copy() graph nodes per step. The lazy copies hold references to pre-verify Metal buffers, preventing the allocator from freeing them. With GPU pipeline depth of 2-3 steps, this creates 300-450 MB of memory pressure that grows during long generations.

Observed Impact

On M2 Ultra 128GB with Qwen3.5-122B-A10B (5-bit, ~82 GB weights):

  • Memory grows ~2 GB per 256 decode steps
  • After ~4000 steps: active=108.7 GB, peak=119.4 GB → Metal OOM crash
  • Server log: [METAL] Command buffer execution failed: Insufficient Memory

Changes

  1. Gate the RNN snapshot loop with if not optimistic: — snapshots are only needed for the verified reject path
  2. Include batch.tokens in mx.async_eval to collapse the token history concatenation chain

Test plan

  • Run 4000+ token generation with --enable-mtp on a hybrid model (Qwen3.5)
  • Verify memory stays flat during decode (should not grow ~2 GB/256 steps)
  • Verify MTP verified mode still works (snapshots still taken when optimistic=False)
  • Monitor with [Metal memory] log lines

@Thump604 Thump604 force-pushed the fix/mtp-optimistic-rnn-snapshot-leak branch from b534f74 to f811002 Compare March 22, 2026 12:17
@Thump604
Copy link
Copy Markdown
Collaborator Author

CI green. Fixes a memory leak in MTP optimistic mode — skips RNN state snapshots when they aren't needed for rollback.

@Thump604
Copy link
Copy Markdown
Collaborator Author

Evidence from M2 Ultra 128GB, Qwen3.5-122B-A10B-VLM-MTP-5bit, BatchedEngine with MTP routing:

Test: 20 sequential requests, server RSS delta < 200MB -- PASS

Without this fix, each MTP optimistic-mode generation creates a full RNN state snapshot that is never freed, growing ~2GB per 256 generation steps on the 122B. After 20 requests, the server would OOM and crash.

With the fix (skip RNN snapshots in optimistic mode), RSS stays stable across requests. The fix is minimal: one conditional check in the MTP generation loop.

Running 24/7 in production with this applied. No memory drift observed over multi-day sessions.

@Thump604
Copy link
Copy Markdown
Collaborator Author

@janhilgard - MTP memory leak fix. RSS grows unbounded without this. Would appreciate a review.

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the diff. Minimal, targeted fix for a real OOM issue. Gating _rnn_snapshots with if not optimistic: is correct — optimistic mode never rejects, so the state copies are pure waste (~147 MB/step of lazy graph nodes holding Metal buffer references). Including batch.tokens in mx.async_eval to collapse the concatenation chain is a good secondary fix.

Single file, 12 lines changed, clear root cause analysis in the PR description.

@Thump604
Copy link
Copy Markdown
Collaborator Author

Thump604 commented Apr 7, 2026

@waybarrios, status update: CI green, mergeable. @janhilgard formally approved. One file, 12 lines. Skips the RNN state snapshot in MTP optimistic mode, where the verifier can never reject so the snapshot is pure waste. Worse, the snapshot holds lazy MLX graph references that prevent Metal buffer reuse, which is what causes the leak. Plus a secondary fix that adds batch.tokens to the mx.async_eval call to collapse the lazy graph chain earlier.

Validated on M2 Ultra 128GB with Qwen3.5-122B-A10B-VLM-MTP-5bit across 20 sequential requests, RSS delta under 200MB vs OOM without.

Ready to merge whenever convenient.

@janhilgard
Copy link
Copy Markdown
Collaborator

Hey @Thump604 — this is a solid fix, already applied it to our production 122B server. The RNN snapshot skip in optimistic mode is a real memory saver.

The PR has merge conflicts with current main though (scheduler.py changed in PR #278). Could you rebase on main? The fix itself should apply cleanly — just the if not optimistic: guard around the snapshot loop.

Note: the batch.tokens async_eval change is already covered in main (lines 206-207 do explicit eval after the condition), so you may only need the snapshot guard part.

@Thump604
Copy link
Copy Markdown
Collaborator Author

Restacked on current main. I dropped the already-covered eval hunk and kept only the optimistic-mode RNN snapshot guard, which is the part still missing after #278. Local verification on the rebased branch: and
no tests ran in 0.00s.

@Thump604 Thump604 force-pushed the fix/mtp-optimistic-rnn-snapshot-leak branch from f811002 to a28fa7b Compare April 11, 2026 15:59
@Thump604
Copy link
Copy Markdown
Collaborator Author

Restacked on current main.

I dropped the already-covered batch.tokens eval hunk and kept only the optimistic-mode RNN snapshot guard, which is the part still missing after #278.

Local verification on the rebased branch:

  • python -m py_compile vllm_mlx/scheduler.py
  • pytest -q tests/test_continuous_batching.py -k 'scheduler_accepts_multiple_requests or scheduler_config_batching_params'

@Thump604 Thump604 force-pushed the fix/mtp-optimistic-rnn-snapshot-leak branch from a28fa7b to 380b12d Compare April 11, 2026 16:00
@Thump604 Thump604 merged commit 575f58b into waybarrios:main Apr 11, 2026
7 checks passed
@janhilgard
Copy link
Copy Markdown
Collaborator

@Thump604 — this is good to merge once you rebase on current main. The only conflict is in scheduler.py which changed in PR #278, but the fix itself (the if not optimistic: guard) should apply cleanly.

We'll merge it to main once it's rebased and CI passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants