fix: break shared-buffer memory leak in GatedDeltaNet cache by adurham · Pull Request #1077 · ml-explore/mlx-lm

adurham · 2026-03-31T02:29:18Z

Summary

Add mx.contiguous() on cache[0] and cache[1] in GatedDeltaNet.__call__ to break a shared-buffer memory leak during multi-chunk prefill

Problem

GatedDeltaNet stores cache entries as slices of larger parent tensors:

cache[0] = conv_input[:, -2:]   # 2-position slice of (T+2) tensor
cache[1] = state                # kernel output sharing input buffers

MLX slices share the parent array's Data buffer via shared_ptr. During multi-chunk prefill, each chunk's tiny cache slice keeps the entire parent conv_input/state alive, which in turn pins the full forward pass computation graph in memory.

This causes ~540 KB/tok memory growth during prefill, limiting Qwen3.5-397B-A17B-4bit to ~24K tokens on 128GB M4 Max before OOM.

Fix

mx.contiguous() creates an independent copy with its own buffer, breaking the reference chain and allowing the parent to be freed.

Results

Tested on 2× M4 Max 128GB in pipeline-parallel with Qwen3.5-397B-A17B-4bit:

Context	Before	After
Memory growth	~540 KB/tok	~20 KB/tok
16K prefill	124.7 GB	117.8 GB
24K prefill	129.7 GB (near OOM)	119.3 GB
32K prefill	OOM	121.0 GB
50K prefill	OOM	124.8 GB

No throughput regression — prefill tok/s remains 415-480, generation output is identical.

Test plan

Verified correct output at short context (128 tokens)
Verified correct output at long context (50K tokens)
Verified no throughput regression
Verified memory growth reduced from ~540 KB/tok to ~20 KB/tok
Only affects Qwen3.5 models (GatedDeltaNet) — other architectures are unchanged

GatedDeltaNet stores cache entries as slices of larger parent tensors: cache[0] = conv_input[:, -2:] # 2-position slice of (T+2) tensor cache[1] = state # kernel output sharing input buffers MLX slices share the parent array's Data buffer via shared_ptr. During multi-chunk prefill, each chunk's tiny cache slice keeps the entire parent conv_input/state alive, which in turn pins the full forward pass computation graph in memory. This causes ~540 KB/tok memory growth that limits Qwen3.5-397B to ~24K tokens on 128GB before OOM. mx.contiguous() creates an independent copy with its own buffer, breaking the reference chain and allowing the parent to be freed. Before: ~540 KB/tok growth, 24K token ceiling on 128GB After: ~20 KB/tok growth, 50K+ tokens verified, 100K+ feasible

Thump604

I run Qwen3.5-122B-A10B (hybrid: 12 full-attention + 36 GatedDeltaNet layers) in production on M2 Ultra 128GB with continuous batching and MTP.

This is a real leak I have hit in production. The cache slice holds a shared_ptr reference back to the full parent tensor from the forward pass, preventing deallocation during multi-chunk prefill. At ~540 KB/tok, a 60K prefill leaks ~32GB on top of the ~82GB model weights. On a 128GB machine that is fatal.

I independently applied the same fix to my local fork (my GDN code is refactored into a _process_chunk method but the cache assignment site has the identical leak). mx.contiguous() is the correct approach since it materializes an independent copy, breaking the reference chain. No regressions observed after applying.

Good catch. This should merge quickly given how many people are running Qwen3.5 hybrid models at long context.

Remove one unnecessary contiguous call and the slightly misleading comment.

Thump604 approved these changes Mar 31, 2026

View reviewed changes

angeloskath added 3 commits March 31, 2026 14:14

Remove an unnecessary contiguous call

796160d

Remove one unnecessary contiguous call and the slightly misleading comment.

Same fix for Kimi linear

4f0b9c5

Same fix for qwen3 next

501c0bb

angeloskath approved these changes Apr 1, 2026

View reviewed changes

angeloskath merged commit 9dcefa5 into ml-explore:main Apr 1, 2026
2 checks passed

Thump604 mentioned this pull request Apr 1, 2026

Batch generation refactoring and various fixes #1072

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: break shared-buffer memory leak in GatedDeltaNet cache#1077

fix: break shared-buffer memory leak in GatedDeltaNet cache#1077
angeloskath merged 4 commits intoml-explore:mainfrom
adurham:fix/deltanet-contiguous-cache-leak

adurham commented Mar 31, 2026

Uh oh!

Thump604 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adurham commented Mar 31, 2026

Summary

Problem

Fix

Results

Test plan

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants