Skip to content

fix: break shared-buffer memory leak in GatedDeltaNet cache#1077

Merged
angeloskath merged 4 commits intoml-explore:mainfrom
adurham:fix/deltanet-contiguous-cache-leak
Apr 1, 2026
Merged

fix: break shared-buffer memory leak in GatedDeltaNet cache#1077
angeloskath merged 4 commits intoml-explore:mainfrom
adurham:fix/deltanet-contiguous-cache-leak

Conversation

@adurham
Copy link
Copy Markdown
Contributor

@adurham adurham commented Mar 31, 2026

Summary

  • Add mx.contiguous() on cache[0] and cache[1] in GatedDeltaNet.__call__ to break a shared-buffer memory leak during multi-chunk prefill

Problem

GatedDeltaNet stores cache entries as slices of larger parent tensors:

cache[0] = conv_input[:, -2:]   # 2-position slice of (T+2) tensor
cache[1] = state                # kernel output sharing input buffers

MLX slices share the parent array's Data buffer via shared_ptr. During multi-chunk prefill, each chunk's tiny cache slice keeps the entire parent conv_input/state alive, which in turn pins the full forward pass computation graph in memory.

This causes ~540 KB/tok memory growth during prefill, limiting Qwen3.5-397B-A17B-4bit to ~24K tokens on 128GB M4 Max before OOM.

Fix

mx.contiguous() creates an independent copy with its own buffer, breaking the reference chain and allowing the parent to be freed.

Results

Tested on 2× M4 Max 128GB in pipeline-parallel with Qwen3.5-397B-A17B-4bit:

Context Before After
Memory growth ~540 KB/tok ~20 KB/tok
16K prefill 124.7 GB 117.8 GB
24K prefill 129.7 GB (near OOM) 119.3 GB
32K prefill OOM 121.0 GB
50K prefill OOM 124.8 GB

No throughput regression — prefill tok/s remains 415-480, generation output is identical.

Test plan

  • Verified correct output at short context (128 tokens)
  • Verified correct output at long context (50K tokens)
  • Verified no throughput regression
  • Verified memory growth reduced from ~540 KB/tok to ~20 KB/tok
  • Only affects Qwen3.5 models (GatedDeltaNet) — other architectures are unchanged

GatedDeltaNet stores cache entries as slices of larger parent tensors:

    cache[0] = conv_input[:, -2:]   # 2-position slice of (T+2) tensor
    cache[1] = state                # kernel output sharing input buffers

MLX slices share the parent array's Data buffer via shared_ptr. During
multi-chunk prefill, each chunk's tiny cache slice keeps the entire
parent conv_input/state alive, which in turn pins the full forward pass
computation graph in memory. This causes ~540 KB/tok memory growth
that limits Qwen3.5-397B to ~24K tokens on 128GB before OOM.

mx.contiguous() creates an independent copy with its own buffer,
breaking the reference chain and allowing the parent to be freed.

Before: ~540 KB/tok growth, 24K token ceiling on 128GB
After:  ~20 KB/tok growth, 50K+ tokens verified, 100K+ feasible
Copy link
Copy Markdown

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run Qwen3.5-122B-A10B (hybrid: 12 full-attention + 36 GatedDeltaNet layers) in production on M2 Ultra 128GB with continuous batching and MTP.

This is a real leak I have hit in production. The cache slice holds a shared_ptr reference back to the full parent tensor from the forward pass, preventing deallocation during multi-chunk prefill. At ~540 KB/tok, a 60K prefill leaks ~32GB on top of the ~82GB model weights. On a 128GB machine that is fatal.

I independently applied the same fix to my local fork (my GDN code is refactored into a _process_chunk method but the cache assignment site has the identical leak). mx.contiguous() is the correct approach since it materializes an independent copy, breaking the reference chain. No regressions observed after applying.

Good catch. This should merge quickly given how many people are running Qwen3.5 hybrid models at long context.

Remove one unnecessary contiguous call and the slightly misleading comment.
@angeloskath angeloskath merged commit 9dcefa5 into ml-explore:main Apr 1, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants