fix: break shared-buffer memory leak in GatedDeltaNet cache#1077
fix: break shared-buffer memory leak in GatedDeltaNet cache#1077angeloskath merged 4 commits intoml-explore:mainfrom
Conversation
GatedDeltaNet stores cache entries as slices of larger parent tensors:
cache[0] = conv_input[:, -2:] # 2-position slice of (T+2) tensor
cache[1] = state # kernel output sharing input buffers
MLX slices share the parent array's Data buffer via shared_ptr. During
multi-chunk prefill, each chunk's tiny cache slice keeps the entire
parent conv_input/state alive, which in turn pins the full forward pass
computation graph in memory. This causes ~540 KB/tok memory growth
that limits Qwen3.5-397B to ~24K tokens on 128GB before OOM.
mx.contiguous() creates an independent copy with its own buffer,
breaking the reference chain and allowing the parent to be freed.
Before: ~540 KB/tok growth, 24K token ceiling on 128GB
After: ~20 KB/tok growth, 50K+ tokens verified, 100K+ feasible
Thump604
left a comment
There was a problem hiding this comment.
I run Qwen3.5-122B-A10B (hybrid: 12 full-attention + 36 GatedDeltaNet layers) in production on M2 Ultra 128GB with continuous batching and MTP.
This is a real leak I have hit in production. The cache slice holds a shared_ptr reference back to the full parent tensor from the forward pass, preventing deallocation during multi-chunk prefill. At ~540 KB/tok, a 60K prefill leaks ~32GB on top of the ~82GB model weights. On a 128GB machine that is fatal.
I independently applied the same fix to my local fork (my GDN code is refactored into a _process_chunk method but the cache assignment site has the identical leak). mx.contiguous() is the correct approach since it materializes an independent copy, breaking the reference chain. No regressions observed after applying.
Good catch. This should merge quickly given how many people are running Qwen3.5 hybrid models at long context.
Remove one unnecessary contiguous call and the slightly misleading comment.
Summary
mx.contiguous()oncache[0]andcache[1]inGatedDeltaNet.__call__to break a shared-buffer memory leak during multi-chunk prefillProblem
GatedDeltaNet stores cache entries as slices of larger parent tensors:
MLX slices share the parent array's
Databuffer viashared_ptr. During multi-chunk prefill, each chunk's tiny cache slice keeps the entire parentconv_input/statealive, which in turn pins the full forward pass computation graph in memory.This causes ~540 KB/tok memory growth during prefill, limiting Qwen3.5-397B-A17B-4bit to ~24K tokens on 128GB M4 Max before OOM.
Fix
mx.contiguous()creates an independent copy with its own buffer, breaking the reference chain and allowing the parent to be freed.Results
Tested on 2× M4 Max 128GB in pipeline-parallel with Qwen3.5-397B-A17B-4bit:
No throughput regression — prefill tok/s remains 415-480, generation output is identical.
Test plan