UPSTREAM PR #16982: Mamba2 SSD #63

DajanaV · 2025-11-04T00:16:50Z

DRAFT STATUS

This PR will remain in Draft until the items in the discussion section are resolved.

Description

This PR is a draft implementation of the Structured Statespace Duality described in the original mamba2 paper which reframes the SSM_SCAN op as a pseudo-attention operation. The paper describes it in great detail, but the short version is that when performing a multi-token scan, the recurrent formulation of SSM_SCAN is inefficient because it cannot parallelize over the sequence dimension the way an attention calculation can. With the SSD formulation, the logical attention matrix is decomposed into chunks and the state is updated at the chunk boundaries, allowing prefill to "jump" by the size of the chunk rather than proceed with tokens one-at-a-time.

Reference Links

Original Paper: https://arxiv.org/pdf/2405.21060
Optimized triton implementation by paper authors: https://github.com/state-spaces/mamba/blob/main/mamba_ssm/ops/triton/ssd_combined.py
Unified implementation in mlx-lm: https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/models/ssm.py

Changes

Introduce new primitive operations in ggml:
- ggml_cumsum / ggml_cumsum_0: Perform a cumulative sum along a give dimension
  - NOTE: This adds the ability to specify dimension on top of the implementation in #16623 and relaxes the need for contiguous rows
- ggml_tri_dims / ggml_tri / ggml_tri_keep: Apply a triangular mask to the given matrix
  - NOTE: This adds the ability to specify dimension on top of the implementation in #16623 and relaxes the need for contiguous rows with ggml_tri_dims
- ggml_softplus: Perform the unary softplus operation
Implement an alternate path through llm_graph_context_mamba::build_mamba2_layer when a multi-token update is detected
- This path is the core of the SSD implementation and avoids calling SSM_SCAN in favor of the chunked pseudo-attention formulation

Discussion

There are a number of outstanding discussion points on this work that need to be resolved before moving it forward:

Performance: Currently, this implementation appears to be significantly slower than simply using SSM_SCAN which roundly defeats the purpose of the change! I suspect that the performance issues are due to the number of ggml_permute / ggml_cont ops that are added to the graph, but could use assistance figuring out how to eliminate them or identifying other sources of slowness.
To chunk or not to chunk: In this PR I have sub-ubatch chunking implemented. I had it mostly working before the corresponding discussion on Qwen3Next. The inter-chunk update would be needed anyway, so I didn't strip it out, but it would be fairly trivial to do so and might offer some performance improvements.
Handling of repeat_interleave: Similar to the issue that came up when initially implementing NemotronH support, I believe that ggml_repeat behaves differently than mx.repeat, resulting in incorrect results for models with n_groups > 1 (tested with NemotronH).

Testing

I've tested this locally with various members of the Granite 4 family and with nvidia/NVIDIA-Nemotron-Nano-9B-v2. For the Granite 4 models with n_groups == 1, I get nearly identical results to running with purely SSM_SCAN, but NemotronH still struggles due to repeat_interleave issues (see above). I'll flesh out more testing results once we've worked through some of the above issues.

cc @compilade since I know this has been on your TODO list since the original mamba2 implementation.

This reverts commit 00f115f.

* gg/metal-mul-mat-fixes: metal : fix mul-mm condition + fix mul-mv permuted kernels

Cherry-picked and edited from 7ec2df64a46f4697d9c95f3f07753c3e3b1926fa The original commit contained the DELTA_NET op as well which I've removed in this cherry-picked version. Co-Authored-By: Piotr Wilkin <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]>

…sors Branch: Mamba2Perf Signed-off-by: Gabe Goodhart <[email protected]>