Feature Request: Granite 4 Support

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

This issue is to track work to support IBM's [Granite 4](https://huggingface.co/ibm-granite/granite-4.0-tiny-preview) model architecture (`GraniteMoEHybrid` in `transformers`). The model uses a number of components that are not yet supported in `llama.cpp`, but are being worked independently, so I'm raising this issue to triangulate the different work streams that will be needed to support the model.

## Necessary Components

- [x] Mamba2 layers
    - [x] Ongoing work by @compilade: https://github.com/ggml-org/llama.cpp/pull/9126
- [x] Refactored KV Cache to an abstract interface: https://github.com/ggml-org/llama.cpp/pull/12799
- [x] Support for hybrid attention / recurrent cache
    - [ ] Initial implementation for `jamba` by @compilade: https://github.com/ggml-org/llama.cpp/pull/7531
    - [x] Initial implementation for `bamba`: https://github.com/ggml-org/llama.cpp/pull/10810
    - [x] Updated implementation for `bamba` that's also out-of-date: https://github.com/gabe-l-hart/llama.cpp/tree/BambaArchitectureRefactor
    - [x] First cut implementation against current abstract interfaces: https://github.com/gabe-l-hart/llama.cpp/tree/HybridCache
- [x] Support for `GraniteMoEShared` layers: https://github.com/ggml-org/llama.cpp/pull/13269
- [ ] Support for `mamba2` in non-CPU backends
    - I'm not totally clear on the state here, so there may well be ongoing work
    - CUDA support for some of the necessary features was added in https://github.com/ggml-org/llama.cpp/pull/10558
    - Some of the `metal` backend needs look like they're addressed already in https://github.com/ggml-org/llama.cpp/pull/9126, but for me that still doesn't work on my M3 (assertion error about non-contiguous data).
- [x] Support for NoPE positional encoding instead of RoPE
    - I haven't fully investigated what is required for this, so it may already work as-is, but putting this here as a placeholder in case further work is needed
- [ ] End-to-end `GraniteMoEHybrid` support tying all of the other pieces together

### Motivation

I lead IBM's efforts to ensure that Granite models work everywhere, and `llama.cpp` is a critical part of "everywhere!"

### Possible Implementation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Granite 4 Support #13275

Prerequisites

Feature Description

Necessary Components

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Granite 4 Support #13275

Description

Prerequisites

Feature Description

Necessary Components

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions