-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Closed
Labels
enhancementNew feature or requestNew feature or requestroadmapPart of a roadmap projectPart of a roadmap project
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
This issue is to track work to support IBM's Granite 4 model architecture (GraniteMoEHybrid in transformers). The model uses a number of components that are not yet supported in llama.cpp, but are being worked independently, so I'm raising this issue to triangulate the different work streams that will be needed to support the model.
Necessary Components
- Mamba2 layers
- Ongoing work by @compilade: llama : initial Mamba-2 support #9126
- Refactored KV Cache to an abstract interface: kv-cache : separate recurrent vs non-recurrent impl #12799
- Support for hybrid attention / recurrent cache
- Initial implementation for
jambaby @compilade: llama : support Jamba hybrid Transformer-Mamba models #7531 - Initial implementation for
bamba: Bamba architecture #10810 - Updated implementation for
bambathat's also out-of-date: https://github.com/gabe-l-hart/llama.cpp/tree/BambaArchitectureRefactor - First cut implementation against current abstract interfaces: https://github.com/gabe-l-hart/llama.cpp/tree/HybridCache
- Initial implementation for
- Support for
GraniteMoESharedlayers: Model: Granite MoE shared #13269 - Support for
mamba2in non-CPU backends- I'm not totally clear on the state here, so there may well be ongoing work
- CUDA support for some of the necessary features was added in Faster ssm scan #10558
- Some of the
metalbackend needs look like they're addressed already in llama : initial Mamba-2 support #9126, but for me that still doesn't work on my M3 (assertion error about non-contiguous data).
- Support for NoPE positional encoding instead of RoPE
- I haven't fully investigated what is required for this, so it may already work as-is, but putting this here as a placeholder in case further work is needed
- End-to-end
GraniteMoEHybridsupport tying all of the other pieces together
Motivation
I lead IBM's efforts to ensure that Granite models work everywhere, and llama.cpp is a critical part of "everywhere!"
Possible Implementation
No response
compilade, David-AU-github, lordfanger, dinerburger, Wetbikeboy2500 and 14 more
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestroadmapPart of a roadmap projectPart of a roadmap project