-
Notifications
You must be signed in to change notification settings - Fork 292
model: GraniteMoeHybrid #442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Looks pretty good! Have you tested it? Is it functional? |
|
🤦 I got so caught up dumping the Claude Code artifacts I forgot to add my testing results! Yes, this does work. Here's the simple example I use everywhere: from mlx_lm import generate, load
model_path = '/Users/ghart/models/granite-4.0-tiny-preview/'
prompt = 'Tell me a story about a developer and their dog'
model, tokenizer = load(model_path)
result = generate(model, tokenizer, prompt=prompt, verbose=True)For comparison, here's roughly the same thing with ./bin/llama-cli -m ~/models/granite-4.0-tiny-preview/Granite-4.0-Tiny-Preview-62x915M-F16.gguf -no-cnv -p "Tell me a story about a developer and their dog" --temp 0The results are not identical, so there are clearly some precision differences somewhere in the calculations, but the story of Alex and Max is consistent and coherent. |
|
The specific model I'm testing with is: https://huggingface.co/ibm-granite/granite-4.0-tiny-preview |
…d by Claude Code
This commit was entirely generated using Claude Code and the following
prompt:
---
I've got an in-depth feature request for you to add. I need you to add support for the GraniteMoeHybrid architecture to the `mlx-lm` project. The task is to extend the existing set of model architecture implementations in `mlx_lm/models` by adding a new module named `granitemoehybrid.py`. Here are a few key pointers on this model architecture:
* It is a hybrid-recurrent model that uses `mamba2` for some layers (recurrent) and `granitemoe` for some layers (attention)
* It is very similar to the `nemotron_h` architecture implemented in `mlx_lm/models/nemotron_h.py`, but with a few key differences
* In `GraniteMoeHybrid`, each layer has either a `mamba2` block or a `granitemoe` attention block AND a MoE block, whereas in `nemotron_h`, each "layer" is a single block that is either `mamba2`, `attention` (llama), or `ffn` (not MoE).
* The config for `GraniteMoeHybrid` uses the `layer_types` field to determine whether to use `mamba2` or `granitemoe` attention for each layer
* The `transformers` implementation can be found at https://github.com/huggingface/transformers/blob/main/src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.py
* The config can be found at https://github.com/huggingface/transformers/blob/main/src/transformers/models/granitemoehybrid/configuration_granitemoehybrid.py
* The PR adding support in `llama.cpp` is: ggml-org/llama.cpp#13550
* NOTE: In `llama.cpp`, I made the architecture slightly more flexible such that each layer could use either a MoE block OR a fully-connected FFN block after the recurrent/attention block
* For the `granitemoe` attention, the architecture is very similar to standard `llama` attention, but it includes 4 additional scalar multipliers that are pulled from config:
* `embedding_multiplier`:
* Multiply the input embeddings by this scalar before the first layer
* Used here in `transformers` https://github.com/huggingface/transformers/blob/main/src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.py#L1347
* `attention_multiplier`:
* Used as the scaling factor in standard attention in place of the default 1/sqrt(n_embed_head)
* Used here in `transformers`: https://github.com/huggingface/transformers/blob/main/src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.py#L217
The goal of this project is to create a fully working local implementation of the model in `mlx_lm`. You can find a local model to test with at /Users/ghart/models/granite-4.0-tiny-preview/. You can find a version of the `nemotron_h` model to test with at /Users/ghart/models/nvidia/NVIDIA-Nemotron-Nano-9B-v2/. To accomplish this project, you'll need to take the following steps:
1. Get a development environment working (you can use `uv` to manage your virtual env) and install the necessary dependencies
2. Run a sample inference with a model that is already known to work (eg `/Users/ghart/models/nvidia/NVIDIA-Nemotron-Nano-9B-v2/`)
3. Create the new module at `mlx_lm/models/granitemoehybrid.py`
4. Implement the model architecture, test, and iterate until you've got things working locally
Once you've got it working, let me know and I'll review and commit
---
Branch: GraniteHybrid
Signed-off-by: Gabe Goodhart <[email protected]>
Inference now matches transormers. Further refinement by me comming next. Branch: GraniteHybrid Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteHybrid Signed-off-by: Gabe Goodhart <[email protected]>
…odels This keeps the implementation of the attention block closer to GraniteMoe for an easier diff view in the future. The functionality is identical. Branch: GraniteHybrid Signed-off-by: Gabe Goodhart <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
|
Thanks for the cleanup fixes! |
Description
Addresses #256
This PR adds support for the
GraniteMoeHybridmodel architecture. It was heavily written using Claude Code with my input and guidance. I have deep knowledge of the model architecture having implemented it forllama.cppalong with similar architectures like nemotron_h, so I gave Claude Code very specific guidance (though after-the fact I found several gaps in my guidance that Claude Code was able to work around).Claude Code Artifacts
I used this as an opportunity to test Claude Code's capabilities, so I'm recording as much of the session here as possible for posterity.
Input Prompt
I've got an in-depth feature request for you to add. I need you to add support for the GraniteMoeHybrid architecture to the
mlx-lmproject. The task is to extend the existing set of model architecture implementations inmlx_lm/modelsby adding a new module namedgranitemoehybrid.py. Here are a few key pointers on this model architecture:mamba2for some layers (recurrent) andgranitemoefor some layers (attention)nemotron_harchitecture implemented inmlx_lm/models/nemotron_h.py, but with a few key differencesGraniteMoeHybrid, each layer has either amamba2block or agranitemoeattention block AND a MoE block, whereas innemotron_h, each "layer" is a single block that is eithermamba2,attention(llama), orffn(not MoE).GraniteMoeHybriduses thelayer_typesfield to determine whether to usemamba2orgranitemoeattention for each layertransformersimplementation can be found at https://github.com/huggingface/transformers/blob/main/src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.pyllama.cppis: Granite Four ggml-org/llama.cpp#13550llama.cpp, I made the architecture slightly more flexible such that each layer could use either a MoE block OR a fully-connected FFN block after the recurrent/attention blockgranitemoeattention, the architecture is very similar to standardllamaattention, but it includes 4 additional scalar multipliers that are pulled from config:embedding_multiplier:transformershttps://github.com/huggingface/transformers/blob/main/src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.py#L1347attention_multiplier:transformers: https://github.com/huggingface/transformers/blob/main/src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.py#L217The goal of this project is to create a fully working local implementation of the model in
mlx_lm. You can find a local model to test with at /Users/ghart/models/granite-4.0-tiny-preview/. You can find a version of thenemotron_hmodel to test with at /Users/ghart/models/nvidia/NVIDIA-Nemotron-Nano-9B-v2/. To accomplish this project, you'll need to take the following steps:uvto manage your virtual env) and install the necessary dependencies/Users/ghart/models/nvidia/NVIDIA-Nemotron-Nano-9B-v2/)mlx_lm/models/granitemoehybrid.pyOnce you've got it working, let me know and I'll review and commit
tmp_mlx.py
tmp_transformers.py
debug_layer_call.py
debug_layers.py
debug_mamba.py
test_cache_mlx.py
test_cache_transformers.py
claude-trace.jsonl.txt