Skip to content

model: add Mellum architecture#23966

Merged
ggerganov merged 8 commits into
ggml-org:masterfrom
Xarbirus:mellum2
Jun 2, 2026
Merged

model: add Mellum architecture#23966
ggerganov merged 8 commits into
ggml-org:masterfrom
Xarbirus:mellum2

Conversation

@Xarbirus
Copy link
Copy Markdown
Contributor

@Xarbirus Xarbirus commented Jun 1, 2026

Overview

This PR adds support for the new Mellum architecture (see hf).

Additional information

  • It is important to note that the transformers version has been updated in this PR. This is because the converter does not work without the fix for one bug.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES for
    • analysis of current implementation
    • test execution and results analysis

@github-actions github-actions Bot added model Model specific testing Everything test related python python script changes labels Jun 1, 2026
@ggml-gh-bot

This comment was marked as resolved.

@CISC
Copy link
Copy Markdown
Member

CISC commented Jun 1, 2026

Undo the formatting changes please. :)

Also fill in the AI disclosure in OP.

Comment thread requirements/requirements-convert_legacy_llama.txt Outdated
@CISC
Copy link
Copy Markdown
Member

CISC commented Jun 2, 2026

Sigh, old transformers limit huggingface_hub versions, try changing requirements/requirements-tool_bench.txt and tools/server/tests/requirements.txt lower requirement to >=0.34.0.

Edit: Scratch that, just remove huggingface_hub altogether, not sure why it's a dependency at all?

@Xarbirus
Copy link
Copy Markdown
Contributor Author

Xarbirus commented Jun 2, 2026

Sigh, old transformers limit huggingface_hub versions, try changing requirements/requirements-tool_bench.txt and tools/server/tests/requirements.txt lower requirement to >=0.34.0.

Edit: Scratch that, just remove huggingface_hub altogether, not sure why it's a dependency at all?

Yep, I'll do that. I checked pyproject.toml, and everything seemed correct in there:( I'll remove the dependency from requirements.txt now.

@CISC
Copy link
Copy Markdown
Member

CISC commented Jun 2, 2026

Sigh, old transformers limit huggingface_hub versions, try changing requirements/requirements-tool_bench.txt and tools/server/tests/requirements.txt lower requirement to >=0.34.0.
Edit: Scratch that, just remove huggingface_hub altogether, not sure why it's a dependency at all?

Yep, I'll do that. I checked pyproject.toml, and everything seemed correct in there:( I'll remove the dependency from requirements.txt now.

Also in tools/server/tests/requirements.txt.

@g0t4
Copy link
Copy Markdown

g0t4 commented Jun 2, 2026

383 tokens/sec generation on RTX 6000 Pro using Q8_0 from JetBrains/Mellum2-12B-A2.5B-Thinking

btw I replicated the conversion to GGUF f16 => Q8_0

nice work @Xarbirus

@CISC CISC added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Jun 2, 2026
@CISC
Copy link
Copy Markdown
Member

CISC commented Jun 2, 2026

Failing CIs are not relevant, GTG.

@CISC CISC removed testing Everything test related examples server labels Jun 2, 2026
@ggerganov ggerganov merged commit 4fb16ec into ggml-org:master Jun 2, 2026
30 of 32 checks passed
arichiardi pushed a commit to arichiardi/llama.cpp that referenced this pull request Jun 2, 2026
* model: support for Mellum architecture

* model: improve mellum.py formatting

* model: improve mellum.py formatting once again

* deps: downgrade transformers to 4.57.6 (to fix CI)

* deps: remove huggingface_hub dependency

* deps: remove huggingface_hub from test requirements

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@coder543
Copy link
Copy Markdown

coder543 commented Jun 3, 2026

Does this PR add support for Mellum2's MTP that is described in the Mellum2 paper?

@pilot7747
Copy link
Copy Markdown

@coder543 Not yet, will need to implement it in transformers first. But it's on the roadmap

@chris-hatton
Copy link
Copy Markdown

chris-hatton commented Jun 4, 2026

Can anyone report success using Tools with Mellum2 and llama.cpp?

Firstly; we need to apply the official Jinja template to get tools calling at all. But even then, 'Failed to parse' failures are frequent e.g:

"Failed to parse input at pos 239: <tool_call>\n{"name": "explore", "arguments": {"path": "/", "pattern": "App.js", "depth": 2}}\n</tool_call>"

The model is claimed as having been trained for Agentic operations; the high failure rate ~60% is therefore unexpected.

Update: Setting reasoning-format deepseek improves things further but still throwing Failed to parse input fairly often

My full llama-server command:

./llama.cpp/build/bin/llama-server \
  --model ./Models/Mellum2-12B-A2.5B-Thinking-Q6_K.gguf \
  --host 0.0.0.0 \
  -fa on \
  --ctx-size 100000 \
  --n-gpu-layers 999 \
  --ubatch-size 256 \
  --threads 8 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0 \
  --chat-template-file ./Templates/mellum2.jinja \
  --jinja \
  --reasoning-format deepseek

@Cyrille37
Copy link
Copy Markdown

Cyrille37 commented Jun 5, 2026

Thanks @Xarbirus 💌

Works great with "Mellum2-12B-A2.5B-Thinking-Q6_K.gguf" on a RTX 3060 12Go with opencode.ai :

  • prompt eval time = 9777.32 ms / 10596 tokens ( 0.92 ms per token, 1083.73 t/s)
  • eval time = 26929.82 ms / 934 tokens ( 28.83 ms per token, 34.68 t/s)
  • total time = 36707.14 ms / 11530 tokens

@chris-hatton
Copy link
Copy Markdown

chris-hatton commented Jun 5, 2026

Thanks @Xarbirus 💌

Works great with "Mellum2-12B-A2.5B-Thinking-Q6_K.gguf" on a RTX 3060 12Go with opencode.ai :

  • prompt eval time = 9777.32 ms / 10596 tokens ( 0.92 ms per token, 1083.73 t/s)
  • eval time = 26929.82 ms / 934 tokens ( 28.83 ms per token, 34.68 t/s)
  • total time = 36707.14 ms / 11530 tokens

@Cyrille37 How is the tool calling for you? That's problematic for me, using the same named model file. Tools work but unusually high failure rate.

@Cyrille37
Copy link
Copy Markdown

How is the tool calling for you?

I only tried with opencode : tools like glob, grep, read, edit ... work fine.

@chris-hatton
Copy link
Copy Markdown

chris-hatton commented Jun 5, 2026

How is the tool calling for you?

I only tried with opencode : tools like glob, grep, read, edit ... work fine.

I'm also using OpenCode, are you willing to share your llama.cpp launch command for comparison? Mine's above.

@Cyrille37
Copy link
Copy Markdown

llama.cpp launch command

llama-server compiled with CUDA 12.9 capability 86

llama-server -m Mellum2-12B-A2.5B-Thinking-Q6_K.gguf \
    --host 0.0.0.0 --port 8012
    --verbosity 3 \
    --threads-http 2 \
    --no-mmap \
    --flash-attn on \
    -sm row \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --repeat-penalty 1.0 --presence-penalty 0.0 \
    --temp 0.2 \
    --jinja \
    --reasoning-format deepseek \
    -c 0

@Ar4l
Copy link
Copy Markdown

Ar4l commented Jun 5, 2026

@chris-hatton we've uploaded a GGUF collection yesterday for which defaults should work for both tools and reasoning. But please let us know if the issue persists!

llama-server -hf JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q8_0 --tools all
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants