Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions models/JetBrains/Mellum2-12B-A2.5B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
meta:
title: "Mellum2-12B-A2.5B-Instruct"
slug: "mellum2-12b-a2.5b-instruct"
provider: "JetBrains"
description: "JetBrains' instruction-tuned code MoE (12B total / 2.5B active) that answers directly without an externalized chain of thought — low-latency coding and tool use"
date_updated: 2026-06-02
difficulty: intermediate
tasks:
- text
performance_headline: "78.4 EvalPlus, 67.1 MultiPL-E — direct answers, fits on a single GPU"
related_recipes:
- "JetBrains/Mellum2-12B-A2.5B-Thinking"
hardware:
h200: verified

model:
model_id: "JetBrains/Mellum2-12B-A2.5B-Instruct"
min_vllm_version: "nightly"
nightly_required: true
architecture: moe
parameter_count: "12B"
active_parameters: "2.5B"
context_length: 131072
base_args:
- "--max-model-len"
- "131072"
base_env: {}

features:
tool_calling:
description: "Hermes tool-call parser for function calling"
args:
- "--enable-auto-tool-choice"
- "--tool-call-parser"
- "hermes"

opt_in_features: []

variants:
default:
precision: bf16
vram_minimum_gb: 29
description: "Native bfloat16 weights; fits comfortably on a single H200/H100/A100"

compatible_strategies:
- single_node_tp
- single_node_tep
- single_node_dep

strategy_overrides:
single_node_tp:
tp: 1

guide: |
## Overview

[Mellum2-12B-A2.5B-Instruct](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct)
is JetBrains' instruction-tuned code assistant. It shares the same Mixture-of-Experts
backbone as the rest of the Mellum2 family — 64 experts (8 activated per token), 12B total
/ 2.5B active parameters, sliding-window + full-attention layers, 131,072-token context —
but is post-trained (SFT + RLVR on math, coding, tool use, instruction following, reasoning,
and knowledge) to **answer directly, without an externalized chain of thought**. For complex
debugging, multi-step planning, or math/reasoning-heavy tasks where you want explicit
reasoning traces, use the
[Thinking](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking) variant instead.

## Prerequisites

- Hardware: a single H200, H100, or A100 (~29 GB at bf16) is plenty
- vLLM **nightly** — `MellumForCausalLM` support landed after v0.22.0 and is not yet in a
stable release. Install the nightly wheels until the next tagged release ships.

### Install vLLM (nightly)

```bash
uv venv
source .venv/bin/activate
uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly
```

## Launch command

Unlike the Thinking checkpoint, Instruct does not emit `<think>` blocks, so no
`--reasoning-parser` is needed.

```bash
# Plain serving
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct \
--max-model-len 131072

# Add tool calling
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct \
--max-model-len 131072 \
--enable-auto-tool-choice \
--tool-call-parser hermes
```

## Client usage

JetBrains recommends sampling at `temperature=0.6`, `top_p=0.95`, `top_k=20`.

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
model="JetBrains/Mellum2-12B-A2.5B-Instruct",
messages=[{"role": "user", "content": "Write a Python function to reverse a string."}],
max_tokens=81920,
temperature=0.6,
top_p=0.95,
extra_body={"top_k": 20},
)
print(resp.choices[0].message.content)
```

## References

- [Model card](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct)
- [vLLM support PR #43992](https://github.com/vllm-project/vllm/pull/43992)
- [Mellum2 Technical Report](https://arxiv.org/abs/2605.31268)
125 changes: 125 additions & 0 deletions models/JetBrains/Mellum2-12B-A2.5B-Thinking.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
meta:
title: "Mellum2-12B-A2.5B-Thinking"
slug: "mellum2-12b-a2.5b-thinking"
provider: "JetBrains"
description: "JetBrains' reasoning-augmented code MoE (12B total / 2.5B active) that emits explicit <think> chains for debugging, planning, and agentic coding"
date_updated: 2026-06-02
difficulty: intermediate
tasks:
- text
performance_headline: "69.9 LiveCodeBench v6, 58.4 AIME — fits on a single GPU"
related_recipes:
- "JetBrains/Mellum2-12B-A2.5B-Instruct"
hardware:
h200: verified

model:
model_id: "JetBrains/Mellum2-12B-A2.5B-Thinking"
min_vllm_version: "nightly"
nightly_required: true
architecture: moe
parameter_count: "12B"
active_parameters: "2.5B"
context_length: 131072
base_args:
- "--max-model-len"
- "131072"
Comment on lines +24 to +26
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Setting the default --max-model-len to the absolute maximum of 131072 in base_args will cause vLLM to attempt to allocate a massive KV cache on startup. For a model of this size (12B total parameters, ~24 GB in bf16), the KV cache for 131k tokens will require an additional ~8 GB+ of VRAM. On GPUs near the minimum VRAM requirement of 29 GB (or even 32GB/40GB GPUs), this will likely result in an Out-Of-Memory (OOM) error during initialization.\n\nConsider setting a more conservative default (e.g., 32768 or 16384) in base_args to ensure the recipe runs out-of-the-box on standard GPUs, and document in the guide that users can scale it up to 131072 if they have higher-end hardware (like an A100 80GB or H100).

  base_args:
    - "--max-model-len"
    - "32768"

base_env: {}

features:
reasoning:
description: "Parse the model's <think>...</think> reasoning blocks (Qwen3-style parser)"
args:
- "--reasoning-parser"
- "qwen3"
tool_calling:
description: "Hermes tool-call parser for function calling"
args:
- "--enable-auto-tool-choice"
- "--tool-call-parser"
- "hermes"

opt_in_features: []

variants:
default:
precision: bf16
vram_minimum_gb: 29
description: "Native bfloat16 weights; fits comfortably on a single H200/H100/A100"

compatible_strategies:
- single_node_tp
- single_node_tep
- single_node_dep

strategy_overrides:
single_node_tp:
tp: 1

guide: |
## Overview

[Mellum2-12B-A2.5B-Thinking](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking)
is JetBrains' reasoning-augmented code assistant. It uses a Mixture-of-Experts architecture
with 64 experts (8 activated per token) — 12B total parameters, 2.5B active — combining
sliding-window and full-attention layers for a 131,072-token context. The model emits its
chain-of-thought inside `<think>...</think>` blocks before the final answer, making it
suited to complex debugging, multi-step planning, and agentic workflows. For direct,
low-latency answers without reasoning traces, use the
[Instruct](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct) variant instead.

## Prerequisites

- Hardware: a single H200, H100, or A100 (~29 GB at bf16) is plenty
- vLLM **nightly** — `MellumForCausalLM` support landed after v0.22.0 and is not yet in a
stable release. Install the nightly wheels until the next tagged release ships.

### Install vLLM (nightly)

```bash
uv venv
source .venv/bin/activate
uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly
```

## Launch command

```bash
# With reasoning (recommended for the Thinking checkpoint)
vllm serve JetBrains/Mellum2-12B-A2.5B-Thinking \
--max-model-len 131072 \
--reasoning-parser qwen3

# Add tool calling
vllm serve JetBrains/Mellum2-12B-A2.5B-Thinking \
--max-model-len 131072 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser hermes
```

## Client usage

JetBrains recommends sampling at `temperature=0.6`, `top_p=0.95`, `top_k=20` for the
Thinking checkpoint.

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
model="JetBrains/Mellum2-12B-A2.5B-Thinking",
messages=[{"role": "user", "content": "Is 1024 a power of 2? Explain your reasoning."}],
max_tokens=81920,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The max_tokens parameter in the Python client usage example is set to 81920. This is extremely high for a single chat completion response and is likely a typo for 8192 (8k), which is the typical maximum generation length for reasoning models. Setting it excessively high can lead to client-side validation issues or unexpected behavior if the model gets stuck in a loop.

      max_tokens=8192,

temperature=0.6,
top_p=0.95,
extra_body={"top_k": 20},
)
print(resp.choices[0].message.content)
```

## References

- [Model card](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking)
- [vLLM support PR #43992](https://github.com/vllm-project/vllm/pull/43992)
- [Mellum2 Technical Report](https://arxiv.org/abs/2605.31268)
1 change: 1 addition & 0 deletions src/lib/providers.js
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ export const PROVIDERS = {
"stabilityai": { display_name: "Stability AI", logo: "/providers/stabilityai.png" },
"stepfun-ai": { display_name: "StepFun", logo: "/providers/stepfun-ai.png" },
"poolside": { display_name: "Poolside", logo: "/providers/poolside.png" },
"JetBrains": { display_name: "JetBrains", logo: "/providers/JetBrains.png" },
};

export function getProviderLogo(hfOrg) {
Expand Down