Add JetBrains Mellum2 recipes (Thinking + Instruct) by esmeetu · Pull Request #503 · vllm-project/recipes

esmeetu · 2026-06-02T00:56:19Z

Adds vLLM recipes for JetBrains' Mellum2 family — the reasoning-augmented Thinking checkpoint and its direct-answer Instruct sibling (both 12B total / 2.5B active, 64 experts / 8 active, 131K context, bf16).

Shared details

Architecture: MoE (MellumForCausalLM), bf16, ~29 GB — fits on a single H200/H100/A100. single_node_tp defaults to TP=1.
vLLM version: nightly. MellumForCausalLM support merged in vllm-project/vllm#43992 on 2026-06-01, after the latest stable v0.22.0 (2026-05-29), so it is not yet in a tagged release. Both recipes set nightly_required: true.
Tool calling: --enable-auto-tool-choice --tool-call-parser hermes (both).

Thinking vs Instruct

Thinking emits <think>...</think> chains before the answer → adds --reasoning-parser qwen3. Suited to complex debugging, planning, agentic/math-heavy tasks.
Instruct answers directly (no externalized CoT) → no reasoning parser. Lower-latency coding and tool use.

The two recipes cross-link via related_recipes.

🤖 Generated with Claude Code

vercel · 2026-06-02T00:56:24Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
vllm-recipes	Ready	Preview, Comment	Jun 2, 2026 1:03am

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yasong.wang <yasong.wang@inferact.ai>

gemini-code-assist

Code Review

This pull request adds a configuration file for the new Mellum2-12B-A2.5B-Thinking model by JetBrains and registers JetBrains as a provider. Feedback on the configuration suggests lowering the default --max-model-len from 131072 to 32768 to avoid potential Out-Of-Memory (OOM) errors on standard GPUs. Additionally, it is recommended to correct a likely typo in the Python client example, changing max_tokens from 81920 to 8192.

gemini-code-assist · 2026-06-02T00:57:45Z

+  base_args:
+    - "--max-model-len"
+    - "131072"


Setting the default --max-model-len to the absolute maximum of 131072 in base_args will cause vLLM to attempt to allocate a massive KV cache on startup. For a model of this size (12B total parameters, ~24 GB in bf16), the KV cache for 131k tokens will require an additional ~8 GB+ of VRAM. On GPUs near the minimum VRAM requirement of 29 GB (or even 32GB/40GB GPUs), this will likely result in an Out-Of-Memory (OOM) error during initialization.\n\nConsider setting a more conservative default (e.g., 32768 or 16384) in base_args to ensure the recipe runs out-of-the-box on standard GPUs, and document in the guide that users can scale it up to 131072 if they have higher-end hardware (like an A100 80GB or H100).

base_args: - "--max-model-len" - "32768"

gemini-code-assist · 2026-06-02T00:57:45Z

+  resp = client.chat.completions.create(
+      model="JetBrains/Mellum2-12B-A2.5B-Thinking",
+      messages=[{"role": "user", "content": "Is 1024 a power of 2? Explain your reasoning."}],
+      max_tokens=81920,


The max_tokens parameter in the Python client usage example is set to 81920. This is extremely high for a single chat completion response and is likely a typo for 8192 (8k), which is the typical maximum generation length for reasoning models. Setting it excessively high can lead to client-side validation issues or unexpected behavior if the model gets stuck in a loop.

max_tokens=8192,

Direct-answer sibling of the Thinking checkpoint (no reasoning parser). Cross-link the two via related_recipes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yasong.wang <yasong.wang@inferact.ai>

vercel Bot deployed to Preview June 2, 2026 00:57 View deployment

Add JetBrains/Mellum2-12B-A2.5B-Thinking recipe

36214a7

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yasong.wang <yasong.wang@inferact.ai>

gemini-code-assist Bot reviewed Jun 2, 2026

View reviewed changes

esmeetu force-pushed the add-mellum2-12b-thinking branch from 34a2605 to 36214a7 Compare June 2, 2026 00:58

vercel Bot deployed to Preview June 2, 2026 00:59 View deployment

Add JetBrains/Mellum2-12B-A2.5B-Instruct recipe

3e8a469

Direct-answer sibling of the Thinking checkpoint (no reasoning parser). Cross-link the two via related_recipes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yasong.wang <yasong.wang@inferact.ai>

esmeetu changed the title ~~Add JetBrains/Mellum2-12B-A2.5B-Thinking recipe~~ Add JetBrains Mellum2 recipes (Thinking + Instruct) Jun 2, 2026

vercel Bot deployed to Preview June 2, 2026 01:04 View deployment

esmeetu merged commit dbe9862 into vllm-project:main Jun 2, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JetBrains Mellum2 recipes (Thinking + Instruct)#503

Add JetBrains Mellum2 recipes (Thinking + Instruct)#503
esmeetu merged 2 commits into
vllm-project:mainfrom
esmeetu:add-mellum2-12b-thinking

esmeetu commented Jun 2, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

esmeetu commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Shared details

Thinking vs Instruct

Uh oh!

vercel Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

esmeetu commented Jun 2, 2026 •

edited

Loading

vercel Bot commented Jun 2, 2026 •

edited

Loading