Add JetBrains Mellum2 recipes (Thinking + Instruct)#503
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yasong.wang <yasong.wang@inferact.ai>
There was a problem hiding this comment.
Code Review
This pull request adds a configuration file for the new Mellum2-12B-A2.5B-Thinking model by JetBrains and registers JetBrains as a provider. Feedback on the configuration suggests lowering the default --max-model-len from 131072 to 32768 to avoid potential Out-Of-Memory (OOM) errors on standard GPUs. Additionally, it is recommended to correct a likely typo in the Python client example, changing max_tokens from 81920 to 8192.
| base_args: | ||
| - "--max-model-len" | ||
| - "131072" |
There was a problem hiding this comment.
Setting the default --max-model-len to the absolute maximum of 131072 in base_args will cause vLLM to attempt to allocate a massive KV cache on startup. For a model of this size (12B total parameters, ~24 GB in bf16), the KV cache for 131k tokens will require an additional ~8 GB+ of VRAM. On GPUs near the minimum VRAM requirement of 29 GB (or even 32GB/40GB GPUs), this will likely result in an Out-Of-Memory (OOM) error during initialization.\n\nConsider setting a more conservative default (e.g., 32768 or 16384) in base_args to ensure the recipe runs out-of-the-box on standard GPUs, and document in the guide that users can scale it up to 131072 if they have higher-end hardware (like an A100 80GB or H100).
base_args:
- "--max-model-len"
- "32768"| resp = client.chat.completions.create( | ||
| model="JetBrains/Mellum2-12B-A2.5B-Thinking", | ||
| messages=[{"role": "user", "content": "Is 1024 a power of 2? Explain your reasoning."}], | ||
| max_tokens=81920, |
There was a problem hiding this comment.
The max_tokens parameter in the Python client usage example is set to 81920. This is extremely high for a single chat completion response and is likely a typo for 8192 (8k), which is the typical maximum generation length for reasoning models. Setting it excessively high can lead to client-side validation issues or unexpected behavior if the model gets stuck in a loop.
max_tokens=8192,34a2605 to
36214a7
Compare
Direct-answer sibling of the Thinking checkpoint (no reasoning parser). Cross-link the two via related_recipes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: yasong.wang <yasong.wang@inferact.ai>
Adds vLLM recipes for JetBrains' Mellum2 family — the reasoning-augmented Thinking checkpoint and its direct-answer Instruct sibling (both 12B total / 2.5B active, 64 experts / 8 active, 131K context, bf16).
Shared details
MellumForCausalLM), bf16, ~29 GB — fits on a single H200/H100/A100.single_node_tpdefaults to TP=1.MellumForCausalLMsupport merged in vllm-project/vllm#43992 on 2026-06-01, after the latest stable v0.22.0 (2026-05-29), so it is not yet in a tagged release. Both recipes setnightly_required: true.--enable-auto-tool-choice --tool-call-parser hermes(both).Thinking vs Instruct
<think>...</think>chains before the answer → adds--reasoning-parser qwen3. Suited to complex debugging, planning, agentic/math-heavy tasks.The two recipes cross-link via
related_recipes.🤖 Generated with Claude Code