studio: tool calling for DeepSeek (R1/V3/V3.1), GLM 4.x, Kimi K2 on safetensors + MLX by danielhanchen · Pull Request #5624 · unslothai/unsloth

danielhanchen · 2026-05-19T15:05:02Z

DRAFT - do not merge. Stacked on PR #5620 (Llama-3 / Mistral / Gemma 4 + healing parity). Validate against real models before merging.

Adds three more emission-family parsers to tool_call_parser.py so the shared safetensors / MLX / GGUF agentic loop covers the major open- weight reasoning families. Patterns ported from llama.cpp (common/chat-parser.cpp legacy pre-PEG branch), vLLM (tool_parsers/deepseekv3*, glm4_moe, kimi_k2), and SGLang (function_call/deepseekv31_detector, glm4_moe_detector, kimik2_detector). All three references are MIT (llama.cpp) or Apache-2.0 (vLLM, SGLang). Formats covered: DeepSeek R1 <｜tool▁calls▁begin｜><｜tool▁call▁begin｜>function <｜tool▁sep｜>NAME\n```json\n{...}\n```<｜tool▁call▁end｜> <｜tool▁calls▁end｜> -- args wrapped in a Markdown json fence, ``function`` literal prefix per llama.cpp common_chat_parse_ deepseek_r1 (chat-parser.cpp:801-820) DeepSeek V3/V3.1 <｜tool▁calls▁begin｜><｜tool▁call▁begin｜>NAME <｜tool▁sep｜>{json}<｜tool▁call▁end｜><｜tool▁calls▁end｜> -- bare JSON, no code fence, no ``function`` prefix per llama.cpp common_chat_parse_deepseek_v3_1 (chat-parser.cpp:822-879) GLM 4.5/4.6/4.7 <tool_call>NAME\n<arg_key>k1</arg_key> \n<arg_value>v1</arg_value>...</tool_call> -- strings raw, non-strings JSON-encoded per chat_template.jinja; multi-call is back-to-back blocks. Per llama.cpp common_chat_parse_glm_4_5 (chat-parser.cpp:1040-1052) Kimi K2 <|tool_calls_section_begin|><|tool_call_begin|> functions.NAME:IDX<|tool_call_argument_begin|>{json} <|tool_call_end|><|tool_calls_section_end|> -- bare name recovered by stripping ``functions.`` prefix and ``:IDX`` suffix; full id preserved as tool_calls[i].id so the roundtrip replays verbatim. Per llama.cpp common_chat_parse_kimi_k2 (chat-parser.cpp:896-913) Marker collisions GLM uses the same ``<tool_call>`` opener as Qwen but with a bare function name + ``<arg_key>`` body (Qwen has ``\s*{`` after the tag). The dispatch keeps Qwen first; Qwen's _TC_JSON_START_RE returns no matches on a GLM emission, so the fall-through to _parse_glm_tool_ calls handles it correctly. Existing Qwen tests confirm zero regression. Streaming buffer TOOL_XML_SIGNALS extended from 5 markers to 12 so the BUFFERING state machine wakes on every new family's section opener. Added the DeepSeek alternative markers (ASCII underscores, short ``<｜tool▁calls｜>`` form) because real checkpoints emit those variants. Strip patterns _TOOL_CLOSED_PATS adds DeepSeek envelope (``<｜tool▁calls▁begin｜>... <｜tool▁calls▁end｜>``) and Kimi section (``<|tool_calls_section_begin|> ...<|tool_calls_section_end|>``). _TOOL_ALL_PATS adds the same plus the unclosed-tail variants so a truncated stream does not leak markup. Route gate _detect_safetensors_features._PARSER_MARKERS grows to include DeepSeek and Kimi markers plus ``<arg_key>`` (the unique GLM signal). _TOOL_XML_RE (the route-layer markup-strip regex) gets DeepSeek and Kimi closed-pair patterns. _TOOL_TEMPLATE_MARKERS in llama_cpp.py adds ``message['role'] == 'tool'``, ``message['tool_calls']``, and ``tool_calls is defined`` so the classifier recognises DeepSeek's subscripted-access template style (it has no top-level ``{% if tools %}`` block). Tests (39 new): TestParserDeepSeek (7) -- R1 fence, short-form opener, V3.1 bare, multi-call, with-reasoning, strip, signal-wakes-streaming TestParserGLM (6) -- single, mixed types, multi-call, unclosed-heal, no-Qwen-regression, strip TestParserKimi (6) -- single, multi-call, dotted-name, unclosed, strip, signal-wakes-streaming TestParserCrossFormatRouting (2) -- dispatch routing, signal coverage TestLoopBasic loop integration (3) -- DeepSeek / GLM / Kimi end-to-end Capability advertise (3) -- DeepSeek / GLM / Kimi templates flip supports_tools=True All 398 targeted tests pass locally (115 safetensors + 27 capability + rest of tool / inference / sandbox / model-config suites). Builds on PR #5620 (parser + healing parity for Llama-3 / Mistral / Gemma 4); will rebase cleanly onto main once #5620 lands. PR opened as draft - do not merge until validated against real models for each family. Sources - llama.cpp common/chat-parser.cpp lines 801-913, 1040-1052 (MIT) - vLLM vllm/tool_parsers/deepseekv31_tool_parser.py (Apache-2.0) - vLLM vllm/tool_parsers/glm4_moe_tool_parser.py (Apache-2.0) - vLLM vllm/tool_parsers/kimi_k2_tool_parser.py (Apache-2.0) - SGLang python/sglang/srt/function_call/{deepseekv31,glm4_moe,kimik2}_ detector.py (Apache-2.0) - Live chat templates: deepseek-ai/DeepSeek-V3.1, zai-org/GLM-4.6, moonshotai/Kimi-K2-Instruct, unsloth/DeepSeek-V3-0324, unsloth/GLM-4.5-Air, unsloth/Kimi-K2-Instruct

Mirrors PR unslothai#5624: three more emission-family parsers for the shared tool_call_parser plus CI updates that exercise the new fixtures cross-OS. - DeepSeek R1 / V3 / V3.1: <｜tool▁calls▁begin｜>...<｜tool▁sep｜>... - GLM 4.5 / 4.6 / 4.7: <tool_call>NAME\n<arg_key>K</arg_key>\n <arg_value>V</arg_value>...</tool_call> - Kimi K2 / Moonshot: <|tool_calls_section_begin|>...<|tool_call_ argument_begin|>... Ported from llama.cpp common/chat-parser.cpp lines 801-913, 1040-1052 (MIT), vLLM tool_parsers/ {deepseekv31, glm4_moe, kimi_k2}_tool_parser.py (Apache-2.0), and SGLang function_call/ {deepseekv31, glm4_moe, kimik2}_detector.py (Apache-2.0). CI multi-format probe extended from 9 to 13 fixtures so all four new families run on ubuntu / macos-14 / windows.

gemini-code-assist

Code Review

This pull request adds tool call parsing support for DeepSeek (R1, V3, V3.1), GLM (4.5, 4.6, 4.7), and Kimi K2 models. The implementation includes new regex patterns, specialized parsing functions, and updates to the inference routing logic and test suites. Reviewers suggested making the DeepSeek R1 parser more robust by handling leading whitespace before JSON blocks and recommended moving the ast import to the top level to avoid performance overhead during streaming inference.

gemini-code-assist · 2026-05-19T15:13:12Z

+        json_start = m.end()
+        # Walk a balanced ``{`` even if the trailing fence is truncated.
+        if json_start >= len(body) or body[json_start] != "{":


The DeepSeek R1 path should skip any leading whitespace between the Markdown code fence and the start of the JSON object, similar to the V3 and Kimi parsers. This makes the parser more robust against variations in model output formatting.

Suggested change

json_start = m.end()

# Walk a balanced ``{`` even if the trailing fence is truncated.

if json_start >= len(body) or body[json_start] != "{":

json_start = m.end()

# Skip any whitespace before the JSON.

while json_start < len(body) and body[json_start] in " \t\n\r":

json_start += 1

# Walk a balanced { even if the trailing fence is truncated.

if json_start >= len(body) or body[json_start] != "{":

gemini-code-assist · 2026-05-19T15:13:12Z

+            except (json.JSONDecodeError, ValueError):
+                pass
+            try:
+                import ast as _ast


Importing ast inside nested loops (while and for) can impact performance, especially since this parser is called frequently during streaming inference. Move the import to the top of the file with the other imports.

Four fixes addressing review of the parent commit: 1. GLM <arg_value> coercion: tighten the json.loads -> ast.literal_eval -> raw cascade to only deserialize when the body unambiguously looks like a JSON literal (object, array, JSON-encoded string, true/false/null, or numeric). Strings like ``True`` / ``None`` (Python literals, not JSON) and arbitrary prose now stay raw. The bare-numeric / bare-boolean ambiguity with string args remains an inherent limitation of the template without schema access -- documented in the new comment. Drops the ast import entirely (closes Gemini's :1036 suggestion). 2. Kimi K2 bare-counter ids (e.g. ``<|tool_call_begin|>3``) are now dropped rather than surfaced as a tool literally named "3". Matches vLLM behaviour; SGLang's schema-infer fallback is out of scope at the parse site. Real Kimi K2 emissions use ``functions.NAME:IDX`` so this is the exception path. 3. Restore the elaborate ``<|python_tag|>(?:[^<]|<(?!\|))*`` clause in routes.inference._TOOL_XML_RE -- the simpler ``[^\n<]*`` form regressed PR #5620's multi-line / literal-``<`` python_tag fix. Restore ``TestRoutesPythonTagStrip`` (8 tests) adapted to call ``_TOOL_XML_RE.sub`` directly since the ``_strip_tool_xml`` helper was inlined this PR. 4. Add the spaced and backslash-escaped DeepSeek opener variants (``<｜tool calls begin｜>``, ``<｜tool\_calls\_begin｜>``) to ``TOOL_XML_SIGNALS`` for streaming-gate parity with ``_DEEPSEEK_BEGIN_RE``. Also updates the llama.cpp / vLLM citations in the parser docstrings: ``common/chat-parser.cpp`` was split into ``common/chat.cpp`` + ``common/chat-peg-parser.cpp`` by llama.cpp PR #18675, and vLLM moved the tool parsers from ``vllm/entrypoints/openai/tool_parsers/`` to ``vllm/tool_parsers/``. Pin to pre-refactor commit ``51fa458a92d6`` where the cited line numbers still resolve. New regression tests in ``test_pr5624_regressions.py`` cover the GLM coercion heuristic shapes, GLM literal-``<`` in arg_value, Kimi K2 dotted name, Kimi K2 bare-counter drop, DeepSeek V3.1 truncated mid-stream, and routes-layer strip across all three new families. Tests: pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py studio/backend/tests/test_pr5624_regressions.py -q -> 170 passed in 1.91s

…k-glm-kimi Resolves three conflicts against the updated 5620 base (which itself merged main after main moved on with PRs #5735 / #5775 / #5803 etc. touching the same routes/inference.py and tool_call_parser.py surface): * studio/backend/core/inference/tool_call_parser.py ``_TOOL_CLOSED_PATS``: kept 5624's full set (Mistral pre-v11 array, Mistral v11+ name{json}, DeepSeek envelope, Kimi section) on top of the 3-pattern base. New 5620 base reverted to the 3 base patterns because main never carried the tool-format extensions. * studio/backend/routes/inference.py Merged the two regex bodies: kept 5624's elaborate python_tag ``(?:[^<]|<(?!\|))*`` clause and the new-family closed-pair patterns (DeepSeek envelope, Kimi section). DROPPED the inlined Mistral patterns in favour of the base's ``_strip_tool_xml`` helper which delegates Mistral handling to the parser module's ``_strip_mistral_closed_calls`` -- the non-greedy ``\{.*?\}`` form truncates at the first ``}`` of a nested JSON arg, so balanced brace/bracket scanning is correct here. Also kept the base's orphan-close and tail-only ``</parameter>`` patterns from the speculative-buffer split boundary work. Net: 9 call sites continue to use ``_strip_tool_xml(...)`` (Mistral-safe). * studio/backend/tests/test_safetensors_tool_loop.py ``TestRoutesPythonTagStrip``: kept the base's wording for the section header (the two were near-identical) and switched the helper ``_strip`` back to ``_strip_tool_xml`` since the helper is restored. Also retargeted ``test_pr5624_regressions.py``'s routes-layer strip tests to ``_strip_tool_xml`` for consistency with the restored helper. Tests: pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py studio/backend/tests/test_pr5624_regressions.py -q -> 170 passed in 1.93s pytest studio/backend/tests/ -q -k 'not gpu and not llama_cpp_integration' -> 2034 passed, 15 failed (pre-existing on the 5620 base; same set as before the merge: test_training_worker_flash_attn, test_desktop_auth, test_studio_api integration shims).

for more information, see https://pre-commit.ci

…k-glm-kimi Second merge pass to absorb the base's "tighten verbose comments in tool-call parser sections" commit plus pre-commit auto-fixes that landed on both sides since the previous merge. Conflicts: * studio/backend/core/inference/tool_call_parser.py - Regex prelude: kept 5624's per-family blocks (DeepSeek, GLM, Kimi) and trimmed comments to the base's tightened style for Gemma 4. New families need the explicit constants regardless of comment density. - parse_tool_calls_from_text: dropped the base's compact for-loop over the 5 original parsers (it would have double-run Qwen and function_xml ahead of the explicit chain). Kept 5624's interleaved dispatch (DeepSeek/Kimi first, Qwen, GLM-after-Qwen, function_xml, python_tag, Mistral, Gemma, bare-JSON fallback), with tightened single-line per-family comments. * studio/backend/routes/inference.py - _PARSER_MARKERS comment: collapsed to base's tighter style while keeping the per-family marker list so the next reader knows what triggers the pill. - _TOOL_XML_RE comment block: same compression, but the four speculative-buffer leak shapes and the Mistral balanced-brace delegation note are preserved. - _TOOL_XML_RE pattern list: kept all six 5624 alternations (python_tag elaborate, DeepSeek envelope, Kimi section, plus the base's orphan-close and tail-only </parameter>) in the order that matches the comment block. Tests: pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py studio/backend/tests/test_pr5624_regressions.py -q -> 170 passed in 2.22s pytest studio/backend/tests/ -q -k 'not gpu and not llama_cpp_integration' -> 2034 passed, 15 failed (same pre-existing CI gaps on the base: test_training_worker_flash_attn, test_desktop_auth, test_studio_api integration shims).

Two fixes surfaced by triple-confirm verification against the live HF chat templates and upstream llama.cpp / vLLM / SGLang parsers. 1. GLM 4.7 silent drop ``zai-org/GLM-4.7/chat_template.jinja`` line 65 uses ``{{- '<tool_call>' + tc.name -}}`` which Jinja strips trailing whitespace from, so the first ``<arg_key>`` follows the function name with NO ``\n`` between them. Real emissions look like ``<tool_call>get_weather<arg_key>city</arg_key><arg_value>London </arg_value></tool_call>``. The previous ``_GLM_TC_OPEN_RE`` ended the name with ``\n`` so GLM-4.7 calls were silently dropped (parser returned ``[]``). Fix: relax the name terminator to a lookahead that accepts EITHER ``\n`` OR the next ``<arg_key>``: _GLM_TC_OPEN_RE = re.compile( r"<tool_call>\s*([^\n<{][^\n<]*?)\s*(?=\n|<arg_key>)" ) The first-char restriction ``[^\n<{]`` still excludes Qwen's ``<tool_call>{json}`` form so the Qwen-vs-GLM dispatch remains mutually exclusive. 2. Kimi multi-section parity with vLLM / SGLang ``vllm/tool_parsers/kimi_k2_tool_parser.py`` and SGLang's ``kimik2_detector.py`` both use ``re.findall`` and so collect every ``<|tool_calls_section_begin|>...<|tool_calls_section_end|>`` block in a single stream. The previous implementation stopped at the first ``<|tool_calls_section_end|>``. Kimi K2 doesn't emit multi-section in practice, but parity is cheap. Fix: wrap the existing per-call body parser in an outer loop that advances past each ``<|tool_calls_section_end|>`` and continues to the next ``<|tool_calls_section_begin|>``. Body parsing extracted to ``_parse_kimi_section_body`` for clarity. Truncated final section is still surfaced via the existing in-body balanced-brace walk. Verified independently against the live HF templates: * GLM-4.7 emission constructed from the live template parses to the expected ``{name, arguments}`` shape. * GLM-4.5 / 4.6 newline shape continues to parse (the lookahead also matches ``\n``). * Qwen ``<tool_call>{json}`` still dispatches to the Qwen path -- the first-char restriction stops the GLM regex from biting JSON bodies. * Kimi two-section stream surfaces both calls in order with full ids preserved. * Bare-counter Kimi ids still drop. Tests added in ``test_pr5624_regressions.py``: * ``test_glm_4_7_no_newlines_between_name_and_arg_key`` * ``test_glm_4_7_no_newlines_multi_call`` * ``test_glm_4_7_does_not_break_qwen_path`` * ``test_kimi_two_sections_in_one_stream_both_parse`` pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py studio/backend/tests/test_pr5624_regressions.py -q -> 174 passed in 1.93s pytest studio/backend/tests/ -q -k 'not gpu and not llama_cpp_integration' -> 2038 passed, 15 failed (pre-existing CI gaps).

for more information, see https://pre-commit.ci

Pure comment / docstring tightening on top of the GLM 4.7 + Kimi multi-section fixes. No behavioural change. * Drop multi-paragraph prelude and post-refactor citation chatter in the DeepSeek, GLM and Kimi parser docstrings; keep the shape and upstream-commit pin. * Collapse ``parse_tool_calls_from_text``'s 9 per-family blocks into a single ordered loop with one combined comment. * Tighten the GLM coercion, Kimi bare-counter and ``_TOOL_XML_RE`` comments to one or two lines each. * Same trim pass on ``_PARSER_MARKERS`` and the regression-test docstrings. Tests: pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py studio/backend/tests/test_pr5624_regressions.py -q -> 174 passed in 2.00s

Adversarial input ``<｜tool▁calls▁begin｜><｜tool▁call▁begin｜>fn<｜tool▁sep｜>`` followed by a long body that does NOT contain a closing brace caused the V3 path's ``([^\n<]+?)<｜tool▁sep｜>`` regex to backtrack quadratically: at each position the lazy quantifier extends one char at a time looking for a sep that isn't there, taking ~19s on 50k chars. Replace the regex search with ``str.find`` on the sep marker plus a left-walk to recover the name. ``str.find`` is O(N); the walk stops on ``\n`` (turn boundary), ``<`` (start of a tag), or ``>`` (end of an optional ``<｜tool▁call▁begin｜>`` prefix). Same observable behaviour as the regex on every canonical input. Tests: test_deepseek_v3_1_huge_truncated_body_is_linear (new) -- 50k chars must parse in < 1s. pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py studio/backend/tests/test_pr5624_regressions.py -q -> 175 passed in 1.97s pytest studio/backend/tests/ -q -k 'not gpu and not llama_cpp_integration' -> 2038 passed, 15 pre-existing failures unchanged.

…k-glm-kimi

for more information, see https://pre-commit.ci

…to pr-5624-head

…to studio-tools-deepseek-glm-kimi

The JSON sub-path of ``_parse_llama3_python_tag`` was fabricating ``{"value": args}`` when the model emitted a non-dict / non-string ``arguments`` value (e.g. ``42``, ``[1,2,3]``, ``null``, ``true``). This silently turned a malformed emission into a real tool call, which the agentic loop would then execute with arguments the model never intended. Tightened: skip the call instead of fabricating. The same behaviour now matches the bare-JSON guard tightened earlier (strict-guard merge from PR #5620, inherited via merge here). Added a regression test covering the four non-scalar shapes. Pass count on this branch: 158 -> 159. Sites in ``_parse_tool_call_json`` and ``_consume_mistral_call`` keep the existing looser behaviour for now; both are reached only after explicit ``<tool_call>`` / ``[TOOL_CALLS]`` markers so the false-positive surface there is much narrower.

…k-glm-kimi

…a.cpp Four GGUF-parity fixes for the GLM and Kimi K2 families: - GLM 4.7 zero-argument inline call <tool_call>name</tool_call> was dropped: the open-tag lookahead only allowed \n or <arg_key> after the name. Allow </tool_call> too so a no-arg call parses to empty args (vLLM / SGLang / llama.cpp all parse it). - GLM string argument values were stripped, losing significant leading / trailing whitespace in code / diff arguments. Keep the raw value for the string fallback and only strip the copy used to probe for a JSON literal, matching vLLM glm4_moe which never strips string args. - Kimi K2 calls emitted without the <|tool_calls_section_begin|> wrapper were dropped. llama.cpp makes the section optional (Kimi can call a tool straight after reasoning without opening a section); parse a bare <|tool_call_begin|> when no section is present. - Kimi K2 malformed / truncated JSON in one call dropped every later call in the section. Skip the bad call and keep parsing so valid subsequent calls are recovered (vLLM parity). Adds regression tests for all four.

…k-glm-kimi

danielhanchen · 2026-05-31T15:47:16Z

Validated this end to end on 4 B200 GPUs, one Studio per model family, running each safetensors model and its unsloth UD-Q4_K_XL GGUF through the same 6 prompts (weather, GitHub issue search, Python math, latest PyTorch version, Hacker News top story, hash plus date math). Identical generation params on both backends (temperature 0.7, top_p 0.8, top_k 20, min_p 0.0), 3 seeds each, so 18 runs per variant. Tool calls counted from the tool_start stream events.

Results, safetensors vs GGUF, tools fired out of 18:

Llama-3.1-8B: 18/18 vs 18/18, both averaging 3.3 calls. Safetensors passed 5/6 of the deterministic code checks vs 2/6 for the GGUF.
Qwen3-14B: 14/18 vs 15/18, essentially identical.
Mistral-Small-3.2-24B: 0/18 vs 0/18. The model refuses to call tools in both backends with the same "I don't have the capability to access real-time information" answer, so this is model behaviour rather than the parser.
gemma-3-27b-it: 0/18 vs 0/18. gemma-3 has no tool-calling chat template on either side, so neither backend can fire.

So on the families whose chat template actually advertises tools, the safetensors path now matches the GGUF. Where safetensors does not fire, the GGUF does not fire either, so there is no remaining backend gap.

I also cross-checked each family's parsing against llama.cpp common/chat.cpp, vLLM tool_parsers, and SGLang function_call. The fixes layered on top of the base parser:

Llama-3.2 bare-JSON form {"name": ..., "parameters": ...} now fires even with no XML signal, since the skip-special-token stream drops <|python_tag|>.
Mistral [CALL_ID] marker skipped and [THINK]...[/THINK] stripped before parsing.
<function name="..."> attribute form recognised as a tool signal and stripped.
GLM zero-argument inline calls parsed, and GLM string-value whitespace preserved.
Kimi K2 calls parsed without the section wrapper, with malformed-JSON recovery continuing to later calls.

All of these pass on the safetensors tool-loop suite.

I confirmed the same behaviour in the Studio UI. On Qwen3-14B the "Search the web for the latest stable PyTorch release, then compute the sum of primes below 1000" prompt shows a "2 tool calls" badge, real web_search source chips, and the correct python result of 76,127. On Llama-3.1-8B the weather plus Fibonacci prompt returns real web_search source chips and runs the python tool.

One follow-up worth noting separately, unrelated to this parser: for Mistral the Studio swaps in the Unsloth mistral chat template, which does not render the tools schema (the model's native template does), so the safetensors path never advertises tools to Mistral. It does not change the comparison here since the GGUF also fires 0, but it is the reason a fix to Mistral tool calling will need the native template, not just the parser.

…ripped parser) Gemma-4 safetensors fired no tools while its GGUF fired reliably. Three gaps: - The Studio swaps in the Unsloth "gemma-4" chat template, which does not render the tools schema (the model's native template does), so the model never saw the tools. Fall back to the model's native template when the override template renders identically with and without tools. Same fix helps any family whose override template drops tools. - skip_special_tokens strips the <|tool_call> wrapper and <|"|> string markers, so a streamed Gemma-4 call arrives as a bare call:NAME{k:v, ...} with unquoted values. Parse that form, keeping commas/braces inside a code or command value, normalising surrounding quotes, and stripping the leaked markup from the final answer. - Without a grammar a small model can loop, repeating one call for the whole tool budget. Collapse exact-duplicate calls within a turn and force a final answer after a turn that made no new tool progress (llama-server's lazy grammar prevents this loop on the GGUF side). Adds parser tests for the bare/stripped Gemma-4 form.

for more information, see https://pre-commit.ci

danielhanchen · 2026-06-01T02:38:42Z

Added gemma-4-E4B-it to the comparison, and this one was a real case of the GGUF being better: safetensors fired 0/18 while the GGUF fired 18/18. Three gaps, all fixed in 4127d1f:

The Studio swaps in the Unsloth gemma-4 chat template, which does not render the tools schema (the model's native template does), so the model never saw the tools. The text path now falls back to the native template when the override renders identically with and without tools. This also covers any other family whose override template drops tools (Mistral has the same template issue).
skip_special_tokens strips the <|tool_call> wrapper and the <|"|> string markers, so a streamed call arrives as a bare call:NAME{k:v, ...} with unquoted values. The parser now reads that form, keeping commas and braces inside a code value, normalising quotes, and stripping the leaked markup from the final answer.
Without a grammar the small E4B model loops on one call. The agentic loop now collapses exact-duplicate calls in a turn and forces a final answer after a turn that made no new progress. llama-server's lazy grammar prevents this loop on the GGUF side.

After the fix, safetensors vs GGUF over 18 runs each (same prompts, same generation params):

tool fire rate: 18/18 (was 0/18) vs 18/18
avg tool calls: 2.3 vs 2.2
tool types: web_search 33, python 8 vs web_search 32, python 7
code-answer correct: 0/6 vs 4/6
avg latency: 81s vs 8s

Tool firing now matches the GGUF in both rate and the count and kind of tools called. Two gaps remain and are structural to running a small model without llama.cpp's grammar: the E4B model relays computed values into its final answer less reliably (0/6 vs 4/6 on the deterministic code checks) and takes more turns, so it is slower. In the Studio UI the model fires web_search against real weather sources and runs python, returning "The 30th Fibonacci number is: 832040".

Parser tests for the bare and stripped Gemma-4 form are included. The safetensors tool-loop suite is at 157 passing.

danielhanchen · 2026-06-01T05:11:52Z

One more validation, this time on a new MoE: Qwen3.6-35B-A3B safetensors vs its MTP-GGUF (unsloth/Qwen3.6-35B-A3B-MTP-GGUF, UD-Q4_K_XL). No code change needed here, the multi-format parser and the qwen3-thinking template already handle it, so I am just posting the numbers for the record.

Architecture is qwen3_5_moe (35B total, ~3B active, thinking). safetensors is 71.9 GB and the MTP-GGUF is 22.9 GB, both under the 90 GB budget. The MTP (multi-token-prediction) variant loads and runs fine through llama-server. Both the native and the Unsloth qwen3-thinking override template render tools (Hermes <tools> system block plus <tool_call>{json}). Ran at 32k context since the reasoning plus full-page web_search results overflow 8k.

safetensors vs MTP-GGUF, 18 runs each, same prompts and params:

tool fire rate: 18/18 vs 18/18
avg tool calls: 2.5 vs 4.2
tool types: web_search 37, python 8 vs web_search 69, python 6
code-answer correct: 6/6 vs 3/6
avg latency: 165s vs 10s (the GGUF gets MTP speculative decoding, around 190 tok/s)

Both fire on every question, so there is no safetensors gap here. If anything the safetensors path is the more accurate one, passing all 6 deterministic code checks vs 3/6 for the GGUF, and it calls fewer tools (the MTP-GGUF over-searches, averaging about 11 web_search calls on the GitHub-issues question). Confirmed in the Studio UI: the MTP-GGUF fires web_search against real PyTorch sources and runs python, returning the correct sum of primes below 1000 (76,127).

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

danielhanchen and others added 16 commits May 27, 2026 12:06

[pre-commit.ci] auto fixes from pre-commit.com hooks

4983ffe

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

af9e466

for more information, see https://pre-commit.ci

Merge branch 'studio-tools-multi-format-v2' into studio-tools-deepsee…

a41f8ad

…k-glm-kimi

[pre-commit.ci] auto fixes from pre-commit.com hooks

4a36288

for more information, see https://pre-commit.ci

Merge remote-tracking branch 'origin/studio-tools-multi-format-v2' in…

eafc1e5

…to pr-5624-head

Merge remote-tracking branch 'origin/studio-tools-multi-format-v2' in…

67c0951

…to studio-tools-deepseek-glm-kimi

Merge branch 'studio-tools-multi-format-v2' into studio-tools-deepsee…

60d6861

…k-glm-kimi

Merge branch 'studio-tools-multi-format-v2' into studio-tools-deepsee…

6328614

…k-glm-kimi

danielhanchen mentioned this pull request May 31, 2026

studio: tool calling + healing parity for Llama-3, Mistral, Gemma 4 on safetensors + MLX #5620

Draft

danielhanchen and others added 2 commits June 1, 2026 02:01

[pre-commit.ci] auto fixes from pre-commit.com hooks

3f3cb36

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

studio: tool calling for DeepSeek (R1/V3/V3.1), GLM 4.x, Kimi K2 on safetensors + MLX#5624

studio: tool calling for DeepSeek (R1/V3/V3.1), GLM 4.x, Kimi K2 on safetensors + MLX#5624
danielhanchen wants to merge 19 commits into
studio-tools-multi-format-v2from
studio-tools-deepseek-glm-kimi

danielhanchen commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

danielhanchen commented May 31, 2026

Uh oh!

danielhanchen commented Jun 1, 2026

Uh oh!

danielhanchen commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

danielhanchen commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented May 31, 2026

Uh oh!

danielhanchen commented Jun 1, 2026

Uh oh!

danielhanchen commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant