Skip to content

studio: tool calling for DeepSeek (R1/V3/V3.1), GLM 4.x, Kimi K2 on safetensors + MLX#5624

Draft
danielhanchen wants to merge 19 commits into
studio-tools-multi-format-v2from
studio-tools-deepseek-glm-kimi
Draft

studio: tool calling for DeepSeek (R1/V3/V3.1), GLM 4.x, Kimi K2 on safetensors + MLX#5624
danielhanchen wants to merge 19 commits into
studio-tools-multi-format-v2from
studio-tools-deepseek-glm-kimi

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

DRAFT - do not merge. Stacked on PR #5620 (Llama-3 / Mistral / Gemma 4 + healing parity). Validate against real models before merging.

Adds three more emission-family parsers to tool_call_parser.py so the
shared safetensors / MLX / GGUF agentic loop covers the major open-
weight reasoning families. Patterns ported from llama.cpp
(common/chat-parser.cpp legacy pre-PEG branch), vLLM
(tool_parsers/deepseekv3*, glm4_moe, kimi_k2), and SGLang
(function_call/deepseekv31_detector, glm4_moe_detector, kimik2_detector).
All three references are MIT (llama.cpp) or Apache-2.0 (vLLM, SGLang).

Formats covered:

  DeepSeek R1     <|tool▁calls▁begin|><|tool▁call▁begin|>function
                  <|tool▁sep|>NAME\n```json\n{...}\n```<|tool▁call▁end|>
                  <|tool▁calls▁end|>
                  -- args wrapped in a Markdown json fence, ``function``
                  literal prefix per llama.cpp common_chat_parse_
                  deepseek_r1 (chat-parser.cpp:801-820)

  DeepSeek V3/V3.1
                  <|tool▁calls▁begin|><|tool▁call▁begin|>NAME
                  <|tool▁sep|>{json}<|tool▁call▁end|><|tool▁calls▁end|>
                  -- bare JSON, no code fence, no ``function`` prefix
                  per llama.cpp common_chat_parse_deepseek_v3_1
                  (chat-parser.cpp:822-879)

  GLM 4.5/4.6/4.7 <tool_call>NAME\n<arg_key>k1</arg_key>
                  \n<arg_value>v1</arg_value>...</tool_call>
                  -- strings raw, non-strings JSON-encoded per
                  chat_template.jinja; multi-call is back-to-back
                  blocks. Per llama.cpp common_chat_parse_glm_4_5
                  (chat-parser.cpp:1040-1052)

  Kimi K2         <|tool_calls_section_begin|><|tool_call_begin|>
                  functions.NAME:IDX<|tool_call_argument_begin|>{json}
                  <|tool_call_end|><|tool_calls_section_end|>
                  -- bare name recovered by stripping ``functions.``
                  prefix and ``:IDX`` suffix; full id preserved as
                  tool_calls[i].id so the roundtrip replays verbatim.
                  Per llama.cpp common_chat_parse_kimi_k2
                  (chat-parser.cpp:896-913)

Marker collisions

GLM uses the same ``<tool_call>`` opener as Qwen but with a bare
function name + ``<arg_key>`` body (Qwen has ``\s*{`` after the tag).
The dispatch keeps Qwen first; Qwen's _TC_JSON_START_RE returns no
matches on a GLM emission, so the fall-through to _parse_glm_tool_
calls handles it correctly. Existing Qwen tests confirm zero
regression.

Streaming buffer

TOOL_XML_SIGNALS extended from 5 markers to 12 so the BUFFERING state
machine wakes on every new family's section opener. Added the
DeepSeek alternative markers (ASCII underscores, short ``<|tool▁calls|>``
form) because real checkpoints emit those variants.

Strip patterns

_TOOL_CLOSED_PATS adds DeepSeek envelope (``<|tool▁calls▁begin|>...
<|tool▁calls▁end|>``) and Kimi section (``<|tool_calls_section_begin|>
...<|tool_calls_section_end|>``). _TOOL_ALL_PATS adds the same plus
the unclosed-tail variants so a truncated stream does not leak
markup.

Route gate

_detect_safetensors_features._PARSER_MARKERS grows to include
DeepSeek and Kimi markers plus ``<arg_key>`` (the unique GLM signal).
_TOOL_XML_RE (the route-layer markup-strip regex) gets DeepSeek and
Kimi closed-pair patterns. _TOOL_TEMPLATE_MARKERS in llama_cpp.py
adds ``message['role'] == 'tool'``, ``message['tool_calls']``, and
``tool_calls is defined`` so the classifier recognises DeepSeek's
subscripted-access template style (it has no top-level
``{% if tools %}`` block).

Tests (39 new):

  TestParserDeepSeek  (7) -- R1 fence, short-form opener, V3.1 bare,
                             multi-call, with-reasoning, strip,
                             signal-wakes-streaming
  TestParserGLM       (6) -- single, mixed types, multi-call,
                             unclosed-heal, no-Qwen-regression, strip
  TestParserKimi      (6) -- single, multi-call, dotted-name, unclosed,
                             strip, signal-wakes-streaming
  TestParserCrossFormatRouting (2) -- dispatch routing, signal coverage
  TestLoopBasic loop integration (3) -- DeepSeek / GLM / Kimi end-to-end
  Capability advertise (3) -- DeepSeek / GLM / Kimi templates flip
                             supports_tools=True

All 398 targeted tests pass locally (115 safetensors + 27 capability
+ rest of tool / inference / sandbox / model-config suites). Builds
on PR #5620 (parser + healing parity for Llama-3 / Mistral / Gemma 4);
will rebase cleanly onto main once #5620 lands. PR opened as draft -
do not merge until validated against real models for each family.

Sources

- llama.cpp common/chat-parser.cpp lines 801-913, 1040-1052 (MIT)
- vLLM vllm/tool_parsers/deepseekv31_tool_parser.py (Apache-2.0)
- vLLM vllm/tool_parsers/glm4_moe_tool_parser.py (Apache-2.0)
- vLLM vllm/tool_parsers/kimi_k2_tool_parser.py (Apache-2.0)
- SGLang python/sglang/srt/function_call/{deepseekv31,glm4_moe,kimik2}_
  detector.py (Apache-2.0)
- Live chat templates: deepseek-ai/DeepSeek-V3.1, zai-org/GLM-4.6,
  moonshotai/Kimi-K2-Instruct, unsloth/DeepSeek-V3-0324,
  unsloth/GLM-4.5-Air, unsloth/Kimi-K2-Instruct
danielhanchen added a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 19, 2026
Mirrors PR unslothai#5624: three more emission-family parsers for the shared
tool_call_parser plus CI updates that exercise the new fixtures
cross-OS.

- DeepSeek R1 / V3 / V3.1: <|tool▁calls▁begin|>...<|tool▁sep|>...
- GLM 4.5 / 4.6 / 4.7: <tool_call>NAME\n<arg_key>K</arg_key>\n
  <arg_value>V</arg_value>...</tool_call>
- Kimi K2 / Moonshot: <|tool_calls_section_begin|>...<|tool_call_
  argument_begin|>...

Ported from llama.cpp common/chat-parser.cpp lines 801-913,
1040-1052 (MIT), vLLM tool_parsers/ {deepseekv31, glm4_moe,
kimi_k2}_tool_parser.py (Apache-2.0), and SGLang function_call/
{deepseekv31, glm4_moe, kimik2}_detector.py (Apache-2.0).

CI multi-format probe extended from 9 to 13 fixtures so all four new
families run on ubuntu / macos-14 / windows.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds tool call parsing support for DeepSeek (R1, V3, V3.1), GLM (4.5, 4.6, 4.7), and Kimi K2 models. The implementation includes new regex patterns, specialized parsing functions, and updates to the inference routing logic and test suites. Reviewers suggested making the DeepSeek R1 parser more robust by handling leading whitespace before JSON blocks and recommended moving the ast import to the top level to avoid performance overhead during streaming inference.

Comment on lines +927 to +929
json_start = m.end()
# Walk a balanced ``{`` even if the trailing fence is truncated.
if json_start >= len(body) or body[json_start] != "{":

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The DeepSeek R1 path should skip any leading whitespace between the Markdown code fence and the start of the JSON object, similar to the V3 and Kimi parsers. This makes the parser more robust against variations in model output formatting.

Suggested change
json_start = m.end()
# Walk a balanced ``{`` even if the trailing fence is truncated.
if json_start >= len(body) or body[json_start] != "{":
json_start = m.end()
# Skip any whitespace before the JSON.
while json_start < len(body) and body[json_start] in " \t\n\r":
json_start += 1
# Walk a balanced { even if the trailing fence is truncated.
if json_start >= len(body) or body[json_start] != "{":

except (json.JSONDecodeError, ValueError):
pass
try:
import ast as _ast

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Importing ast inside nested loops (while and for) can impact performance, especially since this parser is called frequently during streaming inference. Move the import to the top of the file with the other imports.

danielhanchen and others added 16 commits May 27, 2026 12:06
Four fixes addressing review of the parent commit:

1. GLM <arg_value> coercion: tighten the
   json.loads -> ast.literal_eval -> raw cascade to only deserialize
   when the body unambiguously looks like a JSON literal (object,
   array, JSON-encoded string, true/false/null, or numeric). Strings
   like ``True`` / ``None`` (Python literals, not JSON) and arbitrary
   prose now stay raw. The bare-numeric / bare-boolean ambiguity with
   string args remains an inherent limitation of the template without
   schema access -- documented in the new comment. Drops the ast
   import entirely (closes Gemini's :1036 suggestion).

2. Kimi K2 bare-counter ids (e.g. ``<|tool_call_begin|>3``) are now
   dropped rather than surfaced as a tool literally named "3". Matches
   vLLM behaviour; SGLang's schema-infer fallback is out of scope at
   the parse site. Real Kimi K2 emissions use ``functions.NAME:IDX``
   so this is the exception path.

3. Restore the elaborate ``<|python_tag|>(?:[^<]|<(?!\|))*`` clause in
   routes.inference._TOOL_XML_RE -- the simpler ``[^\n<]*`` form
   regressed PR #5620's multi-line / literal-``<`` python_tag fix.
   Restore ``TestRoutesPythonTagStrip`` (8 tests) adapted to call
   ``_TOOL_XML_RE.sub`` directly since the ``_strip_tool_xml`` helper
   was inlined this PR.

4. Add the spaced and backslash-escaped DeepSeek opener variants
   (``<|tool calls begin|>``, ``<|tool\_calls\_begin|>``) to
   ``TOOL_XML_SIGNALS`` for streaming-gate parity with
   ``_DEEPSEEK_BEGIN_RE``.

Also updates the llama.cpp / vLLM citations in the parser docstrings:
``common/chat-parser.cpp`` was split into ``common/chat.cpp`` +
``common/chat-peg-parser.cpp`` by llama.cpp PR #18675, and vLLM
moved the tool parsers from ``vllm/entrypoints/openai/tool_parsers/``
to ``vllm/tool_parsers/``. Pin to pre-refactor commit ``51fa458a92d6``
where the cited line numbers still resolve.

New regression tests in ``test_pr5624_regressions.py`` cover the GLM
coercion heuristic shapes, GLM literal-``<`` in arg_value, Kimi K2
dotted name, Kimi K2 bare-counter drop, DeepSeek V3.1 truncated
mid-stream, and routes-layer strip across all three new families.

Tests:
  pytest studio/backend/tests/test_safetensors_tool_loop.py
         studio/backend/tests/test_safetensors_capability_advertise.py
         studio/backend/tests/test_pr5624_regressions.py -q
  -> 170 passed in 1.91s
…k-glm-kimi

Resolves three conflicts against the updated 5620 base (which itself
merged main after main moved on with PRs #5735 / #5775 / #5803 etc.
touching the same routes/inference.py and tool_call_parser.py surface):

* studio/backend/core/inference/tool_call_parser.py
  ``_TOOL_CLOSED_PATS``: kept 5624's full set (Mistral pre-v11 array,
  Mistral v11+ name{json}, DeepSeek envelope, Kimi section) on top of
  the 3-pattern base. New 5620 base reverted to the 3 base patterns
  because main never carried the tool-format extensions.

* studio/backend/routes/inference.py
  Merged the two regex bodies: kept 5624's elaborate python_tag
  ``(?:[^<]|<(?!\|))*`` clause and the new-family closed-pair
  patterns (DeepSeek envelope, Kimi section). DROPPED the inlined
  Mistral patterns in favour of the base's ``_strip_tool_xml`` helper
  which delegates Mistral handling to the parser module's
  ``_strip_mistral_closed_calls`` -- the non-greedy ``\{.*?\}`` form
  truncates at the first ``}`` of a nested JSON arg, so balanced
  brace/bracket scanning is correct here. Also kept the base's
  orphan-close and tail-only ``</parameter>`` patterns from the
  speculative-buffer split boundary work. Net: 9 call sites continue
  to use ``_strip_tool_xml(...)`` (Mistral-safe).

* studio/backend/tests/test_safetensors_tool_loop.py
  ``TestRoutesPythonTagStrip``: kept the base's wording for the
  section header (the two were near-identical) and switched the helper
  ``_strip`` back to ``_strip_tool_xml`` since the helper is restored.

Also retargeted ``test_pr5624_regressions.py``'s routes-layer strip
tests to ``_strip_tool_xml`` for consistency with the restored helper.

Tests:
  pytest studio/backend/tests/test_safetensors_tool_loop.py
         studio/backend/tests/test_safetensors_capability_advertise.py
         studio/backend/tests/test_pr5624_regressions.py -q
  -> 170 passed in 1.93s

  pytest studio/backend/tests/ -q -k 'not gpu and not llama_cpp_integration'
  -> 2034 passed, 15 failed (pre-existing on the 5620 base; same
     set as before the merge: test_training_worker_flash_attn,
     test_desktop_auth, test_studio_api integration shims).
…k-glm-kimi

Second merge pass to absorb the base's "tighten verbose comments in
tool-call parser sections" commit plus pre-commit auto-fixes that
landed on both sides since the previous merge.

Conflicts:

* studio/backend/core/inference/tool_call_parser.py
  - Regex prelude: kept 5624's per-family blocks (DeepSeek, GLM,
    Kimi) and trimmed comments to the base's tightened style for
    Gemma 4. New families need the explicit constants regardless of
    comment density.
  - parse_tool_calls_from_text: dropped the base's compact for-loop
    over the 5 original parsers (it would have double-run Qwen and
    function_xml ahead of the explicit chain). Kept 5624's interleaved
    dispatch (DeepSeek/Kimi first, Qwen, GLM-after-Qwen,
    function_xml, python_tag, Mistral, Gemma, bare-JSON fallback),
    with tightened single-line per-family comments.

* studio/backend/routes/inference.py
  - _PARSER_MARKERS comment: collapsed to base's tighter style while
    keeping the per-family marker list so the next reader knows what
    triggers the pill.
  - _TOOL_XML_RE comment block: same compression, but the four
    speculative-buffer leak shapes and the Mistral balanced-brace
    delegation note are preserved.
  - _TOOL_XML_RE pattern list: kept all six 5624 alternations
    (python_tag elaborate, DeepSeek envelope, Kimi section, plus the
    base's orphan-close and tail-only </parameter>) in the order that
    matches the comment block.

Tests:
  pytest studio/backend/tests/test_safetensors_tool_loop.py
         studio/backend/tests/test_safetensors_capability_advertise.py
         studio/backend/tests/test_pr5624_regressions.py -q
  -> 170 passed in 2.22s

  pytest studio/backend/tests/ -q -k 'not gpu and not llama_cpp_integration'
  -> 2034 passed, 15 failed (same pre-existing CI gaps on the base:
     test_training_worker_flash_attn, test_desktop_auth,
     test_studio_api integration shims).
Two fixes surfaced by triple-confirm verification against the live
HF chat templates and upstream llama.cpp / vLLM / SGLang parsers.

1. GLM 4.7 silent drop
   ``zai-org/GLM-4.7/chat_template.jinja`` line 65 uses
   ``{{- '<tool_call>' + tc.name -}}`` which Jinja strips trailing
   whitespace from, so the first ``<arg_key>`` follows the function
   name with NO ``\n`` between them. Real emissions look like
   ``<tool_call>get_weather<arg_key>city</arg_key><arg_value>London
   </arg_value></tool_call>``. The previous ``_GLM_TC_OPEN_RE`` ended
   the name with ``\n`` so GLM-4.7 calls were silently dropped
   (parser returned ``[]``).

   Fix: relax the name terminator to a lookahead that accepts EITHER
   ``\n`` OR the next ``<arg_key>``:
       _GLM_TC_OPEN_RE = re.compile(
           r"<tool_call>\s*([^\n<{][^\n<]*?)\s*(?=\n|<arg_key>)"
       )
   The first-char restriction ``[^\n<{]`` still excludes Qwen's
   ``<tool_call>{json}`` form so the Qwen-vs-GLM dispatch remains
   mutually exclusive.

2. Kimi multi-section parity with vLLM / SGLang
   ``vllm/tool_parsers/kimi_k2_tool_parser.py`` and SGLang's
   ``kimik2_detector.py`` both use ``re.findall`` and so collect every
   ``<|tool_calls_section_begin|>...<|tool_calls_section_end|>`` block
   in a single stream. The previous implementation stopped at the
   first ``<|tool_calls_section_end|>``. Kimi K2 doesn't emit
   multi-section in practice, but parity is cheap.

   Fix: wrap the existing per-call body parser in an outer loop that
   advances past each ``<|tool_calls_section_end|>`` and continues to
   the next ``<|tool_calls_section_begin|>``. Body parsing extracted
   to ``_parse_kimi_section_body`` for clarity. Truncated final
   section is still surfaced via the existing in-body balanced-brace
   walk.

Verified independently against the live HF templates:
* GLM-4.7 emission constructed from the live template parses to the
  expected ``{name, arguments}`` shape.
* GLM-4.5 / 4.6 newline shape continues to parse (the lookahead also
  matches ``\n``).
* Qwen ``<tool_call>{json}`` still dispatches to the Qwen path -- the
  first-char restriction stops the GLM regex from biting JSON bodies.
* Kimi two-section stream surfaces both calls in order with full ids
  preserved.
* Bare-counter Kimi ids still drop.

Tests added in ``test_pr5624_regressions.py``:
* ``test_glm_4_7_no_newlines_between_name_and_arg_key``
* ``test_glm_4_7_no_newlines_multi_call``
* ``test_glm_4_7_does_not_break_qwen_path``
* ``test_kimi_two_sections_in_one_stream_both_parse``

  pytest studio/backend/tests/test_safetensors_tool_loop.py
         studio/backend/tests/test_safetensors_capability_advertise.py
         studio/backend/tests/test_pr5624_regressions.py -q
  -> 174 passed in 1.93s

  pytest studio/backend/tests/ -q -k 'not gpu and not llama_cpp_integration'
  -> 2038 passed, 15 failed (pre-existing CI gaps).
Pure comment / docstring tightening on top of the GLM 4.7 + Kimi
multi-section fixes. No behavioural change.

* Drop multi-paragraph prelude and post-refactor citation chatter in
  the DeepSeek, GLM and Kimi parser docstrings; keep the shape and
  upstream-commit pin.
* Collapse ``parse_tool_calls_from_text``'s 9 per-family blocks into
  a single ordered loop with one combined comment.
* Tighten the GLM coercion, Kimi bare-counter and ``_TOOL_XML_RE``
  comments to one or two lines each.
* Same trim pass on ``_PARSER_MARKERS`` and the regression-test
  docstrings.

Tests:
  pytest studio/backend/tests/test_safetensors_tool_loop.py
         studio/backend/tests/test_safetensors_capability_advertise.py
         studio/backend/tests/test_pr5624_regressions.py -q
  -> 174 passed in 2.00s
Adversarial input ``<|tool▁calls▁begin|><|tool▁call▁begin|>fn<|tool▁sep|>``
followed by a long body that does NOT contain a closing brace caused
the V3 path's ``([^\n<]+?)<|tool▁sep|>`` regex to backtrack
quadratically: at each position the lazy quantifier extends one char
at a time looking for a sep that isn't there, taking ~19s on 50k
chars.

Replace the regex search with ``str.find`` on the sep marker plus a
left-walk to recover the name. ``str.find`` is O(N); the walk stops
on ``\n`` (turn boundary), ``<`` (start of a tag), or ``>`` (end of
an optional ``<|tool▁call▁begin|>`` prefix). Same observable
behaviour as the regex on every canonical input.

Tests:
  test_deepseek_v3_1_huge_truncated_body_is_linear (new) -- 50k chars
  must parse in &lt; 1s.
  pytest studio/backend/tests/test_safetensors_tool_loop.py
         studio/backend/tests/test_safetensors_capability_advertise.py
         studio/backend/tests/test_pr5624_regressions.py -q
  -> 175 passed in 1.97s
  pytest studio/backend/tests/ -q -k 'not gpu and not llama_cpp_integration'
  -> 2038 passed, 15 pre-existing failures unchanged.
The JSON sub-path of ``_parse_llama3_python_tag`` was fabricating
``{"value": args}`` when the model emitted a non-dict / non-string
``arguments`` value (e.g. ``42``, ``[1,2,3]``, ``null``, ``true``).
This silently turned a malformed emission into a real tool call,
which the agentic loop would then execute with arguments the model
never intended.

Tightened: skip the call instead of fabricating. The same
behaviour now matches the bare-JSON guard tightened earlier
(strict-guard merge from PR #5620, inherited via merge here).

Added a regression test covering the four non-scalar shapes.
Pass count on this branch: 158 -> 159.

Sites in ``_parse_tool_call_json`` and ``_consume_mistral_call``
keep the existing looser behaviour for now; both are reached
only after explicit ``<tool_call>`` / ``[TOOL_CALLS]`` markers
so the false-positive surface there is much narrower.
…a.cpp

Four GGUF-parity fixes for the GLM and Kimi K2 families:

- GLM 4.7 zero-argument inline call <tool_call>name</tool_call> was dropped:
  the open-tag lookahead only allowed \n or <arg_key> after the name. Allow
  </tool_call> too so a no-arg call parses to empty args (vLLM / SGLang /
  llama.cpp all parse it).

- GLM string argument values were stripped, losing significant leading /
  trailing whitespace in code / diff arguments. Keep the raw value for the
  string fallback and only strip the copy used to probe for a JSON literal,
  matching vLLM glm4_moe which never strips string args.

- Kimi K2 calls emitted without the <|tool_calls_section_begin|> wrapper
  were dropped. llama.cpp makes the section optional (Kimi can call a tool
  straight after reasoning without opening a section); parse a bare
  <|tool_call_begin|> when no section is present.

- Kimi K2 malformed / truncated JSON in one call dropped every later call in
  the section. Skip the bad call and keep parsing so valid subsequent calls
  are recovered (vLLM parity).

Adds regression tests for all four.
@danielhanchen

Copy link
Copy Markdown
Member Author

Validated this end to end on 4 B200 GPUs, one Studio per model family, running each safetensors model and its unsloth UD-Q4_K_XL GGUF through the same 6 prompts (weather, GitHub issue search, Python math, latest PyTorch version, Hacker News top story, hash plus date math). Identical generation params on both backends (temperature 0.7, top_p 0.8, top_k 20, min_p 0.0), 3 seeds each, so 18 runs per variant. Tool calls counted from the tool_start stream events.

Results, safetensors vs GGUF, tools fired out of 18:

  • Llama-3.1-8B: 18/18 vs 18/18, both averaging 3.3 calls. Safetensors passed 5/6 of the deterministic code checks vs 2/6 for the GGUF.
  • Qwen3-14B: 14/18 vs 15/18, essentially identical.
  • Mistral-Small-3.2-24B: 0/18 vs 0/18. The model refuses to call tools in both backends with the same "I don't have the capability to access real-time information" answer, so this is model behaviour rather than the parser.
  • gemma-3-27b-it: 0/18 vs 0/18. gemma-3 has no tool-calling chat template on either side, so neither backend can fire.

So on the families whose chat template actually advertises tools, the safetensors path now matches the GGUF. Where safetensors does not fire, the GGUF does not fire either, so there is no remaining backend gap.

I also cross-checked each family's parsing against llama.cpp common/chat.cpp, vLLM tool_parsers, and SGLang function_call. The fixes layered on top of the base parser:

  • Llama-3.2 bare-JSON form {"name": ..., "parameters": ...} now fires even with no XML signal, since the skip-special-token stream drops <|python_tag|>.
  • Mistral [CALL_ID] marker skipped and [THINK]...[/THINK] stripped before parsing.
  • <function name="..."> attribute form recognised as a tool signal and stripped.
  • GLM zero-argument inline calls parsed, and GLM string-value whitespace preserved.
  • Kimi K2 calls parsed without the section wrapper, with malformed-JSON recovery continuing to later calls.

All of these pass on the safetensors tool-loop suite.

I confirmed the same behaviour in the Studio UI. On Qwen3-14B the "Search the web for the latest stable PyTorch release, then compute the sum of primes below 1000" prompt shows a "2 tool calls" badge, real web_search source chips, and the correct python result of 76,127. On Llama-3.1-8B the weather plus Fibonacci prompt returns real web_search source chips and runs the python tool.

One follow-up worth noting separately, unrelated to this parser: for Mistral the Studio swaps in the Unsloth mistral chat template, which does not render the tools schema (the model's native template does), so the safetensors path never advertises tools to Mistral. It does not change the comparison here since the GGUF also fires 0, but it is the reason a fix to Mistral tool calling will need the native template, not just the parser.

danielhanchen and others added 2 commits June 1, 2026 02:01
…ripped parser)

Gemma-4 safetensors fired no tools while its GGUF fired reliably. Three gaps:

- The Studio swaps in the Unsloth "gemma-4" chat template, which does not
  render the tools schema (the model's native template does), so the model
  never saw the tools. Fall back to the model's native template when the
  override template renders identically with and without tools. Same fix
  helps any family whose override template drops tools.
- skip_special_tokens strips the <|tool_call> wrapper and <|"|> string
  markers, so a streamed Gemma-4 call arrives as a bare call:NAME{k:v, ...}
  with unquoted values. Parse that form, keeping commas/braces inside a
  code or command value, normalising surrounding quotes, and stripping the
  leaked markup from the final answer.
- Without a grammar a small model can loop, repeating one call for the whole
  tool budget. Collapse exact-duplicate calls within a turn and force a final
  answer after a turn that made no new tool progress (llama-server's lazy
  grammar prevents this loop on the GGUF side).

Adds parser tests for the bare/stripped Gemma-4 form.
@danielhanchen

Copy link
Copy Markdown
Member Author

Added gemma-4-E4B-it to the comparison, and this one was a real case of the GGUF being better: safetensors fired 0/18 while the GGUF fired 18/18. Three gaps, all fixed in 4127d1f:

  1. The Studio swaps in the Unsloth gemma-4 chat template, which does not render the tools schema (the model's native template does), so the model never saw the tools. The text path now falls back to the native template when the override renders identically with and without tools. This also covers any other family whose override template drops tools (Mistral has the same template issue).
  2. skip_special_tokens strips the <|tool_call> wrapper and the <|"|> string markers, so a streamed call arrives as a bare call:NAME{k:v, ...} with unquoted values. The parser now reads that form, keeping commas and braces inside a code value, normalising quotes, and stripping the leaked markup from the final answer.
  3. Without a grammar the small E4B model loops on one call. The agentic loop now collapses exact-duplicate calls in a turn and forces a final answer after a turn that made no new progress. llama-server's lazy grammar prevents this loop on the GGUF side.

After the fix, safetensors vs GGUF over 18 runs each (same prompts, same generation params):

  • tool fire rate: 18/18 (was 0/18) vs 18/18
  • avg tool calls: 2.3 vs 2.2
  • tool types: web_search 33, python 8 vs web_search 32, python 7
  • code-answer correct: 0/6 vs 4/6
  • avg latency: 81s vs 8s

Tool firing now matches the GGUF in both rate and the count and kind of tools called. Two gaps remain and are structural to running a small model without llama.cpp's grammar: the E4B model relays computed values into its final answer less reliably (0/6 vs 4/6 on the deterministic code checks) and takes more turns, so it is slower. In the Studio UI the model fires web_search against real weather sources and runs python, returning "The 30th Fibonacci number is: 832040".

Parser tests for the bare and stripped Gemma-4 form are included. The safetensors tool-loop suite is at 157 passing.

@danielhanchen

Copy link
Copy Markdown
Member Author

One more validation, this time on a new MoE: Qwen3.6-35B-A3B safetensors vs its MTP-GGUF (unsloth/Qwen3.6-35B-A3B-MTP-GGUF, UD-Q4_K_XL). No code change needed here, the multi-format parser and the qwen3-thinking template already handle it, so I am just posting the numbers for the record.

Architecture is qwen3_5_moe (35B total, ~3B active, thinking). safetensors is 71.9 GB and the MTP-GGUF is 22.9 GB, both under the 90 GB budget. The MTP (multi-token-prediction) variant loads and runs fine through llama-server. Both the native and the Unsloth qwen3-thinking override template render tools (Hermes <tools> system block plus <tool_call>{json}). Ran at 32k context since the reasoning plus full-page web_search results overflow 8k.

safetensors vs MTP-GGUF, 18 runs each, same prompts and params:

  • tool fire rate: 18/18 vs 18/18
  • avg tool calls: 2.5 vs 4.2
  • tool types: web_search 37, python 8 vs web_search 69, python 6
  • code-answer correct: 6/6 vs 3/6
  • avg latency: 165s vs 10s (the GGUF gets MTP speculative decoding, around 190 tok/s)

Both fire on every question, so there is no safetensors gap here. If anything the safetensors path is the more accurate one, passing all 6 deterministic code checks vs 3/6 for the GGUF, and it calls fewer tools (the MTP-GGUF over-searches, averaging about 11 web_search calls on the GitHub-issues question). Confirmed in the Studio UI: the MTP-GGUF fires web_search against real PyTorch sources and runs python, returning the correct sum of primes below 1000 (76,127).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant